CN108304359B - Unsupervised learning uniform characteristics extractor construction method - Google Patents

Unsupervised learning uniform characteristics extractor construction method Download PDF

Info

Publication number
CN108304359B
CN108304359B CN201810117102.XA CN201810117102A CN108304359B CN 108304359 B CN108304359 B CN 108304359B CN 201810117102 A CN201810117102 A CN 201810117102A CN 108304359 B CN108304359 B CN 108304359B
Authority
CN
China
Prior art keywords
training
data
news
layer
encoding encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810117102.XA
Other languages
Chinese (zh)
Other versions
CN108304359A (en
Inventor
杨楠
曹三省
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN201810117102.XA priority Critical patent/CN108304359B/en
Publication of CN108304359A publication Critical patent/CN108304359A/en
Application granted granted Critical
Publication of CN108304359B publication Critical patent/CN108304359B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Abstract

The application provides a kind of unsupervised learning uniform characteristics extractor construction method, it is characterised in that: obtains practical newsletter archive data from server end and generates news features training dataset;The data that news features training data is concentrated are subjected to processing dyad and obtain news features training vector collection;News data collection is sorted out according to user accesses data, forms user characteristics training dataset;Building one has the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, is trained using specific objective function to depth self-encoding encoder;After depth self-encoding encoder completes training, decoder section is deleted, a binaryzation generation layer is added, unsupervised learning uniform characteristics extractor is completed in building.Unsupervised learning uniform characteristics extractor provided by the present application, the unification that news features and user characteristics may be implemented, the unification based on commending contents and collaborative filtering recommending, and improve the efficiency of real-time recommendation.

Description

Unsupervised learning uniform characteristics extractor construction method
Technical field
The invention belongs to artificial intelligence fields, more particularly to a kind of unsupervised learning uniform characteristics extractor building side Method.
Background technique
Current recommender system or recommended engine is generally divided into content-based recommendation, collaborative filtering recommending, mixed recommendation Etc. types, be information tool of equal importance with search engine in today's society, obtained in fields such as e-commerce, media recommenders To extensive use.Current popular collaborative filtering method is based primarily upon general character, i.e., (can to commodity or media content by certain user To be referred to as " article ") scoring to calculate the similarity between the similarity between user, article, then according to the user it is emerging The scoring of the similar other users of interest infers it to the scoring of new article or according to the similarity with its once interested article Predict the scoring to new article, therefore the score in predicting that is otherwise known as, but the disadvantage is that it is personalized it is insufficient, in score data deficiency In the case where predict it is difficult.
Content-based recommendation is mainly to model to the preference of certain user, model to the attribute of article, according to user Preference, hobby to be recommended, it is personalized stronger, but the modeling and matching of user preference, goods attribute are difficult points.It is past User preference modeling needs the direct feature such as population in use statistics, is also easy to invade the privacy of people.
Deep learning is the new machine learning method of rising in recent years, can be divided into supervised learning and unsupervised It practises.Self-encoding encoder (AutoEncoder, AF) is a current study frontier of unsupervised learning, but current most depth The disadvantages of degree respectively has superiority and inferiority from coded system, and there are easy over-fittings, most of unsupervised that do not realize also in complete meaning It practises, constrains the performance of its ability significantly.
In the case where the fast development of the technologies such as current manual's intelligence, deep learning, unsupervised learning, need to study use New technology, new method are updated the technical foundation of recommender system, effectively realization mixed recommendation, promote online recommend energetically Efficiency.
Summary of the invention
For personalization in the applications such as present fusion Media News recommendation is insufficient, user characteristics extract difficulty, will not Tongfang Method, which is united, is formed with privacy violation, real-time recommendation efficiency during the mixed recommendation method of effect is more difficult, user characteristics extract The problems such as to be improved, according to current novel artificial intelligence technology, it is unified special that this application discloses a kind of unsupervised learnings Extractor (Unsupervised Learning Unified Feature Extractor, ULUFE) construction method is levied, to Extract " uniform characteristics based on content indicate " (Unified Representation Based on Content, URBC).One Kind unsupervised learning uniform characteristics extractor construction method, comprising the following steps:
S1, practical newsletter archive data and user accesses data are obtained from server end, by arrangement and randomization News features training dataset is generated afterwards;
S2, using current Chinese word segmentation tool, the data that news features training data is concentrated are pre-processed, are obtained Pretreated news features training dataset;
S3, by pretreated news features training dataset, news features training vector is obtained by TF-IDF method Collection;
S4, news features training vector collection is sorted out according to user accesses data, forms user characteristics training data Collection;
S5, building one have the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, use JSA-CDAEAs target Function:
Wherein,
Wherein kσFor Gaussian kernel, standard deviation sigma takes 1.0, gaussian kernel function are as follows:
Wherein, the input of x presentation code device, fθThe output of () presentation code device, gθ() indicates decoder output;LMC() Indicate that the cost function individually inputted, λ are the regularization parameters for shrinking self-encoding encoder, | | | |FIt is F norm sign, J (x) is to compile Code device Jacobian matrix, θ is the parameter set of depth self-encoding encoder, xiIndicate the input of encoder in primary training,It indicates The output of decoder reduction, t represent training set, and z represents the algebraic expression in Gaussian kernel;
S6, training depth self-encoding encoder, training step are as follows:
S61, using the news features training vector collection as the training data of the depth self-encoding encoder;
S62, white Gaussian noise is added in the training data, generates the input data with noise;
S63, using the input data with noise as the input of the depth self-encoding encoder, using batch ladder when training Descent method is spent, unsupervised layer-by-layer pre-training is first carried out, obtains the initial parameter of each hidden layer and the output data of output layer;
S64, the training data of input and output data are compared in objective function, realize the reversed biography of gradient It broadcasts, the initial parameter of each hidden layer is adjusted;
After the completion of S65, training, the parameter set of depth self-encoding encoder is obtained;
S7, the decoder section for deleting depth self-encoding encoder, and binaryzation is added after the output of most end hidden layer and generates Layer completes the building of unsupervised learning uniform characteristics extractor.
Preferably, the step S1 obtains practical newsletter archive data from server end and user accesses data, process is whole News features training dataset is generated after reason and randomization, specifically includes the following steps:
News data and user accesses data on S11, acquisition server in certain period;
Picture and video in S12, removal news data, Unified coding UTF-8 are that every news sets serial number, constitute News data set;
S13, the news in news data set is subjected to randomization rearrangement by serial number, then distinguished by a certain percentage News features training dataset as layer-by-layer unsupervised pre-training stage and global training stage.
Preferably, one is constructed in the step S5 has the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, packet Include 2 hidden layers.
Preferably, the coding function of the first hidden layer is h1(xi)=S (w1xi+b1), pre-training decoding functions are The coding function of second hidden layer is h2(h1)=S (w2h1+b2), pre-training decoding functions are
The overall situation training decoding functions of second hidden layer to output layer are go(xi)=S (w1xi+b1);
For the initial parameter of each layer using the random number of [0,1], nonlinear activation function S () is unified to use Sigmoid function, E is Euler's numbers, and h indicates that the coding function of hidden layer, g are decoding functions, and b represents biasing, x presentation code device Input, w1、w2It is the weighting parameter of the first and second hidden layers respectively.
Preferably, the binaryzation generation layer dimension in the step S7 is as depth self-encoding encoder most end hidden layer, and most Each neuron of last hidden layer realizes one-to-one connection;A weight tune is arranged according to the output of most end hidden layer in binaryzation generation layer Section device realizes adjusting thresholds, in weight adjuster the selection of threshold value T so that primary completely trained output result is divided into two classes, And two class inter-class variance it is maximum.
Preferably, further include S8, user characteristics training vector collection is input to unsupervised learning uniform characteristics extractor, is obtained To user preferences modeling, according to the user preferences modeling of each user, it is close that a unified user is generated by similarity-rough set Adjacent table.
The advantages of the application, is:
1. the artificial labeled data needed for supervised learning in the quick recommendation of the network media is difficult to obtain in real time, and Existing depth self-encoding encoder remains a need for the problem of supervision accurate adjustment, the depth in the present invention after using unsupervised layer-by-layer pre-training Whole unsupervised learning may be implemented in degree self-encoding encoder;
2. substituting single hidden layer configuration using depth structure, the potential explanation factor of high-order of learning Content is further improved Ability;
3. asymmetric, hidden layer dimension is lower than input layer dimension using encoder and decoder, the non-of data may learn Linear flow pattern realizes dimensionality reduction while extracting feature, better than the method for the linear flow pattern such as PCA.And asymmetric it can also make To solve a kind of means that self-encoding encoder is easy overfitting problem;
4. the feature that self-encoding encoder exports in the present invention is convenient for binary conversion treatment, can after addition binaryzation generation layer Generate binaryzation feature so that can compare respectively by cosine similarity in recommendation, Hamming distances compare, Hash etc. just Method solves the quick similarity comparison problem of user in amalgamation media, news, makees to the quick recommendation of newsflash in mobile media With obvious.
5. upper in application, the feature (uniform characteristics based on content indicate) extracted from news data is as to be recommended The feature of news and user realizes the unification of two kinds of features, also achieves based on commending contents and collaborative filtering recommending method Unification, while effective protection privacy of user, realize recommended method innovation, improve recommendation efficiency.
Detailed description of the invention
Fig. 1 is SA-CDAE design diagram of the invention;
Fig. 2 is training schematic diagram of the invention;
Fig. 3 is unsupervised learning feature extractor of the invention;
Fig. 4 is online recommendation schematic diagram of the invention;
Fig. 5 is that accurate rate of the invention compares figure;
Fig. 6 is that recall rate of the invention compares figure.
Specific embodiment
Below to unsupervised learning uniform characteristics extractor construction method specific embodiment of the invention and detailed step It is described further:
Step 1: data acquisition and preparation
Present invention is generally directed to the website Text news and mobile phone news client Text news in present fusion media.Newly It hears text data and user accesses data is all located at server end, this step needs to generate " news features training dataset ", tool Body process is as follows:
1) news data and user accesses data on acquisition server in certain period, news data includes on server History news, user accesses data includes the news ID list that user reads within certain period;
2) irrelevant contents such as picture and video, Unified coding UTF-8 in news data are removed, is every news setting sequence Number, constitute news data set;
3) news in news data set is subjected to randomization rearrangement by serial number, then made respectively by a certain percentage For " the news features training dataset " in layer-by-layer unsupervised pre-training stage and global training stage.
Step 2: text data pretreatment
Using current Chinese word segmentation tool, the data that news features training data is concentrated carry out Chinese word segmentation, deactivate The processing such as word rejecting, obtain pretreated news features training dataset.
Step 3: newsletter archive data vector
Vectorization processing is carried out to pretreated news features training dataset using TF-IDF method, is obtained and news The corresponding news features training vector collection of feature training dataset;TF-IDF is the abbreviation of " the reverse document frequency of word frequency-".
TF means word frequency, calculation formula are as follows:
IDF means inverse document frequency, calculation formula are as follows:
Under the premise of keeping news features training data to concentrate the relative position of word, news is obtained by TF-IDF method Feature training data concentrates the initial characteristics vector of data, constitutes news features training vector collection, the calculation formula of TF-ID are as follows:
TF-ID=TF*IDF
Step 4: user characteristics training dataset is obtained
News features training vector collection is sorted out according to user accesses data, it is available " user characteristics training to Quantity set ";
Step 5: the asymmetric noise reduction of one stack of building shrinks self-encoding encoder
The core component of unsupervised learning uniform characteristics extractor in the application is a specially designed depth Self-encoding encoder, effect in the present invention are mainly reflected in two aspects of feature extraction and Dimensionality reduction.The present invention is around fusion The application target that media intelligent is recommended has 2 or 3 in conjunction with devising one the advantages of shrinking self-encoding encoder and noise reduction self-encoding encoder The asymmetric noise reduction of the stack (depth) of hidden layer shrinks self-encoding encoder (Stacked Asymmetric Denoising Contractive Auto-encoder, SA-CDAE), as shown in Figure 1.Structurally, it is single hidden to improve to take more hidden layers The ability in feature extraction of layer, take that input layer output layer dimension is identical, hidden layer dimension is less than input layer and successively successively decrease in proportion, Coding and decoding structure is asymmetric to promote anti-over-fitting ability.The news obtained after early-stage preparations and pretreatment is initially instructed Practice vector set and meet on the whole independent with substep, but there are a certain amount of disturbances, are specifically distributed as unknown, are denoted as D={ x1, x2..., xn, xi∈Rd, n ∈ N, then:
The coding function of first hidden layer is h1(xi)=S (w1xi+b1),
The pre-training decoding functions of first hidden layer are
The coding function of second hidden layer is
The pre-training decoding functions of second hidden layer are
The overall situation training decoding functions of second hidden layer to output layer are go(xi)=S (w1xi+b1);
Using the random number in [0,1] open interval, nonlinear activation function S () is unified to be used the initial parameter of each layer Sigmoid function,
Wherein D represents news initial training vector set, and R is set of real numbers, and N is nature manifold, and h represents the coding letter of hidden layer Number, g are decoding functions, and b represents biasing, and e is Euler's numbers, and h indicates that the coding function of hidden layer, g are decoding functions, and b represents biasing, The input of x presentation code device, w1、w2It is the weighting parameter of the first and second hidden layers, x respectivelyiIndicate encoder in primary training Input.
Self-encoding encoder has been used for reference the characteristics of human brain, and principle is an attempt to make to compile by one coding and decoding mechanism of training The input of code device can be reappeared in the output end of decoder, and wherein encoder section is also known as hidden layer, and decoder section is also known as Output layer.It is not easy to, in the input of output end Perfect Reconstruction also without practical significance, but by designing special structure, replicating In be suitably added constraint, using special cost function and training method, make it that can only realize near-duplicate, model can be forced The data in input are replicated by weight, to construct distribution useful in data in the encoder of self-encoding encoder Feature becomes the forward position of Study on Generation Model Program in recent years.Prototype autocoder has embodied preferable ability in feature extraction, but In use the problems such as being easy to appear over-fitting, generalization ability is lost to real data, then occur successively for prototype into Row improves and the derivative type self-encoding encoder of optimization.
Depth self-encoding encoder of the invention considers simultaneously in design to be added noise and reduces noise (disturbance).Addition is made an uproar Sound refers to the thinking by means of DenoisingAutoEncoders, and the white noise of Gaussian Profile is added in input X, makes to decode The interference of device compulsory commutation noise in output, so that the anti-over-fitting performance of system is improved, in training by input White Gaussian noise is added, can have noise reduction from the characteristics of coding, further decrease the risk of over-fitting.By backpropagation and Stochastic gradient descent (SGD) trains the parameter set θ of neural network.
It reduces noise (disturbance) and refers to that raising system is to the resistivity of non-gaussian distribution noise and disturbance in training.For The influence of news features data set, user characteristic data concentration outlier is further decreased, and further to use in scheme Binaryzation, which generates, provides basis, and also part uses the characteristics of shrinking autocoder in design.Shrinking autocoder is Analyticity is added in the cost function expression formula of prototype autocoder and shrinks penalty factor, to reduce the freedom of character representation Degree, makes hidden neuron reach saturation state, and then output data is limited in a certain range of parameter space.The punishment because Son is actually the F norm (Frobenius norm) of encoder Jacobian matrix (Jacobian), and effect is to reduce outlier (outlier) to the influence of encoder, inhibit the disturbance of training sample (being on low dimensional manifold curved surface) in all directions, it is auxiliary Encoder is helped to learn useful data feature.In addition, shrinking the distributed spy for indicating that there is " saturation " that autocoder learns Point, i.e., the value of most of Hidden unit is all close to both ends (0 or 1), and to the partial derivative of input close to 0.
Mean square error function (Mean Square Error, MSE) is often used in the training of general self-encoding encoder as cost Function has certain tolerance to the noise of Gaussian Profile, but in this example in view of the presence of the disturbances such as minimizer, such as Accidental reading conditions except user preference use maximal correlation entropy (Maximum to improve in robustness the present embodiment Correntropy, MC) it is used as cost function:
Wherein kσFor Gaussian kernel, standard deviation sigma takes 1.0, gaussian kernel function are as follows:
The objective function of depth self-encoding encoder entirety in the present invention are as follows:
In various above, fθThe output of () presentation code device, gθ() indicates decoder output;LMC() indicates single input Cost function, λ be shrink self-encoding encoder regularization parameter, | | | |FIt is F norm sign, J (x) is encoder Jacobian Matrix, θ are the parameter set of depth self-encoding encoder, xiIndicate the input of encoder in primary training,Indicate decoder reduction Output, t represent training set, and z represents the algebraic expression in Gaussian kernel.
Step 6: training depth self-encoding encoder
The training of neural network refers to using the data through over cleaning, arrangement as input, passes through forward-propagating and reversed biography Two links are broadcast, the parameter of neural network objective function is gradually made to tend to restrain, so that higher order statistical characteristic is arrived in study.Such as Fig. 2 Shown, depth self-encoding encoder takes off-line training, and main training step is as follows:
1) using news features training vector collection as the training data of depth self-encoding encoder, it is set as X, it can thus be seen that Training data in the application uses news data, and has not both needed manually to mark or do not need using third party's corpus Library;
2) white Gaussian noise is added in training data X, generates the input data X with noise1
3) by X1As the input of depth self-encoding encoder, batch gradient descent method (Mini-batch) is used when training, first Unsupervised layer-by-layer pre-training is carried out, the initial parameter of each hidden layer and the output of output layer are obtained
4) in objective function to X andIt is compared, realizes the global backpropagation of gradient, to the initial of each hidden layer Parameter is adjusted.
5) after the completion of training, the parameter set of depth self-encoding encoder is obtained, it is unified special for building unsupervised learning in next step Levy extractor.
Depth self-encoding encoder in the application structure, cost function, in terms of carried out completely new consideration, can It to realize dimensionality reduction while feature extraction, and can learn to non-linear flow pattern, PCA etc. is significantly better than in terms of dimensionality reduction The method of linear flow pattern.In addition, according to the parallelization feature of neural network, in the main training of depth self-encoding encoder of the invention Accelerated in step using GPU parallel computation, so that the training effectiveness of depth self-encoding encoder has been obtained large increase, improve Practical application efficiency in recommender system.
Step 7: building unsupervised learning uniform characteristics extractor
The depth self-encoding encoder output that training is completed has the characteristics that easy binaryzation, for this purpose, complete in depth self-encoding encoder After training, decoder section is deleted, and binaryzation generation layer is added after the output of most end hidden layer, the binaryzation generates Layer is used to complete binary conversion treatment, as shown in figure 3, completing the building of unsupervised learning uniform characteristics extractor.
In the present embodiment, the output of depth self-encoding encoder is easy binaryzation, but residue 30% there are about 70% close to 0,1 How to handle, the precision of whole binaryzation extraction effect and subsequent similarity-rough set will be directly influenced.For this purpose, in structure On devise binaryzation generation layer of the dimension as depth self-encoding encoder most end hidden layer, each neuron with most end hidden layer Realize one-to-one connection;In interior design, binaryzation generation layer does not use common fixed threshold, but according to most end The actual distribution of hidden layer output designs a weight adjuster to realize adjusting thresholds, and the selection of threshold value T is former in weight adjuster It is then that the output result once completely trained can be made to be divided into two classes, and the inter-class variance of two classes is maximum.
If total output set of hidden layer each unit is K, wherein there is different data N number of after primary complete training.To K into Sequence obtains data set K (k1, k2 ..., ki) to row from small to large, and sets its two group K1, K2 that can be divided into size as t and N-t, The frequency of occurrence of each ki is ni, wherein [1, N] i ∈;The probability that two groups occur in entirety distinguishes ε1、ε2, two groups it is equal Value is respectively β1、β2.The probability that then ki occurs is pi=ni/N,ε2=1- ε1, the mean value of two groups is respectivelyThe mean value of data set K is β=ε1β12β2.The then inter-class variance definition of two groups For δ (t)=ε11-β)222-β)2.Seek T=argmaxt(6 (t)), even if t when δ (t) is maximum, finds T from K Corresponding value is set as threshold value ,≤T's is set to 0, remaining is set to 1, to realize the binaryzation of hidden layer output.
Step 8: user preferences modeling and user's neighbor table are obtained
After the completion of the building of unsupervised learning uniform characteristics extractor, user characteristics training vector collection is input to no prison Educational inspector practises uniform characteristics extractor, obtains user preferences modeling, according to the user preferences modeling of each user, passes through similarity ratio Compared with one unified user's neighbor table of generation.
As shown in figure 4, being example when carrying out personalized recommendation using unsupervised learning uniform characteristics extractor.To own Newsletter archive to be recommended is pre-processed, is input in unsupervised learning uniform characteristics extractor after vectorization, available The news features vector to be recommended that uniform characteristics based on content indicate;By the preference mould of news features vector to be recommended and user Type carries out similarity-rough set and generates content-based recommendation list;Using user's neighbor table, user similar with user A1 is read The news read generates collaborative filtering recommending list, obtains the Top-N recommendation list of mixed recommendation after weighted blend.
Unsupervised learning uniform characteristics extractor disclosed by the invention is all created on whole design, operational mode It is new:
1, innovated in design: the design of depth self-encoding encoder has merged contraction self-encoding encoder and noise reduction from coding The characteristics of device, devises new objective function, improves higher order statistical letter by depth structure (2~3 hidden layers) in structure The extractability of breath, the quantity of each hidden neuron, which realizes, successively decreases, to realize depth self-encoding encoder coding and decoding Asymmetry is beneficial to improve the overfitting problem that self-encoding encoder is easy to appear, improves the robustness of feature extraction, extracting Dimensionality reduction is realized while feature.Output layer is substituted using a binaryzation generation layer after the completion of training, obtains unsupervised learning Binaryzation feature can be generated in feature extractor, is convenient for Hamming distances and compares, and is also convenient for further progress Hash relatively etc. Operation.
2, innovation is realized in training method: the past is usually the self-encoding encoder of single hidden layer completely using input, output Data as a comparison carry out backpropagation after obtaining error to update network parameter.Multilayer self-encoding encoder generally uses unsupervised After layer-by-layer pre-training, increases the classifiers such as softmax behind most end hidden layer and supervised learning is carried out according to class label, so whole Body is a semi-supervised learning.And the self-encoding encoder of depth of the present invention has comprehensively considered the problem of network depth and computational efficiency, In output end also using entering data to be compared, obtained error is subjected to backpropagation, to realize complete nothing Supervised learning.
3, innovation is above realized in utilization: not only improving the efficiency of the applications such as recommender system, also effectively protect individual Privacy.Recommend in efficiency being promoted, it is special to news hobby, preference as user by extracting the feature generated from newsletter archive Sign not only realizes feature unified (user characteristics and article characteristics) and extracts, also achieves mixed recommendation method in technical foundation On unification;But also the data such as the demographic information of user have been avoided by this higher order statistical theory, it extracts Vector is abstracted information, the explicit data not comprising user, even if being illegally accessed the leakage that will not cause user information, from And secret protection is realized, meet country's protection requirement increasingly tighter to individual privacy information.
4, innovation is realized on the training data.The methods of existing collaborative filtering is by user to commodity or media content Scoring calculate the similarity between the similarity between user, article, but active user seldom carries out the news of reading Scoring causes score data rareness, training data insufficient.The application directly uses news data and user accesses data, as The training data of depth self-encoding encoder has the following characteristics that a defect for being that of avoiding the data that lack in training;Second is that without using the The corpus of tripartite, more closing to reality.
In practical applications, accurate rate and recall rate are two most important indexs used in recommender system evaluation.Through Actual test is shown the feature extracted using the unsupervised learning uniform characteristics extractor constructed in the application present invention and pushed away The method of recommending realizes good matching.So that Novel individualized recommended method is compared with currently more popular method, accurate Good effect is all achieved in terms of rate and recall rate, as shown in attached drawing 5 and attached drawing 6.
Finally, it should be noted that above-described embodiments are merely to illustrate the technical scheme, rather than to it Limitation;Although the present invention is described in detail referring to the foregoing embodiments, those skilled in the art should understand that: It can still modify to technical solution documented by previous embodiment, or to part of or all technical features into Row equivalent replacement;And these modifications or substitutions, it does not separate the essence of the corresponding technical solution various embodiments of the present invention technical side The range of case.

Claims (6)

1. a kind of unsupervised learning uniform characteristics extractor construction method, which is characterized in that the construction method includes following step It is rapid:
S1, practical newsletter archive data and user accesses data are obtained from server end, by arranging and life after randomization At news features training dataset;
S2, using current Chinese word segmentation tool, the data that news features training data is concentrated are pre-processed, pre- place is obtained News features training dataset after reason;
S3, by pretreated news features training dataset, news features training vector collection is obtained by TF-IDF method;
S4, news features training vector collection is sorted out according to user accesses data, forms user characteristics training dataset;
S5, building one have the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, use JSA-CDAEAs objective function:
Wherein,
Wherein kσFor Gaussian kernel, standard deviation sigma takes 1.0, gaussian kernel function are as follows:
Wherein, the input of x presentation code device, fθThe output of () presentation code device, gθ() indicates decoder output;LMC() indicates The cost function individually inputted, λ are the regularization parameters for shrinking self-encoding encoder, | | | |FIt is F norm sign, J (x) is encoder Jacobian matrix, θ are the parameter set of depth self-encoding encoder, xiIndicate the input of encoder in primary training,Indicate decoding The output of device reduction, t represent training set, and z represents the algebraic expression in Gaussian kernel;
S6, training depth self-encoding encoder, training step are as follows:
S61, using the news features training vector collection as the training data of the depth self-encoding encoder;
S62, white Gaussian noise is added in the training data, generates the input data with noise;
S63, using the input data with noise as the input of the depth self-encoding encoder, using under batch gradient when training Drop method first carries out unsupervised layer-by-layer pre-training, obtains the initial parameter of each hidden layer and the output data of output layer;
S64, the training data of input and output data are compared in objective function, realize the backpropagation of gradient, The initial parameter of each hidden layer is adjusted;
After the completion of S65, training, the parameter set of depth self-encoding encoder is obtained;
S7, the decoder section for removing depth self-encoding encoder, and binaryzation generation layer is added after the output of most end hidden layer, it is complete At the building of unsupervised learning uniform characteristics extractor.
2. unsupervised learning uniform characteristics extractor construction method according to claim 1, it is characterised in that:
The step S1 obtains practical newsletter archive data and user accesses data from server end, by arranging and being randomized News features training dataset is generated after processing, specifically includes the following steps:
News data and user accesses data on S11, acquisition server in certain period;
Picture and video in S12, removal news data, Unified coding UTF-8 are that every news sets serial number, constitute news Data acquisition system;
S13, the news in news data set is subjected to randomization rearrangement by serial number, then by a certain percentage respectively as The news features training dataset in layer-by-layer unsupervised pre-training stage and global training stage.
3. unsupervised learning uniform characteristics extractor construction method according to claim 1, it is characterised in that:
One is constructed in the step S5 has the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, including 2 hidden layers.
4. unsupervised learning uniform characteristics extractor construction method according to claim 3, it is characterised in that:
The coding function of first hidden layer is h1(xi)=S (w1xi+b1), pre-training decoding functions are
The coding function of second hidden layer is h2(h1)=S (w2h1+b2), pre-training decoding functions are
The overall situation training decoding functions of second hidden layer to output layer are go(xi)=S (w1xi+b1);
For the initial parameter of each layer using the random number of [0,1], nonlinear activation function S () is unified to use Sigmoid function, E is Euler's numbers, and h indicates that the coding function of hidden layer, g are decoding functions, and b represents biasing, x presentation code device Input, w1、w2It is the weighting parameter of the first and second hidden layers respectively.
5. unsupervised learning uniform characteristics extractor construction method according to claim 1, it is characterised in that:
Each mind of the binaryzation generation layer dimension as depth self-encoding encoder most end hidden layer, with most end hidden layer in the step S7 One-to-one connection is realized through member;A weight adjuster is arranged according to the output of most end hidden layer to realize threshold in binaryzation generation layer Value adjustment, the selection of threshold value T is so that primary completely trained output result is divided into two classes in weight adjuster, and between the class of two classes Variance is maximum.
6. unsupervised learning uniform characteristics extractor construction method according to claim 1, which is characterized in that further include:
User characteristics training vector collection is input to unsupervised learning uniform characteristics extractor, obtains user preferences modeling, root by S8 According to the user preferences modeling of each user, unified user's neighbor table is generated by similarity-rough set.
CN201810117102.XA 2018-02-06 2018-02-06 Unsupervised learning uniform characteristics extractor construction method Active CN108304359B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810117102.XA CN108304359B (en) 2018-02-06 2018-02-06 Unsupervised learning uniform characteristics extractor construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810117102.XA CN108304359B (en) 2018-02-06 2018-02-06 Unsupervised learning uniform characteristics extractor construction method

Publications (2)

Publication Number Publication Date
CN108304359A CN108304359A (en) 2018-07-20
CN108304359B true CN108304359B (en) 2019-06-14

Family

ID=62864632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810117102.XA Active CN108304359B (en) 2018-02-06 2018-02-06 Unsupervised learning uniform characteristics extractor construction method

Country Status (1)

Country Link
CN (1) CN108304359B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344391B (en) * 2018-08-23 2022-10-21 昆明理工大学 Multi-feature fusion Chinese news text abstract generation method based on neural network
CN109614984A (en) * 2018-10-29 2019-04-12 深圳北斗应用技术研究院有限公司 A kind of homologous image detecting method and system
CN109598336A (en) * 2018-12-05 2019-04-09 国网江西省电力有限公司信息通信分公司 A kind of Data Reduction method encoding neural network certainly based on stack noise reduction
CN109635303B (en) * 2018-12-19 2020-08-25 中国科学技术大学 Method for recognizing meaning-changing words in specific field
CN110022313B (en) * 2019-03-25 2021-09-17 河北师范大学 Polymorphic worm feature extraction and polymorphic worm identification method based on machine learning
CN110136226B (en) * 2019-04-08 2023-12-22 华南理工大学 News automatic image distribution method based on image group collaborative description generation
KR20210011844A (en) * 2019-07-23 2021-02-02 삼성전자주식회사 Electronic apparatus and method for controlling thereof
CN110442804A (en) * 2019-08-13 2019-11-12 北京市商汤科技开发有限公司 A kind of training method, device, equipment and the storage medium of object recommendation network
CN110648282B (en) * 2019-09-29 2021-03-23 燕山大学 Image super-resolution reconstruction method and system based on width neural network
CN111368205B (en) * 2020-03-09 2021-04-06 腾讯科技(深圳)有限公司 Data recommendation method and device, computer equipment and storage medium
CN113497938A (en) * 2020-03-19 2021-10-12 华为技术有限公司 Method and device for compressing and decompressing image based on variational self-encoder
CN112116029A (en) * 2020-09-25 2020-12-22 天津工业大学 Intelligent fault diagnosis method for gearbox with multi-scale structure and characteristic fusion
CN115146689A (en) * 2021-03-16 2022-10-04 天津大学 Deep learning-based power system high-dimensional measurement data dimension reduction method
CN113441421B (en) * 2021-07-22 2022-12-13 北京信息科技大学 Automatic garbage classification system and method
CN114417427B (en) * 2022-03-30 2022-08-02 浙江大学 Deep learning-oriented data sensitivity attribute desensitization system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
CN105550677A (en) * 2016-02-02 2016-05-04 河北大学 3D palm print identification method
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9668699B2 (en) * 2013-10-17 2017-06-06 Siemens Healthcare Gmbh Method and system for anatomical object detection using marginal space deep neural networks
CN106295245B (en) * 2016-07-27 2019-08-30 广州麦仑信息科技有限公司 Method of the storehouse noise reduction based on Caffe from coding gene information feature extraction
CN106803062A (en) * 2016-12-20 2017-06-06 陕西师范大学 The recognition methods of stack noise reduction own coding neutral net images of gestures

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
CN105550677A (en) * 2016-02-02 2016-05-04 河北大学 3D palm print identification method
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于栈式降噪稀疏自动编码器的雷达目标识别方法;赵飞翔;《雷达学报》;20170430;第6卷(第2期);全文
基于栈式降噪自动编码器的中文短文本分类;邱爽等;《内蒙古名族大学学报》;20170930;第32卷(第5期);全文

Also Published As

Publication number Publication date
CN108304359A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN108304359B (en) Unsupervised learning uniform characteristics extractor construction method
Zhou et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
Brown et al. Smooth-ap: Smoothing the path towards large-scale image retrieval
Yang et al. Fashion captioning: Towards generating accurate descriptions with semantic rewards
Agrawal Clickbait detection using deep learning
CN109145112A (en) A kind of comment on commodity classification method based on global information attention mechanism
Sheu et al. Knowledge-guided article embedding refinement for session-based news recommendation
Xiao et al. Using convolution control block for Chinese sentiment analysis
Sun et al. Chinese herbal medicine image recognition and retrieval by convolutional neural network
Wang et al. Cross-domain recommendation with user personality
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
Yang et al. Read, attend and comment: A deep architecture for automatic news comment generation
Hu et al. Attentive interactive convolutional matching for community question answering in social multimedia
CN111814453A (en) Fine-grained emotion analysis method based on BiLSTM-TextCNN
Liu et al. Identifying experts in community question answering website based on graph convolutional neural network
Du et al. Sentiment analysis method based on piecewise convolutional neural network and generative adversarial network
CN111680190A (en) Video thumbnail recommendation method fusing visual semantic information
Yilmaz et al. Inferring political alignments of Twitter users
CN117033751A (en) Recommended information processing method, recommended information processing device, storage medium and equipment
Banerjee et al. Recommendation of compatible outfits conditioned on style
Cao et al. Video-based recipe retrieval
Deng et al. A depression tendency detection model fusing weibo content and user behavior
Jaya et al. Analysis of convolution neural network for transfer learning of sentiment analysis in Indonesian tweets
Qian et al. Multi-hop interactive attention based classification network for expert recommendation
CN110909167A (en) Microblog text classification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant