CN108304359B - Unsupervised learning uniform characteristics extractor construction method - Google Patents

Unsupervised learning uniform characteristics extractor construction method Download PDF

Info

Publication number
CN108304359B
CN108304359B CN201810117102.XA CN201810117102A CN108304359B CN 108304359 B CN108304359 B CN 108304359B CN 201810117102 A CN201810117102 A CN 201810117102A CN 108304359 B CN108304359 B CN 108304359B
Authority
CN
China
Prior art keywords
training
news
data
encoder
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810117102.XA
Other languages
Chinese (zh)
Other versions
CN108304359A (en
Inventor
杨楠
曹三省
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN201810117102.XA priority Critical patent/CN108304359B/en
Publication of CN108304359A publication Critical patent/CN108304359A/en
Application granted granted Critical
Publication of CN108304359B publication Critical patent/CN108304359B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The application provides a kind of unsupervised learning uniform characteristics extractor construction method, it is characterised in that: obtains practical newsletter archive data from server end and generates news features training dataset;The data that news features training data is concentrated are subjected to processing dyad and obtain news features training vector collection;News data collection is sorted out according to user accesses data, forms user characteristics training dataset;Building one has the asymmetric noise reduction of the stack of multiple hidden layers to shrink self-encoding encoder, is trained using specific objective function to depth self-encoding encoder;After depth self-encoding encoder completes training, decoder section is deleted, a binaryzation generation layer is added, unsupervised learning uniform characteristics extractor is completed in building.Unsupervised learning uniform characteristics extractor provided by the present application, the unification that news features and user characteristics may be implemented, the unification based on commending contents and collaborative filtering recommending, and improve the efficiency of real-time recommendation.

Description

Method for constructing unsupervised learning unified feature extractor
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to a construction method of an unsupervised learning unified feature extractor.
Background
The current recommendation system or recommendation engine is generally classified into content-based recommendation, collaborative filtering recommendation, mixed recommendation and other types, is an information tool which is as important as a search engine in the current society, and is widely applied to the fields of e-commerce, media recommendation and the like. The current popular collaborative filtering method is mainly based on commonality, namely, similarity between users and similarity between items are calculated through the scores of some users on commodities or media contents (which can be collectively called as "items"), then the scores of other users similar to the interests of the users are used for deducing the scores of the users on new items, or the scores of the users on new items are predicted according to the similarity with the items which the users have interests, so that the method is also called score prediction, but the method has the defects of insufficient personalization and difficult prediction under the condition of insufficient score data.
The recommendation based on the content mainly models the preference of a certain user and the attribute of an article, and the recommendation is carried out according to the preference and the preference of the user, so that the personalization is strong, but the modeling and the matching of the preference of the user and the attribute of the article are difficult. Past user preference modeling requires the use of direct features such as demographics and is also prone to invading a person's privacy.
Deep learning is a new machine learning method which is emerging in recent years and can be divided into supervised learning and unsupervised learning. An auto encoder (AF) is a leading edge of current research on unsupervised learning, but most of the current depth auto encoding systems have advantages and disadvantages, such as easiness in overfitting and the like, most of the depth auto encoding systems do not realize unsupervised learning in a complete sense, and the exertion of the capability of the depth auto encoder is greatly restricted.
Under the condition that technologies such as artificial intelligence, deep learning and unsupervised learning are rapidly developed at present, a new technology and a new method need to be researched and used for updating the technical basis of the recommendation system, hybrid recommendation is effectively achieved, and online recommendation efficiency is greatly improved.
Disclosure of Invention
Aiming at the problems that the personalization is insufficient, the user characteristic extraction is difficult, different methods are Unified to form an effective mixed recommendation method, the privacy is violated in the user characteristic extraction, the real-time recommendation efficiency needs to be improved and the like in the application of current fusion media news recommendation and the like, according to the current novel artificial intelligence technology, the application discloses a construction method of an Unsupervised Learning Unified Feature Extractor (ULUFE) for extracting Content-Based Unified Feature Representation (URBC). A construction method of an unsupervised learning unified feature extractor comprises the following steps:
s1, acquiring actual news text data and user access data from a server, and generating a news characteristic training data set after sorting and randomizing;
s2, preprocessing the data in the news characteristic training data set by using the current Chinese word segmentation tool to obtain a preprocessed news characteristic training data set;
s3, obtaining a news characteristic training vector set from the preprocessed news characteristic training data set through a TF-IDF method;
s4, classifying the news characteristic training vector set according to the user access data to form a user characteristic training data set;
s5, constructing a stacked asymmetric noise reduction and contraction self-encoder with a plurality of hidden layers and using JSA-CDAEAs an objective function:
wherein,
wherein k isσIs gaussian kernel, with standard deviation σ of 1.0, gaussian kernel function:
where x denotes the input of the encoder, fθ() Representing the output of the encoder, gθ() Represents the decoder output; l isMC() A cost function representing a single input, λ is a regularization parameter of a systolic auto-encoder, | | | | luminanceFIs the F norm symbol, J (x) is the encoder Jacobian matrix, θ is the parameter set for the depth autocoder, xiRepresenting the input to the encoder in one training session,representing the output restored by the decoder, t representing the training set, and z representing the algebraic expression in the Gaussian kernel;
s6, training the depth self-encoder, wherein the training steps are as follows:
s61, taking the news feature training vector set as training data of the depth self-encoder;
s62, adding Gaussian white noise into the training data to generate input data with noise;
s63, taking the input data with noise as the input of the depth self-encoder, and during training, adopting a batch gradient descent method, firstly performing unsupervised layer-by-layer pre-training to obtain initial parameters of each hidden layer and output data of an output layer;
s64, comparing input training data with output data in the objective function to realize the reverse propagation of the gradient and adjust the initial parameters of each hidden layer;
s65, obtaining a parameter set of the depth self-encoder after the training is finished;
and S7, deleting a decoder part of the depth self-encoder, adding a binary generation layer after the output of the last hidden layer, and completing the construction of the unsupervised learning unified feature extractor.
Preferably, the step S1, obtaining actual news text data and user access data from the server, and generating a news characteristic training data set after sorting and randomizing, specifically includes the following steps:
s11, collecting news data and user access data in a certain time period on the server;
s12, removing pictures and videos in news data, uniformly coding the pictures and videos into UTF-8, setting a sequence number for each news to form a news data set;
s13, randomizing and reordering the news in the news data set according to the sequence numbers, and then respectively using the news as the news characteristic training data sets in the layer-by-layer unsupervised pre-training stage and the global training stage according to a certain proportion.
Preferably, in step S5, a stacked asymmetric noise reduction and shrinkage self-encoder with multiple hidden layers is constructed, including 2 hidden layers.
Preferably, the coding function of the first hidden layer is h1(xi)=S(w1xi+b1) The pre-training decoding function is The coding function of the second hidden layer is h2(h1)=S(w2h1+b2) The pre-training decoding function is
The global training decoding function from the second hidden layer to the output layer is go(xi)=S(w1xi+b1);
The initial parameters of each layer are [0,1 ]]The nonlinear activation function S () uses a Sigmoid function in common, e is the Euler number, h represents the coding function of the hidden layer, g is the decoding function, b represents the offset, x represents the input to the coder, w1、w2The weight parameters of the first and second hidden layers are respectively.
Preferably, the dimension of the binary generation layer in step S7 is the same as that of the last hidden layer of the depth self-encoder, and a one-to-one connection is realized with each neuron of the last hidden layer; the binary generation layer is provided with a weight regulator according to the output of the last hidden layer to realize threshold value regulation, the selection of the threshold value T in the weight regulator enables the output result of one complete training to be divided into two types, and the variance between the two types is the largest.
Preferably, the method further includes step S8, inputting the user feature training vector set to the unsupervised learning unified feature extractor to obtain a user preference model, and generating a unified user neighbor table through similarity comparison according to the user preference model of each user.
The application has the advantages that:
1. aiming at the problems that manual marking data required by supervised learning in the quick recommendation of network media is difficult to obtain in real time, and the conventional depth self-encoder still needs supervised fine tuning after adopting unsupervised layer-by-layer pre-training, the depth self-encoder can realize whole-process unsupervised learning;
2. the deep structure is adopted to replace a single hidden layer structure, so that the capability of high-order potential interpretation factors of the learning content is further improved;
3. the method has the advantages that the encoder and the decoder are asymmetrical, the hidden layer dimension is lower than the input layer dimension, the nonlinear flow pattern of data can be learned, the dimension reduction is realized while the characteristics are extracted, and the method is superior to linear flow patterns such as PCA. And the asymmetry can also be used as a means for solving the problem that the self-encoder is easy to over-fit;
4. the features output by the self-encoder are convenient for binarization processing, and the binarization features can be generated after a binarization generation layer is added, so that the rapid similarity comparison problem of users and news in a fusion medium can be solved by cosine similarity comparison, Hamming distance comparison, Hash and other methods in recommendation, and the rapid recommendation effect on short news in a mobile medium is obvious.
5. In application, the features (based on unified feature representation of content) extracted from the news data are used as the features of the news to be recommended and the user, so that the unification of the two features is realized, the unification of the recommendation method based on content recommendation and collaborative filtering recommendation is also realized, the innovation of the recommendation method is realized while the privacy of the user is effectively protected, and the recommendation efficiency is improved.
Drawings
FIG. 1 is a schematic design of an SA-CDAE according to the present invention;
FIG. 2 is a schematic diagram of the training of the present invention;
FIG. 3 is an unsupervised learning feature extractor of the present invention;
FIG. 4 is a schematic diagram of an online recommendation of the present invention;
FIG. 5 is a graph comparing accuracy rates of the present invention;
FIG. 6 is a chart comparing recall rates of the present invention.
Detailed Description
The specific implementation and detailed steps of the unsupervised learning unified feature extractor construction method of the present invention are further described below:
the method comprises the following steps: data acquisition and preparation
The invention mainly aims at website text news and mobile phone news client text news in the current fusion media. The news text data and the user access data are both located at a server side, a 'news characteristic training data set' needs to be generated in the step, and the specific process is as follows:
1) collecting news data and user access data in a certain period of time on a server, wherein the news data comprises historical news on the server, and the user access data comprises a news ID list read by a user in a certain period of time;
2) removing irrelevant contents such as pictures, videos and the like in news data, uniformly coding the irrelevant contents into UTF-8, and setting a sequence number for each piece of news to form a news data set;
3) the news in the news data set is randomized and reordered according to the sequence number, and then the news is respectively used as a 'news characteristic training data set' in a layer-by-layer unsupervised pre-training stage and a global training stage according to a certain proportion.
Step two: text data preprocessing
And performing Chinese word segmentation, stop word elimination and other processing on the data in the news characteristic training data set by using the current Chinese word segmentation tool to obtain a preprocessed news characteristic training data set.
Step three: news text data vectorization
Performing vectorization processing on the preprocessed news characteristic training data set by using a TF-IDF method to obtain a news characteristic training vector set corresponding to the news characteristic training data set; TF-IDF is an abbreviation for "term frequency-inverse document frequency".
TF means word frequency, and the calculation formula is as follows:
IDF means inverse document frequency and is calculated by the formula:
on the premise of keeping the relative position of words in the news characteristic training data set, obtaining initial characteristic vectors of data in the news characteristic training data set through a TF-IDF method to form a news characteristic training vector set, wherein the TF-ID calculation formula is as follows:
TF-ID=TF*IDF
step four: obtaining a user feature training dataset
Classifying the news characteristic training vector set according to user access data to obtain a user characteristic training vector set;
step five: constructing a stacked asymmetric noise reduction contraction self-encoder
The core component of the unsupervised learning unified feature extractor in the application is a specially designed depth self-encoder, and the role of the unsupervised learning unified feature extractor in the invention is mainly embodied in two aspects of feature extraction and dimension reduction. The invention designs a Stacked (depth) Asymmetric noise reduction and shrinkage self-encoder (SA-CDAE) with 2 or 3 hidden layers by combining the advantages of the shrinkage self-encoder and the noise reduction self-encoder around the application target of the fusion media intelligent recommendation, as shown in figure 1. Structurally, the multi-hidden layer is adopted to improve the feature extraction capability of the single hidden layer, the input layer and the output layer have the same dimension, the hidden layer dimension is smaller than that of the input layer, the hidden layer dimension is gradually reduced layer by layer according to a proportion, and the coding and decoding structure is asymmetric, so that the anti-overfitting capability is improved. The initial news training vector set obtained after the early preparation and the preprocessing accords with independent same steps on the whole, but a certain amount of disturbance exists, the specific distribution is unknown, and the vector set is recorded as D ═ x1,x2,...,xn},xi∈RdAnd N belongs to N, then:
the coding function of the first hidden layer is h1(xi)=S(w1xi+b1),
The pre-training decoding function of the first hidden layer is
The coding function of the second hidden layer is
The pre-training decoding function of the second hidden layer is
The global training decoding function from the second hidden layer to the output layer is go(xi)=S(w1xi+b1);
The initial parameters of each layer are [0,1 ]]Random numbers in open intervals, nonlinear activation function S () uses a Sigmoid function in unison,
wherein D represents news initial training vector set, R is real number set, N is natural number set, h represents coding function of hidden layer, g represents decoding function, b represents bias, e is Euler number, h represents coding function of hidden layer, g represents decoding function, b represents bias, x represents input of encoder, w represents input of encoder, and the like1、w2Weight parameters, x, of the first and second hidden layers, respectivelyiRepresenting the input to the encoder in one training session.
The principle of the self-encoder is to try to make the input of the encoder reappear at the output of the decoder by training an encoding and decoding mechanism, wherein the encoder part is also called hidden layer, and the decoder part is also called output layer. It is not easy and not practical to reconstruct the input at the output end completely, but it only can implement approximate replication by designing special structure, adding constraint in replication properly, using special cost function and training method, and can force the model to replicate the data in the input according to the weight, so as to construct useful distributed features in the data in the encoder of the self-encoder, which has become the leading edge of the research of generating model in recent years. The prototype automatic encoder represents better feature extraction capability, but the problems of overfitting and the like easily occur in use, the generalization capability of actual data is lost, and then the derivative automatic encoder which is improved and optimized for the prototype is developed in succession.
The depth self-encoder of the present invention is designed to take into account both the addition of noise and the reduction of noise (perturbations). The noise adding means that white noise with Gaussian distribution is added into input X by means of the thought of denoisingAutoEncoders, so that the decoder can forcibly remove the interference of the noise during output, the over-fitting resistance of the system is improved, the noise reducing self-coding characteristic can be achieved by adding the white Gaussian noise into the input during training, and the over-fitting risk is further reduced. The parameter set θ of the neural network is trained by back propagation and Stochastic Gradient Descent (SGD).
Reducing noise (disturbance) refers to improving the system's resistance to non-gaussian distributed noise and disturbance during training. In order to further reduce the influence of outliers in a news characteristic data set and a user characteristic data set and provide a basis for further adopting binarization generation in a scheme, the characteristic of a contraction automatic encoder is partially adopted in design. The shrinkage automatic encoder adds analytic shrinkage penalty factors in the cost function expression of the prototype automatic encoder to reduce the freedom degree of characteristic expression, so that hidden layer neurons reach a saturation state, and output data is limited within a certain range of a parameter space. The penalty factor is actually an F norm (Frobenius norm) of a Jacobian matrix (Jacobian) of the encoder, and has the functions of reducing the influence of an outlier (outlier) on the encoder, suppressing the disturbance of a training sample (on a low-dimensional manifold surface) in all directions and assisting the encoder to learn useful data characteristics. Furthermore, the distributed representation learned by the systolic auto-encoder has the feature of "saturation", i.e. most hidden layer elements have values close to both ends (0 or 1) and the partial derivative to the input is close to 0.
In general self-encoder training, a Mean Square Error (MSE) is often used as a cost function, and there is a certain tolerance to gaussian-distributed noise, but in this example, considering the existence of disturbances such as minimization variables, for example, in an accidental reading situation outside user preference, in order to improve robustness, the present embodiment uses maximum correlation entropy (MC) as the cost function:
wherein k isσIs gaussian kernel, with standard deviation σ of 1.0, gaussian kernel function:
the overall objective function of the depth self-encoder in the invention is as follows:
in the above formulae, fθ() Representing the output of the encoder, gθ() Represents the decoder output; l isMC() A cost function representing a single input, λ is a regularization parameter of a systolic auto-encoder, | | | | luminanceFIs the F norm symbol, J (x) is the encoder Jacobian matrix, θ is the parameter set for the depth autocoder, xiRepresenting the input to the encoder in one training session,representing the output restored by the decoder, t representing the training set, and z representing the algebraic expression in the gaussian kernel.
Step six: training depth autoencoder
The training of the neural network refers to that cleaned and sorted data are used as input, and parameters of a neural network target function tend to converge gradually through two links of forward propagation and backward propagation, so that high-order statistical characteristics are learned. As shown in fig. 2, the deep self-encoder takes off-line training, and the main training steps are as follows:
1) the method comprises the steps that a news characteristic training vector set is used as training data of a deep self-encoder and is set to be X, so that the training data in the application are news data, and neither manual marking nor third-party corpus is needed;
2) adding white Gaussian noise to the training data X to generate input data X with noise1
3) Mixing X1As the input of the depth self-encoder, a batch gradient descent method (Mini-batch) is adopted during training, unsupervised layer-by-layer pre-training is firstly carried out to obtain initial parameters of each hidden layer and the output of an output layer
4) For X and X in the objective functionAnd comparing to realize global backward propagation of the gradient and adjusting the initial parameters of all hidden layers.
5) And after the training is finished, obtaining a parameter set of the depth self-encoder, and using the parameter set to construct an unsupervised learning unified feature extractor in the next step.
The deep self-encoder in the application takes brand-new consideration on the aspects of structure, cost function, training mode and the like, can realize dimension reduction while extracting features, can learn nonlinear flow patterns, and is greatly superior to linear flow pattern methods such as PCA (principal component analysis) in the aspect of dimension reduction. In addition, according to the parallelization characteristic of the neural network, GPU parallel computing is adopted to accelerate the main training steps of the depth self-encoder, so that the training efficiency of the depth self-encoder is greatly improved, and the practical application efficiency in a recommendation system is improved.
Step seven: constructing an unsupervised learning unified feature extractor
The trained output of the deep self-encoder has the characteristic of easy binarization, for this reason, after the deep self-encoder finishes training, a decoder part is deleted, and a binarization generation layer is added after the output of the last hidden layer, wherein the binarization generation layer is used for finishing binarization processing, and as shown in fig. 3, the construction of the unsupervised learning unified feature extractor is finished.
In this embodiment, about 70% of the output of the depth self-encoder is close to 0 and 1, and binarization is easy, but how the remaining 30% is processed will directly affect the overall binarization extraction effect and the accuracy of subsequent similarity comparison. Therefore, a binary generation layer with the same dimension as the last hidden layer of the depth self-encoder is structurally designed, and one-to-one connection with each neuron of the last hidden layer is realized; in internal design, a common fixed threshold is not adopted in a binary generation layer, but a weight regulator is designed according to the actual distribution output by the last hidden layer to realize threshold adjustment, and the selection principle of the threshold T in the weight regulator is that the output result of one-time complete training can be divided into two types, and the variance between the two types is maximum.
After one complete training, the total output set of each unit of the hidden layer is K, wherein N different data exist. Sorting K from small to large to obtain a data set K (K1, K2, …, ki), and dividing the data set K into two groups K1 and K2 with the size of t and N-t, wherein the occurrence frequency of each ki is ni, wherein i belongs to [1, N ∈](ii) a Probability of occurrence of two groups in the whole respectively1、ε2The mean values of the two groups are β respectively1、β2. Then the probability of ki occurring is pi=ni/N,ε2=1-ε1The mean values of the two groups are respectivelyThe mean value of the data set K is β ═ epsilon1β12β2. The between-class variance of the two groups is defined as δ (t) ═ e11-β)222-β)2. Get T ═ argmaxt(6(t)), i.e., δ(T) finding the value corresponding to the position of T from K as a threshold value at the maximum T, setting the value less than or equal to T as 0, and setting the rest as 1, thereby realizing the binaryzation of hidden layer output.
Step eight: obtaining a user preference model and a user neighbor table
And after the construction of the unsupervised learning unified feature extractor is completed, inputting the user feature training vector set into the unsupervised learning unified feature extractor to obtain a user preference model, and generating a unified user neighbor table through similarity comparison according to the user preference model of each user.
Fig. 4 shows an example of personalized recommendation using an unsupervised learning unified feature extractor. Preprocessing and vectorizing all news texts to be recommended, and inputting the processed news texts into an unsupervised learning unified feature extractor to obtain a news feature vector to be recommended, wherein the news feature vector is represented by unified features based on content; comparing the similarity of the news characteristic vector to be recommended with a preference model of a user to generate a recommendation list based on content; and generating a collaborative filtering recommendation list from news read by a user similar to the user A1 by using the user neighbor table, and obtaining a Top-N recommendation list of mixed recommendations after weighted mixing.
The unsupervised learning unified feature extractor disclosed by the invention is innovative in the aspects of overall design and application mode:
1. innovations are made in design: the design of the depth self-encoder integrates the characteristics of a contraction self-encoder and a noise reduction self-encoder, a new objective function is designed, the extraction capability of high-order statistical information is structurally improved through a depth structure (2-3 hidden layers), and the number of neurons in each hidden layer is reduced progressively, so that the coding and decoding asymmetry of the depth self-encoder is realized, the problem of overfitting easily occurring in the self-encoder is favorably improved, the robustness of feature extraction is improved, and the dimension reduction is realized while the features are extracted. After training is finished, a binarization generation layer is used for replacing an output layer to obtain an unsupervised learning feature extractor, binarization features can be generated, hamming distance comparison is facilitated, and operations such as Hash comparison are facilitated.
2. The innovation is realized on the training mode: in the past, a self-encoder of a single hidden layer totally uses input and output as comparison data, and updates network parameters by back propagation after errors are obtained. After unsupervised layer-by-layer pre-training is generally adopted in the multi-layer self-encoder, classifiers such as softmax are added behind the last hidden layer to perform supervised learning according to class labels, so that the whole body is semi-supervised learning. The self-encoder of the depth comprehensively considers the problems of network depth and calculation efficiency, input data are also used for comparison at an output end, and the obtained error is propagated reversely, so that complete unsupervised learning is realized.
3. The innovation is realized in application: the efficiency of applications such as a recommendation system is improved, and the privacy of individuals is effectively protected. On the aspect of improving the recommendation efficiency, the characteristics generated by extracting the news text are used as the favorite and preference characteristics of the user to the news, so that the uniform extraction of the characteristics (the characteristics of the user and the characteristics of the articles) is realized, and the uniformity of a mixed recommendation method on the technical basis is also realized; and data such as user demographic information and the like are avoided through the high-order statistical characteristics, the extracted vector is abstract information and does not contain explicit data of the user, and the user information cannot be leaked even if the vector is illegally acquired, so that privacy protection is realized, and the increasingly strict protection requirements of the country on the personal privacy information are met.
4. Innovations in training data are realized. In the existing methods such as collaborative filtering, similarity among users and similarity among articles are calculated through scoring of commodities or media contents by the users, but the current users rarely score read news, so that scoring data is rare and training data is insufficient. The application directly uses news data and user access data as training data of the depth self-encoder, and has the following characteristics: firstly, the defect of lack of training data is avoided; secondly, a third-party corpus is not used, so that the method is more practical.
In practical applications, precision and recall are the two most important indicators used in the recommendation system evaluation. Practical tests show that the features extracted by the unsupervised learning unified feature extractor constructed by the method are well matched with the recommendation method. Compared with the current popular method, the novel personalized recommendation method has good effects in the aspects of accuracy rate and recall rate, and is shown in the attached figures 5 and 6.
Finally, it should be noted that: the above-mentioned embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A construction method of an unsupervised learning unified feature extractor is characterized by comprising the following steps:
s1, acquiring actual news text data and user access data from a server, and generating a news characteristic training data set after sorting and randomizing;
s2, preprocessing the data in the news characteristic training data set by using the current Chinese word segmentation tool to obtain a preprocessed news characteristic training data set;
s3, obtaining a news characteristic training vector set from the preprocessed news characteristic training data set through a TF-IDF method;
s4, classifying the news characteristic training vector set according to the user access data to form a user characteristic training data set;
s5, constructing a stacked asymmetric noise reduction and contraction self-encoder with a plurality of hidden layers and using JSA-CDAEAs an objective function:
wherein,
wherein k isσIs gaussian kernel, with standard deviation σ of 1.0, gaussian kernel function:
where x denotes the input of the encoder, fθ() Representing the output of the encoder, gθ() Represents the decoder output; l isMC() A cost function representing a single input, λ is a regularization parameter of a systolic auto-encoder, | | | | luminanceFIs the F norm symbol, J (x) is the encoder Jacobian matrix, θ is the parameter set for the depth autocoder, xiRepresenting the input to the encoder in one training session,representing the output restored by the decoder, t representing the training set, and z representing the algebraic expression in the Gaussian kernel;
s6, training the depth self-encoder, wherein the training steps are as follows:
s61, taking the news feature training vector set as training data of the depth self-encoder;
s62, adding Gaussian white noise into the training data to generate input data with noise;
s63, taking the input data with noise as the input of the depth self-encoder, and during training, adopting a batch gradient descent method, firstly performing unsupervised layer-by-layer pre-training to obtain initial parameters of each hidden layer and output data of an output layer;
s64, comparing input training data with output data in the objective function to realize the reverse propagation of the gradient and adjust the initial parameters of each hidden layer;
s65, obtaining a parameter set of the depth self-encoder after the training is finished;
and S7, removing a decoder part of the depth self-encoder, and adding a binary generation layer after the output of the last hidden layer to complete the construction of the unsupervised learning unified feature extractor.
2. The unsupervised learning unified feature extractor construction method of claim 1, wherein:
the step S1, acquiring actual news text data and user access data from the server, and generating a news characteristic training data set after sorting and randomizing, specifically includes the following steps:
s11, collecting news data and user access data in a certain time period on the server;
s12, removing pictures and videos in news data, uniformly coding the pictures and videos into UTF-8, setting a sequence number for each news to form a news data set;
s13, randomizing and reordering the news in the news data set according to the sequence numbers, and then respectively using the news as the news characteristic training data sets in the layer-by-layer unsupervised pre-training stage and the global training stage according to a certain proportion.
3. The unsupervised learning unified feature extractor construction method of claim 1, wherein:
in the step S5, a stacked asymmetric noise reduction and shrinkage self-encoder with multiple hidden layers is constructed, including 2 hidden layers.
4. The unsupervised learning unified feature extractor construction method of claim 3, wherein:
the coding function of the first hidden layer is h1(xi)=S(w1xi+b1) The pre-training decoding function is
The coding function of the second hidden layer is h2(h1)=S(w2h1+b2) The pre-training decoding function is
The global training decoding function from the second hidden layer to the output layer is go(xi)=S(w1xi+b1);
The initial parameters of each layer are [0,1 ]]The nonlinear activation function S () uses a Sigmoid function in common, e is the Euler number, h represents the coding function of the hidden layer, g is the decoding function, b represents the offset, x represents the input to the coder, w1、w2The weight parameters of the first and second hidden layers are respectively.
5. The unsupervised learning unified feature extractor construction method of claim 1, wherein:
the dimension of the binary generation layer in the step S7 is the same as that of the last hidden layer of the depth self-encoder, and one-to-one connection is realized with each neuron of the last hidden layer; the binary generation layer is provided with a weight regulator according to the output of the last hidden layer to realize threshold value regulation, the selection of the threshold value T in the weight regulator enables the output result of one complete training to be divided into two types, and the variance between the two types is the largest.
6. The unsupervised learning unified feature extractor construction method of claim 1, further comprising:
and S8, inputting the user feature training vector set into the unsupervised learning unified feature extractor to obtain a user preference model, and generating a unified user neighbor table through similarity comparison according to the user preference model of each user.
CN201810117102.XA 2018-02-06 2018-02-06 Unsupervised learning uniform characteristics extractor construction method Active CN108304359B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810117102.XA CN108304359B (en) 2018-02-06 2018-02-06 Unsupervised learning uniform characteristics extractor construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810117102.XA CN108304359B (en) 2018-02-06 2018-02-06 Unsupervised learning uniform characteristics extractor construction method

Publications (2)

Publication Number Publication Date
CN108304359A CN108304359A (en) 2018-07-20
CN108304359B true CN108304359B (en) 2019-06-14

Family

ID=62864632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810117102.XA Active CN108304359B (en) 2018-02-06 2018-02-06 Unsupervised learning uniform characteristics extractor construction method

Country Status (1)

Country Link
CN (1) CN108304359B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344391B (en) * 2018-08-23 2022-10-21 昆明理工大学 Multi-feature fusion Chinese news text abstract generation method based on neural network
CN109614984A (en) * 2018-10-29 2019-04-12 深圳北斗应用技术研究院有限公司 A kind of homologous image detecting method and system
CN109598336A (en) * 2018-12-05 2019-04-09 国网江西省电力有限公司信息通信分公司 A kind of Data Reduction method encoding neural network certainly based on stack noise reduction
CN109635303B (en) * 2018-12-19 2020-08-25 中国科学技术大学 Method for recognizing meaning-changing words in specific field
CN110022313B (en) * 2019-03-25 2021-09-17 河北师范大学 Polymorphic worm feature extraction and polymorphic worm identification method based on machine learning
CN110136226B (en) * 2019-04-08 2023-12-22 华南理工大学 News automatic image distribution method based on image group collaborative description generation
KR20210011844A (en) * 2019-07-23 2021-02-02 삼성전자주식회사 Electronic apparatus and method for controlling thereof
CN110442804A (en) * 2019-08-13 2019-11-12 北京市商汤科技开发有限公司 A kind of training method, device, equipment and the storage medium of object recommendation network
CN110648282B (en) * 2019-09-29 2021-03-23 燕山大学 Image super-resolution reconstruction method and system based on width neural network
CN112651221A (en) * 2019-10-10 2021-04-13 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN111368205B (en) * 2020-03-09 2021-04-06 腾讯科技(深圳)有限公司 Data recommendation method and device, computer equipment and storage medium
CN113497938A (en) * 2020-03-19 2021-10-12 华为技术有限公司 Method and device for compressing and decompressing image based on variational self-encoder
CN112116029A (en) * 2020-09-25 2020-12-22 天津工业大学 Intelligent fault diagnosis method for gearbox with multi-scale structure and characteristic fusion
CN115146689A (en) * 2021-03-16 2022-10-04 天津大学 Deep learning-based power system high-dimensional measurement data dimension reduction method
CN113441421B (en) * 2021-07-22 2022-12-13 北京信息科技大学 Automatic garbage classification system and method
CN114417427B (en) * 2022-03-30 2022-08-02 浙江大学 Deep learning-oriented data sensitivity attribute desensitization system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
CN105550677A (en) * 2016-02-02 2016-05-04 河北大学 3D palm print identification method
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9668699B2 (en) * 2013-10-17 2017-06-06 Siemens Healthcare Gmbh Method and system for anatomical object detection using marginal space deep neural networks
CN106295245B (en) * 2016-07-27 2019-08-30 广州麦仑信息科技有限公司 Method of the storehouse noise reduction based on Caffe from coding gene information feature extraction
CN106803062A (en) * 2016-12-20 2017-06-06 陕西师范大学 The recognition methods of stack noise reduction own coding neutral net images of gestures

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
CN105550677A (en) * 2016-02-02 2016-05-04 河北大学 3D palm print identification method
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于栈式降噪稀疏自动编码器的雷达目标识别方法;赵飞翔;《雷达学报》;20170430;第6卷(第2期);全文
基于栈式降噪自动编码器的中文短文本分类;邱爽等;《内蒙古名族大学学报》;20170930;第32卷(第5期);全文

Also Published As

Publication number Publication date
CN108304359A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN108304359B (en) Unsupervised learning uniform characteristics extractor construction method
CN108363804B (en) Local model weighted fusion Top-N movie recommendation method based on user clustering
CN109145112B (en) Commodity comment classification method based on global information attention mechanism
CN107832663B (en) Multi-modal emotion analysis method based on quantum theory
CN107608956B (en) Reader emotion distribution prediction algorithm based on CNN-GRNN
Zhang et al. Face sketch synthesis via sparse representation-based greedy search
Zhang et al. Preference preserving hashing for efficient recommendation
CN109886020A (en) Software vulnerability automatic classification method based on deep neural network
CN112417306B (en) Method for optimizing performance of recommendation algorithm based on knowledge graph
CN111127146B (en) Information recommendation method and system based on convolutional neural network and noise reduction self-encoder
CN107895303B (en) Personalized recommendation method based on OCEAN model
CN113239159B (en) Cross-modal retrieval method for video and text based on relational inference network
CN110781401A (en) Top-n project recommendation method based on collaborative autoregressive flow
CN113343125A (en) Academic-precision-recommendation-oriented heterogeneous scientific research information integration method and system
CN117333037A (en) Industrial brain construction method and device for publishing big data
CN113204522A (en) Large-scale data retrieval method based on Hash algorithm combined with generation countermeasure network
CN116595975A (en) Aspect-level emotion analysis method for word information enhancement based on sentence information
CN116680363A (en) Emotion analysis method based on multi-mode comment data
CN115408605A (en) Neural network recommendation method and system based on side information and attention mechanism
CN112085158A (en) Book recommendation method based on stack noise reduction self-encoder
Lu et al. Recommender system based on scarce information mining
Chen et al. Exploiting visual contents in posters and still frames for movie recommendation
CN115481236A (en) News recommendation method based on user interest modeling
Li et al. Otcmr: Bridging heterogeneity gap with optimal transport for cross-modal retrieval
CN114817566A (en) Emotion reason pair extraction method based on emotion embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant