CN110334204B

CN110334204B - Exercise similarity calculation recommendation method based on user records

Info

Publication number: CN110334204B
Application number: CN201910444120.3A
Authority: CN
Inventors: 王汉武; 骆益军
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2022-10-18
Anticipated expiration: 2039-05-27
Also published as: CN110334204A

Abstract

The invention discloses a method for calculating and recommending exercise similarity based on user records, which effectively combines the advantages of an item2vec thought and a convolutional neural network, solves the problem that similar exercise patterns are difficult to match due to the fact that the exercise contains a large number of formula symbols in the conventional exercise recommendation, is complex in content structure and meaningfully semantically, can divide words of the exercise from the perspective of natural language processing, learns the specific grammatical meaning of the exercise, and matches the similar exercise patterns in terms of word meaning. Finally, the problem recommendation system can better recommend more matched similar problem types, and the problem recommendation quality is improved.

Description

Exercise similarity calculation recommendation method based on user records

The technical field is as follows:

the invention belongs to the field of software, and particularly relates to a method for calculating and recommending exercise similarity based on user records.

Background art:

the most commonly used algorithms, such as TIIDF, LSA, LDA, etc., for detecting text similarity based on machine learning can achieve a certain accuracy when the data formats and the comparison made are cleaned in place, but the similarity is only in weak semantic meanings, so the effect is not very good in practical recommendation use, the recommended topics are basically very similar (belong to a term with one meaning), how to improve the understanding of the algorithms in semantic layer is very important, it is very important to really obtain the correlation in the semantic meanings of the questions, the algorithm based on deep learning is used in many scenarios, the models based on LSTM and CNN can learn and represent the semantic meanings of sentences to a certain extent, so the matching of text similarity using the algorithm based on deep learning is better than the traditional machine learning method in effect, but the fundamental differences between the sentences and the original text are very different, the meaning of the sentences is more tortuous and more varied, and various text impurities (mathematical symbols, formulas, etc.), and the accuracy of the basic sentence matching on the models is greatly reduced. The existing deep learning-based approach also has difficulty in achieving satisfactory results.

The noun explains:

word2vec: the word embedding model is proposed in Google 2013, is actually a shallow neural network model, and has two network structures, namely CBOW and Skip-gram. The patent mainly uses a network model using word2vec as Skip-gram.

item2vec: the method of word2vec is mainly used in a recommendation system, commodity items are used as words in the word2vec, and a commodity item set purchased by a user at one time is used as sentences in the word2 vec.

skip-gram network model: the neural network is composed of an input layer, a mapping layer and an output layer, context is inferred through a target word, namely the target word is input, and a context word is obtained.

softmax: to normalize the exponential function, the output of multiple neurons is mapped into the (0, 1) interval, which can be considered as a probability.

Cross entropy: the difference distance between the actual output and the expected output is mainly calculated as a loss function of the model. I.e., the smaller the cross-entropy value, the closer the actual output and the desired output.

One exercise is a set of exercises that the customer does in a certain period of time or a set number of exercises that the customer does at one time.

The same type of problem: i.e., two problems are the same form of problem under one subject or one knowledge point.

The invention content is as follows:

the invention discloses a exercise similarity calculation recommendation method based on user records; the problem that the problem of inaccurate exercise recommendation is caused by improper similarity calculation due to the complex structure of exercise content can be well solved, and the exercise recommendation accuracy is effectively improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for calculating and recommending exercise similarity based on user records comprises the following steps:

step one, taking each exercise as a sentence to perform word segmentation processing to obtain word embedding vectors of the segmented words in the exercise, then connecting the word embedding vectors of all words in each exercise into a matrix according to the word sequence of the words appearing in the exercise to obtain an exercise matrix representing exercise information, and processing the exercise matrix by using a convolution neural network model: the convolutional neural network model adopts filters with different sizes to carry out convolution to obtain a plurality of output characteristics, and the results of the output characteristics are subjected to pooling processing and spliced into a vector1;

step two, taking the exercises as a whole, and calculating the similarity between the exercises: taking the exercises as a word, and taking a set of exercises which are done by each user once as a sentence; calculating the probability of two exercises appearing in the same exercise set at the same time as the similarity of the two exercises; finally, obtaining an embedded vector of each problem, namely a vector2;

splicing the vector1 and the vector2 to obtain a final vector, and training through the vector to obtain a trained model;

step four, inputting the latest exercise made by the user into the trained model, outputting a result as a recommendation probability that the probability that all exercises made by the user are in the same category corresponding to the exercises in the exercise library, sequencing the probabilities of all exercises in the result, and selecting a exercise which has the maximum recommendation probability and has not been made by the user to display to the user to complete a recommendation task; a is the set exercise recommendation number.

In a further improvement, the first step comprises the following steps:

the method comprises the following steps that firstly, a third-party base jieba Chinese word segmentation component is used for segmenting each exercise, obtained segmentation is trained through a skip-gram network model of word2vec, each word in the exercise is mapped into a d-dimensional word vector, word vectors of all segmentation words in each exercise are connected according to a semantic sequence in the exercise, and a representative exercise matrix is obtained; taking the exercise with the maximum number of words, wherein the number of words is n; processing each problem into an n-d matrix, and performing 0 complementing operation on the problems with the word number less than n to ensure that the dimensionality of input data is consistent; learning a problem matrix by using a convolution model, setting three sizes of 2 × d,3 × d and 5 × d, performing convolution operation on each size by using three filters respectively, and performing maximum pooling operation on output features; and splicing the processed results of the nine output features into a vector1 containing the semantic information of the exercises.

In a further improvement, the second step includes the following steps:

obtaining an embedded vector of each problem by using a skip-gram network model: firstly, taking exercises done by a user in one exercise as a set, and setting the number of the exercises done by the user in one exercise as S, wherein the exercises are W1, W2 and W3, 8230; WS; selecting a current target problem Wi, and outputting other problems co-occurring with the current target problem Wi in a problem set by using a skip-gram network model, namely a positive sample; the model is trained such that the conditional probability of the co-occurrence of the target problem Wi with every other problem in the user's one exercise is maximized in all problem sets, i.e.

Maximum;

wherein the content of the first and second substances,

wherein u is _i Is the vector of the target problem Wi, v _j Is a problem that appears in the collection simultaneously with the target problem Wi

The vector of (a); i represents a question bank containing all the questions; k represents the problem of the input question bank; wj represents in use

The problem in the problem which is practised by the user at one time is different from the target problem Wi;

a negative sampling method is applied, namely a plurality of exercises which are not in the same set with the target exercise Wi, namely negative samples are randomly extracted to optimize output, and the training calculated amount of the model is reduced; finally, the embedded vector expression of the problem itself is obtained: vector2.

In a further improvement, the third step comprises the following steps:

splicing the vector1 and the vector2 to obtain a final vector, inputting the vector into a fully-connected neural network, and then performing learning training through the final vector: and taking the same type of exercises as a training set, taking a plurality of training sets as a training set, inputting the target exercises, wherein the exercises expected to be output are other exercises belonging to the same type of exercises as the target exercises, so that the probability of outputting the other exercises which are the same type as the previous target exercises is the maximum, and the calculated probability of the other exercises which are not the same type as the current target exercises is the minimum, thereby obtaining the trained model.

Further improvement, a negative sampling method is adopted in the training process to accelerate the training, namely, for the input of a target exercise, e exercises which are not in the same set with the target exercise, namely negative samples, are randomly extracted to optimize the updating process of the parameters, the calculated amount is reduced, and the training speed of the network is accelerated.

The invention has the beneficial effects that: the problem that the problem of inaccurate exercise recommendation is caused by improper similarity calculation due to the complex structure of exercise content can be well solved, and the exercise recommendation accuracy is effectively improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow chart of step one.

Fig. 2 is a schematic of the final flow of the present invention.

The specific implementation mode is as follows:

the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings, and it should be understood that the embodiments described herein are for purposes of illustration and explanation, and are not intended to limit the invention.

Example 1

The specific steps of the invention are shown in fig. 1 and fig. 2:

1) Firstly, segmenting each exercise by adopting Chinese segmentation in a third-party library jieba, and training the obtained segmentation by using a skip-gram network model of word2vec so as to map each word into a d-dimensional word vector. And connecting the word vectors of all the participles of the problem to obtain a matrix representing the problem. And taking the exercise with the maximum word number, wherein the word number is n. And processing each problem into an n-d matrix, and keeping the dimension of input by complementing 0 for the problem with the word number less than n. All the problems are finally represented as n x d matrices. And then learning the problem matrix by using a convolution neural network, setting three filters with the sizes of 2 x d,3 x d and 5 x d for convolution operation, and outputting the maximum value of the output characteristics by using a maximum pooling operation. And splicing the results of processing the nine output features into a vector1 containing the semantic information of the problem.

2) Taking exercises as a whole, trying to obtain an embedded vector of each exercise by using a skip-gram network model through the idea of item2vec, taking the exercises made by a user once as a set, and setting the number of the exercises made by the user at this time as S and the exercises as W1, W2, W3 \8230 \ 8230and WS. We select current target problem Wi, then the other problems that require the network output of the skip-gram to co-occur with the current target problem in one set, namely positive samples, while the problems that do not occur in one set are negative samples. The model is trained such that the conditional probability of two co-occurring problems in a set of problems is maximized. The corresponding objective function of the model is as follows:

where p (Wj | Wi) is a softmax function:

wherein u is _i Is the vector of the target problem Wi, v _j Is a vector of problems that appear in the collection concurrently with target problem Wi; i represents a question bank containing all the exercises; k represents the problem of the input question bank;

and (3) a negative sampling method is applied, namely a plurality of exercises which are not in the current set, namely negative samples, are randomly extracted to optimize output, and only a small number of parameters need to be updated each time to accelerate training, so that the embedded vector expression of the exercises is finally obtained: vector2.

3) Splicing the vector1 and the vector2 to obtain a final vector, inputting the vector into a fully-connected neural network, and then performing learning training: inputting a target problem, wherein the problem expected to be output is other problems belonging to a same category as the target problem, specifically, inputting a target problem vector, and obtaining the probability that each problem in the problem library is the problem of the same type as the current problem after normalization through a multilayer neural network and a softmax function, wherein the fitting target of the model is to maximize the calculated probability of other problems of the same type as the current target problem and minimize the calculated probability of other problems of the same type as the current target problem. After training is finished, the model can calculate the probability that other exercises in the exercise library are the same as the type of the other exercises according to the target exercise vector, namely the probability of recommending the exercises.

The problem quantity is great, if the training mode of normal sampling needs a large amount of calculation and time, the output optimization is carried out by adopting the thought of negative sampling, the specific measure is that a plurality of negative samples (generally the quantity is set to be 3-7) are randomly selected for a target problem, the training is carried out by adopting the form of cross entropy, thereby the training of the model is completed, and the training calculation cost and the training time are saved compared with the training of the total quantity of problems.

4) And (3) inputting the model trained in the step (3) into a problem sample made by the user, outputting a result as the probability that all problems are recommended to the problem, sequencing the probabilities of all problems in the result according to the probability representative and the probability that the problems belong to the same category, and selecting the largest problem which is not made by the user to display to the user to complete a recommendation task.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for calculating and recommending exercise similarity based on user records is characterized by comprising the following steps:

step one, taking each exercise as a sentence to perform word segmentation processing to obtain word embedding vectors of the segmented words in the exercise, connecting the word embedding vectors of all words in each exercise into a matrix according to the sequence of the words appearing in the exercise to obtain an exercise matrix representing exercise information, and processing the exercise matrix by using a convolutional neural network model: the convolutional neural network model adopts filters with different sizes to carry out convolution to obtain a plurality of output characteristics, and the results of the output characteristics are subjected to pooling processing and spliced into a vector1;

step two, taking the exercises as a whole, and calculating the similarity between the exercises: taking the exercises as a word, and taking a set of the exercises which are once done by each user as a sentence; calculating the probability of two exercises appearing in the same exercise set at the same time to serve as the similarity of the two exercises; finally, obtaining an embedded vector of each problem, namely a vector2;

Maximum;

wherein, the first and the second end of the pipe are connected with each other,

wherein u is _i Is the vector of the target problem Wi, v _j Is a vector of the problem that appears in the collection with target problem Wi; i represents a question bank containing all the exercises; k represents the exercises of the input question bank; wj represents a problem different from the target problem Wi in the problem that the user exercises at one time;

a negative sampling method is applied, namely a plurality of exercises which are not in the same set with the target exercise Wi, namely negative samples are randomly extracted to optimize output, and the training calculated amount of the model is reduced; finally, the embedded vector expression of the problem is obtained: vector2;

step three, splicing vector1 and vector2 to obtain a final vector, and training through the vector to obtain a trained model: splicing the vector1 and the vector2 to obtain a final vector, inputting the vector into a fully-connected neural network, and then performing learning training through the final vector: taking the same type of exercises as a training set, taking a plurality of training sets as a training set, inputting target exercises, wherein the exercises expected to be output are other exercises belonging to the same type of exercises as the target exercises, so that the probability of outputting the exercises which are the same type as the previous target exercises is the maximum, and the calculated probabilities of the exercises which are not the same type as the current target exercises are the minimum, thereby obtaining a trained model;

inputting the latest exercise made by the user into the trained model, outputting a result that the probability that all exercises in the exercise library correspond to the exercises made by the user in the same category is a recommendation probability, sequencing the probabilities of all exercises in the result, and selecting the exercises with the maximum recommendation probability which are not made by the user to be displayed to the user to complete a recommendation task; a is the set exercise recommendation number.

2. The method of claim 1, wherein the step one comprises the steps of:

step one, segmenting each exercise by using a third-party library jieba Chinese segmentation component, training the obtained segmentation by using a skip-gram network model of word2vec, mapping each word in the exercise into a d-dimensional word vector, and connecting the word vectors of all the segments in each exercise according to the semantic sequence in the exercise to obtain a representative exercise matrix; taking the exercise with the maximum number of words, wherein the number of words is n; processing each problem into an n x d matrix, and performing 0 complementing operation on the problem with the word number less than n so as to keep the dimension of input data consistent; learning a problem matrix by using a convolution model, setting three sizes of 2 × d,3 × d and 5 × d, performing convolution operation on each size by using three filters respectively, and performing maximum pooling operation on output features; and splicing the processed results of the nine output features into a vector1 containing the semantic information of the problem.

3. The exercise similarity calculation recommendation method based on user records as claimed in claim 1, wherein a negative sampling method is adopted to accelerate the training in the training process, that is, for the input of a target exercise, e exercises which are not in the same set as the target exercise, that is, negative samples are randomly extracted to optimize the updating process of the parameters, so that the calculation amount is reduced, and the training speed of the network is accelerated.