CN114969487A - Course recommendation method and device, computer equipment and storage medium - Google Patents
Course recommendation method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN114969487A CN114969487A CN202110190358.5A CN202110190358A CN114969487A CN 114969487 A CN114969487 A CN 114969487A CN 202110190358 A CN202110190358 A CN 202110190358A CN 114969487 A CN114969487 A CN 114969487A
- Authority
- CN
- China
- Prior art keywords
- course
- target
- network model
- reinforcement learning
- courses
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 230000002787 reinforcement Effects 0.000 claims abstract description 135
- 230000009471 action Effects 0.000 claims abstract description 63
- 238000012216 screening Methods 0.000 claims abstract description 37
- 239000013598 vector Substances 0.000 claims description 139
- 238000012549 training Methods 0.000 claims description 98
- 230000006870 function Effects 0.000 claims description 39
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 17
- 230000008901 benefit Effects 0.000 abstract description 10
- 230000007774 longterm Effects 0.000 abstract description 9
- 230000000694 effects Effects 0.000 abstract description 5
- 230000009467 reduction Effects 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 11
- 238000004364 calculation method Methods 0.000 description 5
- 230000001186 cumulative effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/20—Education
- G06Q50/205—Education administration or guidance
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Tourism & Hospitality (AREA)
- Strategic Management (AREA)
- Educational Technology (AREA)
- Educational Administration (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Economics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- General Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a course recommendation method, a course recommendation device, computer equipment and a storage medium, wherein the method comprises the steps of determining the current state of a target user according to first historical course browsing data of the target user; obtaining candidate course classes meeting the screening condition according to the current state and a pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model takes the course class to which the network course belongs as an action space, and the output quantity of the output action space is the same as the total quantity of the course classes; and screening a set number of target courses from the network courses corresponding to each candidate course category and pushing the target courses to the target user. By utilizing the method, the effect that the network course recommended to the user can bring long-term benefits to the network teaching platform is achieved by adopting a processing form of a reinforcement learning network model; meanwhile, the dimension reduction processing of the action space output in the reinforcement learning network model is realized, the effective recommendation of the network course to the user side is ensured, and the user experience is improved.
Description
Technical Field
The present application relates to the field of information recommendation technologies, and in particular, to a course recommendation method and apparatus, a computer device, and a storage medium.
Background
With the rapid development and popularization and application of internet technology, a digitalized online learning mode is more and more accepted by the public, and the amount of network courses for users to learn in a network teaching platform is also increased explosively. In the face of such huge amount of information, it is difficult for the user to quickly find out the courses interested in or wanted to learn.
Currently, the platform side often solves the above problems by actively or passively making course recommendations to the user. Most of the existing course recommendation is realized by performing correlation modeling on historical operation data of a user on a network teaching platform, so that courses interested by the user are predicted and recommended, and the modeling mode only considers the short-term preference and income of the user, ignores the long-term income of the whole platform and cannot be matched with the interest target of a platform party. The existing recommendation method considering the long-term income has the problem that the method cannot be suitable for large-scale network course recommendation.
Disclosure of Invention
In view of this, embodiments of the present application provide a course recommendation method, an apparatus, a computer device, and a storage medium, which can implement effective recommendation for large-scale network courses on the basis of ensuring long-term revenue of a platform.
In a first aspect, an embodiment of the present application provides a course recommendation method, including:
determining the current state of a target user according to first historical course browsing data of the target user;
obtaining candidate course classes meeting screening conditions according to the current state and a pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model takes the course class to which the network course belongs as an action space, and the output quantity of the output action space is the same as the total quantity of the course classes;
and screening a set number of target courses from the network courses corresponding to the candidate course categories and pushing the target courses to the target user.
Further, the step of dividing the class of the network course to which the network course belongs includes:
obtaining second historical course browsing data of the selected users from the message queue, and forming course browsing sequences corresponding to the users;
taking each course browsing sequence as a sentence to be processed, and obtaining a course vector of each network course through a word vector division model to form a course vector set;
and clustering the course vector set to obtain the clustering clusters of the output quantity, and correspondingly determining the clustering center vector of each clustering cluster as a course category.
Further, the determining the current state of the target user according to the historical course browsing data of the target user includes:
performing word segmentation on first historical course browsing data of the target user in a set time period, and determining a browsed course vector of a browsed course corresponding to the target user;
and determining the current state of the target user by the average vector of the browsed course vectors.
Further, the obtaining candidate course categories meeting the screening condition according to the current state and the pre-trained target reinforcement learning network model includes:
inputting the current state into the target reinforcement learning network model, and outputting the output quantity of candidate vectors as an action space through the target reinforcement learning network model, wherein each candidate vector respectively identifies a course type;
determining the accumulated return value of each course category through a given accumulated return value model and by combining the current state and the current network parameters of the target reinforcement learning network model;
and ranking the course categories according to the accumulated return value, and taking the course category with the first set name as a candidate course category.
Further, the step of screening a set number of target courses from the network courses corresponding to each of the candidate course categories and pushing the target courses to the target user includes:
aiming at each candidate course category, acquiring a clustering center vector of the candidate course category;
determining a distance value between each course vector in a cluster associated with the cluster center vector and the cluster center vector;
ranking the course vectors according to the distance value, and taking the course vector with the second previous set name as a course to be recommended;
and selecting target courses meeting fine-grained screening conditions from the courses to be recommended and pushing the target courses to the target users respectively.
Further, the training step of the target reinforcement learning network model comprises:
respectively recording two reinforcement learning network models with the same network structure and different network parameters as a real-time training network model and an initial reinforcement learning network model;
constructing a training sample set for model training according to the course categories identified by the clustering center vectors and the second historical course browsing data of the selected users, wherein each training sample in the training sample set comprises: the method comprises the steps of obtaining a first state sequence of a current state of a user, a target clustering center vector, an instantaneous return value and a second state sequence of a next state;
and performing loss function fitting according to output results of the training samples under the real-time training network model and the initial reinforcement learning network model respectively, and obtaining a target reinforcement learning network model through reverse learning of the fitted loss function.
Further, the performing loss function fitting according to the output results of the training samples under the real-time training network model and the initial reinforcement learning network model, and obtaining the target reinforcement learning network model through reverse learning of the fitted loss function includes:
for each training sample, determining the current accumulated return value of each action space vector output by the first state sequence under the real-time training network model, and determining the maximum current accumulated return value;
determining a standard cumulative return value of the first state sequence relative to the target clustering center vector under the initial reinforcement learning network model;
performing loss function fitting according to the maximum current accumulated return value and the standard accumulated return value corresponding to each training sample;
updating the network parameters of the real-time training network model according to the fitted loss function, and replacing the network parameters of the initial reinforcement learning network model with the network parameters of the real-time training network model when the updating times meet a parameter replacement period;
and determining the initial reinforcement learning network model after parameter replacement as the target reinforcement learning network model.
In a second aspect, an embodiment of the present application provides a course recommending apparatus, including:
the information determining module is used for determining the current state of a target user according to first historical course browsing data of the target user;
the candidate determining module is used for acquiring candidate course classes meeting the screening condition by a target according to the current state and a pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model takes the course class to which the network course belongs as an action space, and the output quantity of the output action space is the same as the total quantity of the course classes;
and the target recommendation module is used for screening a set number of target courses from the network courses corresponding to each candidate course category and pushing the target courses to the target user.
In a third aspect, an embodiment of the present application further provides a computer device, including: a memory and one or more processors;
the memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the course recommendation method as described in the first aspect above.
In a fourth aspect, embodiments of the present application further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the course recommendation method according to the first aspect.
According to the course recommendation method, the device, the computer equipment and the storage medium, firstly, the current state of a target user is determined according to first historical course browsing data of the target user, and then candidate course categories meeting the screening condition are obtained according to a target reinforcement learning network model trained in advance according to the current state set, wherein an action space output in reinforcement learning passes through a course category identification to which a network course belongs, and the output quantity of the action space is equal to the total quantity of the course categories; and finally, a certain amount of target courses are screened from the candidate course categories including the network courses and pushed to the target users. According to the technical scheme, the effect that the network courses recommended to the target user can bring long-term benefits to the network education platform is achieved by mainly adopting a processing form of a reinforcement learning network model; meanwhile, the problem that the reinforcement learning cannot adapt to large-scale data volume processing is solved by dimension reduction processing of the output action space in the reinforcement learning network model, namely, the output quantity of the output action space is only the same as the class quantity of the network courses, so that effective recommendation of the network courses to a user side is realized, and the user experience is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
fig. 1 is a flowchart illustrating a course recommendation method according to an embodiment of the present application;
fig. 2 is a flowchart illustrating a course recommending method according to a second embodiment of the present application;
fig. 3 is a block diagram illustrating a structure of a course recommending apparatus according to a third embodiment of the present application;
fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings. It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.
In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, nor should be construed to indicate or imply relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
Example one
Fig. 1 is a schematic flowchart of a course recommendation method according to an embodiment of the present application, where the method is suitable for performing network course recommendation in a network teaching platform for a user. The method may be performed by a course recommender, which may be implemented in hardware and/or software and is typically integrated in a computer device.
It should be noted that, when online network teaching is taken as an application scenario of this embodiment, a computer device that integrates the method provided in this embodiment may be taken as a platform server for online network teaching. Generally, thousands of network courses for users to learn often exist on a network teaching platform, and a large-scale network course amount is presented. The problem of long-term benefits can be solved only by adopting a traditional reinforcement learning mode, but the reinforcement learning output is considered as a candidate set of recommendation information, and when the original data for determining the candidate set is large in size, effective execution of the reinforcement learning in information recommendation cannot be guaranteed.
The course recommendation method provided by the embodiment can effectively solve the problem that the network course is large in scale and cannot be recommended through reinforcement learning.
As shown in fig. 1, a course recommendation method provided in this embodiment specifically includes the following steps:
s101, determining the current state of a target user according to first historical course browsing data of the target user.
In this embodiment, a user with learning needs to enter the network teaching platform through registration and login operations. The network teaching platform side can record the user information of each registered user. Generally, the network teaching platform side can recommend network courses to each online user, and this embodiment can regard each user who enters the network teaching interface through login operation as a target user, and for each target user, corresponding course recommendation can be achieved through the method provided by this embodiment.
In this embodiment, the historical course browsing data may be specifically understood as data generated when a user browses each displayed interface within a period of time (e.g., one day, one week, or even one month) after entering the network teaching platform. The specific data may be data generated when browsing the courses, for example, which courses are browsed, how many times a course is browsed in a period of time, and the like, where different courses may be distinguished by a course identification number. For the convenience of distinction, the present embodiment records the historical lesson browsing data acquired by the relative target user as the first historical lesson browsing data.
In this embodiment, the current state may be specifically understood as an operation state that the target user has learned about the web lesson before performing the next action operation when the web lesson learning is taken as the application environment, and for example, the end of browsing the a lesson corresponds to a state of the user.
Specifically, the current state may be determined by analyzing the first historical course browsing data corresponding to the target user. It can be known that the first historical lesson browsing data includes operation information of the target user relative to the network lesson within a period of time, and this step can regard these operation information as a sentence, and then analyze and obtain keyword information of the target user within this period of time, such as the operated network lesson ID, and then form vector information capable of representing the current state by processing and summarizing these keyword information.
And S102, obtaining candidate course classes meeting screening conditions according to the current state and a pre-trained target reinforcement learning network model.
In this embodiment, this step corresponds to an operation of determining candidate courses necessary for course recommendation by the reinforcement learning network model. It should be understood that reinforcement learning is specifically understood to mean that in an application environment, actions to be performed can be selected continuously according to the current state of the environment in the form of intelligent learning, so as to ensure that the number of rewards obtained by performing the selected actions is maximum. And the specific execution of selecting the execution action in each state and ensuring the maximum number of the finally obtained incentives can be realized by the reinforcement learning network model.
In this embodiment, the reinforcement Learning network model preferably used is a Deep reinforcement Learning (DQN) network model, and the reinforcement Learning network model formed after Learning through the set training samples is recorded as a target reinforcement Learning network model. Generally, the input data of the reinforcement learning network model is a current state value, and the output data is action space data related to an application environment, wherein each action space represents data information of a candidate execution action. The output quantity of the action space output by the target reinforcement learning network model is equal to the quantity of the executable actions in the application environment in principle, but the mode is more suitable for the scene comprising the small-scale executable actions.
In this embodiment, each network course included in the network teaching platform is basically equivalent to one candidate action that can be executed by the user, because the number of network courses is large, each network course cannot be directly used as each executable candidate action output in the target reinforcement learning network model. The present embodiment considers the division of class into class for the network classes, and takes each class as the action space of the target reinforcement learning network model output. That is, the target reinforcement learning network model takes the class to which the network class belongs as the motion space, and the output number of the output motion space is the same as the total amount of the class.
Specifically, in this step, the current state of the target user may be used as an input data of the target reinforcement learning network model, and a plurality of different pieces of motion space data information may be output, where the data information of each motion space represents a course category in the network teaching platform. Then, each class outputted can be regarded as a candidate set, and candidate classes meeting the conditions can be screened out from the candidate set. The screening condition may be set by the accumulated return values corresponding to the action spaces output by the target reinforcement learning network model. Generally, for setting the screening conditions, a greedy algorithm strategy is often adopted to determine an action space corresponding to an optimal accumulated reported value as a coarse-grained screening result in this step, in this embodiment, it is considered that an optimal solution determined based on the greedy algorithm strategy cannot guarantee diversity of the screening result, and it is preferably considered that a non-greedy algorithm is adopted to guarantee diversity of the coarse-grained screening result in this step, for example, a plurality of classes with top-ranked accumulated reported values are respectively considered as candidate classes.
It should be noted that, in this embodiment, the network courses may be clustered and divided according to the course browsing habits of the user through a clustering process, and the cluster center vector of each clustered and divided cluster may be regarded as data information of a course category.
S103, a set number of target courses are screened from the network courses corresponding to the candidate course categories and pushed to the target user.
In this embodiment, one or more network courses to which the candidate course category belongs exist in each of the candidate course categories, and these network courses may be used as candidate courses recommended by the course. For example, this step may filter out a certain amount of target courses from the candidate course set formed by each candidate course category through a certain filtering policy and push the target courses to the target user.
Specifically, for the screening of the target course, each candidate course included in the candidate course category may be analyzed, and the network course with a high importance or a high weight value may be used as the target course. In the step, the screening operation of the target courses can be carried out on each candidate course category, and the screened target courses are pushed to the target user in a non-sequential manner so as to display the first-degree information of each target course on the client of the target user.
According to the course recommendation method provided by the embodiment of the application, a processing form of a reinforcement learning network model is mainly adopted to achieve the effect that network courses recommended to a target user can bring long-term benefits to a network education platform; meanwhile, the problem that the reinforcement learning cannot adapt to large-scale data volume processing is solved by dimension reduction processing of the output action space in the reinforcement learning network model, namely, by ensuring that the output quantity of the output action space is only the same as the class quantity of the courses possessed by the network courses, so that effective recommendation of the network courses to a user side is realized, and user experience is improved.
As an optional embodiment of the present application, on the basis of the foregoing embodiment, the step of dividing the class to which the network course belongs may include:
it can be understood that the candidate course category determination through the target reinforcement learning network model in the embodiment is premised on that it is predetermined which course categories exist for the network courses that the user currently tends to browse in the network teaching platform. Thus, the alternative embodiment provides a specific implementation of class classification according to the class browsing behavior data of the user.
a1) And acquiring second historical course browsing data of the selected users from the message queue, and forming a course browsing sequence corresponding to the users.
A message queue is understood to mean, in particular, a buffer queue intended to buffer user behavior data generated by a user interacting with a backend. In this step, historical course browsing data corresponding to each registered user or each selected user participating in the course category division within a certain time period may be obtained from the set message queue, and this embodiment is recorded as second historical course browsing data.
In this step, sequence information including only the course ID of each network course browsed by the user is formed by analyzing and extracting the second historical course browsing data of each user, and is recorded as a course browsing sequence corresponding to the user, and the course browsing sequence corresponding to each user is equivalent to a sentence capable of performing word segmentation processing.
b1) And taking each course browsing sequence as a sentence to be processed, and obtaining the course vector of each network course through a word vector division model to form a course vector set.
Wherein the word vector partitioning model is preferably a correlation model word2vec generating word vectors. In the step, each course browsing sequence can be regarded as a sentence to be processed as input data of the word vector division model, so that a vector set output by the word vector division model can be obtained, each vector in the vector set represents a course vector of a network course, and the vector set comprising each course vector is preferably recorded as a course vector set.
c1) And clustering the course vector set to obtain the clustering clusters of the output quantity, and correspondingly determining the clustering center vector of each clustering cluster as a course category.
In this embodiment, the course vector set may be clustered through a K-means clustering algorithm, the K value of the formed cluster may be determined through a knee point method, and the determined K value may be used as the output quantity of the action space output by the target reinforcement learning network model in this embodiment. The process of determining the K value by the inflection point method can be described as follows:
searching all possible values of K in a certain range, and aiming at each possible value, clustering by adopting the possible value to obtain a corresponding clustering result, and then, integrating the clustering results under each K value by adopting an error square sum calculation formula to calculate the error square sum. Wherein, the error square sum calculation formula is expressed as:
where K is the maximum value of all possible values, this embodiment sets a value K from 1, specifically, i represents a change in the value K, where C is a cluster set of a clustering result corresponding to the value K when i is selected, each p represents a point in space, and m is the maximum value of all possible values i Representing a cluster center when K selects i.
Through the formula, a sum of squares of errors can be obtained for each K value, each sum of squares of errors is connected, and the K value with the largest slope change on the connection line can be used as the optimal K value.
In this embodiment, after the optimal K value is determined, the K value is used as the output quantity of this embodiment, and a plurality of cluster clusters of the output quantity can be obtained, and the cluster center vector in each cluster can be regarded as vector information representing one course category. Thereby achieving the classification of the class required by the embodiment.
It should be noted that another premise of the present embodiment for determining the class of the candidate lesson through the target reinforcement learning network model is that: it is also necessary to ensure that the adopted target reinforcement learning network model is a pre-trained network model, and this embodiment also provides another optional embodiment to implement training of the reinforcement learning network model.
Specifically, as another optional embodiment of the present application, on the basis of the above optional embodiment, the embodiment may express the training step of the target reinforcement learning network model as:
a2) and respectively recording two reinforcement learning network models with the same network structure and different network parameters as a real-time training network model and an initial reinforcement learning network model.
Through the relevant analysis of reinforcement learning, it can be known that two neural network models with the same network structure need to be provided in the training process of reinforcement learning, but the network parameters of the two neural network models are different, one of the neural network models can be recorded as a real-time training network model which needs to be trained in real time, the other neural network model can be recorded as an initial reinforcement learning network model which needs to be continuously updated and is used in an actual application scene.
b2) And constructing a training sample set for model training according to the class identified by the clustering center vector and the second historical course browsing data of the selected user.
In this embodiment, in order to ensure that the reinforcement learning network model obtained after training can match the course recommendation application scenario of this embodiment, a training sample set required by model training needs to be set based on the course recommendation application scenario of this embodiment. In this step, the clustering center vector that identifies each course category when performing the course category division may be obtained, and second historical course browsing data that is used when performing the course category division may also be obtained.
Through the analysis of the second historical course browsing data corresponding to each user, the construction of the training samples can be performed, wherein each training sample in the training sample set comprises: a first state sequence of a current state of the user, a target cluster center vector, an instantaneous reward value, and a second state sequence of a next state.
Specifically, in order to form a training sample, the present embodiment determines what information should be provided in a training sample from the perspective of the required parameters in the reinforcement learning scenario. In general, the main parameters in the reinforcement learning scenario include the current state of the environment, the actions that the user can perform (e.g., which lessons to browse), the next state of the environment after the user performs the actions, and the instantaneous reward generated after the user performs the actions. Through the second historical course browsing data, which courses are browsed by the user in the historical time period can be known, so that a state sequence of the user in the application environment can be formed, and further parameter information of each parameter in the training sample can be determined.
For example, assuming that the second historical lesson browsing data is analyzed to determine that the user browses four lessons, i.e., lesson a, lesson B, lesson C and lesson D, in sequence within a period of time, the present embodiment may construct current status data of the user based on the first 3 lessons and construct next status data of the user based on the four lessons.
As described above, in the present embodiment, each course may determine corresponding course vector information, so that a first state sequence representing the current state of the user is formed based on the course vector information of the previous 3 courses, for example, an average vector obtained by averaging and summing the course vector information of the previous 3 courses is regarded as the first state sequence. Meanwhile, a second state sequence representing the next state of the user can be formed based on the course vector information of the 4 courses, and similarly, the course vector information of the 4 courses can be obtained by summing.
After the 4 th course (course D) is known, which is equivalent to knowing an action space to be executed when the user switches from the current state to the next state, in this embodiment, vector data of the course D is not directly used as the action space, but a cluster to which the course D belongs is determined first, and then a cluster center vector of the cluster to which the course D belongs is used as action space information executed when the user switches from the current state to the next state, that is, a target cluster center vector in a training sample required by the embodiment.
Similarly, the user may immediately feed back an instantaneous report value according to the action performed by the user after switching from the current state to the next state, and thus, the fed-back instantaneous report value is also equivalent to a parameter information in a training sample.
In this step, a corresponding training sample may be determined for each user in the manner described above.
c2) And performing loss function fitting according to output results of the training samples under the real-time training network model and the initial reinforcement learning network model respectively, and obtaining a target reinforcement learning network model through reverse learning of the fitted loss function.
It can be known that the training process of the network model is equivalent to a process of performing reverse learning adjustment on the network parameters in the network model by comparing an actual output value obtained by inputting input data in a training sample into the network model to be trained with a standard output value in the training sample in a certain manner. The actual output value is compared with the standard output value in a certain way, and the comparison is mainly realized by fitting a loss function.
Based on this, in this step, the current state (i.e., the first state sequence) in the training sample may be used as input data for implementing the training network model, the action space with the maximum accumulated return value in the output action space may be used as an actual output value, and the loss function may be fitted by combining a target clustering center vector, which is given in the training sample and is used as a standard output value.
In the implementation of the loss function fitting of this embodiment, the maximum accumulated return value corresponding to the actual output value and the accumulated return value corresponding to the standard output value under the initial reinforcement learning network model are mainly set. In the execution of the training and learning again, the network parameters of the real-time training network model can be adjusted through the fitted loss function, then the network parameters of the initial reinforcement learning network model can also be adjusted after the condition of adjusting the parameters of the initial reinforcement learning network model is met, and finally the initial reinforcement learning network model after the network parameters are adjusted can be used as the currently available target reinforcement learning network model.
Further, this embodiment may specifically implement the step c2) as the following steps:
it should be noted that, the determination process of the target reinforcement learning network model requires the participation of each training sample, and the steps provided in the following description of the present alternative embodiment need to be performed for each training sample.
c21) And determining the current accumulated return value of each action space vector output by the first state sequence under the real-time training network model and determining the maximum current accumulated return value aiming at each training sample.
In this step, the specific implementation mainly includes: first, a first state sequence in a training sample is input as input data to a real-time training network model, and motion space vectors (corresponding to cluster center vectors corresponding to class categories formed by cluster division) output by the operation of the real-time training network model are obtained.
Then, the current accumulated return value of each motion space vector can be obtained, and the maximum current accumulated return value can be determined by comparing the current accumulated return values. Wherein, the current accumulated return value can be obtained by a known return value determination function calculation. The target values required for the loss function fit are determined using the bellman equation. Wherein, the Bellman equation can be expressed as:
where Y represents the target value required for the loss function fitting, Rt +1 represents the instantaneous reward value in the training sample, γ is a predetermined parameter, Q (S) t+1 ,a,θ t ) Expressed in a network parameter of theta t The action space obtained under the real-time training network model is converted from the current state to the corresponding accumulated return value when the next state is reached,then the network parameter is theta t ' the initial reinforcement learning network model determines the cumulative reward value according to the determined target action space.
c22) And determining a standard accumulated return value of the first state sequence relative to the target clustering center vector under the initial reinforcement learning network model.
The concrete realization of this step includes: and inputting the first state sequence as input data into the initial reinforcement learning network model, finding out a clustering center vector in the training sample from the output motion space vector, and obtaining a corresponding standard accumulated return value.
c23) And performing loss function fitting according to the corresponding maximum current accumulated return value and the standard accumulated return value under each training sample.
In this embodiment, the target values required for the loss function fitting may be determined using the bellman equation.
Wherein, the Bellman equation can be expressed as:
wherein the calculation of the Bellman equation is for each training sample, Y represents a target value required for a loss function fit, R t+1 Representing the instantaneous value of the reward in the training sample, gamma being a predetermined parameter, Q (S) t+1 ,a,θ t ) Expressed in a network parameter of theta t Training a motion space obtained under the network model in real time from the current state S t Transition to the next state S t+1 The corresponding current accumulated return value, based on each current accumulated return value, may determine a maximum accumulated return value,then the network parameter is theta t ' the initial reinforcement learning network model accumulates the return value according to the determined standard of the determined target action space. The target motion space is equivalent to the motion space corresponding to the maximum accumulated return value, and the target motion space is often the target clustering center vector in the training sample.
Then, a specific loss function value may be determined from a given loss function, where the loss function may be expressed as:
wherein, Q (S) t+1 ,a,θ t Expressed in a network parameter of theta t The target motion space obtained under the real-time training network model is converted from the previous state to the current state S t And the corresponding last accumulated return value, Y is the determined target value, n is the number of training samples, and the expression is mainly used for fitting the mean square error between the actual value and the target value Y of the real-time training network model.
c24) And updating the network parameters of the real-time training network model according to the fitted loss function, and replacing the network parameters of the initial reinforcement learning network model by adopting the network parameters of the real-time training network model when the updating times meet a parameter replacement cycle.
Specifically, the inverse training of the network model can be trained in real time through the mean square error, so as to adjust the network parameters to update the network model. The method comprises the steps of updating network parameters of a real-time training network model in real time, counting the updating times in real time, and replacing the network parameters of the initial reinforcement learning network model with the network parameters of the real-time training network model at that moment when the accumulated value of the updating times reaches a parameter replacement period, so as to update the initial reinforcement learning network model.
In the preferred parameter replacement period of this embodiment, the number of updates is accumulated from 0 to a value, which can be determined according to historical experience, for example, 50 times.
c25) And determining the initial reinforcement learning network model after parameter replacement as the target reinforcement learning network model.
The optional embodiment of this embodiment specifically provides implementation of course category division of the network course, and implementation of training of a reinforcement learning network model with the divided course categories as motion space dimensions. The scale of the network courses to be recommended can be effectively reduced through the classification of the course categories, so that the action space dimensionality in reinforcement learning is reduced, and the effective application of the reinforcement learning in a large-scale data recommendation scene is ensured.
Example two
Fig. 2 is a flowchart of a course recommendation method according to a second embodiment of the present application, where the present embodiment is based on the foregoing embodiment, and in this embodiment, a current state of a target user may be determined according to historical course browsing data of the target user, and specifically expressed as: performing word segmentation on first historical course browsing data of the target user in a set time period, and determining a browsed course vector of a browsed course corresponding to the target user; and determining the current state of the target user by the average vector of the browsed course vectors.
Meanwhile, the embodiment may further specifically express, according to the current state and the pre-trained target reinforcement learning network model, the candidate course categories meeting the screening condition as follows: inputting the current state into the target reinforcement learning network model, and outputting the output quantity of candidate vectors as an action space through the target reinforcement learning network model, wherein each candidate vector respectively identifies a course type; determining the accumulated return value of each course category through a given accumulated return value model and by combining the current state and the current network parameters of the target reinforcement learning network model; and ranking the course categories according to the accumulated return value, and taking the course category with the first set ranking as a candidate course category.
In addition, in this embodiment, the obtaining of the candidate course classes meeting the screening condition according to the current state and the pre-trained target reinforcement learning network model may be specifically expressed as: aiming at each candidate course category, acquiring a clustering center vector of the candidate course category; determining a distance value between each course vector in a cluster associated with the cluster center vector and the cluster center vector; ranking the course vectors according to the distance value, and taking the course vector with the second previous set name as a course to be recommended; and selecting target courses meeting fine-grained screening conditions from the courses to be recommended and pushing the target courses to the target users respectively.
As shown in fig. 2, a course recommendation method provided in the second embodiment of the present application specifically includes the following operations:
s201, performing word segmentation on the first historical course browsing data of the target user in a set time period, and determining the browsed course vector of the browsed course corresponding to the target user.
Illustratively, the word vector correlation model word2vec may also be used to implement the analysis process, so that the course vectors of all the web courses that the user has browsed within the set time period may be obtained, and this step is denoted as the browsed course vector.
S202, determining the current state of the target user according to the average vector of the browsed course vectors.
This step corresponds to an implementation of the current state, that is, an average vector representing the current state of the target user is obtained by performing an average calculation on each of the browsed curriculum vectors.
S203, inputting the current state into the target reinforcement learning network model, and outputting the output quantity of candidate vectors as an action space through the target reinforcement learning network model.
This step is equivalent to the specific application of the target reinforcement learning network model, and a plurality of motion spaces can be output corresponding to the input current state, the vector form of each motion space represents, and each vector is recorded as a candidate vector, wherein each candidate vector respectively identifies a class, that is, one output motion space identifies a class. Thus, the output number corresponds to the K value determined when the class classification is performed by the clustering algorithm.
S204, aiming at each course category, determining the accumulated return value of the course category through a given accumulated return value model and combining the current state and the current network parameters of the target reinforcement learning network model.
It is understood that each class characterized by the cluster center vector can be obtained by the above-mentioned classification of class, and this step is equivalent to the operation for each class. The specific determination of the cumulative reward value in this step may preferably be calculated using the cumulative reward value function described above. The required known information includes the candidate vector (cluster center vector) corresponding to the class category, the current network parameters of the target reinforcement learning network model, and the next state corresponding to the reinforcement learning process.
S205, ranking the course categories according to the accumulated return value, and taking the course category with the first set ranking as a candidate course category.
The accumulated return value corresponding to each course category can be determined through the steps, and the accumulated return values can be ranked in the steps, so that a plurality of course categories with the top names can be selected as candidate course categories, and the diversity of the obtained coarse-grained screening result is ensured. Wherein, the first set noun may be preferably 2, i.e. the class ranked in the first two.
The following S206 to S209 give specific implementation of fine-grained screening of target courses
S206, aiming at each candidate course category, obtaining the clustering center vector of the candidate course category.
It can be known that, in this step, the following steps S207 and S208 are all operations with respect to each candidate course category, and one candidate course category also corresponds to a clustering center of one clustering cluster.
And S207, determining the distance value between each course vector in the cluster associated with the cluster center vector and the cluster center vector.
In this embodiment, a cluster includes at least one network course belonging to the cluster center, and each network course is represented by a corresponding course vector. The step can calculate the distance value from each course vector to the clustering center vector.
And S208, ranking the course vectors according to the distance value, and taking the course vector with the second previous set name as a course to be recommended.
The step may rank the obtained distance values, so as to obtain a plurality of top-ranked curriculum vectors, and the step preferably ranks the top 20 top curriculums as the curriculum to be recommended, that is, the second set noun is preferably 20.
S209, selecting target courses meeting fine-grained screening conditions from the courses to be recommended and pushing the target courses to the target users respectively.
In the step, the courses to be recommended corresponding to each candidate course category can be summarized, and the courses to be recommended can be sorted again according to the given fine-grained screening conditions, so that a proper number of network courses are selected as the target courses. The fine-grained screening conditions can be selected according to a specific application scene, and after selection, the sorting reference dimension for sorting the courses to be recommended can be determined.
In an implementation manner given in this embodiment, specifically, 4 network courses may be randomly selected from the to-be-recommended courses corresponding to each candidate course category and pushed to the target user as the target course. The pushed target courses can be displayed in a home page interface of the user client side for the user to select and browse.
The course recommending method provided by the second embodiment of the invention embodies the determining process of the current state of the target user, and also embodies the determining process of the candidate course category and the screening process of the target course. The method provided by the embodiment mainly adopts a processing form of a reinforcement learning network model to achieve the effect that the network course recommended to the target user can bring long-term benefits to the network education platform; meanwhile, the problem that the reinforcement learning cannot adapt to large-scale data volume processing is solved by dimension reduction processing of the output action space in the reinforcement learning network model, namely, the output quantity of the output action space is only the same as the class quantity of the network courses, so that effective recommendation of the network courses to a user side is realized, and the user experience is improved.
EXAMPLE III
Fig. 3 is a block diagram of a course recommending apparatus according to a third embodiment of the present application, where the apparatus is suitable for recommending a network course in a network teaching platform to a user. The apparatus may be implemented by hardware and/or software and is typically integrated in a computer device. As shown in fig. 3, the apparatus includes: an information determination module 31, a candidate determination module 32 and a target recommendation module 33.
An information determining module 31, configured to determine a current state of a target user according to first historical course browsing data of the target user;
a candidate determining module 32, configured to obtain candidate curriculum categories meeting the screening condition by using a target according to the current state and a pre-trained target reinforcement learning network model, where the target reinforcement learning network model uses the curriculum category to which the network curriculum belongs as an action space, and an output quantity of the output action space is the same as a total quantity of the curriculum categories;
and the target recommending module 33 is configured to filter a set number of target courses from the network courses corresponding to each candidate course category and push the selected target courses to the target user.
The course recommending device provided by the third embodiment mainly adopts a processing form of a reinforcement learning network model to achieve the effect that the network courses recommended to the target user can bring long-term benefits to the network education platform; meanwhile, the problem that the reinforcement learning cannot adapt to large-scale data volume processing is solved by dimension reduction processing of the output action space in the reinforcement learning network model, namely, by ensuring that the output quantity of the output action space is only the same as the class quantity of the courses possessed by the network courses, so that effective recommendation of the network courses to a user side is realized, and user experience is improved.
Further, the apparatus may further include: a course classification and division module for classifying the courses,
the course classification module may be specifically configured to:
acquiring second historical course browsing data of the selected users from the message queue, and forming course browsing sequences corresponding to the users;
taking each course browsing sequence as a sentence to be processed, and obtaining a course vector of each network course through a word vector division model to form a course vector set;
and clustering the course vector set to obtain the clustering clusters of the output quantity, and correspondingly determining the clustering center vector of each clustering cluster as a course category.
Further, the information determining module 31 may specifically be configured to:
performing word segmentation on first historical course browsing data of the target user in a set time period, and determining a browsed course vector of a browsed course corresponding to the target user;
and determining the current state of the target user by the average vector of the browsed course vectors.
Further, the candidate determination module 32 may be specifically configured to:
inputting the current state into the target reinforcement learning network model, and outputting the output quantity of candidate vectors as an action space through the target reinforcement learning network model, wherein each candidate vector respectively identifies a course type;
determining the accumulated return value of each course category through a given accumulated return value model and by combining the current state and the current network parameters of the target reinforcement learning network model;
and ranking the course categories according to the accumulated return value, and taking the course category with the first set ranking as a candidate course category.
Further, the target recommendation module 33 may specifically be configured to:
aiming at each candidate course category, acquiring a clustering center vector of the candidate course category;
determining a distance value between each course vector in a cluster associated with the cluster center vector and the cluster center vector;
ranking the course vectors according to the distance value, and taking the course vector with the second previous set name as a course to be recommended;
and selecting target courses meeting fine-grained screening conditions from the courses to be recommended and pushing the target courses to the target users respectively.
Further, the apparatus may further include a model training module, wherein the model training module may include:
the information initialization unit is used for respectively recording two reinforcement learning network models with the same network structure and different network parameters as a real-time training network model and an initial reinforcement learning network model;
a sample determining unit, configured to construct a training sample set for model training according to the class identified by the clustering center vector and second historical course browsing data of the selected user, where each training sample in the training sample set includes: the method comprises the steps of obtaining a first state sequence of a current state of a user, a target clustering center vector, an instantaneous return value and a second state sequence of a next state;
and the target obtaining unit is used for performing loss function fitting according to output results of the training samples under the real-time training network model and the initial reinforcement learning network model respectively, and obtaining the target reinforcement learning network model through reverse learning of the fitted loss function.
Further, the target obtaining unit may specifically be configured to:
for each training sample, determining the current accumulated return value of each action space vector output by the first state sequence under the real-time training network model, and determining the maximum current accumulated return value;
determining a standard cumulative return value of the first state sequence relative to the target clustering center vector under the initial reinforcement learning network model;
performing loss function fitting according to the maximum current accumulated return value and the standard accumulated return value corresponding to each training sample;
updating the network parameters of the real-time training network model according to the fitted loss function, and replacing the network parameters of the initial reinforcement learning network model with the network parameters of the real-time training network model when the updating times meet a parameter replacement period;
and determining the initial reinforcement learning network model after parameter replacement as the target reinforcement learning network model.
Example four
Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present application. The computer device includes: a processor 40, a memory 41, a display 42, an input device 43, and an output device 44. The number of processors 40 in the computer device may be one or more, and one processor 40 is taken as an example in fig. 4. The number of the memory 41 in the computer device may be one or more, and one memory 41 is taken as an example in fig. 4. The processor 40, the memory 41, the display 42, the input device 43 and the output device 44 of the computer apparatus may be connected by a bus or other means, as exemplified by the bus connection in fig. 4. In an embodiment, the computer device may be a computer, a notebook, or a smart tablet, etc.
The memory 41 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the computer device according to any embodiment of the present invention (for example, the information determination module 31, the candidate determination module 32, and the target recommendation module 33 in the course recommendation apparatus). The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 41 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 41 may further include memory located remotely from processor 40, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The display screen 42 may be a touch-enabled display screen 42, which may be a capacitive screen, an electromagnetic screen, or an infrared screen. Generally speaking, the display screen 42 is used for displaying data according to instructions of the processor 40, and is also used for receiving touch operations applied to the display screen 42 and sending corresponding signals to the processor 40 or other devices.
The input means 43 may be used for receiving input numeric or character information and generating key signal inputs related to user settings and function controls of the presentation apparatus, and may be a camera for acquiring images and a sound pickup apparatus for acquiring audio data. The output device 44 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 43 and the output device 44 can be set according to actual conditions.
The processor 40 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 41, i.e. implementing the above-described course recommendation method.
The computer device provided by the above can be used for executing the course recommendation method provided by any of the above embodiments, and has corresponding functions and advantages.
EXAMPLE five
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a course recommendation method, including:
determining the current state of a target user according to first historical course browsing data of the target user;
obtaining candidate course classes meeting screening conditions according to the current state and a pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model takes the course class to which the network course belongs as an action space, and the output quantity of the output action space is the same as the total quantity of the course classes;
and screening a set number of target courses from the network courses corresponding to the candidate course categories and pushing the target courses to the target user.
Of course, the storage medium provided by the embodiment of the present invention includes computer-executable instructions, and the computer-executable instructions are not limited to the operation of the course recommendation method described above, and may also perform related operations in the course recommendation method provided by any embodiment of the present invention, and have corresponding functions and advantages.
From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a robot, a personal computer, a server, or a network device) to execute the course recommending method according to any embodiment of the present application.
It should be noted that, in the course recommendation apparatus, the units and modules included in the course recommendation apparatus are only divided according to the functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It is to be noted that the foregoing is only illustrative of the presently preferred embodiments and application of the principles of the present invention. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.
Claims (10)
1. A course recommendation method, comprising:
determining the current state of a target user according to first historical course browsing data of the target user;
obtaining candidate course classes meeting screening conditions according to the current state and a pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model takes the course class to which the network course belongs as an action space, and the output quantity of the output action space is the same as the total quantity of the course classes;
and screening a set number of target courses from the network courses corresponding to the candidate course categories and pushing the target courses to the target user.
2. The method as claimed in claim 1, wherein the step of dividing the class of the network lesson comprises:
obtaining second historical course browsing data of the selected users from the message queue, and forming course browsing sequences corresponding to the users;
taking each course browsing sequence as a sentence to be processed, and obtaining a course vector of each network course through a word vector division model to form a course vector set;
and clustering the course vector set to obtain the clustering clusters of the output quantity, and correspondingly determining the clustering center vector of each clustering cluster as a course category.
3. The method as claimed in claim 1, wherein the determining the current status of the target user according to the historical lesson browsing data of the target user comprises:
performing word segmentation on first historical course browsing data of the target user in a set time period, and determining a browsed course vector of a browsed course corresponding to the target user;
and determining the current state of the target user by the average vector of the browsed course vectors.
4. The method as claimed in claim 1, wherein said obtaining candidate course classes satisfying the filtering condition according to the current state and the pre-trained target reinforcement learning network model comprises:
inputting the current state into the target reinforcement learning network model, and outputting the output quantity of candidate vectors as an action space through the target reinforcement learning network model, wherein each candidate vector respectively identifies a course type;
determining the accumulated return value of each course category through a given accumulated return value model and by combining the current state and the current network parameters of the target reinforcement learning network model;
and ranking the course categories according to the accumulated return value, and taking the course category with the first set ranking as a candidate course category.
5. The method as claimed in claim 2, wherein the step of screening a set number of target courses from the network courses corresponding to each of the candidate course categories comprises:
aiming at each candidate course category, acquiring a clustering center vector of the candidate course category;
determining a distance value between each course vector in a cluster associated with the cluster center vector and the cluster center vector;
ranking the course vectors according to the distance value, and taking the course vector with the second previous set name as a course to be recommended;
and selecting target courses meeting fine-grained screening conditions from the courses to be recommended and pushing the target courses to the target users respectively.
6. The method of claim 2, wherein the training step of the target reinforcement learning network model comprises:
respectively recording two reinforcement learning network models with the same network structure and different network parameters as a real-time training network model and an initial reinforcement learning network model;
constructing a training sample set for model training according to the course categories identified by the clustering center vectors and the second historical course browsing data of the selected users, wherein each training sample in the training sample set comprises: the method comprises the steps of obtaining a first state sequence of a current state of a user, a target clustering center vector, an instantaneous return value and a second state sequence of a next state;
and performing loss function fitting according to output results of the training samples under the real-time training network model and the initial reinforcement learning network model respectively, and obtaining a target reinforcement learning network model through reverse learning of the fitted loss function.
7. The method according to claim 6, wherein the performing a loss function fitting according to the output results of the training samples under the real-time training network model and the initial reinforcement learning network model, respectively, and obtaining the target reinforcement learning network model through reverse learning of the fitted loss function comprises:
for each training sample, determining the current accumulated return value of each action space vector output by the first state sequence under the real-time training network model, and determining the maximum current accumulated return value;
determining a standard accumulated return value of the first state sequence relative to the target clustering center vector under the initial reinforcement learning network model;
performing loss function fitting according to the maximum current accumulated return value and the standard accumulated return value corresponding to each training sample;
updating the network parameters of the real-time training network model according to the fitted loss function, and replacing the network parameters of the initial reinforcement learning network model with the network parameters of the real-time training network model when the updating times meet a parameter replacement period;
and determining the initial reinforcement learning network model after parameter replacement as the target reinforcement learning network model.
8. A course recommending apparatus, comprising:
the information determining module is used for determining the current state of a target user according to first historical course browsing data of the target user;
the candidate determining module is used for acquiring candidate course classes meeting screening conditions by a target according to the current state and a pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model takes the course class to which the network course belongs as an action space, and the output quantity of the output action space is the same as the total quantity of the course classes;
and the target recommending module is used for screening a set number of target courses from the network courses corresponding to the candidate course categories and pushing the target courses to the target user.
9. A computer device, comprising: a memory and one or more processors;
the memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A storage medium containing computer-executable instructions for performing the method of claims 1-7 when executed by a computer processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110190358.5A CN114969487A (en) | 2021-02-18 | 2021-02-18 | Course recommendation method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110190358.5A CN114969487A (en) | 2021-02-18 | 2021-02-18 | Course recommendation method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114969487A true CN114969487A (en) | 2022-08-30 |
Family
ID=82954252
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110190358.5A Pending CN114969487A (en) | 2021-02-18 | 2021-02-18 | Course recommendation method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114969487A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116228484A (en) * | 2023-05-06 | 2023-06-06 | 中诚华隆计算机技术有限公司 | Course combination method and device based on quantum clustering algorithm |
CN116934486A (en) * | 2023-09-15 | 2023-10-24 | 深圳格隆汇信息科技有限公司 | Decision evaluation method and system based on deep learning |
CN117997957A (en) * | 2024-03-06 | 2024-05-07 | 广州市云蝶教育科技有限公司 | Member portrait-based online class self-adaptive distribution method for Internet |
-
2021
- 2021-02-18 CN CN202110190358.5A patent/CN114969487A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116228484A (en) * | 2023-05-06 | 2023-06-06 | 中诚华隆计算机技术有限公司 | Course combination method and device based on quantum clustering algorithm |
CN116228484B (en) * | 2023-05-06 | 2023-07-07 | 中诚华隆计算机技术有限公司 | Course combination method and device based on quantum clustering algorithm |
CN116934486A (en) * | 2023-09-15 | 2023-10-24 | 深圳格隆汇信息科技有限公司 | Decision evaluation method and system based on deep learning |
CN116934486B (en) * | 2023-09-15 | 2024-01-12 | 深圳市蓝宇飞扬科技有限公司 | Decision evaluation method and system based on deep learning |
CN117997957A (en) * | 2024-03-06 | 2024-05-07 | 广州市云蝶教育科技有限公司 | Member portrait-based online class self-adaptive distribution method for Internet |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021139325A1 (en) | Media information recommendation method and apparatus, electronic device, and storage medium | |
JP7222022B2 (en) | Information retrieval method, device, equipment, storage medium, and program | |
CN110781321B (en) | Multimedia content recommendation method and device | |
CN112163165B (en) | Information recommendation method, device, equipment and computer readable storage medium | |
US20220284327A1 (en) | Resource pushing method and apparatus, device, and storage medium | |
CN108875055B (en) | Answer providing method and equipment | |
CN114969487A (en) | Course recommendation method and device, computer equipment and storage medium | |
CN110781391A (en) | Information recommendation method, device, equipment and storage medium | |
CN112632385A (en) | Course recommendation method and device, computer equipment and medium | |
CN108921221A (en) | Generation method, device, equipment and the storage medium of user characteristics | |
CN111400603A (en) | Information pushing method, device and equipment and computer readable storage medium | |
CN109241412A (en) | A kind of recommended method, system and electronic equipment based on network representation study | |
CN103198086A (en) | Information processing device, information processing method, and program | |
CN104182449A (en) | System and method for personalized video recommendation based on user interests modeling | |
Pizzutilo et al. | Group modeling in a public space: methods, techniques, experiences | |
CN112052387A (en) | Content recommendation method and device and computer readable storage medium | |
WO2019109724A1 (en) | Item recommendation method and device | |
CN111242310A (en) | Feature validity evaluation method and device, electronic equipment and storage medium | |
CN110110899B (en) | Knowledge mastery degree prediction method, adaptive learning method and electronic equipment | |
CN113326440B (en) | Artificial intelligence based recommendation method and device and electronic equipment | |
CN107766316B (en) | Evaluation data analysis method, device and system | |
CN111597446A (en) | Content pushing method and device based on artificial intelligence, server and storage medium | |
Li et al. | MOOC-FRS: A new fusion recommender system for MOOCs | |
CN111523035A (en) | Recommendation method, device, server and medium for APP browsing content | |
CN112100221A (en) | Information recommendation method and device, recommendation server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |