CN114969487A

CN114969487A - Course recommendation method and device, computer equipment and storage medium

Info

Publication number: CN114969487A
Application number: CN202110190358.5A
Authority: CN
Inventors: 文谊
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shiyuan Artificial Intelligence Innovation Research Institute Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd; Guangzhou Shiyuan Artificial Intelligence Innovation Research Institute Co Ltd
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2022-08-30
Anticipated expiration: 2041-02-18
Also published as: CN114969487B

Abstract

The present application discloses a course recommendation method, device, computer equipment and storage medium. The method includes determining the current state of the target user according to the first historical course browsing data of the target user; strengthening the learning network according to the current state and the pre-trained target model to obtain candidate course categories that meet the screening conditions, in which the target reinforcement learning network model takes the course category to which the online course belongs as the action space, and the output quantity of the output action space is the same as the total number of course categories; corresponding to each candidate course category Filter the set number of target courses in the online courses and push them to the target users. Using this method, the processing form of the reinforcement learning network model is adopted to achieve the effect that the online courses recommended to users can bring long-term benefits to the network teaching platform; at the same time, the dimensionality reduction processing of the output action space in the reinforcement learning network model is realized. , which ensures the effective recommendation of online courses to the user and improves the user experience.

Description

A course recommendation method, device, computer equipment and storage medium

技术领域technical field

本申请涉及信息推荐技术领域，尤其涉及一种课程推荐方法、装置、计算机设备及存储介质。The present application relates to the technical field of information recommendation, and in particular, to a course recommendation method, apparatus, computer equipment and storage medium.

背景技术Background technique

随着互联网技术的快速发展和普及应用，数字化的在线学习方式己越来越多地被大众所接受，而网络教学平台中供用户学习的网络课程量也爆炸式增加。面对如此巨大的信息量，用户很难快速的搜寻到自己感兴趣或是想要学习的课程。With the rapid development and popularization of Internet technology, digital online learning methods have been increasingly accepted by the public, and the number of online courses for users to learn in online teaching platforms has also exploded. Faced with such a huge amount of information, it is difficult for users to quickly search for courses they are interested in or want to learn.

目前，平台方往往通过主动或被动向用户进行课程推荐的方式来解决上述问题。现有的课程推荐大多数是通过用户在网络教学平台上中的历史操作数据进行相关性建模来实现，以此来预测用户感兴趣的课程并推荐，这种建模方式是仅仅考虑了用户短期的偏好和收益，忽略了整个平台的长期收益，无法与平台方的利益目标相匹配。而现有考虑长期收益的推荐方法又存在无法适用大规模网络课程推荐的问题。At present, the platform side often solves the above problems by actively or passively recommending courses to users. Most of the existing course recommendations are implemented by correlation modeling based on the historical operation data of users on the online teaching platform, so as to predict the courses that users are interested in and recommend. This modeling method only considers the users. Short-term preferences and benefits ignore the long-term benefits of the entire platform and cannot match the platform's interests. However, the existing recommendation methods considering long-term benefits have the problem that they cannot be applied to large-scale online course recommendation.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请实施例提供了一种课程推荐方法、装置、计算机设备及存储介质，能够在保证平台长期收益的基础上实现面向大规模网络课程的有效推荐。In view of this, the embodiments of the present application provide a course recommendation method, apparatus, computer equipment, and storage medium, which can implement effective recommendation for large-scale online courses on the basis of ensuring long-term platform benefits.

第一方面，本申请实施例提供了一种课程推荐方法，包括：In a first aspect, an embodiment of the present application provides a course recommendation method, including:

根据目标用户的第一历史课程浏览数据，确定所述目标用户的当前状态；Determine the current state of the target user according to the first historical course browsing data of the target user;

根据所述当前状态及预先训练的目标强化学习网络模型，获得满足筛选条件的候选课程类别，其中，所述目标强化学习网络模型以网络课程归属的课程类别作为动作空间，且所输出动作空间的输出数量与所述课程类别的总量相同；According to the current state and the pre-trained target reinforcement learning network model, a candidate course category that satisfies the screening conditions is obtained, wherein the target reinforcement learning network model takes the course category to which the online course belongs as the action space, and the output of the action space is The number of outputs is the same as the total number of said course categories;

从各所述候选课程类别对应的网络课程中筛选设定数量的目标课程推送给所述目标用户。A set number of target courses are selected from the online courses corresponding to each candidate course category and pushed to the target user.

进一步地，所述网络课程所归属课程类别的划分步骤包括：Further, the step of dividing the course category to which the online course belongs includes:

从消息队列中获取所选定各用户的第二历史课程浏览数据，对应各所述用户形成课程浏览序列；Obtain the second historical course browsing data of each selected user from the message queue, and form a course browsing sequence corresponding to each of the users;

将各所述课程浏览序列作为一个待处理句，通过词向量划分模型获得各网络课程的课程向量，形成课程向量集；Taking each said course browsing sequence as a to-be-processed sentence, the course vector of each online course is obtained through the word vector division model to form a course vector set;

对所述课程向量集进行聚类处理，获得所述输出数量的聚类簇，将每个聚类簇的聚类中心向量相应确定为一种课程类别。The course vector set is clustered to obtain the output number of clusters, and the cluster center vector of each cluster is correspondingly determined as a course category.

进一步地，所述根据目标用户的历史课程浏览数据，确定所述目标用户的当前状态，包括：Further, determining the current state of the target user according to the historical course browsing data of the target user includes:

对所述目标用户在设定时间段内的第一历史课程浏览数据进行分词处理，确定所述目标用户所对应已浏览课程的已浏览课程向量；Perform word segmentation processing on the first historical course browsing data of the target user within the set time period, and determine the browsed course vector of the browsed course corresponding to the target user;

将各所述已浏览课程向量的平均向量确定所述目标用户的当前状态。The current state of the target user is determined from the average vector of each of the browsed course vectors.

进一步地，所述根据所述当前状态及预先训练的目标强化学习网络模型，获得满足筛选条件的候选课程类别，包括：Further, according to the current state and the pre-trained target reinforcement learning network model, the candidate course categories that meet the screening conditions are obtained, including:

将所述当前状态输入至所述目标强化学习网络模型，通过所述目标强化学习网络模型输出所述输出数量个候选向量作为动作空间，其中，各所述候选向量分别标识一种课程类别；Inputting the current state into the target reinforcement learning network model, and outputting the output number of candidate vectors as an action space through the target reinforcement learning network model, wherein each of the candidate vectors respectively identifies a course category;

针对每个课程类别，通过给定的累计回报值模型，结合所述当前状态及所述目标强化学习网络模型的当前网络参数，确定所述课程类别的累计回报值；For each course category, through a given cumulative reward value model, combined with the current state and the current network parameters of the target reinforcement learning network model, determine the cumulative reward value of the course category;

按照累计回报值排名各所述课程类别，将处于前第一设定名次的课程类别作为候选课程类别。Rank each of the course categories according to the cumulative return value, and take the course category with the first set ranking as the candidate course category.

进一步地，所述从各所述候选课程类别对应的网络课程中筛选设定数量的目标课程推送给所述目标用户，包括：Further, the selection of a set number of target courses from the online courses corresponding to the candidate course categories to push to the target users includes:

针对每个候选课程类别，获取候选课程类别的聚类中心向量；For each candidate course category, obtain the cluster center vector of the candidate course category;

确定所述聚类中心向量所关联聚类簇中各课程向量与所述聚类中心向量的距离值；Determine the distance value of each course vector and the cluster center vector in the cluster cluster associated with the cluster center vector;

按照距离值排名各所述课程向量，将处于前第二设定名次的课程向量作为待推荐课程；Rank each of the course vectors according to the distance value, and take the course vector in the top second set ranking as the course to be recommended;

从各所述待推荐课程中选定满足细粒度筛选条件的目标课程分别推送各所述目标用户。The target courses that satisfy the fine-grained screening conditions are selected from the courses to be recommended and pushed to the target users respectively.

进一步地，所述目标强化学习网络模型的训练步骤包括：Further, the training steps of the target reinforcement learning network model include:

将网络结构相同、网络参数不同的两个强化学习网络模型分别记为实时训练网络模型和初始强化学习网络模型；The two reinforcement learning network models with the same network structure and different network parameters are respectively recorded as the real-time training network model and the initial reinforcement learning network model;

根据采用各所述聚类中心向量标识的各课程类别、以及所选定各用户的第二历史课程浏览数据，构造模型训练的训练样本集，其中，所述训练样本集中每个训练样本包括：用户当前状态的第一状态序列、目标聚类中心向量、瞬时回报值、以及下一状态的第二状态序列；According to each course category identified by each of the cluster center vectors and the second historical course browsing data of each selected user, a training sample set for model training is constructed, wherein each training sample in the training sample set includes: The first state sequence of the user's current state, the target cluster center vector, the instantaneous reward value, and the second state sequence of the next state;

根据各训练样本分别在所述实时训练网络模型和初始强化学习网络模型下的输出结果进行损失函数拟合，并通过所拟合损失函数的反向学习，获得目标强化学习网络模型。The loss function is fitted according to the output results of each training sample under the real-time training network model and the initial reinforcement learning network model, and the target reinforcement learning network model is obtained through reverse learning of the fitted loss function.

进一步地，所述根据各训练样本分别在所述实时训练网络模型和初始强化学习网络模型下的输出结果进行损失函数拟合，并通过所拟合损失函数的反向学习，获得目标强化学习网络模型，包括：Further, the loss function fitting is performed according to the output results of each training sample under the real-time training network model and the initial reinforcement learning network model, and the target reinforcement learning network is obtained through reverse learning of the fitted loss function. models, including:

针对每个训练样本，确定所包括第一状态序列在所述实时训练网络模型下所输出各动作空间向量的当前累计回报值，并确定最大当前累计回报值；For each training sample, determine the current cumulative reward value of each action space vector output by the included first state sequence under the real-time training network model, and determine the maximum current cumulative reward value;

确定所述第一状态序列在所述初始强化学习网络模型下相对所述目标聚类中心向量的标准累计回报值；Determine the standard cumulative return value of the first state sequence relative to the target cluster center vector under the initial reinforcement learning network model;

根据各训练样本下对应的最大当前累计回报值及标准累计回报值进行损失函数拟合；The loss function is fitted according to the maximum current cumulative return value and the standard cumulative return value corresponding to each training sample;

根据拟合的损失函数对所述实时训练网络模型的网络参数进行更新，并在更新次数满足参数替换周期时，采用所述实时训练网络模型的网络参数替换所述初始强化学习网络模型的网络参数；The network parameters of the real-time training network model are updated according to the fitted loss function, and when the update times meet the parameter replacement period, the network parameters of the real-time training network model are used to replace the network parameters of the initial reinforcement learning network model ;

将参数替换后的初始强化学习网络模型确定为所述目标强化学习网络模型。The initial reinforcement learning network model after parameter replacement is determined as the target reinforcement learning network model.

第二方面，本申请实施例提供了一种课程推荐装置，包括：In a second aspect, an embodiment of the present application provides a course recommendation device, including:

信息确定模块，用于根据目标用户的第一历史课程浏览数据，确定所述目标用户的当前状态；an information determination module for determining the current state of the target user according to the first historical course browsing data of the target user;

候选确定模块，目标根据所述当前状态及预先训练的目标强化学习网络模型，获得满足筛选条件的候选课程类别，其中，所述目标强化学习网络模型以网络课程归属的课程类别作为动作空间，且所输出动作空间的输出数量与所述课程类别的总量相同；The candidate determination module, the target obtains the candidate course category that satisfies the screening condition according to the current state and the pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model uses the course category to which the online course belongs as the action space, and The output quantity of the output action space is the same as the total amount of the course category;

目标推荐模块，用于从各所述候选课程类别对应的网络课程中筛选设定数量的目标课程推送给所述目标用户。The target recommendation module is configured to select a set number of target courses from the online courses corresponding to the candidate course categories and push them to the target users.

第三方面，本申请实施例还提供一种计算机设备，包括：存储器以及一个或多个处理器；In a third aspect, an embodiment of the present application further provides a computer device, including: a memory and one or more processors;

所述存储器，用于存储一个或多个程序；the memory for storing one or more programs;

当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现如上述第一方面所述的课程推荐方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the course recommendation method as described in the first aspect above.

第四方面，本申请实施例还提供一种包含计算机可执行指令的存储介质，所述计算机可执行指令在由计算机处理器执行时用于执行如第一方面所述的课程推荐方法。In a fourth aspect, embodiments of the present application further provide a storage medium containing computer-executable instructions, where the computer-executable instructions are used to execute the course recommendation method according to the first aspect when executed by a computer processor.

上述提供的一种课程推荐方法、装置、计算机设备及存储介质，该方法首先根据目标用户的第一历史课程浏览数据，确定该目标用户的当前状态，然后根据该当前状态集预先训练的目标强化学习网络模型来获得满足筛选条件的候选课程类别，其中，强化学习中输出的动作空间通过网络课程归属的课程类别标识，且动作空间的输出数量等同于课程类别的总量；最终可以从候选课程类别包含网络课程中筛选一定量的目标课程推送给目标用户。本实施例上述技术方案，主要采用了强化学习网络模型的处理形式来达到向目标用户推荐的网络课程能够为网络教育平台带来长期收益的效果；同时，通过对强化学习网络模型中所输出动作空间的降维处理，即，通过保证所输出动作空间的输出数量仅与网络课程所具备课程类别数量相同，来解决强化学习无法适应大规模数据量处理的问题，由此实现了网络课程到用户端的有效推荐，提升了用户体验。A course recommendation method, device, computer equipment and storage medium provided above, the method first determines the current state of the target user according to the first historical course browsing data of the target user, and then strengthens the target pre-trained according to the current state set Learn the network model to obtain the candidate course categories that meet the screening conditions, where the output action space in reinforcement learning is identified by the course category to which the network course belongs, and the output quantity of the action space is equal to the total number of course categories; finally, the candidate courses can be obtained from the candidate courses. The category contains a certain amount of target courses selected from the online courses and pushed to the target users. The above technical solution of this embodiment mainly adopts the processing form of the reinforcement learning network model to achieve the effect that the online courses recommended to the target users can bring long-term benefits to the online education platform; Space dimensionality reduction processing, that is, by ensuring that the output number of the output action space is only the same as the number of course categories available in the online course, to solve the problem that reinforcement learning cannot adapt to large-scale data processing, thus realizing the network course to users. The effective recommendation on the terminal improves the user experience.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本申请的其它特征、目的和优点将会变得更明显：Other features, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1为本申请实施例一提供的一种课程推荐方法的流程示意图；1 is a schematic flowchart of a course recommendation method provided in Embodiment 1 of the present application;

图2为本申请实施例二提供的一种课程推荐方法的流程示意图；2 is a schematic flowchart of a course recommendation method provided in Embodiment 2 of the present application;

图3为本申请实施例三提供的一种课程推荐装置的结构框图；FIG. 3 is a structural block diagram of a course recommendation apparatus provided in Embodiment 3 of the present application;

图4为本申请实施例四提供的一种计算机设备的结构示意图。FIG. 4 is a schematic structural diagram of a computer device according to Embodiment 4 of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施例方式作进一步地详细描述。应当明确，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例，都属于本申请保护的范围。In order to make the objectives, technical solutions and advantages of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings. It should be clear that the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application, as recited in the appended claims.

在本申请的描述中，需要理解的是，术语“第一”、“第二”、“第三”等仅用于用于区别类似的对象，而不必用于描述特定的顺序或先后次序，也不能理解为指示或暗示相对重要性。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本申请中的具体含义。此外，在本申请的描述中，除非另有说明，“多个”是指两个或两个以上。“和/或”，描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。In the description of the present application, it should be understood that the terms "first", "second", "third", etc. are only used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence, Nor should it be construed to indicate or imply relative importance. For those of ordinary skill in the art, the specific meanings of the above terms in this application can be understood according to specific situations. Also, in the description of the present application, unless otherwise specified, "a plurality" means two or more. "And/or", which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the associated objects are an "or" relationship.

实施例一Example 1

图1为本申请实施例一提供的一种课程推荐方法的流程示意图，该方法适用于向用户进行网络教学平台中网络课程推荐的情况。该方法可以由课程推荐装置执行，该装置可以由硬件和/或软件实现，并一般集成在计算机设备中。FIG. 1 is a schematic flowchart of a course recommendation method provided in Embodiment 1 of the present application, and the method is suitable for recommending an online course in an online teaching platform to a user. The method may be performed by a course recommendation apparatus, which may be implemented in hardware and/or software, and is generally integrated in computer equipment.

需要说明的是，将网络在线教学作为本实施例的应用场景时，可以将集成本实施例所提供方法的计算机设备看作进行网络在线教学的平台服务器。一般的，网络教学平台上往往存在有成千上万供用户学习的网络课程，呈现出大规模的网络课程量。仅采用传统的强化学习方式可以解决长期收益的问题，但考虑到强化学习输出的是作为推荐信息的候选集，当用于候选集确定的原始数据规模较大时，并不无法保证强化学习在信息推荐中的有效执行。It should be noted that, when online online teaching is used as the application scenario of this embodiment, the computer device integrating the method provided in this embodiment can be regarded as a platform server for online online teaching. Generally, there are often thousands of online courses for users to learn on the online teaching platform, showing a large-scale amount of online courses. Only the traditional reinforcement learning method can solve the problem of long-term benefits, but considering that the output of reinforcement learning is a candidate set as recommendation information, when the original data used to determine the candidate set is large, it is not impossible to guarantee that reinforcement learning can be used in Effective implementation of information recommendations.

本实施例提供的一种课程推荐方法，能够有效解决网络课程规模较大无法通过强化学习进行推荐的问题。The method for recommending courses provided by this embodiment can effectively solve the problem that the scale of online courses cannot be recommended through reinforcement learning.

如图1所示，本实施例一提供的一种课程推荐方法，具体包括如下步骤：As shown in Figure 1, a course recommendation method provided by the first embodiment specifically includes the following steps:

S101、根据目标用户的第一历史课程浏览数据，确定所述目标用户的当前状态。S101. Determine the current state of the target user according to the first historical course browsing data of the target user.

在本实施例中，具备学习需求的用户需要通过注册以及登陆操作来进入网络教学平台。网络教学平台侧可以对所注册每位用户的用户信息进行记录。一般的，网络教学平台侧可以对每个在线用户进行网络课程的推荐，本实施例可以将每一位通过登陆操作进入网络教学界面的用户看做一个目标用户，而对于每一位目标用户，均可以通过本实施例提供的方法实现相应的课程推荐。In this embodiment, users with learning needs need to enter the online teaching platform through registration and login operations. The network teaching platform side can record the user information of each registered user. Generally, the online teaching platform side can recommend online courses to each online user. In this embodiment, each user who enters the online teaching interface through a login operation can be regarded as a target user, and for each target user, Corresponding course recommendation can be implemented through the method provided in this embodiment.

在本实施例中，历史课程浏览数据具体可理解为用户进入网络教学平台后在一段时间内(如，一天、一周、甚至一个月)对所展示各界面进行浏览操作时产生的数据。具体可以是对课程进行浏览时产生的数据，如，浏览了哪些课程，在一段时间内相对一门课程浏览了多少次等，其中，不同的课程的可以通过课程标识号进行区分。为便于区分，本实施例将相对目标用户获取的历史课程浏览数据记为第一历史课程浏览数据。In this embodiment, historical course browsing data can be specifically understood as data generated when a user browses the displayed interfaces within a period of time (eg, one day, one week, or even one month) after entering the online teaching platform. Specifically, it can be data generated when browsing courses, such as which courses have been browsed, how many times a course has been browsed in a period of time, etc., wherein, different courses can be distinguished by course identification numbers. For the convenience of distinction, in this embodiment, the historical course browsing data obtained relative to the target user is recorded as the first historical course browsing data.

在本实施例中，所述当前状态具体可理解为在将进行网络课程学习作为应用环境时，目标用户进行下一个动作操作前相对该网络课程学习所具备的操作状态，如，结束了A课程的浏览就相当于用户的一种状态。In this embodiment, the current state can be specifically understood as the operating state that the target user has relative to the online course learning before the next action operation when the online course learning is used as the application environment, for example, the end of the A course Browsing is equivalent to a state of the user.

具体的，本步骤可以通过对目标用户所对应第一历史课程浏览数据的分析来确定当前状态。可以知道的是，第一历史课程浏览数据中包括了目标用户在一段时间内的相对网络课程的操作信息，本步骤可以将这些操作信息看作一个句子，之后可从中分析获得目标用户的在该段时间内的关键词信息，如所操作的网络课程ID，之后通过对这些关键词信息的处理和汇总形成能够表征当前状态的向量信息。Specifically, in this step, the current state can be determined by analyzing the browsing data of the first historical course corresponding to the target user. It can be known that the first historical course browsing data includes the operation information of the target user relative to the online course within a certain period of time. In this step, these operation information can be regarded as a sentence, and then the target user's operation information can be obtained by analyzing it. The keyword information within a period of time, such as the operated online course ID, is then processed and aggregated to form vector information that can represent the current state.

S102、根据所述当前状态及预先训练的目标强化学习网络模型，获得满足筛选条件的候选课程类别。S102 , according to the current state and the pre-trained target reinforcement learning network model, obtain a candidate course category that satisfies the screening condition.

在本实施例中，本步骤相当于通过强化学习网络模型进行课程推荐所需候选课程确定的操作。需要知道的是，强化学习具体可理解为在一个应用环境下，能够以智能学习的形式不断根据环境的当前状态选择待执行的动作，以保证通过执行所选择的各动作后所获得的奖励数目最大。而在每个状态下进行执行动作选择，并保证最终所获得奖励数目最大的具体执行可以通过强化学习网络模型来实现。In this embodiment, this step is equivalent to the operation of determining the candidate courses required for course recommendation through the reinforcement learning network model. What needs to be known is that reinforcement learning can be specifically understood as in an application environment, in the form of intelligent learning, the actions to be performed can be continuously selected according to the current state of the environment, so as to ensure the number of rewards obtained by performing the selected actions. maximum. In each state, the action selection is performed and the specific execution that guarantees the maximum number of rewards can be realized by the reinforcement learning network model.

本实施例优选采用的强化学习网络模型为深度强化学习(Deep Q-Learning，DQN)网络模型，且将通过设定的训练样本学习后形成的强化学习网络模型记为目标强化学习网络模型。一般的，强化学习网络模型的输入数据为一个当前状态值，输出数据则为与应用环境相关的动作空间数据，每一个动作空间代表一个候选执行动作的数据信息。该目标强化学习网络模型所输出动作空间的输出数量原则上相当于应用环境中作为可执行动作的数量，但该种模式比较适用于包括小规模可执行动作的场景。The reinforcement learning network model preferably used in this embodiment is a deep reinforcement learning (Deep Q-Learning, DQN) network model, and the reinforcement learning network model formed after learning through the set training samples is recorded as the target reinforcement learning network model. Generally, the input data of the reinforcement learning network model is a current state value, and the output data is the action space data related to the application environment, and each action space represents the data information of a candidate execution action. In principle, the output quantity of the action space output by the target reinforcement learning network model is equivalent to the number of executable actions in the application environment, but this mode is more suitable for scenarios including small-scale executable actions.

在本实施例中，网络教学平台中所包括的每门网络课程原则上都相当于一个用户可执行的候选动作，因为网络课程的数量规模较大，无法直接将每门网络课程作为目标强化学习网络模型中输出的各可执行候选动作。本实施例考虑对网络课程进行课程类别的划分，并将每个课程类别作为目标强化学习网络模型输出的动作空间。即，所述目标强化学习网络模型以网络课程归属的课程类别作为动作空间，且所输出动作空间的输出数量与所述课程类别的总量相同。In this embodiment, in principle, each online course included in the online teaching platform is equivalent to a candidate action that can be performed by a user. Because the number of online courses is relatively large, it is impossible to directly use each online course as a target reinforcement learning. Each executable candidate action output in the network model. In this embodiment, the network courses are divided into course categories, and each course category is used as the action space output by the target reinforcement learning network model. That is, the target reinforcement learning network model uses the course category to which the online course belongs as the action space, and the output quantity of the output action space is the same as the total amount of the course category.

具体的，本步骤可以将目标用户的当前状态作为目标强化学习网络模型的一个输入数据，且可以输出多个不同的动作空间数据信息，每个动作空间的数据信息表征了网络教学平台中的一种课程类别。之后，可以将输出的每个课程类别看作一个候选集合，可以从中筛选出满足条件的候选课程类别。其中，筛选条件可以通过目标强化学习网络模型所输出各动作空间分别对应的累计回报值来设置。一般的，对于对于筛选条件的设定，往往采用贪心算法的策略将最优累计回报值对应的动作空间确定为本步骤粗粒度筛选的结果，本实施例考虑到基于贪心算法策略所确定的最优解无法保证筛选结果的多样性，优选考虑采用非贪心算法来保证本步骤粗粒度筛选结果的多样性，如，考虑将累计回报值排名靠前的多个课程类别分别作为候选课程类别。Specifically, in this step, the current state of the target user may be used as an input data of the target reinforcement learning network model, and multiple different action space data information may be output, and the data information of each action space represents a types of courses. After that, each output course category can be regarded as a candidate set, from which the candidate course categories that meet the conditions can be screened. Among them, the filtering conditions can be set by the cumulative reward values corresponding to each action space output by the target reinforcement learning network model. Generally, for the setting of screening conditions, the greedy algorithm strategy is often used to determine the action space corresponding to the optimal cumulative reward value as the result of the coarse-grained screening in this step. This embodiment takes into account the maximum value determined based on the greedy algorithm strategy. The optimal solution cannot guarantee the diversity of the screening results. It is preferable to consider adopting a non-greedy algorithm to ensure the diversity of the coarse-grained screening results in this step. For example, consider taking multiple course categories with the highest cumulative return value as candidate course categories.

需要说明的是，本实施例可以通过聚类处理的形式根据用户的课程浏览习惯对网络课程进行聚类划分，并可将聚类划分后每个聚类簇的聚类中心向量看作一个课程类别的数据信息。It should be noted that, in this embodiment, the online courses can be clustered and divided according to the user's course browsing habits in the form of clustering processing, and the cluster center vector of each cluster after the clustering can be regarded as a course. Category data information.

S103、从各所述候选课程类别对应的网络课程中筛选设定数量的目标课程推送给所述目标用户。S103: Screen a set number of target courses from the online courses corresponding to the candidate course categories and push them to the target user.

在本实施例中，每一个候选课程类别中都存在所归属的一个或多个网络课程，这些网络课程都可以作为课程推荐的候选课程。示例性的，本步骤可以通过某种筛选策略从每个候选课程类别形成的候选课程集合中筛选出一定量的目标课程并推送给目标用户。In this embodiment, each candidate course category has one or more online courses to which it belongs, and these online courses can be used as candidate courses for course recommendation. Exemplarily, in this step, a certain amount of target courses can be selected from the candidate course set formed by each candidate course category through a certain screening strategy and pushed to the target user.

具体的，对于目标课程的筛选，可以是对候选课程类别所包含各候选课程进行分析，将重要度或者权重值高的网络课程作为目标课程。本步骤可以对每个候选课程类别都进行上述目标课程的筛选操作，所筛选出的目标课程排名不分先后的推送给目标用户，以将各目标课程的先关信息展示在目标用户的客户端上。Specifically, for the screening of target courses, each candidate course included in the candidate course category may be analyzed, and an online course with high importance or weight value is used as the target course. In this step, the above-mentioned target course screening operation can be performed for each candidate course category, and the screened target course rankings can be pushed to the target users in no particular order, so as to display the first-level information of each target course on the target user's client superior.

本申请实施例一提供的一种课程推荐方法，主要采用了强化学习网络模型的处理形式来达到向目标用户推荐的网络课程能够为网络教育平台带来长期收益的效果；同时，通过对强化学习网络模型中所输出动作空间的降维处理，即，通过保证所输出动作空间的输出数量仅与网络课程所具备课程类别数量相同，来解决强化学习无法适应大规模数据量处理的问题，由此实现了网络课程到用户端的有效推荐，提升了用户体验。A course recommendation method provided in the first embodiment of the present application mainly adopts the processing form of the reinforcement learning network model to achieve the effect that the online courses recommended to target users can bring long-term benefits to the online education platform; The dimensionality reduction processing of the output action space in the network model, that is, by ensuring that the output number of the output action space is only the same as the number of course categories available in the network course, to solve the problem that reinforcement learning cannot adapt to large-scale data processing, thus It realizes the effective recommendation of online courses to the user, and improves the user experience.

作为本申请实施例的一个可选实施例，在上述实施例的基础上，所述网络课程所归属课程类别的划分步骤可以包括：As an optional embodiment of the embodiment of the present application, on the basis of the foregoing embodiment, the step of dividing the course category to which the online course belongs may include:

可以理解的是，本实施例通过目标强化学习网络模型进行候选课程类别确定的前提在于预先确定了网络教学平台中的用户当前倾向浏览的网络课程只要存在哪些课程类别。由此，本可选实施例提供了根据用户的课程浏览行为数据进行课程类别划分的具体实现。It can be understood that the premise of determining candidate course categories by using the target reinforcement learning network model in this embodiment is to pre-determine which course categories exist as long as the online courses that users currently tend to browse in the online teaching platform. Thus, this optional embodiment provides a specific implementation of class classification of courses according to user's course browsing behavior data.

a1)从消息队列中获取所选定各用户的第二历史课程浏览数据，对应各所述用户形成课程浏览序列。a1) Acquire the second historical course browsing data of each selected user from the message queue, and form a course browsing sequence corresponding to each of the users.

其中，消息队列具体可理解为要敢于缓存用户与后端交互时所产生各用户行为数据的缓存队列。本步骤可以从所设定的消息队列中获取到所注册各用户，或者所选定参与课程类别划分的各用户在一定时间段内对应的历史课程浏览数据，本实施例记为第二历史课程浏览数据。Among them, the message queue can be specifically understood as a cache queue that dares to cache the behavior data of each user generated when the user interacts with the backend. In this step, the historical course browsing data corresponding to each registered user or each user who has been selected to participate in the course category division within a certain period of time can be obtained from the set message queue, which is recorded as the second historical course in this embodiment. Browse data.

本步骤可以通过对每个用户的第二历史课程浏览数据的分析和提取，形成仅包含用户所浏览每个网络课程的课程ID的序列信息，记为该用户对应的课程浏览序列，且每个用户所对应的课程浏览序列相当于一个可进行分词处理的句子。In this step, by analyzing and extracting the second historical course browsing data of each user, sequence information containing only the course ID of each online course browsed by the user can be formed, which is recorded as the course browsing sequence corresponding to the user, and each The course browsing sequence corresponding to the user is equivalent to a sentence that can be processed by word segmentation.

b1)将各所述课程浏览序列作为一个待处理句，通过词向量划分模型获得各网络课程的课程向量，形成课程向量集。b1) Taking each of the course browsing sequences as a sentence to be processed, the course vector of each online course is obtained through a word vector partition model to form a course vector set.

其中，词向量划分模型优选为产生词向量的相关模型word2vec。本步骤可以将每个课程浏览序列看做一个待处理句作为词向量划分模型的输入数据，就可以获得到词向量划分模型输出的向量集合，向量集合中的每个向量代表一门网络课程的课程向量，包含各课程向量的向量集合也优选记为课程向量集。Wherein, the word vector division model is preferably a correlation model word2vec for generating word vectors. In this step, each course browsing sequence can be regarded as a to-be-processed sentence as the input data of the word vector division model, and the vector set output by the word vector division model can be obtained. The course vector, the vector set including each course vector is also preferably recorded as the course vector set.

c1)对所述课程向量集进行聚类处理，获得所述输出数量的聚类簇，将每个聚类簇的聚类中心向量相应确定为一种课程类别。c1) Perform clustering processing on the course vector set to obtain the output number of clusters, and determine the cluster center vector of each cluster as a course category accordingly.

在本实施例中，可以通过K-means聚类算法对课程向量集进行聚类处理，所形成聚类簇的K值可以通过拐点法的形式确定，且可将确定出的K值作为本实施例中目标强化学习网络模型所输出动作空间的输出数量。其中，通过拐点法进行K值确定的过程可描述为：In this embodiment, the course vector set can be clustered by the K-means clustering algorithm, the K value of the formed cluster can be determined by the inflection point method, and the determined K value can be used as the present embodiment The number of outputs of the action space output by the target reinforcement learning network model in the example. Among them, the process of determining the K value by the inflection point method can be described as:

搜索一定范围内K的所有可能取值，并针对每个可能取值，采用该可能取值进行聚类处理，获得相应的聚类结果，然后采用误差平方和计算公式集合每个K值下的聚类结果进行误差平方和的计算。其中，误差平方和计算公式表示为：Search for all possible values of K within a certain range, and for each possible value, use the possible value to perform clustering processing to obtain the corresponding clustering results, and then use the error square sum calculation formula to collect the values under each K value. The clustering results are subjected to the calculation of the sum of squares of the errors. Among them, the calculation formula of the sum of squares of errors is expressed as:

其中，k为所有可能取值的最大取值，本实施例从1开始进行K值设定，具体以i来表示K取值的变化，其中，C为K值选定i时所对应聚类结果的聚类簇集合，每个p代表空间中的一个点，m_i代表K值选定i时的一个聚类中心。

Among them, k is the maximum value of all possible values. In this embodiment, the K value is set from 1. Specifically, i is used to represent the change of the value of K, where C is the corresponding cluster when the value of K is selected. The resulting cluster set, each p represents a point in the space, and m _i represents a cluster center when i is selected by the value of K.

通过上述公式，可以相对每个K值获取一个误差平方和，将每个误差平方和进行连线，可以将连线上斜率变化最大的那个K值作为最优K值。Through the above formula, an error square sum can be obtained for each K value, and each error square sum can be connected by a line, and the K value with the largest slope change on the connection line can be used as the optimal K value.

本实施例可以在确定最优K值后，将该K值作为本实施例的输出数量，并可获得输出数量个聚类簇，每个聚类簇中的聚类中心向量就可以看作表征一个课程类别的向量信息。由此实现了本实施例所需课程类别的划分。In this embodiment, after the optimal K value is determined, the K value can be used as the output quantity of this embodiment, and the output quantity of clusters can be obtained, and the cluster center vector in each cluster can be regarded as a representation Vector information for a course category. Thus, the division of the required course categories in this embodiment is achieved.

需要说明的是，本实施例通过目标强化学习网络模型进行候选课程类别确定的另一个前提为：还需要保证所采用的目标强化学习网络模型为预先训练好的网络模型，本实施例还提供了的另一个可选实施例来实现强化学习网络模型的训练。It should be noted that another premise of determining the candidate course category by using the target reinforcement learning network model in this embodiment is: it is also necessary to ensure that the adopted target reinforcement learning network model is a pre-trained network model. This embodiment also provides Another optional embodiment to implement reinforcement learning network model training.

具体的，作为本申请实施例的另一个可选实施例，在上述可选实施例的基础上，本实施例可以将目标强化学习网络模型的训练步骤表述为：Specifically, as another optional embodiment of the embodiments of the present application, on the basis of the above optional embodiments, the training steps of the target reinforcement learning network model can be expressed in this embodiment as:

a2)将网络结构相同、网络参数不同的两个强化学习网络模型分别记为实时训练网络模型和初始强化学习网络模型。a2) The two reinforcement learning network models with the same network structure and different network parameters are recorded as the real-time training network model and the initial reinforcement learning network model, respectively.

通过对强化学习的相关分析，可知强化学习的训练过程中需要提供两个网络结构相同的神经网络模型，但两个神经网络模型所具备的网络参数并不相同，可将其中一个记为需要实时进行训练的实时训练网络模型，将另一个记为已经进行了某种学习，并可以用在实际应用场景中，但仍需要持续不断更新的初始强化学习网络模型。Through the relevant analysis of reinforcement learning, it can be known that two neural network models with the same network structure need to be provided in the training process of reinforcement learning, but the network parameters of the two neural network models are not the same, and one of them can be recorded as requiring real-time The real-time training network model that is trained, the other is recorded as having done some kind of learning and can be used in real application scenarios, but still requires continuous updating of the initial reinforcement learning network model.

b2)根据采用各所述聚类中心向量标识的各课程类别、以及所选定各用户的第二历史课程浏览数据，构造模型训练的训练样本集。b2) Constructing a training sample set for model training according to each course category identified by each of the cluster center vectors and the second historical course browsing data of each selected user.

在本实施中，为了保证训练后所获得的强化学习网络模型能够匹配本实施例的课程推荐应用场景，需要基于本实施例的课程推荐应用场景来设定模型训练所需的训练样本集。本步骤中，可以获取上述进行课程类别划分时标识每个课程类别的聚类中心向量，还可以获取到上述进行课程类别划分时所采用的第二历史课程浏览数据。In this implementation, in order to ensure that the reinforcement learning network model obtained after training can match the course recommendation application scenario of this embodiment, the training sample set required for model training needs to be set based on the course recommendation application scenario of this embodiment. In this step, the above-mentioned cluster center vector identifying each course category in the course category division can be obtained, and the second historical course browsing data used in the above-mentioned course category division can also be obtained.

通过对每个用户所对应第二历史课程浏览数据的分析，可以进行训练样本的构建，其中，所述训练样本集中每个训练样本包括：用户当前状态的第一状态序列、目标聚类中心向量、瞬时回报值、以及下一状态的第二状态序列。By analyzing the browsing data of the second historical course corresponding to each user, a training sample can be constructed, wherein each training sample in the training sample set includes: the first state sequence of the current state of the user, the target cluster center vector , the instantaneous reward value, and the second state sequence for the next state.

具体的，为了形成训练样本，本实施例要站在强化学习场景下所需参数的角度确定一条训练样本中应该具备哪些信息。一般的，强化学习场景下的主要参数包括环境的当前状态、用户可执行的动作(如，浏览哪一门课程)、用户执行动作后环境所具备的下一状态，以及用户执行动作后所产生的瞬时回报。通过过第二历史课程浏览数据，可以知道用户在历史时间段内都浏览了哪些课程，由此可以形成用户在该应用环境下所具备的状态序列，进而可以确定出训练样本中每个参数的参数信息。Specifically, in order to form a training sample, in this embodiment, from the perspective of parameters required in a reinforcement learning scenario, it is necessary to determine what information should be included in a training sample. Generally, the main parameters in reinforcement learning scenarios include the current state of the environment, the actions that the user can perform (such as which course to browse), the next state of the environment after the user performs the action, and the resulting actions after the user performs the action. instant return. By browsing the data through the second historical course, you can know which courses the user has browsed in the historical time period, thus forming the state sequence that the user has in the application environment, and then determining the value of each parameter in the training sample. Parameter information.

示例性的，假设分析第二历史课程浏览数据确定该用户在一段时间内依次浏览了课程A、课程B、课程C以及课程D四门课程，则本实施例可以基于前3门课程构建一个用户的当前状态数据，以及基于四门课程构建用户的下一状态数据。Exemplarily, assuming that the second historical course browsing data is analyzed to determine that the user has browsed four courses of course A, course B, course C and course D in sequence within a period of time, then this embodiment can construct a user based on the first three courses. The current state data of , and the next state data of the user based on the four courses.

接上述描述，本实施例中每门课程可分别确定出相应的课程向量信息，由此基于前3门课程的课程向量信息形成表征用户当前状态的第一状态序列，如，将前3门课程的课程向量信息进行平均求和所获得的平均向量看作第一状态序列。同时，还可以基于上述4门课程的课程向量信息形成表征用户下一状态的第二状态序列，同样的，可以将上述4门课程的课程向量信息进行求和来获得。Following the above description, in this embodiment, each course can determine the corresponding course vector information respectively, and thus form a first state sequence representing the current state of the user based on the course vector information of the first three courses. The average vector obtained by the average summation of the course vector information is regarded as the first state sequence. At the same time, a second state sequence representing the user's next state can also be formed based on the course vector information of the above four courses. Similarly, the course vector information of the above four courses can be summed to obtain.

在已知第4门课程(课程D)后，相当于已知了用户从当前状态转换为下一状态时待执行的动作空间，本实施例并不直接采用课程D的向量数据来作为该动作空间，而是先确定该课程D归属的归属的聚类簇，然后将其所归属聚类簇的聚类中心向量作为用户由当前状态转换为下一状态时所执行的动作空间信息，即本实施例所需训练样本中的目标聚类中心向量。After the fourth course (course D) is known, it is equivalent to knowing the action space to be executed when the user transitions from the current state to the next state. This embodiment does not directly use the vector data of course D as the action Instead, first determine the cluster to which the course D belongs, and then use the cluster center vector of the cluster to which it belongs as the action space information performed by the user when the user transitions from the current state to the next state, that is, the current state. The target cluster center vector in the training sample required by the embodiment.

同样的，用户由当前状态转换为下一状态后也可以根据用户所进行的动作操作即时反馈一个瞬时回报值，由此，所反馈的该瞬时回报值也相当于一条训练样本中的一个参数信息。Similarly, after the user transitions from the current state to the next state, he can also immediately feedback an instantaneous reward value according to the action operation performed by the user. Therefore, the instantaneous reward value that is fed back is also equivalent to a parameter information in a training sample. .

本步骤可以采用上述描述的方式相对每个用户确定出相应的一条训练样本。In this step, a corresponding training sample may be determined for each user in the manner described above.

c2)根据各训练样本分别在所述实时训练网络模型和初始强化学习网络模型下的输出结果进行损失函数拟合，并通过所拟合损失函数的反向学习，获得目标强化学习网络模型。c2) Fitting a loss function according to the output results of each training sample under the real-time training network model and the initial reinforcement learning network model, and obtaining a target reinforcement learning network model through reverse learning of the fitted loss function.

可以知道的是，网络模型的训练过程相当于通过将训练样本中的输入数据输入待训练网络模型后获得的实际输出值，与训练样本中所具备标准输出值通过某种方式的比对，而对网络模型中的网络参数进行反向学习调整的过程。其中，将实际输出值与标准输出值以某种方式比对的方式主要通过对损失函数的拟合来实现。It can be known that the training process of the network model is equivalent to the actual output value obtained by inputting the input data in the training sample into the network model to be trained, and the standard output value in the training sample is compared in some way, and The process of reverse learning and adjusting the network parameters in the network model. Among them, the way of comparing the actual output value with the standard output value in a certain way is mainly realized by fitting the loss function.

基于此，本步骤可以将训练样本中的当前状态(即第一状态序列)作为实施训练网络模型的输入数据，将所输出动作空间中累计回报值最大的动作空间作为实际输出值，结合训练样本中给定的作为标准输出值的目标聚类中心向量来进行损失函数的拟合。Based on this, in this step, the current state (ie, the first state sequence) in the training sample can be used as the input data for implementing the training network model, and the action space with the largest cumulative reward value in the output action space can be used as the actual output value, combined with the training samples. The target cluster center vector given in as the standard output value is used to fit the loss function.

本实施例的损失函数拟合实现中，主要基于实际输出值所对应最大累计回报值，与标准输出值在初始强化强化学习网络模型下所对应的累计回报值来设定。再一次训练学习的执行中，可以通过所拟合的损失函数来对实时训练网络模型进行网络参数的调整，之后也可在满足初始强化学习网络模型参数调整的条件后对其网络参数进行调整，最终可以将网络参数调整后的初始强化学习网络模型作为当前可用的目标强化学习网络模型。In the implementation of the loss function fitting in this embodiment, the setting is mainly based on the maximum cumulative reward value corresponding to the actual output value and the cumulative reward value corresponding to the standard output value under the initial reinforcement reinforcement learning network model. In the execution of training and learning again, the network parameters of the real-time training network model can be adjusted through the fitted loss function, and then the network parameters of the network model can also be adjusted after satisfying the conditions for adjusting the parameters of the initial reinforcement learning network model. Finally, the initial reinforcement learning network model after network parameter adjustment can be used as the currently available target reinforcement learning network model.

进一步地，本实施例可以将上述步骤c2)具体为下述步骤实现：Further, in this embodiment, the above-mentioned step c2) can be implemented as the following steps:

需要说明的是，目标强化学习网络模型的确定过程中需要上述每个训练样本的参与，对于每个训练样本都需要执行本可选实施例下述所提供的各项步骤。It should be noted that the determination of the target reinforcement learning network model requires the participation of each of the above training samples, and each of the steps provided below in this optional embodiment needs to be performed for each training sample.

c21)针对每个训练样本，确定所包括第一状态序列在所述实时训练网络模型下所输出各动作空间向量的当前累计回报值，并确定最大当前累计回报值。c21) For each training sample, determine the current cumulative reward value of each action space vector output by the included first state sequence under the real-time training network model, and determine the maximum current cumulative reward value.

在本步骤中，其具体实现主要包括：首先，将训练样本中的第一状态序列作为输入数据输入实时训练网络模型，并获得通过实时训练网络模型的运行所输出的各动作空间向量(相当于通过聚类划分所形成各课程类别对应的聚类中心向量)。In this step, the specific implementation mainly includes: first, input the first state sequence in the training sample as input data into the real-time training network model, and obtain each action space vector (equivalent to the output of the real-time training network model) output through the operation The cluster center vector corresponding to each course category formed by clustering).

之后，可以在得到每个动作空间向量的当前累计回报值，并可通过对各当前累计回报值的比较确定出最大当前累计回报值。其中，当前累计回报值可通过一个已知的回报值确定函数计算来获得。利用贝尔曼方程来确定损失函数拟合所需的目标值。其中，贝尔曼方程可表述为：After that, the current cumulative reward value of each action space vector can be obtained, and the maximum current cumulative reward value can be determined by comparing the current cumulative reward values. Among them, the current cumulative reward value can be obtained by calculating a known reward value determination function. Use the Bellman equation to determine the target value required for the loss function fit. Among them, the Bellman equation can be expressed as:

其中，Y表示损失函数拟合所需的目标值，Rt+1表示训练样本中的瞬时回报值，γ为一个预先设定的参数，Q(S_t+1,a,θ_t)表示在网络参数为θ_t的实时训练网络模型下所获得的一个动作空间由当前状态转换为下一状态时对应的累计回报值，

则在网络参数为θ_t'的初始强化学习网络模型下按照上述所确定目标动作空间确定的累计回报值。

Among them, Y represents the target value required for the loss function fitting, Rt+1 represents the instantaneous reward value in the training sample, γ is a preset parameter, and Q(S _t+1 , a, θ _t ) represents the network The cumulative reward value corresponding to the transformation of an action space from the current state to the next state obtained under the real-time training network model with parameter θ _t ,

Then, under the initial reinforcement learning network model with the network parameter θ _t ', the cumulative reward value is determined according to the target action space determined above.

c22)确定所述第一状态序列在所述初始强化学习网络模型下相对所述目标聚类中心向量的标准累计回报值。c22) Determine the standard cumulative return value of the first state sequence relative to the target cluster center vector under the initial reinforcement learning network model.

本步骤的具体实现包括：将第一状态序列作为输入数据输入初始强化学习网络模型，可以从所输出的动作空间向量中找出训练样本中的聚类中心向量，并由此获得其对应的标准累计回报值。The specific implementation of this step includes: inputting the first state sequence as input data into the initial reinforcement learning network model, and finding out the cluster center vector in the training sample from the output action space vector, and then obtaining its corresponding standard Cumulative return value.

c23)根据各训练样本下对应的最大当前累计回报值及标准累计回报值进行损失函数拟合。c23) Fitting the loss function according to the maximum current cumulative return value and the standard cumulative return value corresponding to each training sample.

在本实施例中，可以利用贝尔曼方程来确定损失函数拟合所需的目标值。In this embodiment, the Bellman equation can be used to determine the target value required for the fitting of the loss function.

其中，贝尔曼方程可表述为：Among them, the Bellman equation can be expressed as:

其中，该贝尔曼方程的计算是相对每个训练样本而言的，Y表示损失函数拟合所需的目标值，R_t+1表示训练样本中的瞬时回报值，γ为一个预先设定的参数，Q(S_t+1,a,θ_t)表示在网络参数为θ_t的实时训练网络模型下所获得的一个动作空间由当前状态S_t转换为下一状态S_t+1时对应的当前累计回报值，基于每个当前累计回报值可以确定出最大累计回报值，

则在网络参数为θ_t'的初始强化学习网络模型下按照上述所确定目标动作空间确定的标准累计回报值。其中，目标动作空间相当于上述最大累计回报值对应的动作空间，该目标动作空间也往往是训练样本中的目标聚类中心向量。Among them, the calculation of the Bellman equation is relative to each training sample, Y represents the target value required for the loss function fitting, R _t+1 represents the instantaneous reward value in the training sample, and γ is a preset value. parameter, Q(S _t+1 , a, θ _t ) represents the corresponding action space obtained under the real-time training network model with network parameter θ _t from the current state S _t to the next state S _t+1 The current cumulative return value, the maximum cumulative return value can be determined based on each current cumulative return value,

Then, under the initial reinforcement learning network model with the network parameter θ _t ', the standard cumulative reward value is determined according to the target action space determined above. Among them, the target action space is equivalent to the action space corresponding to the above maximum cumulative reward value, and the target action space is often the target cluster center vector in the training sample.

之后，可以根据给定的损失函数确定具体的损失函数值，其中损失函数可以表示为：After that, the specific loss function value can be determined according to the given loss function, where the loss function can be expressed as:

其中，Q(S_t+1,a,θ_t表示在网络参数为θ_t的实时训练网络模型下所获得的目标动作空间由上一状态转换为当前状态S_t时对应的上一累计回报值，Y为上述确定的目标值，n为训练样本的个数，上述表达式主要拟合实时训练网络模型的实际值值和目标值Y之间的均方误差。

Among them, Q(S _t+1 , a, θ _t represents the previous cumulative reward value when the target action space obtained under the real-time training network model with network parameter θ _t is converted from the previous state to the current state S _t , Y is the target value determined above, n is the number of training samples, the above expression mainly fits the mean square error between the actual value of the real-time training network model and the target value Y.

c24)根据拟合的损失函数对所述实时训练网络模型的网络参数进行更新，并在更新次数满足参数替换周期时，采用所述实时训练网络模型的网络参数替换所述初始强化学习网络模型的网络参数。c24) Update the network parameters of the real-time training network model according to the fitted loss function, and when the number of updates satisfies the parameter replacement period, replace the initial reinforcement learning network model with the network parameters of the real-time training network model Network parameters.

具体的，通过该均方误差可以进行实时训练网络模型的反向训练，以对网络参数进行调整来更新网络模型。本步骤除实时对实时训练网络模型的网络参数进行更新外，还实时的进行更新次数的统计，在更新次数的累计值达到参数替换周期时，就需要采用那一时刻下实时训练网络模型的网络参数来替换初始强化学习网络模型的网络参数，以此来实现初始强化学习网络模型的更新。Specifically, the reverse training of the real-time training network model can be performed through the mean square error, so as to adjust the network parameters to update the network model. In this step, in addition to updating the network parameters of the real-time training network model in real time, the statistics of the number of updates are also performed in real time. When the cumulative value of the number of updates reaches the parameter replacement period, it is necessary to use the network for training the network model in real time at that moment. parameters to replace the network parameters of the initial reinforcement learning network model, so as to realize the update of the initial reinforcement learning network model.

其中，本实施例优选参数替换周期为更新次数由0累计为一个数量值，该数量值可根据历史经验确定，如50次。Wherein, the preferred parameter replacement cycle in this embodiment is that the number of updates is accumulated from 0 to a quantitative value, and the quantitative value can be determined according to historical experience, such as 50 times.

c25)将参数替换后的初始强化学习网络模型确定为所述目标强化学习网络模型。c25) Determine the initial reinforcement learning network model after parameter replacement as the target reinforcement learning network model.

本实施例上述可选实施例具体给出了网络课程的课程类别划分实现，以及以划分所形成课程类别作为动作空间维度的强化学习网络模型的训练实现。通过课程类别的划分可以有效减少待推荐网络课程中规模，以此来降低强化学习中的动作空间维度，保证强化学习在大规模数据推荐场景中的有效应用。The above-mentioned optional embodiments of this embodiment specifically provide the realization of the course category division of the network course, and the training realization of the reinforcement learning network model with the course category formed by the division as the action space dimension. The division of course categories can effectively reduce the scale of online courses to be recommended, thereby reducing the action space dimension in reinforcement learning, and ensuring the effective application of reinforcement learning in large-scale data recommendation scenarios.

实施例二Embodiment 2

图2为本申请实施例二提供的一种课程推荐方法的流程示意图，本实施例以上述实施例为基础，在本实施例中，可以根据目标用户的历史课程浏览数据，确定所述目标用户的当前状态具体表述为：对所述目标用户在设定时间段内的第一历史课程浏览数据进行分词处理，确定所述目标用户所对应已浏览课程的已浏览课程向量；将各所述已浏览课程向量的平均向量确定所述目标用户的当前状态。FIG. 2 is a schematic flowchart of a course recommendation method provided in Embodiment 2 of the present application. This embodiment is based on the above-mentioned embodiment. In this embodiment, the target user can be determined according to the historical course browsing data of the target user. The current state is specifically expressed as: performing word segmentation on the first historical course browsing data of the target user within the set time period, and determining the browsed course vector of the browsed courses corresponding to the target user; The average vector of browsing course vectors determines the current state of the target user.

同时，本实施例还可以将根据所述当前状态及预先训练的目标强化学习网络模型，获得满足筛选条件的候选课程类别具体表述为：将所述当前状态输入至所述目标强化学习网络模型，通过所述目标强化学习网络模型输出所述输出数量个候选向量作为动作空间，其中，各所述候选向量分别标识一种课程类别；针对每个课程类别，通过给定的累计回报值模型，结合所述当前状态及所述目标强化学习网络模型的当前网络参数，确定所述课程类别的累计回报值；按照累计回报值排名各所述课程类别，将处于前第一设定名次的课程类别作为候选课程类别。At the same time, in this embodiment, obtaining the candidate course category that meets the screening conditions according to the current state and the pre-trained target reinforcement learning network model may be specifically expressed as: inputting the current state into the target reinforcement learning network model, Output the output number of candidate vectors as the action space through the target reinforcement learning network model, wherein each of the candidate vectors respectively identifies a course category; for each course category, through a given cumulative reward model, combined with The current state and the current network parameters of the target reinforcement learning network model determine the cumulative reward value of the course category; rank each of the course categories according to the cumulative reward value, and take the course category in the first set ranking as the Candidate course category.

此外，本实施例也可以将根据所述当前状态及预先训练的目标强化学习网络模型，获得满足筛选条件的候选课程类别具体表述为：针对每个候选课程类别，获取候选课程类别的聚类中心向量；确定所述聚类中心向量所关联聚类簇中各课程向量与所述聚类中心向量的距离值；按照距离值排名各所述课程向量，将处于前第二设定名次的课程向量作为待推荐课程；从各所述待推荐课程中选定满足细粒度筛选条件的目标课程分别推送各所述目标用户。In addition, in this embodiment, the acquisition of candidate course categories that meet the screening conditions according to the current state and the pre-trained target reinforcement learning network model can also be specifically expressed as: for each candidate course category, obtain the cluster center of the candidate course category vector; determine the distance value of each course vector in the cluster cluster associated with the cluster center vector and the cluster center vector; rank each of the course vectors according to the distance value, and the course vector in the first second set ranking will be As the to-be-recommended course; select a target course that satisfies the fine-grained screening condition from each of the to-be-recommended courses and push it to each of the target users respectively.

如图2所示，本申请实施例二提供的一种课程推荐方法，具体包括如下操作：As shown in FIG. 2 , a course recommendation method provided in Embodiment 2 of the present application specifically includes the following operations:

S201、对所述目标用户在设定时间段内的第一历史课程浏览数据进行分词处理，确定所述目标用户所对应已浏览课程的已浏览课程向量。S201. Perform word segmentation on the first historical course browsing data of the target user within a set time period, and determine a browsed course vector of the browsed course corresponding to the target user.

示例性的，同样可以采用词向量的相关模型word2vec来实现分析处理，由此可以获得用户在该设定时间段内已浏览的所有网络课程的课程向量，本步骤记为已浏览课程向量。Exemplarily, the related model word2vec of word vectors can also be used to implement analysis processing, thereby obtaining the course vectors of all online courses that the user has browsed in the set time period, which is recorded as the browsed course vector in this step.

S202、将各所述已浏览课程向量的平均向量确定所述目标用户的当前状态。S202: Determine the current state of the target user from the average vector of the browsed course vectors.

本步骤相当于当前状态的一种实现，即通过对各所述已浏览课程向量进行平均值的计算来获得表征目标用户当前状态的平均向量。This step is equivalent to an implementation of the current state, that is, an average vector representing the current state of the target user is obtained by calculating the average value of each of the browsed course vectors.

S203、将所述当前状态输入至所述目标强化学习网络模型，通过所述目标强化学习网络模型输出所述输出数量个候选向量作为动作空间。S203. Input the current state to the target reinforcement learning network model, and output the output number of candidate vectors as an action space through the target reinforcement learning network model.

本步骤相当于目标强化学习网络模型的具体应用，可以相应所输入的当前状态输出多个动作空间，每个动作空间的向量形式表示，每个向量记为候选向量，其中，各所述候选向量分别标识一种课程类别，也即，所输出的一个动作空间标识了一种课程类别。由此，所输出的数量相当于上述基于聚类算法进行课程类别划分时所确定的K值。This step is equivalent to the specific application of the target reinforcement learning network model, which can output multiple action spaces corresponding to the input current state, each action space is represented in the form of a vector, and each vector is recorded as a candidate vector, wherein each candidate vector A course category is respectively identified, that is, an output action space identifies a course category. Therefore, the output quantity is equivalent to the K value determined when the above-mentioned clustering algorithm is used to classify the courses.

S204、针对每个课程类别，通过给定的累计回报值模型，结合所述当前状态及所述目标强化学习网络模型的当前网络参数，确定所述课程类别的累计回报值。S204. For each course category, determine the cumulative reward value of the course category by combining the current state and the current network parameters of the target reinforcement learning network model with a given cumulative reward value model.

可以知道的是，通过上述课程类别的划分可以获得到以聚类中心向量表征的每个课程类别，本步骤相当于对每个课程类别的操作。本步骤中累计回报值的具体确定可优选采用上述所描述的累计回报值函数进行计算。其需要已知的信息由该课程类别对应的候选向量(聚类中心向量)，该目标强化学习网络模型的当前网络参数，以及进行该强化学习处理后所对应的下一状态。It can be known that each course category represented by the cluster center vector can be obtained through the division of the above course categories, and this step is equivalent to the operation of each course category. The specific determination of the cumulative reward value in this step may preferably be calculated by using the cumulative reward value function described above. The required information is the candidate vector (cluster center vector) corresponding to the course category, the current network parameters of the target reinforcement learning network model, and the next state corresponding to the reinforcement learning process.

S205、按照累计回报值排名各所述课程类别，将处于前第一设定名次的课程类别作为候选课程类别。S205. Rank each of the course categories according to the cumulative return value, and take the course category at the top first set ranking as the candidate course category.

通过上述步骤可以确定每个课程类别所对应的累计回报值，本步骤可以对各累计回报值进行排名，由此可以选择出名次靠前的多个课程类别作为候选课程类别，以此来保证所获得粗粒度筛选结果的多样性。其中，第一设定名词可优选为2，即排名处于前两名的课程类别。Through the above steps, the cumulative reward value corresponding to each course category can be determined. In this step, each cumulative reward value can be ranked, so that multiple course categories with the highest ranking can be selected as candidate course categories, so as to ensure that all Gain variety in coarse-grained screening results. Wherein, the first set noun may preferably be 2, that is, the course category ranked in the top two.

下述S206至S209给出了细粒度筛选出目标课程的具体实现The following S206 to S209 give the specific implementation of fine-grained screening of target courses

S206、针对每个候选课程类别，获取候选课程类别的聚类中心向量。S206, for each candidate course category, obtain the cluster center vector of the candidate course category.

可以知道的是，本步骤与下述S207及S208均为相对每个候选课程类别的操作，一个候选课程类别也相当于一个聚类簇的聚类中心，本步骤可以获取到该具备中心的聚类中心向量，也即每个候选课程类别相对目标强化学习网络模型的候选向量。It can be known that this step and the following S207 and S208 are operations relative to each candidate course category. One candidate course category is also equivalent to the clustering center of a cluster. This step can obtain the cluster with the center. Class center vector, that is, the candidate vector of each candidate course category relative to the target reinforcement learning network model.

S207、确定所述聚类中心向量所关联聚类簇中各课程向量与所述聚类中心向量的距离值。S207: Determine the distance value between each course vector in the cluster associated with the cluster center vector and the cluster center vector.

在本实施例中，一个聚类簇中包含有至少一个归属该聚类中心的网络课程，每个网络课程以相应的课程向量表示。本步骤可以计算出各课程向量到聚类中心向量的距离值。In this embodiment, a cluster includes at least one online course belonging to the cluster center, and each online course is represented by a corresponding course vector. In this step, the distance value from each course vector to the cluster center vector can be calculated.

S208、按照距离值排名各所述课程向量，将处于前第二设定名次的课程向量作为待推荐课程。S208. Rank each of the course vectors according to the distance value, and use the course vector in the first and second set ranking as the course to be recommended.

本步骤可以对上述获得的各距离值进行排名，由此可以获得到排名靠前的多个课程向量，本步骤优选排名靠前的前20名作为待推荐课程，即第二设定名词优选为20。In this step, the distance values obtained above can be ranked, so that multiple top-ranked course vectors can be obtained. In this step, the top 20 ranked courses are selected as the courses to be recommended, that is, the second set noun is preferably 20.

S209、从各所述待推荐课程中选定满足细粒度筛选条件的目标课程分别推送各所述目标用户。S209: Select target courses that satisfy the fine-grained screening conditions from the courses to be recommended and push them to the target users respectively.

本步骤可以汇总相对每个候选课程类别对应的待推荐课程，并可按照给定的细粒度筛选条件对待推荐课程再次进行排序，由此选定合适数量的网络课程作为目标课程。其中，细粒度筛选条件可以依据具体的应用场景来选定，选定后就可以确定出对待推荐课程进行排序的排序参考维度。In this step, the courses to be recommended corresponding to each candidate course category can be aggregated, and the recommended courses can be sorted again according to the given fine-grained filtering conditions, thereby selecting an appropriate number of online courses as target courses. Among them, the fine-grained filter conditions can be selected according to specific application scenarios, and after the selection, the sorting reference dimension for sorting the recommended courses can be determined.

本实施例所给定的一种实现方式中，具体可以随机的从每个候选课程类别对应的待推荐课程中选择4门网络课程作为目标课程推送给目标用户。所推送的各目标课程可以展示在用户客户端的首页界面中，以供用户选择浏览。In an implementation manner given in this embodiment, 4 online courses may be randomly selected from the courses to be recommended corresponding to each candidate course category as target courses and pushed to the target user. The pushed target courses can be displayed on the home page interface of the user client for the user to choose to browse.

本发明实施例二提供的一种课程推荐方法，具体化了目标用户当前状态的确定过程，同时具体化了候选课程类别的确定过程，以及目标课程的筛选过程。本实施例提供的方法主要采用了强化学习网络模型的处理形式来达到向目标用户推荐的网络课程能够为网络教育平台带来长期收益的效果；同时，通过对强化学习网络模型中所输出动作空间的降维处理，即，通过保证所输出动作空间的输出数量仅与网络课程所具备课程类别数量相同，来解决强化学习无法适应大规模数据量处理的问题，由此实现了网络课程到用户端的有效推荐，提升了用户体验。A course recommendation method provided by the second embodiment of the present invention embodies the process of determining the current state of the target user, the process of determining the category of candidate courses, and the process of screening the target courses. The method provided in this embodiment mainly adopts the processing form of the reinforcement learning network model to achieve the effect that the online courses recommended to the target users can bring long-term benefits to the online education platform; dimensionality reduction processing, that is, by ensuring that the output number of the output action space is only the same as the number of course categories available in the online course, to solve the problem that reinforcement learning cannot adapt to large-scale data processing, thus realizing the network course to the user. Effective recommendations improve user experience.

实施例三Embodiment 3

图3为本申请实施例三提供的一种课程推荐装置的结构框图，该装置适用于向用户进行网络教学平台中网络课程推荐的情况。该装置可以由硬件和/或软件实现，并一般集成在计算机设备中。如图3所示，该装置包括：信息确定模块31、候选确定模块32以及目标推荐模块33。FIG. 3 is a structural block diagram of a course recommendation apparatus provided in Embodiment 3 of the present application, and the apparatus is suitable for recommending online courses in an online teaching platform to users. The apparatus may be implemented in hardware and/or software, and is typically integrated in computer equipment. As shown in FIG. 3 , the apparatus includes: an information determination module 31 , a candidate determination module 32 and a target recommendation module 33 .

信息确定模块31，用于根据目标用户的第一历史课程浏览数据，确定所述目标用户的当前状态；The information determination module 31 is used for determining the current state of the target user according to the first historical course browsing data of the target user;

候选确定模块32，目标根据所述当前状态及预先训练的目标强化学习网络模型，获得满足筛选条件的候选课程类别，其中，所述目标强化学习网络模型以网络课程归属的课程类别作为动作空间，且所输出动作空间的输出数量与所述课程类别的总量相同；Candidate determination module 32, the target obtains candidate course categories that meet the screening conditions according to the current state and the pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model uses the course category to which the online course belongs as the action space, And the output quantity of the output action space is the same as the total amount of the course category;

目标推荐模块33，用于从各所述候选课程类别对应的网络课程中筛选设定数量的目标课程推送给所述目标用户。The target recommendation module 33 is configured to select a set number of target courses from the online courses corresponding to each candidate course category and push them to the target user.

本实施例三提供的一种课程推荐装置，主要采用了强化学习网络模型的处理形式来达到向目标用户推荐的网络课程能够为网络教育平台带来长期收益的效果；同时，通过对强化学习网络模型中所输出动作空间的降维处理，即，通过保证所输出动作空间的输出数量仅与网络课程所具备课程类别数量相同，来解决强化学习无法适应大规模数据量处理的问题，由此实现了网络课程到用户端的有效推荐，提升了用户体验。A course recommendation device provided by the third embodiment mainly adopts the processing form of the reinforcement learning network model to achieve the effect that the online courses recommended to target users can bring long-term benefits to the online education platform; The dimensionality reduction processing of the output action space in the model, that is, by ensuring that the output number of the output action space is only the same as the number of course categories available in the online course, to solve the problem that reinforcement learning cannot adapt to large-scale data processing, thus achieving It can effectively recommend online courses to the user, and improve the user experience.

进一步地，该装置还可以包括：课程分类划分模块，Further, the device may also include: a course classification and division module,

课程分类划分模块具体可以用于：The course classification and division module can be used for:

进一步地，信息确定模块31具体可以用于：Further, the information determination module 31 can be specifically used for:

进一步地，候选确定模块32具体可以用于：Further, the candidate determination module 32 can be specifically used for:

进一步地，目标推荐模块33具体可以用于：Further, the target recommendation module 33 can be specifically used for:

进一步地，该装置还可以包括模型训练模块，其中，模型训练模块可以包括：Further, the apparatus may also include a model training module, wherein the model training module may include:

信息初始化单元，用于将网络结构相同、网络参数不同的两个强化学习网络模型分别记为实时训练网络模型和初始强化学习网络模型；The information initialization unit is used to record the two reinforcement learning network models with the same network structure and different network parameters as the real-time training network model and the initial reinforcement learning network model;

样本确定单元，用于根据采用各所述聚类中心向量标识的各课程类别、以及所选定各用户的第二历史课程浏览数据，构造模型训练的训练样本集，其中，所述训练样本集中每个训练样本包括：用户当前状态的第一状态序列、目标聚类中心向量、瞬时回报值、以及下一状态的第二状态序列；The sample determination unit is configured to construct a training sample set for model training according to each course category identified by each of the cluster center vectors and the second historical course browsing data of each selected user, wherein the training sample set is Each training sample includes: the first state sequence of the user's current state, the target cluster center vector, the instantaneous reward value, and the second state sequence of the next state;

目标获得单元，用于根据各训练样本分别在所述实时训练网络模型和初始强化学习网络模型下的输出结果进行损失函数拟合，并通过所拟合损失函数的反向学习，获得目标强化学习网络模型。The target obtaining unit is used to perform loss function fitting according to the output results of each training sample under the real-time training network model and the initial reinforcement learning network model, and obtain the target reinforcement learning through reverse learning of the fitted loss function network model.

进一步地，目标获得单元具体可以用于：Further, the target obtaining unit can be specifically used for:

实施例四Embodiment 4

图4为本申请实施例四提供的一种计算机设备的结构示意图。该计算机设备包括：处理器40、存储器41、显示屏42、输入装置43以及输出装置44。该计算机设备中处理器40的数量可以是一个或者多个，图4中以一个处理器40为例。该计算机设备中存储器41的数量可以是一个或者多个，图4中以一个存储器41为例。该计算机设备的处理器40、存储器41、显示屏42、输入装置43以及输出装置44可以通过总线或者其他方式连接，图4中以通过总线连接为例。实施例中，计算机设备可以是电脑、笔记本或智能平板等。FIG. 4 is a schematic structural diagram of a computer device according to Embodiment 4 of the present application. The computer equipment includes: a processor 40 , a memory 41 , a display screen 42 , an input device 43 and an output device 44 . The number of processors 40 in the computer device may be one or more, and one processor 40 is taken as an example in FIG. 4 . The number of memories 41 in the computer device may be one or more, and one memory 41 is taken as an example in FIG. 4 . The processor 40 , the memory 41 , the display screen 42 , the input device 43 and the output device 44 of the computer device may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 4 . In an embodiment, the computer device may be a computer, a notebook, a smart tablet, or the like.

存储器41作为一种计算机可读存储介质，可用于存储软件程序、计算机可执行程序以及模块，如本发明任意实施例所述的计算机设备对应的程序指令/模块(例如，课程推荐装置中的信息确定模块31、候选确定模块32以及目标推荐模块33)。存储器41可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序；存储数据区可存储根据设备的使用所创建的数据等。此外，存储器41可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中，存储器41可进一步包括相对于处理器40远程设置的存储器，这些远程存储器可以通过网络连接至设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a computer-readable storage medium, the memory 41 can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the computer equipment described in any embodiment of the present invention (for example, information in the course recommendation device). A determination module 31, a candidate determination module 32, and a target recommendation module 33). The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. In addition, memory 41 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some instances, memory 41 may further include memory located remotely from processor 40, which may be connected to the device through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

显示屏42可以为具有触摸功能的显示屏42，其可以是电容屏、电磁屏或者红外屏。一般而言，显示屏42用于根据处理器40的指示显示数据，还用于接收作用于显示屏42的触摸操作，并将相应的信号发送至处理器40或其他装置。The display screen 42 may be a display screen 42 with a touch function, which may be a capacitive screen, an electromagnetic screen or an infrared screen. Generally speaking, the display screen 42 is used for displaying data according to the instruction of the processor 40, and is also used for receiving touch operations acting on the display screen 42, and sending corresponding signals to the processor 40 or other devices.

输入装置43可用于接收输入的数字或者字符信息，以及产生与展示设备的用户设置以及功能控制有关的键信号输入，还可以是用于获取图像的摄像头以及获取音频数据的拾音设备。输出装置44可以包括扬声器等音频设备。需要说明的是，输入装置43和输出装置44的具体组成可以根据实际情况设定。The input device 43 can be used for receiving input digital or character information, and generating key signal input related to user setting and function control of the display device, and can also be a camera for acquiring images and a pickup device for acquiring audio data. The output device 44 may include audio equipment such as speakers. It should be noted that the specific composition of the input device 43 and the output device 44 can be set according to actual conditions.

处理器40通过运行存储在存储器41中的软件程序、指令以及模块，从而执行设备的各种功能应用以及数据处理，即实现上述的课程推荐方法。The processor 40 executes various functional applications and data processing of the device by running the software programs, instructions and modules stored in the memory 41 , that is, to implement the above-mentioned course recommendation method.

上述提供的计算机设备可用于执行上述任意实施例提供的课程推荐方法，具备相应的功能和有益效果。The computer device provided above can be used to execute the course recommendation method provided by any of the above embodiments, and has corresponding functions and beneficial effects.

实施例五Embodiment 5

本发明实施例五还提供一种包含计算机可执行指令的存储介质，所述计算机可执行指令在由计算机处理器执行时用于执行一种课程推荐方法，包括：Embodiment 5 of the present invention also provides a storage medium containing computer-executable instructions, where the computer-executable instructions are used to execute a course recommendation method when executed by a computer processor, including:

当然,本发明实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的课程推荐方法操作,还可以执行本发明任意实施例所提供的课程推荐方法中的相关操作，且具备相应的功能和有益效果。Of course, a storage medium containing computer-executable instructions provided by an embodiment of the present invention, the computer-executable instructions of which are not limited to the operations of the course recommendation method as described above, and can also execute the course recommendation method provided by any embodiment of the present invention. The related operations in the system have corresponding functions and beneficial effects.

通过以上关于实施方式的描述，所属领域的技术人员可以清楚地了解到，本申请可借助软件及必需的通用硬件来实现，当然也可以通过硬件实现，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(RandomAccess Memory,RAM)、闪存(FLASH)、硬盘或光盘等，包括若干指令用以使得一台计算机设备(可以是机器人，个人计算机，服务器，或者网络设备等)执行本申请任意实施例所述的课程推荐方法。From the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software and necessary general-purpose hardware, and of course can also be implemented by hardware, but in many cases the former is a better implementation manner . Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in a computer-readable storage medium, such as a floppy disk of a computer , read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disk, etc., including several instructions to make a computer device (which can be a robot, a personal computer, server, or network device, etc.) to execute the course recommendation method described in any embodiment of this application.

值得注意的是，上述课程推荐装置中，所包括的各个单元和模块只是按照功能逻辑进行划分的，但并不局限于上述的划分，只要能够实现相应的功能即可；另外，各功能单元的具体名称也只是为了便于相互区分，并不用于限制本申请的保护范围。It is worth noting that the units and modules included in the above-mentioned course recommendation device are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be realized; The specific names are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application.

应当理解，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of this application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.

注意，上述仅为本申请的较佳实施例及所运用技术原理。本领域技术人员会理解，本申请不限于这里所述的特定实施例，对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此，虽然通过以上实施例对本申请进行了较为详细的说明，但是本申请不仅仅限于以上实施例，在不脱离本申请构思的情况下，还可以包括更多其他等效实施例，而本申请的范围由所附的权利要求范围决定。Note that the above are only preferred embodiments of the present application and applied technical principles. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present application. The scope is determined by the scope of the appended claims.

Claims

1. A course recommendation method, comprising:

determining the current state of a target user according to first historical course browsing data of the target user;

obtaining candidate course classes meeting screening conditions according to the current state and a pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model takes the course class to which the network course belongs as an action space, and the output quantity of the output action space is the same as the total quantity of the course classes;

and screening a set number of target courses from the network courses corresponding to the candidate course categories and pushing the target courses to the target user.

2. The method as claimed in claim 1, wherein the step of dividing the class of the network lesson comprises:

obtaining second historical course browsing data of the selected users from the message queue, and forming course browsing sequences corresponding to the users;

taking each course browsing sequence as a sentence to be processed, and obtaining a course vector of each network course through a word vector division model to form a course vector set;

and clustering the course vector set to obtain the clustering clusters of the output quantity, and correspondingly determining the clustering center vector of each clustering cluster as a course category.

3. The method as claimed in claim 1, wherein the determining the current status of the target user according to the historical lesson browsing data of the target user comprises:

performing word segmentation on first historical course browsing data of the target user in a set time period, and determining a browsed course vector of a browsed course corresponding to the target user;

and determining the current state of the target user by the average vector of the browsed course vectors.

4. The method as claimed in claim 1, wherein said obtaining candidate course classes satisfying the filtering condition according to the current state and the pre-trained target reinforcement learning network model comprises:

inputting the current state into the target reinforcement learning network model, and outputting the output quantity of candidate vectors as an action space through the target reinforcement learning network model, wherein each candidate vector respectively identifies a course type;

determining the accumulated return value of each course category through a given accumulated return value model and by combining the current state and the current network parameters of the target reinforcement learning network model;

and ranking the course categories according to the accumulated return value, and taking the course category with the first set ranking as a candidate course category.

5. The method as claimed in claim 2, wherein the step of screening a set number of target courses from the network courses corresponding to each of the candidate course categories comprises:

aiming at each candidate course category, acquiring a clustering center vector of the candidate course category;

determining a distance value between each course vector in a cluster associated with the cluster center vector and the cluster center vector;

ranking the course vectors according to the distance value, and taking the course vector with the second previous set name as a course to be recommended;

and selecting target courses meeting fine-grained screening conditions from the courses to be recommended and pushing the target courses to the target users respectively.

6. The method of claim 2, wherein the training step of the target reinforcement learning network model comprises:

respectively recording two reinforcement learning network models with the same network structure and different network parameters as a real-time training network model and an initial reinforcement learning network model;

constructing a training sample set for model training according to the course categories identified by the clustering center vectors and the second historical course browsing data of the selected users, wherein each training sample in the training sample set comprises: the method comprises the steps of obtaining a first state sequence of a current state of a user, a target clustering center vector, an instantaneous return value and a second state sequence of a next state;

and performing loss function fitting according to output results of the training samples under the real-time training network model and the initial reinforcement learning network model respectively, and obtaining a target reinforcement learning network model through reverse learning of the fitted loss function.

7. The method according to claim 6, wherein the performing a loss function fitting according to the output results of the training samples under the real-time training network model and the initial reinforcement learning network model, respectively, and obtaining the target reinforcement learning network model through reverse learning of the fitted loss function comprises:

for each training sample, determining the current accumulated return value of each action space vector output by the first state sequence under the real-time training network model, and determining the maximum current accumulated return value;

determining a standard accumulated return value of the first state sequence relative to the target clustering center vector under the initial reinforcement learning network model;

performing loss function fitting according to the maximum current accumulated return value and the standard accumulated return value corresponding to each training sample;

updating the network parameters of the real-time training network model according to the fitted loss function, and replacing the network parameters of the initial reinforcement learning network model with the network parameters of the real-time training network model when the updating times meet a parameter replacement period;

and determining the initial reinforcement learning network model after parameter replacement as the target reinforcement learning network model.

8. A course recommending apparatus, comprising:

the information determining module is used for determining the current state of a target user according to first historical course browsing data of the target user;

the candidate determining module is used for acquiring candidate course classes meeting screening conditions by a target according to the current state and a pre-trained target reinforcement learning network model, wherein the target reinforcement learning network model takes the course class to which the network course belongs as an action space, and the output quantity of the output action space is the same as the total quantity of the course classes;

and the target recommending module is used for screening a set number of target courses from the network courses corresponding to the candidate course categories and pushing the target courses to the target user.

9. A computer device, comprising: a memory and one or more processors;

the memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A storage medium containing computer-executable instructions for performing the method of claims 1-7 when executed by a computer processor.