CN116049557A

CN116049557A - Educational resource recommendation method based on multi-mode pre-training model

Info

Publication number: CN116049557A
Application number: CN202310097847.5A
Authority: CN
Inventors: 王海艳; 唐瞻; 骆健
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-02-10
Filing date: 2023-02-10
Publication date: 2023-05-02

Abstract

The invention discloses an educational resource recommending method based on a multi-mode pre-training model, which comprises the following steps: the method comprises the steps of collecting multi-mode data which are learned by a user online, carrying out corresponding preprocessing according to data of different modes, obtaining text data, picture data, audio data and numerical value data, respectively inputting the data into a single-mode receiving network of a multi-mode pre-training model, carrying out single-mode training and multi-mode cross training, carrying out mask prediction pre-training, predicting an original input shielding part by shielding a certain proportion of original input, and carrying out cross alignment training task to predict the characteristics of one mode by the specific characteristics of the other mode. The pre-trained multi-modal model can directly receive multi-modal input, input long and short memory network learning multi-modal expression vectors, project through a fully connected network, output the result as the probability that the user needs resources, and serve as recommendation ranking basis to be displayed to the user. The method and the system improve the comprehensiveness and accuracy of online education resource recommendation.

Description

Educational resource recommendation method based on multi-mode pre-training model

Technical Field

The invention relates to an educational resource recommendation method based on a multi-mode pre-training model, which can learn information of multiple modes of a user to recommend proper educational resources for the user and belongs to the technical field of educational recommendation.

Background

The rapid development of the internet has led to new data modes and industrial forms, and various data information presents geometric level growth, so that a user is difficult to obtain specific data meeting own requirements. A recommendation system aims at mining user preferences and providing possible interested items or services for users through portrait information of the users and combining historical interaction records of the users, and helping the users to retrieve related items from a large content library. The recommendation system has become a key technology for solving the overload of the online information and eliminating the information island, and is an important way for improving the quality of the information service.

With the rapid development of online curriculum technology and internet technology, large-scale open online curriculum (MOOC) has developed rapidly in recent years, attracting millions of online users. The MOOC still faces many new challenges, and high rates of coupling are one of the most serious challenges faced by MOOC. It is counted that the course completion rate of the MOOC platform is often less than 5%. For users, when a large amount of learning resources and activities are presented on the internet at the same time, the learner is inevitably confused by excessive information resources, and it is difficult to quickly find learning resources suitable for the learner. Therefore, how to reduce the learning-dropping rate and realize personalized recommendation for users is a main research problem in the field of online learning recommendation.

According to the research of the inventor, students at home and abroad develop a lot of researches aiming at the recommendation method and obtain a certain research result. Classifying from recommended objects, wherein group recommendation and personalized recommendation are performed; technically classified are recommendation methods based on collaborative filtering (Collaborative Filtering, CF), recommendation methods based on Machine Learning (ML), and recommendation methods based on Deep Learning (DL). According to the CF-based user recommendation method, resources similar to those in the user history behavior interaction records are matched and recommended to the user through similarity, or similar users are found through similarity, resources used by the similar users are recommended to the user, and the method faces the problems of data sparseness and cold start. Some research results show that the ML-based method can obtain effective recommendation results under the condition of sparse user behavior interaction data. However, the ML-based demand prediction method requires a large amount of feature processing on the data before training the model, which is time-consuming and labor-consuming. Along with the development and maturity of the deep learning technology, some students develop the research of the recommendation method based on the deep learning technology, and a certain research result is obtained. However, in the existing recommendation method based on deep learning, recommendation is generally performed based on single-mode data, and prediction accuracy needs to be improved. Unlike e-commerce/multimedia recommendations that have been developed for many years, the convergence of online educational and recommendation systems has been an initial stage and has presented a number of significant challenges. The online course platform generally only has implicit interaction behavior data (such as clicking, watching time length, collecting comments and the like of a course by students), lacks explicit scoring, and is difficult to directly play a role by a main stream recommendation algorithm.

A large number of research works show that the pre-training neural network model through massive unlabeled corpora can learn the general language representation beneficial to the downstream NLP task, and can avoid training a new model from zero. Pre-training models have been considered as an efficient strategy for training deep neural network models.

Disclosure of Invention

In order to solve the problems, the main purpose of the invention is to provide an online education based on multi-mode pre-training model education resource recommendation method, which facilitates the learning and representation of subsequent user vectors and course vectors through modeling of different interactive behavior data of multiple modes so as to improve the recommended service quality of online education resources.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

in a first aspect, the present invention provides an educational resource recommendation method based on a multi-modal pre-training model, comprising:

step 1, acquiring multi-mode data in an online education system; wherein the multimodal data includes numeric class data, text class data, non-text user interaction class data, video class data, picture class data, and audio class data;

step 2, carrying out data preprocessing on the multi-mode data to obtain preprocessed text data, picture data, audio data and numerical value data;

step 3, the multi-mode pre-training model comprises a text encoder, a picture encoder, an audio encoder and a multi-mode cross encoder based on a common attention transducer;

inputting the preprocessed text data into a text encoder for masking and pre-training to obtain text vectors; inputting the preprocessed picture data into a picture encoder for masking and pre-training to obtain picture vectors; inputting the preprocessed audio data into an audio encoder for masking and pre-training to obtain an audio vector;

inputting the text vector, the picture vector and the audio vector into a multi-mode cross encoder for multi-mode alignment pre-training to obtain a multi-mode representation vector;

step 4, inputting the multi-mode expression vector and the preprocessed numerical value class data into a long-short memory network for learning to obtain a user preference vector;

step 5, inputting the user preference vector into a fully-connected network, and performing projection output to obtain the educational resource recommendation probability;

and 6, determining an educational resource recommendation result according to the educational resource recommendation probability.

In some embodiments, in step 1, the multimodal data is a series of multimodal combinations generated by a user while learning a learning resource, one multimodal combination including user operations, video, pictures, audio, text, and explicit values at a time.

In some embodiments, a method for preprocessing numeric class data includes:

the numerical class data is divided into discrete and continuous types, discrete numerical scores R,is prepared by five percent, ten percent or percentage, and is directly reserved; for the duration T of user learning _u Needs to be mapped based on a threshold value, T _middle For learning the median of the duration, the formula is used

Mapping is carried out; continuity numerical user integral S _u Processing by normalization, mapping to [0,1]]Interval ∈ using formula->

S _max The maximum value is integrated for the entire user.

In some embodiments, a method for preprocessing text class data includes: text-like data: the method comprises the steps of performing sentence segmentation on texts by using an open source tool nltk, firstly reading all files into a memory, storing the files by using list to divide the articles, then dividing the sentences by using segment, and then dividing the sentences into the files; the corpus is processed into tokens, firstly, the basic token is mainly operated as unicode conversion, punctuation segmentation, lowercase conversion, chinese character segmentation and accent symbol removal operation, and finally, the array of the words is returned; secondly, the wordpietokenizer divides each word obtained in the previous step, and the wordpiere divides words by utilizing prefix and suffix roots of the words.

In some embodiments, a method for preprocessing non-text user interaction class data includes:

according to learning time and statistics frequency, the non-text user interaction data firstly carries out independent thermal coding on various behaviors, marks the occurrence of the behaviors as 1 and marks the occurrence of the behaviors as 0, then adds up the behaviors by dividing time periods, combines the behaviors into a triplet, namely < time period, behaviors and quantity >, carries out the tokenization on the triplet, and converts the triplet into text data.

In some embodiments, the preprocessing method for video class data and picture class data comprises the following steps:

the video data are processed into pictures, and in order to extract visual characteristics according to frames, the video stream is regarded as a series of images, one video per second frame number is extracted, and the video stream is converted into picture data for processing;

the preprocessing method of the picture data comprises the following steps: scaling and regularizing the picture by utilizing a pre-trained ResNeXt-50 model, and preparing each image as three channels of RGB; three channels are converted into 2048-dimensional vectors by initializing pre-trained weights on the ImageNet dataset.

In some embodiments, a method of preprocessing audio class data includes:

the audio class data is preprocessed by using Librosa; pre-emphasis, framing, windowing preprocessing operations are performed prior to analyzing and processing the speech signals:

the pre-emphasis of the voice signal s (n) is realized by a digital filter, and the input-output relation of the pre-emphasis network is that

Wherein a is a pre-emphasis coefficient; taking 30ms as one frame for framing of the voice signal; the Hamming window is used for windowing, and the specific formula is as follows:

wherein w (N) is a voice signal subjected to Hamming window windowing, N is the order of a filter, and N represents the total length of a window function; for each short-time analysis window, obtaining a corresponding frequency spectrum through FFT; the above frequency spectrum is passed through a Mel filter bank to obtain Mel frequency spectrum; cepstrum analysis is performed on top of the Mel spectrum to obtain Mel frequency cepstrum coefficient MFCC, i.e. the characteristics of the frame of speech.

In some embodiments, step 6, determining the educational resource recommendation result according to the educational resource recommendation probability, includes:

and sequencing the educational resources according to the prediction probability from large to small, and selecting the preset number of educational resources ranked in front from the educational resources to recommend the educational resources to the user.

In a second aspect, the invention provides an educational resource recommendation device based on a multi-mode pre-training model, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to the first aspect.

In a third aspect, the present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of the first aspect.

The beneficial effects are that: the education resource recommendation method and device based on the multi-mode pre-training model provided by the invention have the following advantages: through modeling of different interactive behavior data of multiple modes, subsequent user vectors and course vectors can be conveniently learned and represented, so that the recommended service quality of the online education resources is improved.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the educational resource recommendation method based on a multimodal pre-training model of the present invention;

FIG. 2 is a flow chart of the whole recommendation process in a preferred embodiment of the educational resource recommendation method based on the multi-modal pre-training model of the present invention;

FIG. 3 is a schematic diagram of a preferred embodiment of the educational resource recommendation method based on a multi-modal pre-training model of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In the description of the present invention, the descriptions of the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Example 1

An educational resource recommendation method based on a multi-mode pre-training model, comprising:

The educational resource recommending method based on the multi-mode pre-training model comprises the following basic steps:

the method comprises the steps of obtaining original data, wherein operation, video, pictures, audio and text of a user and explicit numerical values are multi-modal combinations at a certain moment, and when the user learns a certain learning resource, a series of multi-modal combinations, namely a multi-modal combination queue, are generated.

Basic data preprocessing: and normalizing various multi-mode data and interaction records in the original input data by various methods. Wherein the multimodal data includes numeric class data, text class data, non-text interaction class data, picture class data, and audio class data.

The numerical value type data is generally an explicit score and is divided into discrete numerical values and continuous numerical values, the non-continuous numerical values are defined in a definite domain and directly reserved, and the learning duration of a specific course is required to be mapped based on a threshold value, so that the learning preference of a user to the course is reflected. The continuity value, such as the learned integral, is processed by data normalization and mapped to the [0,1] interval.

The non-numerical text is complicated text information such as course knowledge points, and more elaborate coding processing is needed, namely, all knowledge point text data are counted, the knowledge point text data are converted into vector type data which can be input by subsequent neural network processing, the word segmentation is firstly carried out, and then the vector type data are input into a text pre-training module in a multi-mode pre-training module.

The interaction data needs to be converted into numerical values, such as 'user-course' interaction behaviors of comments, collection and the like, and the occurrence of the interaction behavior can be directly recorded as 1, and the non-occurrence is recorded as 0. The time periods are then divided and accumulated to form a triplet, i.e. < time period, behavior, quantity >. And then, carrying out tokenization on the triples, and inputting the triples into an interactive type pre-training module of the multi-mode and training model.

The video class is processed into picture frames and then converted into picture class data for processing.

The picture class data is first read and converted, and the conversion includes changing brightness, contrast, saturation, enlarging target and random cutting of the picture. And resetting the size of the image, remapping the data types, and finally normalizing the image to facilitate the input of the neural network. The pre-trained ResNeXt-50 model is used for photo video. Each frame is scaled and regularized and each image is prepared as three channels of RGB. This pre-trained model can convert three channels into 2048-dimensional vectors by initializing pre-trained weights on the ImageNet dataset.

The data of the speech class are processed using Librosa, mainly extracting acoustic features including zero-crossing rate, spectral centroid, spectral attenuation, mayer frequency cepstral coefficients and chrominance frequencies for audio assessment.

Through the processing, the original input data can be converted into a data format which can be received by a subsequent neural network, training of a pre-training model is facilitated, noise information of the original data is effectively eliminated, modal differences are reduced, and effective representation of model training is obtained.

Aiming at the text interaction class, the picture video class and the voice class, respectively pre-training, encoding in different single-mode encoders and then inputting the encoded single-mode encoders into a cross transducer for multi-mode feature fusion.

The multimodal data encoded by the encoder follows the mask language modeling task in standard bert—masking about 15% of the words, image areas and audio inputs and letting the model reconstruct them given the remaining inputs. The cross pre-training then uses mainly the method of modality alignment, predicting the content of one modality by another modality.

After pre-training, the output multi-mode expression vector fuses the preference of the user to a certain part of educational resources, the user learns by using LSTM, and finally, the output classification is carried out through a fully connected network to obtain scores and display results.

In some embodiments, according to the educational resource recommendation method based on the multi-mode pre-training model in the preferred embodiment of the present invention, as shown in fig. 1 and fig. 2, the educational resource recommendation method based on the multi-mode pre-training model includes the following steps:

step 1: in an online educational system, a data collection module collects various modal data and divides the various single modal data, such as course shots, comments, barrages, course scores, notes, audio lectures, text material, and the like.

The learning record of the user in one resource is shown in table 1:

TABLE 1

2022-12-01 8:00	Open video (Limit teaching)	Video addresses
			2022-12-01 8:10	Pause	Video frame
2022-12-01 8:22	Exit from
			2022-12-01 8:25	Practice problem with opening limit	Exercise question text
2022-12-01 8:40	Open the answer of exercise and exercises	Answer text
			2022-12-01 8:45	Uploading problem	Problem picture
2022-12-01 8:47	Forum comments	Comment text
			……	……	……

Other types of data formats are shown in table 2:

TABLE 2

Video scoring	4.5
		Difficulty of exercises	3
User viewing duration	6:20:00
		User learning points	65

And after data collection, a training data set is manufactured, data preprocessing is carried out, and data preprocessing is carried out on data of different modes.

Step 2: for explicit numerical data, the data is divided into discrete and continuous types, and the discrete numerical score R, which is generally five-component, ten-component or one-percent, can be directly reserved. For a pair ofFor a duration T of user learning _u Needs to be mapped based on a threshold value, T _middle Is the median of the learning duration. Using the formula

Mapping is performed. Continuity value such as user integral S _u Treatment by normalization, using the formula +.>

S _max The maximum value is integrated for the entire user.

Step 3: the text data such as exercises, answers, comments and barrages are characterized in that an open source tool nltk is utilized to divide texts, all files are read into a memory, list is used for dividing the texts and storing the texts, segment the sentences into the files according to the ratio of train to loss. The corpus is processed into tokens, firstly, the corpus is a basic token, the main operations are unicode conversion, punctuation mark segmentation, lowercase conversion, chinese character segmentation, accent symbol removal and other operations, and finally, the array of the words is returned; secondly, the wordbietokenizer divides each word obtained in the previous step by wordbiere, and wordbiere divides words by using more common word roots such as prefix and suffix of words, for example, the "lobida rule" is decomposed into [ "lobida", "rule" ] so that the size of a dictionary becomes acceptable.

Step 4: the user interaction data is shown in table 1, firstly, performing single-hot coding on various behaviors according to learning time and statistics frequency, then dividing a small learning time period according to total learning time length, taking the week as the small learning time length, and then accumulating to obtain sequences such as < first week, opening video, 40>, < first week, fast forwarding, 68>, < first week, pause, 20>, < first week, exit, 52> … … < eighth week, and checking example questions, 18>, wherein the sequences only need word segmentation according to triples to perform word segmentation, and the corresponding sequences are converted into token ids in a dictionary.

The video class data is processed as pictures first, and in order to extract visual features on a frame-by-frame basis, the video stream is treated as a series of images and one video per second frame number is extracted. To maintain the validity and integrity of each frame, a pre-trained ResNeXt-50 model is utilized. Each frame is scaled and regularized and each image is prepared as three channels of RGB. This pre-trained model can convert three channels into 2048-dimensional vectors by initializing pre-trained weights on the ImageNet dataset. After extraction, the video frame sequence is transformed into 2048 frame sequence-size frame features by a ResNeXt-50 model. The average of these feature vectors is used to represent visual information of the entire video. Thus, videos having different frames can be converted into the same 2048-dimensional feature vector.

Step 5: audio class data is pre-processed with Librosa for audio signals, which must be pre-emphasized, framed, windowed, etc., prior to analysis and processing of the audio signals. The purpose of these operations is to eliminate the impact on the quality of the speech signal due to aliasing, higher harmonic distortion, high frequencies, etc. caused by the human vocal organ itself and by the equipment that collects the speech signal. The method ensures that the signals obtained by the subsequent voice processing are more uniform and smoother as far as possible, provides high-quality parameters for signal parameter extraction, and improves the voice processing quality. The pre-emphasis of the voice signal s (n) is realized by a digital filter, and the input-output relation of the pre-emphasis network is that

Where a is the pre-emphasis coefficient, the method takes a=0.9375; taking 30ms as one frame for framing of the voice signal; the method is used for windowing by using a Hamming window immediately after the windowing, and the specific formula is as follows:

wherein w (N) is a voice signal subjected to Hamming window windowing, N is the order of a filter, and N represents the total length of a window function; for each short-time analysis window, obtaining a corresponding frequency spectrum through FFT; the above frequency spectrum is passed through a Mel filter bank to obtain Mel frequency spectrum; cepstrum analysis (taking the logarithm, making inverse transformation, the actual inverse transformation is generally realized by DCT discrete cosine transformation, taking the 2 nd to 13 th coefficients after DCT as MFCC coefficients) is carried out on the Mel frequency spectrum, and Mel frequency cepstrum coefficient MFCC is obtained, which is the characteristic of the frame of voice.

After preprocessing each single-mode data, inputting a multi-mode pre-training model for pre-training.

Step 6: the pre-trained input consists of three parallel BERT style models, running on the image area, text segment and audio, respectively. Each stream is made up of a series of transducer blocks (TRMs) and a new common attention transducer layer (Co-TRMs) to enable information exchange between modalities. Given an image I represented as a set of region features v ₁ ，v ₂ ，...v _τ Text input w ₁ ，w ₂ ，...w _t And audio input s ₁ ，s ₂ ，...s _n The model output is ultimately represented as h _v1 ，h _v2 ，...h _vτ ,h _w1 ，h _w2 ，...h _wT And h _s1 ，h _s2 ，...h _sn . The exchange between modality streams is limited to only between specific layers and the individual modality features interact with more processing before, i.e. the selected visual features are already quite advanced and require a comparatively limited context aggregation compared to words in sentences. Given intermediate vision, language and audio representations at the joint attention TRM layer

And->

The module calculates Q, K and V matrices as in a standard transducer block. However, K and V for each modality are passed as inputs to the multi-headed attention block for the other modality.

Step 7: the first pre-training task, the mask language modeling task, is performed after the pre-training model is entered, following the mask language modeling task in standard bert—masking 15% of the word and image region inputs and letting the model reconstruct them given the remaining inputs. In the whole process, 90% of the probability of the image features of the covered image area is blocked, and 10% of the probability remains unchanged. The masking text input is handled in the same way as BERT. The model does not directly regression mask feature values, but predicts semantic class distributions for the corresponding image regions. The output of the pre-trained object detection model is used as the real label. The aim is to minimize the KL divergence between these two distributions.

Step 8: the second pre-training task is a multi-modal alignment task, with the model presented as image-text-audio pairs: and it must be predicted whether the image, text and audio are aligned, i.e.: whether the text describes an image, whether the audio illustrates the content of the image. Will output h _IMG 、h _CLS And h _AUD As an overall representation of visual, text, and language inputs. The global representation is calculated as h using a cross-transducer structure _IMG 、h _CLS And h _AUD The elements in between are multiplied one by one and a fully connected layer is added to predict whether the image, text and audio are aligned because the dataset only includes aligned image-text-audio pairs, so to generate negative sample pairs, the image or text is randomly replaced with another, using cross entropy as a loss function.

Step 9: after pre-training, a multi-mode pre-training model is obtained, the characteristic representation of multi-mode data can be directly obtained, the multi-mode data is input into an LSTM network for learning, then a fully-connected neural network is input, and the score of the knowledge point resource is output. And (5) inputting the output of the pre-training model into a neural network for classification, and obtaining scores for recommendation.

Example 2

In a second aspect, the present embodiment provides an educational resource recommendation device based on a multi-mode pre-training model, including a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is operative according to the instructions to perform the steps of the method according to embodiment 1.

Example 3

In a third aspect, the present embodiment provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method described in embodiment 1.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. An educational resource recommendation method based on a multi-mode pre-training model is characterized by comprising the following steps:

2. The method of claim 1, wherein in step 1, the multimodal data is a series of multimodal combinations generated by a user while learning a learning resource, one multimodal combination including user operations, video, pictures, audio, text, and explicit numerical values at a time.

3. The method of claim 1, wherein the preprocessing method of the numeric class data comprises:

the numerical value class data are divided into discrete type and continuous type, wherein the discrete type numerical value score R is five-point system, ten-point system or percentage system and is directly reserved; for the duration T of user learning _u Needs to be mapped based on a threshold value, T _middle For learning the median of the duration, the formula is used

S _max The maximum value is integrated for the entire user.

4. The method of claim 1, wherein the preprocessing method of the text class data comprises: text-like data: the method comprises the steps of performing sentence segmentation on texts by using an open source tool nltk, firstly reading all files into a memory, storing the files by using list to divide the articles, then dividing the sentences by using segment, and then dividing the sentences into the files; the corpus is processed into tokens, firstly, the basic token is mainly operated as unicode conversion, punctuation segmentation, lowercase conversion, chinese character segmentation and accent symbol removal operation, and finally, the array of the words is returned; secondly, the wordpietokenizer divides each word obtained in the previous step, and the wordpiere divides words by utilizing prefix and suffix roots of the words.

5. The method of claim 1, wherein the preprocessing method of the non-text user interaction class data comprises:

6. The method according to claim 1, wherein the preprocessing method of the video class data and the picture class data comprises:

7. The method of claim 1, wherein the preprocessing method of the audio class data comprises: the audio class data is preprocessed by using Librosa;

pre-emphasis, framing, windowing preprocessing operations are performed prior to analyzing and processing the speech signals:

8. The method of claim 1, wherein step 6 of determining the educational resource recommendation result based on the educational resource recommendation probability comprises:

9. An educational resource recommendation device based on a multi-mode pre-training model is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1 to 8.

10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1 to 8.