CN118132772A

CN118132772A - Server, display equipment and recommended media resource generation method

Info

Publication number: CN118132772A
Application number: CN202410095422.5A
Authority: CN
Inventors: 陈昶旭
Original assignee: Vidaa Netherlands International Holdings BV
Current assignee: Vidaa Netherlands International Holdings BV
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-06-04

Abstract

Some embodiments of the application show a server, a display device and a recommended media asset generation method, wherein the method comprises the following steps: acquiring user data, wherein the user data comprises at least one of a media asset viewing history record, a media asset preference record, a media asset viewing time preference and a media asset comment of a user; inputting the user data into a target model to obtain text description, wherein the text description is used for guiding acquisition and splicing of media asset fragments; and acquiring at least one target media resource segment based on the text description, and splicing the target media resource segment to obtain recommended media resources. According to the embodiment of the application, the personal characteristics and interest preferences of the user are analyzed, and the deep learning algorithm is utilized, so that personalized recommended media meeting the taste of the user can be automatically selected and clipped from a large number of media libraries. Not only can provide highly customized media content experience for users, but also can save time and energy of users, so that the users can easily find and enjoy favorite content.

Description

Server, display equipment and recommended media resource generation method

Technical Field

The application relates to the technical field of recommended media resource generation, in particular to a server, display equipment and a recommended media resource generation method.

Background

With the rapid growth of the internet and digital media, users' demands for media content are increasingly diversified and personalized. However, current media platforms often only provide a large amount of content, and users need to spend a great deal of time and effort to filter and find content of interest.

In order to reduce the time for users to screen media assets, personalized recommendation systems are widely used. The personalized recommendation system provides personalized recommendation contents for the user through machine learning and data mining algorithms by using personal information and behavior data of the user. Such a system is capable of filtering and recommending content that meets the user's tastes based on the user's interests and preferences. But may simply generate a cut from the user's search keywords or browsing history, lacking in-depth analysis and consideration of the user's personalized features. In addition, the media asset fragments displayed when different users recommend the same media asset are consistent, which may cause that the recommended media asset cannot be interested by the users, so that the users miss interested contents or the recommendation success rate is lower.

Disclosure of Invention

Some embodiments of the present application provide a server, a display device, and a recommended media asset generating method, which can automatically select and clip personalized clips meeting the user taste from a large number of media asset libraries by analyzing personal characteristics and interest preferences of users and using a deep learning algorithm. Not only can provide highly customized media content experience for users, but also can save time and energy of users, so that the users can easily find and enjoy favorite content.

In a first aspect, some embodiments of the present application provide a server configured to:

acquiring user data, wherein the user data comprises at least one of a media asset viewing history record, a media asset preference record, a media asset viewing time preference and a media asset comment of a user;

Inputting the user data into a target model to obtain text description, wherein the text description is used for guiding the acquisition and splicing of the text of the media asset fragment, the target model is trained based on fine tuning training data after a pre-training model is acquired, the fine tuning training data comprise sample input data and sample text description corresponding to the sample input data, and the sample input data comprise at least one of media asset viewing history records, media asset preference records, media asset viewing time preference and sample data of media asset comments;

and acquiring at least one target media resource segment based on the text description, and splicing the target media resource segment to obtain recommended media resources.

In some embodiments, the controller performs the inputting of the user data into a target model resulting in a textual description, and is further configured to:

filling the user data into a prompt template, wherein the prompt template is used for guiding a target model to generate text description;

and inputting the prompt template into a target model to obtain text description.

In some embodiments, the controller performing the obtaining of the at least one target media asset segment based on the text description is further configured to:

Determining a media asset recommendation list based on the user data, the media asset recommendation list comprising at least one media asset data;

And screening at least one target media resource segment which is consistent with the text description from the media resource data.

In some embodiments, the controller performs screening of at least one target media item segment corresponding to the text description from the media item data corresponding to the media item recommendation list, and is further configured to:

Determining characteristic elements corresponding to video frames in media asset data, wherein the characteristic elements are used for representing image contents of the video frames;

determining a target video frame of the media asset data, wherein characteristic elements corresponding to the target video frame are the same as characteristic elements corresponding to the text description;

And extracting a target media resource segment, wherein the target media resource segment comprises the target video frame and a target number of video frames before and after the target video frame.

In some embodiments, the controller performs acquiring user data, and is further configured to:

receiving user data sent by a display device in response to an instruction of opening a media recommendation page or opening a push notification input by a user; or alternatively

And receiving a user identifier sent by the display equipment in response to an instruction of opening a media asset recommendation page or opening a push notification input by a user, and acquiring user data corresponding to the user identifier.

In some embodiments, the media asset recommendation page includes a recommended media asset display area, and after obtaining the recommended media asset, the controller is configured to:

and controlling the display to play the recommended media asset in the recommended media asset display area.

In some embodiments, the controller is configured to:

after receiving a search instruction input by a user, filling the search instruction into a prompt template;

In a second aspect, some embodiments of the present application provide a display apparatus, including:

A display;

A controller configured to:

after receiving an instruction of opening a media asset recommendation page or opening a push notification input by a user, transmitting user data or a user identifier to a server, so that the server acquires user data based on the user identifier, inputting the user data into a target model to obtain text description, wherein the user data comprises at least one of media asset viewing history records, media asset preference records, media asset viewing time preference and media asset comments of the user, the text description is a text for guiding acquisition and splicing of media asset fragments, acquires at least one target media asset fragment based on the text description, splices the target media asset fragment, and obtains recommended media assets;

and receiving the recommended media assets sent by the server and controlling the display to display the recommended media assets.

In a third aspect, some embodiments of the present application provide a recommended media asset generating method, which is applied to a server, and includes:

In a fourth aspect, some embodiments of the present application provide a recommended media asset generating method, which is applied to a display device, and includes:

and receiving the recommended media assets sent by the server and controlling a display to display the recommended media assets.

Some embodiments of the application provide a server, a display device and a recommended media asset generation method. Acquiring user data, wherein the user data comprises at least one of a media asset viewing history record, a media asset preference record, a media asset viewing time preference and a media asset comment of a user; inputting the user data into a target model to obtain text description, wherein the text description is used for guiding the acquisition and splicing of the text of the media asset fragment, the target model is trained based on fine tuning training data after a pre-training model is acquired, the fine tuning training data comprise sample input data and sample text description corresponding to the sample input data, and the sample input data comprise at least one of media asset viewing history records, media asset preference records, media asset viewing time preference and sample data of media asset comments; and acquiring at least one target media resource segment based on the text description, and splicing the target media resource segment to obtain recommended media resources. According to the embodiment of the application, the personal characteristics and interest preferences of the user are analyzed, and the deep learning algorithm is utilized, so that personalized recommended media meeting the taste of the user can be automatically selected and clipped from a large number of media libraries. Not only can provide highly customized media content experience for users, but also can save time and energy of users, so that the users can easily find and enjoy favorite content.

Drawings

FIG. 1 illustrates an operational scenario between a display device and a control apparatus according to some embodiments;

FIG. 2 illustrates a hardware configuration block diagram of a control device according to some embodiments;

FIG. 3 illustrates a hardware configuration block diagram of a display device according to some embodiments;

FIG. 4 illustrates a software configuration diagram in a display device according to some embodiments;

FIG. 5 illustrates a flow chart of a recommended media asset generation method provided in accordance with some embodiments;

FIG. 6 illustrates a schematic diagram of a media recommendation page provided in accordance with some embodiments;

FIG. 7 illustrates a schematic diagram of an application program interface provided in accordance with some embodiments;

FIG. 8 illustrates a flow chart of another recommended media asset generation provided in accordance with some embodiments;

FIG. 9 illustrates a flow chart of a recommendation generation method provided in accordance with some embodiments;

FIG. 10 illustrates a schematic diagram of a media asset details page provided in accordance with some embodiments;

FIG. 11 illustrates a schematic diagram of a media asset page provided in accordance with some embodiments;

FIG. 12 illustrates a schematic diagram of a recommendation interface provided in accordance with some embodiments;

FIG. 13 illustrates a flow chart of a model fine-tuning provided in accordance with some embodiments;

FIG. 14 illustrates a schematic diagram of a model structure provided in accordance with some embodiments;

FIG. 15 illustrates a schematic diagram of a media search page provided in accordance with some embodiments.

Detailed Description

For the purposes of making the objects and embodiments of the present application more apparent, an exemplary embodiment of the present application will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present application are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present application.

It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first and second and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The display device provided by the embodiment of the application can have various implementation forms, for example, a television, an intelligent television, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), an electronic desktop (electronic table) and the like. Fig. 1 and 2 are specific embodiments of a display device of the present application.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.

In some embodiments, a smart device 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.

In some embodiments, the display device may receive instructions not using the smart device or control device described above, but rather receive control of the user by touch or gesture, or the like.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.

As shown in fig. 3, the display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface.

In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

The display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, displaying video content, image content, and a menu manipulation interface, and a user manipulation UI interface.

The display 260 may be a liquid crystal display, an OLED display, a projection device, or a projection screen.

The display 260 further includes a touch screen, and the touch screen is used for receiving an action input control instruction such as sliding or clicking of a finger of a user on the touch screen.

The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.

A user interface, which may be used to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).

The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; either the detector 230 comprises an image collector, such as a camera, which may be used to collect external environmental scenes, user attributes or user interaction gestures, or the detector 230 comprises a sound collector, such as a microphone or the like, for receiving external sounds.

The external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, etc. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

The modem 210 receives broadcast television signals through a wired or wireless reception manner, and demodulates audio and video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

The controller 250 controls the operation of the display device and responds to the user's operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the controller includes at least one of a central processing unit (Central Processing Unit, CPU), a video processor, an audio processor, a graphics processor (Graphics Processing Unit, GPU), RAM (Random Access Memory, RAM), ROM (Read-Only Memory, ROM), a first interface to an nth interface for input/output, a communication Bus (Bus), and the like.

The user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Or the user may input the user command by inputting a specific sound or gesture, the user input interface recognizes the sound or gesture through the sensor, and receives the user input command.

A "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user, which enables conversion between an internal form of information and a user-acceptable form. A commonly used presentation form of a user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a graphically displayed user interface that is related to computer operations. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

As shown in fig. 4, the system of the display device is divided into three layers, an application layer, a middleware layer, and a hardware layer, from top to bottom.

The application layer mainly comprises common applications on the television, and an application framework (Application Framework), wherein the common applications are mainly applications developed based on Browser, such as: HTML5 APPs; a native application (NATIVE APPS);

The application framework (Application Framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange, and the interface for the use of these functions (toolbar, status column, menu, dialog box).

The native application (NATIVE APPS) may support online or offline, message pushing, or local resource access.

The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.

The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

In order to solve the above technical problems, an embodiment of the present application provides a server 400. As shown in fig. 5, the server 400 performs the steps of:

Step S501: acquiring user data;

The user data comprises at least one of a media asset viewing history record, a media asset preference record, a media asset viewing time preference and a media asset comment of a user. The asset viewing history refers to assets that the user has viewed. The media asset preference record refers to media assets collected and praised by the user or preferences of the user for different types of media content, such as comedy, action, scenario, or favorite actors, directors, etc. The media asset viewing time preference refers to the viewing preference of the user for the media asset at different time periods during the day.

In some embodiments, user data is stored in display device 200. The step of acquiring user data comprises the steps of:

User data sent by the display device in response to an instruction of opening a media asset recommendation page or opening a push notification input by a user is received.

After receiving the power-on instruction or the instruction of opening the application program, the display device 200 displays a homepage, where the homepage includes a navigation bar, and the navigation bar includes at least one navigation control, and the navigation control includes a recommendation control. The home page may select the recommendation control by default and may also move focus onto the recommendation control to select the recommendation control. And after the focus is moved to the recommendation control, displaying a media resource recommendation page corresponding to the recommendation control. The media asset recommendation page comprises a recommended media asset display area. The recommended media resource display area is used for displaying recommended media resources. After receiving an instruction from the user to move the focus to the recommendation control, the display device 200 transmits a request for acquiring recommendation page data to the server 400. The recommended page data includes page base data and recommended media assets. The page basic data refers to page structures, pictures, names and the like corresponding to page controls. Recommended media capital refers to media segment collection that meets user characteristics and interests. The recommended assets may be a collection of multiple segments of an asset, such as an asset clip. The recommended assets may also be a collection of one or more asset segments of a plurality of different assets.

In some embodiments, the server 400, upon receiving a request for recommended page data, sends the page base data to the display device 200 along with the recommended media asset determination.

In some embodiments, after receiving the request for recommending page data, the server 400 asynchronously sends the page base data and the recommended media asset to the display device 200, that is, the page base data has a faster determining process, and may preferentially send the display device 200 alone, so that the display device 200 preferentially displays the page base data, and the waiting time of the user is reduced.

Illustratively, as shown in FIG. 6, the media asset recommendation page includes a recommended media asset display area 61, a recommendation list area 62, a trade batch control 63, and a focus 64. The recommendation list area 62 includes a first recommended media asset control 621 and a second recommended media asset control 622. The recommended media asset display area 61 is configured to display recommended media assets issued by the server 400, where the recommended media assets may be a collection of a plurality of segments of media assets corresponding to the first recommended media asset control 621. The recommended media assets may also be media assets spliced by one or more media asset segments corresponding to the first recommended media asset control 621 and the second recommended media asset control 622. By selecting the first recommended media resource control 621 or the second recommended media resource control 622, the user can directly jump to the media resource detail page corresponding to the media resource control. Another recommended media asset based on the user data may be retrieved by selecting the alternate set of controls 63.

It should be noted that the controls, which are visual objects displayed in the display areas of the user interface in the display device to represent corresponding contents such as icons, thumbnails, video clips, links, etc., may provide the user with various conventional program contents received through data broadcasting, and various application and service contents set by the content manufacturer.

The presentation form of the control is typically diversified. For example, the controls may include text content and/or images for displaying thumbnails related to the text content, or video clips related to the text. As another example, the control may be text and/or an icon of an application.

The focus is used to indicate that any of the controls has been selected. In one aspect, the control may be selected or controlled by controlling movement of a display focus object in a display device according to user input through the control apparatus 100. Such as: the user may select and control controls by directional keying movement of the control focus object between controls on the control device 100. On the other hand, the movement of each control displayed in the display device may be controlled to cause the focus object to select or control the control according to the input of the user through the control apparatus 100. Such as: the user can control the controls to move left and right together through the direction keys on the control device 100, so that the focus object can select and control the controls while the position of the focus object is kept unchanged.

The form of identification of the focal point is typically varied. For example, the location of the focus object may be achieved or identified by magnifying the control, and also by setting a background color of the control, or may be identified by changing a border line, size, color, transparency, outline, and/or font of the text or image of the focus control.

During the use of the application by the user, the server 400 periodically generates a push notification according to the user's media viewing history, media preference record, media viewing time preference, and media comments, and sends the push notification to the display device 200. The display device 200 displays push notification reminders. After receiving the instruction of opening the push notification from the user input, the display device 400 transmits the user data to the server 400 to acquire the recommended media asset transmitted by the server 400.

Illustratively, as shown in FIG. 7, the application program interface includes a push message control 71. After receiving the instruction of selecting the push message control 71, the user data is sent to the server, the media asset recommendation page is displayed, and after receiving the recommended media asset, the recommended media asset is played in the recommended media asset display area. Or after receiving the instruction of selecting the push message control 71 from the user input, sending the user data to the server, receiving the recommended media asset sent by the server, and playing the recommended media asset in the playing window. And drawing a transparent target area on the current application program interface, wherein the position and the size of the target area are the same as those of the playing window, so that the recommended media assets can be displayed to the user.

In some embodiments, a user-specific channel is created. Illustratively, as shown in FIG. 7, a navigation control 72 for a dedicated channel is added to the navigation bar. After receiving an instruction of selecting the exclusive channel control from the user input, displaying an exclusive channel page, wherein the exclusive channel interface comprises personalized content which meets the user interest, such as recommended media resources such as media resource clips.

It should be noted that, as the usage time is prolonged, the preference of the user may change. The user data includes recent times, for example: recently, a user's media viewing history, media preference records, media viewing time preferences, media comments, etc. The update frequency of recommended assets may also depend on the frequency of model updates by the server 400, and the model may be trained once with a fixed period of updates.

According to the embodiment of the application, the processes of user characteristics updating, training data expanding and model training iterating are continuously carried out, namely, as a server is online, the user data are continuously collected, the optimizing benefit of abundant and diverse data on the model is good, and the optimizing of the model can bring more accurate and personalized editing experience which meets the user expectations.

In some embodiments, user data is stored in the server 400 corresponding to a user identification that characterizes user identity information. The user identification includes a user account. The step of acquiring user data comprises the steps of:

Unlike the above embodiment, the present application transmits the user identifier to the server 400 after receiving the instruction of the user to input the opening of the media asset recommendation page or the opening of the push notification, so that the server 400 obtains the user data according to the user identifier.

Step S502: inputting user data into a target model to obtain text description;

The text description is used for guiding acquisition and splicing of media asset fragments, the target model is trained based on fine tuning training data after the pre-training model is acquired, the fine tuning training data comprise sample input data and sample text description corresponding to the sample input data, and the sample input data comprise at least one of media asset viewing history records, media asset preference records, media asset viewing time preference and sample data of media asset comments.

In some embodiments, the step of training the target model comprises:

1) Selecting a pre-training model

Illustratively, GPT-2 (GENERATIVE PRE-Trained Transformer, generating a pre-trained transducer model) is selected as the pre-training model. GPT-2 is a powerful pre-trained model with deep understanding and generating capabilities for languages.

2) Data preparation: a large-scale and diverse text data set is prepared that should cover areas of possible interest to the user, including movies, entertainment, literature, etc. This helps the model better understand the different topics and contexts.

Embodiments of the present application use deep learning algorithms, i.e., pre-trained large language models, to understand and generate text. The model is pre-trained on large-scale text data, and has powerful understanding and generating capabilities for languages.

3) Model initialization: using the selected pre-training model (e.g., GPT-2), its pre-training parameters and configuration are loaded. Thus, the model can utilize language knowledge learned over large-scale data. Wherein the pre-training model can be directly selected without performing step 1) and step 2).

4) Fine tuning:

a) Constructing a user feature and interest preference data set: the user's personal characteristics and interest preference data are put into an appropriate format for input into the deep learning model. This may include media asset viewing history, media asset preference records, media asset viewing time preferences, media asset comments, and the like.

The arrangement of the user's personal characteristics and interest preference data into an appropriate format is to facilitate model processing of this information. This process generally includes the steps of:

Feature selection: the personal characteristics and interest preferences of the user are analyzed, and data characterizing the characteristics and interest preferences of the user are selected as desired. The user characteristics and interest preference data include media asset viewing history, media asset preference records, media asset viewing time preferences, media asset reviews and scoring history, etc., which help the system better understand the user's tastes and needs.

Data normalization: for continuous type features, a normalization operation is performed, ensuring that they are on the same scale. For example, the scores of movies in the viewing history are normalized to a uniform range.

Category feature coding: for discrete class-type features, unique codes or other coding means are used to transform them into a form that can be processed by the model. For example, movie types (comedy, action, etc.) are encoded into binary vectors.

And (3) time characteristic processing: if there are time-dependent features, such as viewing time preferences, it may be necessary to convert them into a format that the model can understand, for example, to divide the time period into several discrete categories.

Formatting data: the collated features are organized in a suitable format, such as by placing them into a dataset or data box. Each sample should contain a textual description of the user's characteristics and related target variables, i.e., the recommended assets generated.

Custom dataset class: it may be desirable to create a custom data set class (CustomDataset) to ensure that data can be efficiently loaded and processed by the deep learning model.

Dividing data: the consolidated data is divided into a training set, a validation set, and a test set for training and evaluating the model during the fine tuning process.

B) Custom dataset class: a custom data set class (CustomDataset) is required to prepare the data, ensuring that the class can efficiently process and provide user characteristic data.

By way of example, one of the collated fine tuning training data may include a user's media asset viewing history, media asset preference records, media asset viewing time preferences, etc. as input features (sample input data), and a generated text description of the recommended media asset (sample text description) as a target output. The goal of the data arrangement is to provide a clear, structured input to the model that enables it to learn user specific interests and preferences efficiently and generate corresponding recommended media.

C) Defining a fine tuning task: in the fine tuning stage, the task of defining the model is to generate personalized recommended media materials which meet the taste of the user. This involves adapting the model to the goals of a particular task.

5) Fine tuning optimization

A) Selecting a loss function: in the fine tuning process, an appropriate loss function is selected to measure the consistency of the recommended media assets generated by the model with the user characteristics and interests. For example, the loss function may alternatively use a cross entropy loss function.

In deep learning, a Cross entropy (Cross-Entropy) loss function is typically used to measure the difference between two probability distributions. Because the task of the embodiment of the application is to generate personalized recommended media, the cross entropy loss of the condition generation task is used. In particular, the generation task is to generate recommended assets that meet the user's tastes based on the user's personal characteristics and interests. Thus, the loss function may include two key parts:

Language model loss: this section focuses on the similarity between the generated text and the real sample (text description). For a generation task based on a language model such as GPT-2, cross entropy loss can be used to measure the difference between the probability distribution of the generated text and the distribution of the real text.

Personalized preference loss: since the task is to generate content that meets the user's taste, it may also be necessary to introduce a penalty term for personalized preferences. This term may be based on information of the user's characteristics and interests, by some metrics or mapping, to compare the generated recommended assets to the user's personalized needs.

B) Use optimizer: a suitable optimizer, such as AdamW, is employed to effectively adjust the model parameters to better adapt them to the personalized task.

Illustratively, the form of the loss function is as follows:

When personalized recommended media assets are generated, the generated recommended media assets not only accord with the grammar and the authenticity of the text, but also accord with the personalized preference of the user. To achieve this goal, a loss function is considered that combines two factors. Assume that a generative model is provided whose parameters are θ. By fine tuning this model, the recommended assets it generates are expected in two ways:

Language model loss: the generated recommended assets are accurate and smooth in language. The loss function of this portion is measured by comparing the difference of the generated recommended asset probability distribution gen (P _gen (y|θ)) and the actual cut probability distribution true (P _true (y|θ)).

Personalized preference loss: the generated recommended media resource accords with the personalized interests of the user. The loss function of this section measures the degree of compliance of the generated cut flowers with the user's personalized needs in some way by correlating the user's personalized features (e.g., viewing history, interest preferences, etc.) with model parameters.

The total loss function is a linear combination of these two losses, where λ ₁ and λ ₂ are the weights of the two loss terms, used to balance the importance of these two aspects. By adjusting these weights, the balance of the generative model between language accuracy and personalized preferences can be flexibly controlled.

Optimizing the loss function, i.e. fine tuning the model parameters theta, can make the generated recommended media better conform to the taste of the user, and can maintain the quality in language and be more customized in personalized preference. The method can ensure that the generated content accords with the general language rule and can meet the personalized requirements of users.

6) Model preservation: and after the fine tuning stage is finished, the parameters and the configuration of the model are saved.

The step of inputting the user data into a target model to obtain text description comprises the following steps:

Filling the user data into a prompt template;

The alert template may be a text string containing key information, such as: at least one of a user's media asset viewing history, media asset preference records, media asset viewing time preferences, and media asset reviews. The prompt template is used for guiding the target model to generate a text description of the recommended media materials which accords with the taste of the user.

After receiving the input of the alert template, a function or module is utilized to process the content of the alert template (alert information). The content of the hint template may be encoded to be embedded into the model. For example, tokenizer or other text embedding methods using models.

The text description describes recommended media asset content generated by the model. The textual description may be a segment of a textual description or a sequence of media segments. The textual description is the output of the model, i.e. the textual description generated by the trimming process. This textual description may be a piece of natural language text describing a personalized recommended media asset that meets the user's tastes. This text description may also be an abstract representation of how the model splices and edits pieces of media assets to meet user preferences.

The text description may be converted into actual media asset segments in a subsequent process. The step of specific transformation includes interpreting the text description, i.e. using natural language processing techniques, interpreting the generated text description into specific editing and stitching instructions. In general, the generated text description serves as a guiding role that provides an abstract guideline for how movie clips are edited and spliced. Subsequent processing steps then translate these abstract guidelines into actual, user viewable personalized recommended media assets. This approach allows the system to meet user needs while maintaining a degree of flexibility and personalized customization.

And calling a generating function by using the trimmed model and inputting the processed prompt information into the model. The generating function may need to be further adjusted to accept the hint information and to take the hint information into account when generating.

In some embodiments, the step of inputting the user data into a target model to obtain a text description includes:

analyzing the user data to obtain user interest labels, wherein the user interest labels are used for representing characteristics of user favorites and watching habits;

filling the user interest labels into a prompt template;

The embodiment of the application can determine the user interest tag according to the user data. For example: the user is a action fan or love fan. The user interest tag is used as an input for model training or application, and the text description is still output as a model. The fine tuning training data includes sample user interest tags and their corresponding sample text descriptions. Illustratively, after receiving the user data, the user interest tag is obtained as an action film fan according to analysis of the media asset viewing history record, the media asset preference record, the media asset viewing time preference, the media asset comment and the like. And inputting the action film fan into the target model to obtain the text description.

Step S503: and acquiring at least one target media resource segment based on the text description, and splicing the target media resource segment to obtain recommended media resources.

The step of obtaining at least one target media asset segment based on the text description comprises:

In some embodiments, the objective model may determine, according to the user data, the media assets that need to be recommended, and the recommended media assets may be one or more, that is, the media asset data in the media asset recommendation list may be one or more.

In some embodiments, when the media asset recommendation list includes a plurality of media asset data, a recommended media asset may be generated for each media asset data, that is, each recommended media asset is formed by splicing a plurality of media asset segments of one media asset data.

In some embodiments, when the media asset recommendation list includes a plurality of media asset data, a recommended media asset may be generated based on the plurality of media asset data together, that is, the recommended media asset is formed by splicing one or more media asset segments of a plurality of different media asset data. In order to determine the media assets to which the current video frame belongs, a media asset name identifier can be added to the video frame so as to facilitate the tracing of users.

In some embodiments, the determination of the media recommendation list may be determined by other models or services. The media asset recommendation list may be entered into the target model. Other models or services may also determine media recommendation lists based on user data. The media asset viewing history, media asset preference record, media asset viewing time preference, and media asset comments help determine the media asset list.

The step of screening at least one target media resource segment corresponding to the text description from the media resource data comprises the following steps:

A step of media asset data analysis, comprising:

Reading media data: and reading the media data into the memory of the computer.

Frame extraction: each frame of video frames is extracted from the media asset data.

Object detection: and carrying out object detection on each frame of video frame, identifying objects in the picture through a target detection algorithm, and carrying out marking.

Object tracking: for continuous video frames, tracking the same object through an object detection algorithm, so as to obtain the motion track of the object in the video.

Feature extraction: and extracting the characteristics of each frame of picture, and extracting key characteristics such as color, texture, shape and the like from the picture through a convolutional neural network and other models.

And (3) image identification: through analysis and processing of the characteristics, image recognition is carried out on each frame of video frame, and characteristic elements of the video frame such as scenes, emotions, themes and the like are determined.

And (3) media data analysis: and analyzing the whole media data, and extracting information such as the theme, emotion, plot and the like of the video frames through processing and identifying the video frames of each frame.

In the embodiment of the application, the media data can be analyzed in advance and the characteristic elements of the media data are stored correspondingly with the video frame information.

and screening video frames corresponding to the characteristic elements corresponding to the text description from the media data of the media recommendation list, wherein the video frames are target video frames.

In some embodiments, if there are multiple video frames with text description corresponding feature elements, one video frame may be selected randomly, or a centered video frame in the continuous video frames with text description corresponding feature elements may be selected so that more video frames that meet the user's preference may be acquired.

In some embodiments, if there are multiple video frames of the same media asset data that have text description corresponding feature elements, multiple video frames may be selected, and when the frame spacing between two adjacent video frames is greater than a preset value, the preset value may be set as the number of video frames of the media asset segment. The preset value is set to prevent the same video frame from being repeatedly selected, resulting in the effect of repeated playback.

For example, the number of video frames of each media clip is 100 frames, and the interval between two adjacent video frames in the selected video frames is greater than 100 frames.

In some embodiments, the target video frame and a target number of video frames before or after the target video frame may be decimated as the target media asset segment after the target video frame is determined.

In some embodiments, the target video frame, the number of video frames before and after the target video frame, may be decimated as the target media asset segment after the target video frame is determined.

The embodiment of the application can ensure that the segments formed by the extracted video frames are continuous and can form a complete image instead of a picture.

In some embodiments, the length of time (number of video frames) of the recommended media asset is within a certain range, and if the length is out of the range, the media asset can be added or deleted appropriately. If the total length of the screened media resource segments is lower than the minimum value, video frames of the non-text description corresponding characteristic elements can be properly increased, or the extraction number corresponding to the target video frames can be properly increased. If the total length of the screened media resource segments is higher than the highest value, the number of target video frames can be properly reduced, or the extraction number corresponding to the target video frames can be properly reduced.

And after the target media resource segments are acquired, splicing the target media resource segments to obtain recommended media resources.

This step involves integration with the media library to select the appropriate video clip for editing and splicing.

In some embodiments, the generated recommended assets require further post-processing and optimization to ensure their fluency, quality, and compliance with the user's expectations. This may include adjusting audio, images, and applying transitional effects, etc.

In some embodiments, the textual description may be presented in text form in the recommended media asset.

After determining the recommended asset, the server 400 transmits the recommended asset to the display device 200.

In some embodiments, the display device 200, upon receiving the recommended media asset, controls the display 260 to directly play the recommended media asset in the recommended media asset display area.

In some embodiments, the display device 200 controls the display 260 to display the first frame of recommended media asset in the recommended media asset display area after receiving the recommended media asset. And after receiving an instruction of playing the recommended media asset input by the user, starting playing the recommended media asset.

In some embodiments, receiving a search instruction input by a user;

In some embodiments, a media search instruction is received that the user enters by pressing a voice key of the control device 100 or waking up a voice assistant by a far field wake-up word, the user voice instruction being, for example, "search for a movie on science fiction and war".

In some embodiments, a media search instruction entered by a user through text entered in a search input box is received. Illustratively, "science fiction war" is entered in the search input box.

In some embodiments, display device 200 presents a media asset search interface. The media asset search interface includes a selectable category control. An instruction is received from a user to select at least one category control. Illustratively, the user selects a science fiction control and a war control.

Filling the search instruction into a prompt template;

Illustratively, the prompt template is filled with user data or user search instructions, which may be used as prompt information. The hint information is converted into a numerical format that the model can understand, typically by encoding or embedding each keyword. The understood format may be a vector so that the model can be computed in the space represented by this vector. Inputting the processed prompt information into the trimmed model, and calling a generating function to generate a text description, for example: "in this XXX, science and technology is interleaved with adventure, with you traversing unknown universe. The fierce action scenes, the surprise stimulus, bring to you an unparalleled audiovisual experience. "based on the generated text description, select an appropriate video clip, possibly including science fiction scenes, action games, etc. Clipping and splicing are carried out to generate complete personalized recommended media assets. Post-processing can be performed on the generated recommended media, smooth pictures and proper sound effects are ensured, and transitional effects can be possibly added.

The embodiment of the application is inconsistent for the media resource fragments displayed when different users recommend the same media resource. For example, both users a and B are liked to a science fiction movie while being recommended for science fiction movie X. But determining that the user A prefers science fiction and war according to the information of the user A, such as the media asset viewing history record, the media asset preference record, the media asset viewing time preference, the media asset comments and the like. And determining that the user A prefers science fiction and nature according to the information of the user B, such as the media asset viewing history record, the media asset preference record, the media asset viewing time preference, the media asset comments and the like. The recommended media asset model for AB two user clips will be differentiated: the recommended media of the A user is more focused on the war scene in the film, and the recommended media of the B user is focused on the natural scene of the planet.

The embodiment of the application is based on a clipping algorithm of user characteristics. The server collects data such as personal information, browsing history, interest preference and the like of the user, and performs deep analysis to know the preference and habit of the user. With this data, the system can accurately grasp the user's preferences, thereby providing the user with personalized clipping results that meet his taste. And the server selects the material fragments related to the user interests from the media asset library according to the characteristics of the user. The material segments may be from different film and television works, musical works, advertisements, etc. The server reasonably splices the fragments by using an advanced clipping technology to form personalized recommended media resources.

The generation of recommended assets includes data preprocessing, model training and reasoning. As shown in fig. 8, a pre-training model is initialized and loaded, fine tuning parameters are defined, the pre-training model is trained by setting an optimizer and a loss function and utilizing fine tuning training data to obtain a fine tuning model, and the fine tuning model is saved. And loading the trimmed model, receiving an input prompt template, and generating recommended media resources.

The embodiment of the application uses PyTorch and Hugging Face Transformers libraries to pretrain and fine tune a large language model (such as GPT) to generate personalized media video fragments, and the steps are as follows:

1) A dataset class CustomDataset is defined that includes initialization method __ init __ and two special methods __ len __ and __ getitem __.

2) Data is prepared, a CustomDataset instance is created, two text lists are entered as parameters, then a DataLoader instance is created, and the data set is entered as parameters.

3) Initializing and loading a pre-training model, creating a GPT2Config instance, inputting a pre-training model name as a parameter, and then creating a GPT2LMHeadModel instance, inputting the pre-training model name and a configuration object as parameters.

4) Parameters for fine tuning are defined, creating AdamW an optimizer instance and a cross entropy loss function instance.

5) Fine-tuning the model, moving the model onto the GPU (if the GPU is available), then traversing DataLoader each batch, moving the batch's inputs and labels onto the GPU, then zeroing the gradient of the model, passing the inputs to the model, calculating the loss and back-propagating, and finally updating the parameters of the model.

6) And storing the trimmed model, and storing the trimmed model on a disk.

7) Generating personalized recommendation media by using the trimmed model, creating a function generation_video_clip, receiving a text prompt as a parameter, generating a text description/sequence by using the trimmed model, and finally returning the text description/sequence.

8) And guiding selection, splicing and editing of the media asset fragments by using the text description to generate recommended media assets.

Example 1:

If the user is interested in an actor or director, the server may create an album clip based on these preferences, including classical scenes or work pieces for the actor or director. This helps to recommend movies that are more in line with the interests of the user. The embodiment of the application provides a basic framework and key ideas, which help to realize the recommendation media asset personalized clipping algorithm based on the user characteristics, and comprises the following steps:

Data preparation: data containing actors, director information, and corresponding video clips are prepared. The data may be a database containing media asset information, actors, directors, etc. And identifying media data of the actor or director to determine the corresponding relation between the actor and the video frame, and determining the corresponding relation between the classical scene and the video frame.

Pre-training a large language model: pretraining is performed using pretrained Transformers models, such as BERT, GPT, etc. Pre-training on data in the entertainment domain may be selected to better accommodate the application.

For example: one GPT2Tokenizer instance and one GPT2LMHeadModel instance are created, which are initialized using the pre-training model "GPT 2".

Fine tuning the model: the pre-trained model is fine-tuned on the task of actor or director album editing. An appropriate loss function is defined that takes into account the user characteristics, actor/director information, and quality of the generated video segments.

Generating personalized recommended media assets: and generating text description by utilizing the fine-tuned model and combining the characteristics of the user and favorite actors or director information, acquiring a media asset clip segment containing actor video frames or classical scene video frames according to the text description, and generating personalized recommended media assets.

A function generate personalized clip is created, which accepts user characteristics and actor/director information as parameters, and generates a personalized recommended media based on these information. Firstly, an input text is created, which contains user characteristics and actor/director information, then the input text is encoded into an input id sequence, a text description is generated by using a model, and finally, the generated text description is decoded and corresponding media asset fragments are acquired. In this embodiment, "Action movie lover" is used as the user feature, and "XXX" is used as actor/director information, so as to generate a personalized recommended media asset.

Example 2:

If the user prefers certain emotion elements, such as suspense, romance, or humor, the server 400 can provide more personalized recommendations to the user by analyzing movie content, clipping and stitching those segments that highlight particular emotion elements. The embodiment of the application designs a recommendation media asset personalized editing algorithm based on user characteristics, which is mainly used for generating personalized recommendation media assets by pre-training and fine-adjusting a large language model focused on a recommendation vertical domain by means of a large language model and Transformers library.

The pre-trained GPT-2 models and tokenizer are initialized, which are initialized using the pre-trained model "GPT 2".

The user characteristics include a preferred emotion element.

Defining a function generate_ personalized _ clips, accepting parameters such as user characteristics, models, tokenizer and the like, and generating personalized media asset fragments. First, key information is extracted from user features, then text description is generated, and a plurality of fragments are generated.

Generating personalized video clips, calling a generate_ personalized _ clips function, transmitting parameters such as user characteristics, models and tokenizer, generating a plurality of personalized media asset clips, and splicing the media asset clips to obtain recommended media assets. For example, using the preferred emotion element as "suspicion", 5 personalized suspicion video clips are generated and then stitched together.

Example 3:

Based on the user's viewing history and comments, the system may generate personalized user comment videos. By editing and splicing the comment fragments positive by the user, a recommended video can be produced.

Model initialization, using the pre-training model "GPT2" initialization tokenizer and GPT2LMHeadModel.

The user features are replaced by actual user comments and viewing history data, and the user comments and the viewing history data are respectively stored in the two lists.

And (3) preprocessing data, namely splicing user comments and viewing history data into a text segment, and encoding the text segment into an input id sequence by using tokenizer.

A video clip is generated, a text description is generated using GPT-2, and then the generated text description is decoded and returned.

Generating a personalized video of a user, calling a generate_video_clip function, inputting a model and inputting an id sequence, and generating a personalized video clip. For example: and generating a final personalized recommended media asset by using the four user comments and the three movie viewing historical data.

In the embodiment of the application, 1) by analyzing the personal characteristics and interest preferences of the user, the system can know the taste of the user and generate personalized content for the user. Such personalized recommendations may be applied not only to movies, but also to other media content, such as advertisements, educational videos, and the like. 2) The use of a pre-trained deep learning model, such as GPT-2, provides the server with the ability to understand the language and media content. Such models are pre-trained on large-scale data to better capture context and user interests. 3) Through clipping and splicing technologies, the server can organically combine fragments meeting the user interests in the media content, and content which is more close to the user requirements is created. This helps to improve the user experience and satisfaction. 4) For the emotion preference of the user to the movie, the server can emphasize specific emotion elements through clipping and splicing technologies, so that the user can more easily find movie content meeting the emotion requirements of the user. 5) The scheme of the application is based on a deep learning model and a general clipping and splicing technology, so that the scheme is not limited to the field of movies. It is also applicable to other fields such as advertising, education, social media, etc., providing users with a variety of personalized experiences. 6) User engagement improves: by generating personalized content, the system can increase the user's interest and engagement in recommending content. This is important to improve user retention and platform adhesion.

In the embodiment of the application, the server considers the factors such as personal preference, emotional tendency, watching habit and the like of the user in the clipping process. By analyzing the preference and behavior data of the user, the preference of the user can be accurately grasped, and thus, personalized clipping results conforming to the taste of the user can be provided. For example, for a user who likes a suspense, the system may select and clip out some suspense episodes; for users who like romantic love, the system may then select and clip out some romantic love pieces. Through personalized clipping, the server can meet different interests and favorites of the user, and media content which is more fit with the requirements of the user is provided. In addition, the system can continuously optimize the editing algorithm according to the feedback and evaluation of the user, and provide more accurate and personalized editing experience which meets the user's expectations. Users can score and comment on the clipping results, and the system can optimize and improve the algorithm according to the feedback data. By constantly learning the user's preferences and feedback, the system can gradually increase the accuracy and user satisfaction of personalized clips.

Embodiments of the present application provide a highly customized media content experience. The user does not need to spend a great deal of time searching for the interested content, and the personalized cut flower meeting the taste of the user can be automatically cut for the user according to the characteristics and the preference of the user. This not only saves time and effort for the user, but also provides a better user experience, increasing the user's viscosity to the media platform.

Some embodiments of the present application provide a recommended media asset generation method, the method being applicable to a server configured to: acquiring user data, wherein the user data comprises at least one of a media asset viewing history record, a media asset preference record, a media asset viewing time preference and a media asset comment of a user; inputting the user data into a target model to obtain text description, wherein the text description is used for guiding the acquisition and splicing of the text of the media asset fragment, the target model is trained based on fine tuning training data after a pre-training model is acquired, the fine tuning training data comprise sample input data and sample text description corresponding to the sample input data, and the sample input data comprise at least one of media asset viewing history records, media asset preference records, media asset viewing time preference and sample data of media asset comments; and acquiring at least one target media resource segment based on the text description, and splicing the target media resource segment to obtain recommended media resources. According to the embodiment of the application, the personal characteristics and interest preferences of the user are analyzed, and the deep learning algorithm is utilized, so that personalized recommended media meeting the taste of the user can be automatically selected and clipped from a large number of media libraries. Not only can provide highly customized media content experience for users, but also can save time and energy of users, so that the users can easily find and enjoy favorite content.

With the rapid growth of the internet and digital media, people are faced with the selection and acquisition of a large amount of information. In this era of information explosion, recommendation systems have become an important tool to help users screen personalized content from vast amounts of information. The recommendation system can provide personalized recommendation content according to interests, preferences and behavior data of the user, so that user experience is improved, user viscosity is increased, and consumption is promoted. Recommendation systems have been widely used in various fields such as e-commerce, social media, news, music, video, etc. However, existing recommendation systems often have problems in generating recommendations. Firstly, the conventional recommendation system usually only focuses on behavior data of users, and deep mining of preference characteristics of the users is lacking. This results in a lack of personalization of the recommendation results, failing to meet the personalized needs of the user. Second, existing recommendation systems often ignore the utilization of the media asset feature. The media asset features include key information of the media content, such as titles, labels, descriptions, etc., which have an important impact on the accuracy of the recommendation results and the user experience. However, the current recommendation system rarely performs deep mining and utilization on the media asset characteristics, so that the generation effect of the recommendation language is poor.

In some embodiments, the most common recommendation algorithm is a collaborative filtering based recommendation algorithm. Collaborative filtering is a recommendation method based on user behavior data, and the user's historical behavior data and similar user behavior data are analyzed to find out the interests and hobbies of similar users so as to conduct recommendation. However, collaborative filtering algorithms have a cold start problem in that there is insufficient data for new users or new assets to make accurate recommendations. In addition, collaborative filtering algorithms cannot deeply mine media features, resulting in lack of personalization of recommendation results. In addition, the application of deep learning in recommendation systems has also made some progress. Deep learning can learn the interests of users and the characteristics of media assets from large-scale data by establishing a deep neural network model, so that the recommendation accuracy is improved.

The popularity of large language models such as GPT also brings opportunities for the application of personalized recommendations. In recent years, the amount of data on the internet has exploded, while computing resources have become more powerful and readily available. The method provides conditions for training and applying the large-scale language model, so that the large-scale language model can be popularized in the recommendation field. The development of pre-training and fine tuning techniques has made training and application of large language models more efficient and effective. Through pre-training a large-scale language model, the statistical rule of the language can be learned on a large amount of data, and then the model is more specialized and personalized through fine tuning and adaptation to specific tasks. Therefore, the large language model has wide application prospect in generating personalized recommended languages. The large language model may generate personalized recommendations based on the user's historical behavior, preferences, and contextual information. These recommendations may be used in a variety of recommendation systems for merchandise recommendations, news recommendations, music recommendations, etc., to provide a more accurate and attractive recommendation experience.

In order to solve the above technical problems, an embodiment of the present application provides a server 400. As shown in fig. 9, the server 400 performs the steps of:

Step S901: receiving a recommendation request sent by display equipment, wherein the recommendation request comprises a user identifier;

In some embodiments, the display device 200 displays a recommended media asset page that includes at least one recommended media asset control. Receiving an instruction of selecting a recommended media resource control from a user input, the display device 200 sends a media resource detail interface data request to the server 400, and after receiving the media resource detail interface data, controls the display 260 to display a media resource detail page corresponding to the recommended media resource control, wherein the media resource detail page comprises a recommended language display area, and the recommended language display area is used for displaying a recommended language text.

In some embodiments, the media asset detail interface data request includes a recommender request that includes a user identification and a recommended media asset identification.

In some embodiments, receiving a user input selecting an instruction to recommend a media asset control, display device 200 sends a media asset detail interface data request and a recommender request to server 400.

Illustratively, in the media asset recommendation page shown in fig. 6, an instruction from the user to select the first recommended media asset control 621 is received, and in response to the user instruction, a media asset detail interface data request and a recommendation request are sent to the server 400. Wherein the recommended language request includes a user identifier and a media identifier corresponding to the first recommended media control 621. The display device 200 receives the media asset detail interface data transmitted from the server 400, and controls the display 200 to display the media asset detail page, as shown in fig. 10. The media asset detail page includes a media asset display area 101, a recommended language display area 102, a full screen control 103, and a focus 104. The media asset display area 101 is used for playing the first recommended media asset, the recommended language display area 102 is used for displaying recommended language text, and the full screen control 103 is used for magnifying the media asset display area to full screen.

In some embodiments, embodiments of the application are not limited to recommended media asset controls. The display device 200 displays any page including the media asset control, receives an instruction from the user to input the selected media asset control, and the display device 200 sends a recommendation request to the server 400.

In some embodiments, the display device 200 displays a media asset page, where the media asset interface includes a recommendation control, receives an instruction from a user to input a selection of the recommendation control, and sends an interface data request and a recommendation request to the server 400, where the recommendation request includes a user identifier.

Illustratively, as shown in FIG. 11, the media asset interface includes a recommendation control 111. Receiving a user input instruction of selecting the recommendation control 111, and sending an interface data request and a recommendation request to the server 400. The page data sent by the server 400 is received, and a recommended interface is displayed as shown in fig. 12. The recommendation interface comprises a media resource display area 121, a recommendation language display area 122 and a recommendation list

123. Full screen control 124 and focus 125. The recommendation list includes a first recommended media asset control 1231 and a second recommended media asset control

1232. The recommended language display area 122 displays recommended languages corresponding to the recommended media resource controls selected by the focus 125. The media asset display area 121 displays recommended media assets, i.e., personalized clips, corresponding to the recommended media asset control selected by the focus 125.

Step S902: acquiring user data based on the user identification;

in some embodiments, the user data is stored in the server 400 corresponding to the user identification, and the server 400 may obtain the user data according to the user identification.

In some embodiments, the user data is stored in the display device 200, and the display device 200 sends a recommendation request including the user data to the server 400.

The user data includes at least one of a user's media asset viewing history, media asset preference records, media asset search records, and media asset reviews.

Step S903: determining recommended media assets and acquiring media asset characteristics corresponding to the recommended media assets, wherein the media asset characteristics are used for representing information of the recommended media assets;

in some embodiments, the recommender request further comprises a recommended asset identifier, and the step of determining recommended assets comprises:

and determining recommended media assets based on the recommended media asset identification.

When it is clear that the user opens a detailed page of a recommended asset, the identity of the recommended asset is sent to the server 400 together with the user identity. The server 400 may determine recommended assets based on the asset identification.

In some embodiments, the step of determining recommended media assets includes:

And determining recommended media assets based on the user data.

When it is not clear that the user opens a detailed page of a recommended media asset, only the user identification is sent to the server 400. The server 400 may determine recommended assets according to user data corresponding to the user identification, that is, determine recommended assets according to the user's asset viewing history, asset preference record, asset search record, and asset comment.

Media asset characteristics are typically attributes or information describing media content.

In some embodiments, the media metadata includes information about the media content, such as title, description, release time, author, etc. This information can be used directly as a media asset feature or processed for modeling.

The step of obtaining the media asset characteristics corresponding to the recommended media asset comprises the following steps:

And acquiring the media asset characteristics corresponding to the recommended media asset from the media metadata, wherein the media asset characteristics comprise a title, a description, release time and an author.

In some embodiments, text descriptions of media content are processed, and information such as keywords, topics, emotions and the like are extracted as media asset features. Text information may be converted into a feature representation that may be used for the model using natural language processing techniques.

And extracting keywords, topic information and emotion information in the text description of the recommended media assets, and determining the keywords, the topic information and the emotion information as media asset characteristics.

In some embodiments, for media content that contains images or video, visual features may be extracted using computer vision techniques. This may include color histograms of images, texture features, object recognition results, etc.

and extracting visual features of recommended media assets by using a computer visual technology, and determining the visual features as media asset features, wherein the visual features comprise color histograms, texture features and object recognition results of images.

In some embodiments, for media content containing audio, audio features such as audio spectrum, cadence patterns, etc. may be extracted. This is useful for content such as music, audio files, and the like.

extracting audio features of recommended assets, and determining the audio features as the assets features, wherein the audio features comprise audio frequency spectrums and rhythm patterns.

In some embodiments, user comments, ratings, and interactive behavior may also be used as a media asset feature. Such feedback information reflects the user's opinion and preference for media content.

And obtaining comments, scores and interaction behaviors of the user, and determining the characteristics of the media assets as the comments, scores and interaction behaviors of the user.

In some embodiments, media content is mapped to a high-dimensional feature space using a pre-trained deep learning model, such as an image classification model, text embedding model, or the like, and these features are used as media asset features.

The method of acquisition of media asset characteristics may be varied depending on the type of media content and the data available. The integrated use of multiple feature acquisition methods helps to more fully represent media content.

In some embodiments, the media asset characteristics of the recommended media asset may be obtained in advance, and the recommended media asset and the media asset characteristics may be stored to a predetermined location. When the media resource characteristics of the recommended media resource are required to be acquired, the media resource characteristics corresponding to the recommended media resource are directly acquired from the preset position without adopting the method. The embodiment of the application can shorten the generation time of the recommended language text.

Step S904: and inputting the media asset characteristics and the user data into a recommendation generation model to obtain a recommendation text.

The recommendation generation model is obtained by training based on fine tuning training data after a pre-training model is obtained, the fine tuning training data comprise sample input data and sample recommendation corresponding to the sample input data, and the sample input data comprise at least one of media resource viewing history records, media resource preference records, media resource search records and media resource comments, and media resource characteristics.

The generation step of the recommendation generation model comprises the following steps:

1) Preparing a data set: a large-scale text dataset is collected that is related to the recommendation domain. This may include information on user history behavior, merchandise descriptions, reviews, etc. It is ensured that the dataset contains sufficient diversity to enable the model to learn a broad range of recommended domain knowledge.

The text data set includes large-scale text data related to the recommendation field, and may include information of historical behaviors of users, commodity descriptions, comments and the like. The text data may exist in plain text form or may contain structured information, and the specific form may be a series of text documents or records. For training, these text data will be used as training samples for the model. The model is pre-trained on these data to learn statistical rules of language and domain specific knowledge.

The user history behavior is in the form of a user's click record, browsing history, search record, etc. in the text dataset. These behavioral records may help the model learn the preferences and interests of the user, thereby generating more personalized recommendations. The behavior sequence of each user may be organized into samples including inputs (the user's previous behavior) and goals (the user's next possible behavior or preferences). The historical behavior may include assets viewed by the user, links clicked on, keywords searched on, assets praised on, and the like. The user history behavior includes a media view history, a media preference record, and a media search record.

The commodity description refers to description information of the media assets. The commodity description refers to the relevant information of the introduction, the characteristics, the type, the director, the actors and the like of the media assets. The media asset characteristics may include information about the type, style, director, actors, etc. of the media asset. The information can be converted into a form which can be understood by the model and is used for learning the preference of the user on the commodity so as to better understand the relationship between the media asset and the user and improve the accuracy of personalized recommendation.

The media asset comments refer to the user's evaluation or opinion of a certain media asset. In a text data set, comments may exist as part of the text data associated with information related to a particular media asset. The user history behavior may correspond to comments, for example, a comment may be left after the user views a piece of media. User comments on the assets can also be used as part of the training data to learn the user's opinion and emotion. This may need to be associated with the identity of the asset to ensure that the comment corresponds to the correct asset.

Other features: other characteristics, such as user demographics, date of release of the media asset, etc., may also be considered depending on the particular task. These features may be used to more fully understand the context of the user and the merchandise.

2) Selecting a pre-training model: a large language model suitable for the task, i.e., a pre-trained model, such as GPT-3, BERT (Bidirectional Encoder Representations from Transformers, a language representation model), etc., is selected. The size and resource requirements of the model are taken into account in the selection.

3) Preprocessing data: preprocessing the data of the data set, including word segmentation, marking, deletion of stop words and the like. Ensuring that the data format meets the input requirements of the selected model.

Segmentation is the process of dividing text into words or tokens. In this step, the text is broken down into lexical units, typically words. Chinese segmentation is a process of cutting chinese text into words, while english segmentation is cutting english text into words. Word segmentation aims at converting text into the most basic language units that the model can understand and process. This helps the model to better understand the semantics and structure of the text, improving the performance of the model on the text data.

Tokenization is the process of converting text into a sequence of tokens or tokens. In natural language processing, the tokens may be words, subwords, or characters. This typically involves dividing the text into discrete labels using predefined markers, such as Tokenizer of GPT-2. The purpose of the tokenization is to convert the text into sequence data that the model can handle for subsequent training or reasoning. Tokenization also helps build a vocabulary of text, which is important for models to understand the context and context of a language.

Deleting a stop word refers to removing common, non-meaningful words in text, such as "yes," "in," etc. Stop words are typically filtered out in natural language processing tasks because they tend to semantically contain no useful information. The purpose of deleting stop words is to reduce the dimensionality of the text data, removing noise and redundant information that does not contribute to the model task. This helps to increase the efficiency of the model, focusing it more on capturing critical information, while reducing the need for computing resources.

The format meeting the model input requirements is typically a tensor or sequence data format acceptable to the model, with the particular format being dependent on the large language model selected. Generally, the input of a large language model is pre-processed text data, which is converted into a digital representation that the model can understand.

For most language models, including models like GPT-3 and BERT, it is often necessary to convert text data into a labeled sequence. This sequence may be a sequence of words, subwords or characters, depending on the pre-training mode of the model. Each element of these sequences corresponds to a tag or index in a vocabulary.

After tokenization, it is often necessary to convert these tokens into tensor format that the model can handle, i.e., input in digital form. This may be a sequence of integers, where each integer corresponds to a token in the vocabulary. For some models, it may also be necessary to provide other information, such as position coding of the embedded layer, etc.

In combination, the format meeting the input requirements of the model can be summarized as a sequence of labeled text, where each label is mapped into the vocabulary of the model and ultimately converted into a digital tensor that the model can handle. A specific implementation step may involve using tokenizer of the models and related tools.

4) And (3) constructing a model: a language model is constructed for pre-training. This typically involves a pre-training phase in which the model learns over a large scale of text to capture the statistical rules of the language.

5) Fine tuning the model: the pre-training model is fine-tuned using the pre-processed data. The purpose of the fine tuning is to make the model more specialized and adapt to specific recommendation tasks, i.e. recommendation generation tasks. The fine-tuned model is a recommendation generation model.

It should be noted that, the training data sources for constructing the model are generally extracted from a large-scale text corpus or a domain-related dataset. They are used to train language models during a pre-training phase, allowing the models to learn generic language representations and grammar structures. In building training data for a model, the goal is to let the model learn a broad range of linguistic knowledge, without being specific to a certain domain or task. The goal of this training phase is to provide a strong language understanding basis for the model.

The training data for the fine tuning model is typically derived from a data set of a particular domain or task, such as user behavior data in a recommendation system, commodity descriptions, and the like. These data are collected in order to make the model more specialized, adapting to specific recommended tasks. In the fine tuning stage, the model has some language understanding capability through pre-training, but needs to be further adapted to the specific task. The goal of the fine-tuning is to allow the model to better understand and predict information related to a particular recommended task, such as generating personalized recommendations.

Training data for building models is general, widely covering data of linguistic knowledge, while fine-tuned data is data for making models more specialized, adapted to specific tasks. The fine-tuning is performed on the basis of a generic language model to improve the performance of the model on a specific task.

A pre-training model generally refers to a model that is trained on large-scale data, where the input is raw data, i.e., user features, media features, etc., the output is a corresponding vector representation or feature representation, and does not require a specific prediction result as in a fine-tuning model, because the goal of the pre-training model is to learn advanced features of the data, rather than to make specific predictions for a particular task. Pre-training models are typically trained on large-scale data, learning advanced features of the data, such as texture, shape, color, etc. of pictures, semantics, grammar, emotion, etc. of text, and frequency, cadence, etc. of sounds. These features learned by the pre-trained model can be used for many different tasks such as image classification, object detection, text classification, emotion analysis, etc., as these tasks all require high-level features of the extracted data. Thus, the output of the pre-training model is a generic vector representation or feature representation, rather than a specific prediction result. In practical application, the output of the pre-training model can be used as the input of other models, or fine-tuning training can be performed on the basis of the output of the pre-training model so as to adapt to different tasks and data. The convergence rate of the model can be increased, the performance of the model can be improved, and the data volume and the demand of calculation resources can be reduced by pre-training the advanced features learned by the model.

The fine tuning model is a model for carrying out fine tuning training on the specific task according to the personalized recommended language on the basis of the pre-training model, wherein the input of the fine tuning model is data and labels related to the task, and the output of the fine tuning model is the corresponding recommended language. The goal of the fine-tuning model is to optimize the performance of the model on a particular task, often requiring the addition or modification of several layers of neural networks on the basis of a pre-trained model to accommodate different inputs and outputs.

Illustratively, the step of generating a recommendation generation model includes:

data preparation: the data set includes information such as user historical behavior, commodity descriptions, and the like. Ensuring that the data format meets the model input requirements.

Model selection and loading: a pre-trained GPT-2 language model is selected and loaded using the Transformers library of Hugging Face.

Data preprocessing: a class RecommenderDataset of data preprocessing is defined, which contains the initialization, length, and method of obtaining the samples. Text data is subjected to word segmentation, marking and the like by using tokenizer of GPT-2.

Model initialization: a model object of GPT-2 is created for subsequent generation tasks.

Recommendation generation function: a function generate recommendation is defined that accepts user historical behavior, model objects, and tokenizer as parameters to generate personalized recommendations.

Input generation: in the generate recommendation function, the user history behavior is concatenated with the prefix text to generate an input text.

Encoding and generating: input_text is encoded using tokenizer, which is converted into input tensor input_ids acceptable to the model. The generate method of the model is used to generate the output of the recommendation.

Decoding and output: and decoding the generated text by using tokenizer to obtain a final recommended language.

And (5) returning a result: and returning the generated recommended language as the output of the function.

As shown in fig. 13, the step of fine-tuning the model includes:

1) Loading a pre-trained model

A pre-trained language model, such as GPT-2, bert, etc., is selected that is suitable for the task and pre-trained weights for the model are loaded using a corresponding library (e.g., transformers library of Hugging Face). The manner in which the pre-trained model is used allows for rapid benefits over a particular task from the rich knowledge that the model learns in the general context, thereby improving the performance of the model over the task.

The pre-training language model is selected for fine tuning for generating personalized recommended languages, and the following key reasons exist:

1. Transfer learning (TRANSFER LEARNING): the pre-trained language model is trained on large-scale generic language data, learning rich language representations. These models have excellent performance in a general context and thus can be used as a powerful starting point for fine tuning of a particular task. Through the transfer learning, the model can adapt to the task in the field of recommendation systems faster, because it already has a deep understanding of language.

2. Semantic understanding and generating capabilities: the pre-trained language model has learned rich semantic representations in the original task (e.g., language model, mask language modeling, etc.), and is able to understand and generate natural language. This semantic understanding and generating capability is very important for the generation of personalized recommendations, as recommendations typically require deep understanding of user interests and media asset characteristics.

3. Training time and resource consumption are reduced: training of pre-trained models typically requires a significant amount of computational resources and time, while fine tuning is a relatively more cost-effective approach. By fine tuning on the basis of the pre-trained model, better performance can be achieved on smaller data sets, reducing reliance on large-scale data and computing resources.

4. Commonality and generalization: the pre-training language model is trained on the general language data, so that the pre-training language model has better generalization. This means that the pre-trained model can provide good initial performance and perform well on small amounts of fine-tuning data even if there is no large scale of specific data sets in the field of recommendation systems.

5. Model interpretability: pre-trained language models generally have better interpretability because their learning process is based on a large amount of text data. This makes it easier to understand the decision making process of the model during the fine tuning phase, helping to debug and optimize the recommendation system.

When the pre-training language model is selected for fine tuning and is used for generating personalized recommended languages, the following key bases are needed to be considered:

1. task matching degree (TASK RELEVANCE): and ensuring that the selected pre-trained language model has good matching degree with the task of the recommendation system. Different tasks may require different context understanding and generating capabilities, so it is crucial to select pre-trained models relevant to the field of recommendation systems.

2. Model performance: the performance of the pre-trained model on the generic task is evaluated. The performance of a model may be measured by performance on a standard benchmark task, such as confusion of language models, masking language modeling of BERT models, and the like.

3. Model interpretability: considering the interpretability of the model, the logic to generate the recommendation needs to be understood, especially in a recommendation system. The easier the model is to interpret, the more conducive to debugging and optimization in practical applications.

4. Architecture and flexibility of model: whether the architecture of the model is flexible or not can be easily modified to accommodate the needs of the recommendation system. Some pre-trained models may require more adjustments to adapt to a particular generation task.

5. Resource and computational requirements: consider whether a selected pre-trained model is adapted to our computing resources and budget. Some models may require more computing resources, while others may perform well in lower resource environments.

6. Coverage of the dataset: it is ensured that the training data of the pre-training model covers the context associated with the recommendation system. The versatility of the model in the recommendation field depends largely on the coverage of its training data.

7. Community support and update frequency: see if the selected model has strong community support so that help can be obtained and the problem can be solved during use. Meanwhile, whether the model is in a continuously updated state is considered, so that the latest improvement and performance improvement are obtained.

8. Legal and ethical factors: note whether the pre-training model selected meets legal and ethical requirements, including data privacy and copyright issues that may be involved in model training.

2) Modifying model structures

The model structure is modified to accommodate the task. In the recommender scenario, some additional layers need to be added to process the user history information and generate personalized recommendations. The additional layers include: at least one of a sequence modeling layer, an attention layer, a generation module, a condition generation module, a fusion layer, a fine tuning related layer and a feedforward neural network

In some embodiments, as shown in FIG. 14, the modified model structure includes a sequence modeling layer, an attention mechanism (attention layer), a generation module, a fusion layer, and a fine-tuning correlation layer.

Modifying the model structure to accommodate the corresponding task generally takes into account:

a) Processing user history information:

Sequence modeling layer: if the recommendation system focuses on the historical behavior sequence of the user, a sequence modeling layer is needed to be added, and the sequence modeling layer is used for processing the historical behavior sequence of the user. The sequence modeling layer may be a Recurrent Neural Network (RNN) or a long short-term memory network (LSTM) for capturing behavior sequence information of the user.

Attention mechanism: the attention introducing mechanism is helpful for the model to pay more attention to the key parts in the historical behavior sequence of the user, and the attention of important behaviors is improved.

B) Generating personalized recommendation:

The generation module is used for: to generate personalized recommendations, a generation module is added that can employ a Generation Antagonism Network (GAN), a Variational Automatic Encoder (VAE), or similar generation model.

And (3) condition generation: the historical behavior and the media asset characteristics of the user are used as conditions to be introduced into the generation module, so that the generated recommendation is ensured to be related to the interests of the user and the recommended content.

C) Combining the user portrait with the media asset characteristics:

fusion layer: the fusion layer is used for fusing the user history behavior (or user portrait) and the media asset characteristics so that the model can consider the information of the user history behavior (or user portrait) and the media asset characteristics at the same time. Fusion can be achieved by means of simple stitching, weighted summation, etc.

Multimodal fusion: if the media asset feature contains multimodal information such as text, images, etc., it is contemplated to use a multimodal fusion strategy, such as using a multimodal attention mechanism or a multimodal fusion network.

D) Correlation layer of model fine tuning: the additional layers of trimming typically include some fully connected layers or other layers of tuning parameters suitable for the custom task.

The above extra layers are added to better accommodate the recommended system tasks, with adjustments made based on understanding of the tasks and knowledge of the data. In practical application, parameters and structures of the layers can be adjusted according to specific conditions so as to meet the requirements of tasks.

E) Model output adaptability: the output of the GPT-2 model is the hidden state of the entire sequence, but in the recommended system task, only the last hidden state of the sequence may be of interest, thus requiring an additional layer to adapt.

F) Task specificity: the additional layers added need to be matched to the specific tasks of the recommender system, such as the task of generating personalized recommendations may require a layer that captures user interests and content characteristics.

In practice, the basis for adding these additional layers is often derived from a deep understanding of the recommendation system tasks, as well as knowledge of the user's historical behavioral information and the role of the media asset features in generating recommendations, thereby improving model performance. In particular implementations, the layers may be defined using a deep learning framework (e.g., tensorFlow, pyTorch) and integrated into the overall model.

The modification of the model is typically based on the specificity of the task and the type of data that needs to be accommodated. Different tasks and data may require different model structures and adjustments.

Task difference: different personalized recommended tasks may have different requirements. For example, one system may focus on semantic similarity of the recommendation, while another system may focus more on diversity of the recommendation. The difference in tasks may require different adjustments to the model structure.

Data diversity: the recommendation system involves various types of data including user behavior data, media asset characteristics, user portraits, etc. Different recommendation systems may use different types of data, and therefore require modification of the model according to the diversity of the data to better capture the relationships of the different data.

Field difference: recommendation systems are applied in different fields, such as movies, music, news, etc. User behavior and content characteristics in different domains may differ, so the model needs to be adjusted according to the specific domain to better adapt to the recommended task in the specific domain.

User behavior patterns: the behavior pattern of the user may vary from application scenario to application scenario. Some systems may be more focused on the long-term interests of the user, while others may be more focused on the immediate behavior of the user. Modification of the model may involve different modeling approaches to the user behavior pattern.

Recommended content type: the recommended content may include various types of text, images, video, etc. Different types of content may require different ways of processing, such as multimodal fusion policies. Thus, modification of the model may involve efficient processing of multiple content types.

Performance optimization: the model is modified to improve performance, such as increasing the attention mechanisms, adjusting parameters of the layers, etc. The goal of performance optimization is to make the model more efficient in capturing user interests, improving recommendation accuracy and appeal.

Exemplary, the step of modifying the model structure includes:

a) Loading a pre-training model:

and loading a pre-trained GPT-2 model as a basic model.

B) Defining an additional feed forward neural network:

an additional feed-forward neural network (Feedforward Neural Network) is added, including one or more fully connected layers (LINEAR LAYERS), activation functions (activation functions), and the like. For example, the feedforward neural network includes one linear layer, a ReLU activation function, and another linear layer. This feed forward neural network is added to better accommodate the recommended system tasks.

C) Model initialization:

A new model class is created that inherits from the nn. Module (base class of neural network module), including the pre-trained GPT-2 model and the additional feed-forward neural network.

D) Forward propagation:

in the forward propagation method, a GPT-2 model is used for forward propagation, and then the output hidden state is transferred to an additional feedforward neural network to obtain the final recommended output.

E) Fine tuning training: training the entire model using fine tuning data to better adapt the model to recommended tasks

The embodiment of the application enables the model to better understand and process the task of the recommendation system by introducing an additional feedforward neural network. By adding layers required by specific tasks on the basis of the pre-trained model, the performance of the model on specific field tasks is improved.

3) Defining a loss function

When defining the loss function, the task objective and the output form of the model need to be considered. For personalized recommendation generation tasks, some common penalty function may be chosen, or a specific penalty function may be designed based on the characteristics of the task.

Cross entropy loss (Cross-Entropy Loss): cross-entropy loss may be used if the target task is a classification task, e.g., a task that generates recommendations based on user history information and media asset characteristics is considered a classification task, i.e., the model needs to select the correct recommendation from a fixed set of categories. This loss applies to classification problems, where the difference is measured by comparing the probability distribution generated by the model with the probability distribution of the actual tag.

Mean square error loss (Mean Squared Error Loss): if the generation of the recommendation is a regression problem, i.e. the model needs to generate a continuous value recommendation, a mean square error penalty can be used. This applies to the difference between the model generated continuous value and the actual label.

Custom loss function: depending on the particularities of the task, its own loss function may also be defined. For example, if attention is paid to the semantic similarity of the recommendation, a penalty function may be designed while taking into account the similarity between the semantic representation of the generated statement and the semantic representation of the actual tag.

Multitasking learning: if the target task involves multiple subtasks, a manner of using multitasking learning can be considered to design a loss function that simultaneously considers multiple tasks. For example, in addition to generating recommendations, other tasks may be of interest, such as click prediction or duration prediction.

Resistance training loss: if the generated recommended words are more diversified and innovative, the introduction of resistive training loss can be considered, and the model is encouraged to generate recommended words with larger differences from training data.

The basis for judging the loss function is mainly dependent on the nature and the goal of the task. It is desirable to consider whether the loss function can effectively guide the model to generate recommendations that meet the interests and needs of the user and achieve good performance on the validation set or test set. And (3) carrying out experiments and adjusting the loss functions, and observing the performances of the model under different loss functions to find the loss function which is most suitable for the task.

4) Preparing trimming data

Data for fine tuning using a data set of the recommended domain.

When the model is fine-tuned to adapt to the personalized recommendation generation task, the basis for adjusting the model parameters is mainly based on the following aspects:

1. task objective and data characteristics: specific goals of the recommender system task are considered, such as generating personalized recommendations, and features of the dataset. Different recommender tasks may require different model structures and parameter settings to better capture the interests of the user and generate the appropriate recommendations.

2. Model performance and index: performance metrics of the model are defined, such as accuracy over the validation set, diversity of generated recommendations, etc. Based on these metrics, the performance of the model on the task is evaluated and model parameters are adjusted to improve performance.

3. Super parameter tuning: and adjusting super parameters of the model, such as learning rate, batch size, hidden layer size and the like. And selecting optimal super-parameter configuration through the performance of the experiment and verification set so as to improve the generalization and training effect of the model.

4. And (3) adjusting a loss function: consider the selection and adjustment of the loss function during fine tuning. Different loss functions have different influences on the training direction and speed of the model, so that proper loss functions are required to be selected according to task targets, and parameters such as weights and the like are required to be adjusted.

5. Regularization strategy: consider whether a regularization strategy, such as L1 or L2 regularization, needs to be introduced to prevent overfitting. In the recommendation system, since data can be sparse, overfitting is a common problem, and proper regularization helps to improve the generalization performance of the model.

6. Iterative training: an appropriate iterative training strategy is used. Data in the recommendation system may change over time, so after the model is online, iterative training may be performed periodically to capture new user interests and behavior patterns.

7. Real-time requirements: consider the real-time requirements of a recommendation system. Some recommendation systems need to respond to the user's behavior in real time, so the model needs to have efficient reasoning speed while guaranteeing performance. The model parameters should also be selected in consideration of real-time requirements.

8. User feedback and AB testing: and (5) adjusting by combining user feedback and AB test results. The feedback of the actual user may be different from the performance in the laboratory, so that the model parameters are further adjusted by actually deploying the model and performing the AB test, and obtaining feedback from the user behavior.

By combining the consideration of the aspects, the model parameters can be better adjusted, so that the model parameters are more suitable for the personalized recommended language generation task, and the effect and the practicability of the model are improved.

5) Fine tuning model

The model is trained using the fine tuning data, and model parameters are adjusted to accommodate the recommended task.

Illustratively, GPT-2 is used for fine tuning to build the recommendation generation model.

A) Pre-training model selection: GPT-2, a generic language model, has been pre-trained on large-scale Internet text. The model parameters of GPT-2 contain a broad understanding of language.

B) Fine tuning purposes: the task is to generate personalized video recommendations in an online video recommendation system. The historical behavior data of the user includes videos watched, videos praised, and keywords searched. This information is used to generate more attractive recommendations for the user.

C) Fine tuning:

loading a pre-training model: and obtaining the pre-training weight of the GPT-2 from the open source resource, and loading the model.

Modifying the model structure: to accommodate the video recommendation task, a new attention layer is added for processing the user's historical viewing record and linking it to the backbone structure of GPT-2.

Defining a loss function: and selecting a loss function according to the semantic similarity between the generated recommendation and the actual user preference label.

Preparing fine tuning data: user behavior data from the video recommendation system is used, including viewing history, praise information, and the like.

Preparation of the fine-tuning dataset typically includes user behavior data and media asset features. The user behavior data includes viewing history of the user, praise information, etc., and the asset characteristics are related characteristics about assets (such as video) in the recommendation system, for example: text information such as video titles, descriptions, numerical information such as video durations, release times, etc., class information such as video tags, categories, etc., and images or other media characteristics of the video.

Fine tuning:

And (3) fine-tuning the model by using the data of the fine-tuning data set, and adjusting model parameters to enable the model parameters to be better suitable for video recommendation tasks. During the fine tuning process, the model becomes more focused on the task in the video recommendation domain by learning the user's viewing history and preference characteristics.

1. Importing required libraries: transformers, torch.

2. Class RecommenderDataset of the fine-tuning dataset is defined and the __ init __, __ len __ and __ getitem __ methods are implemented.

3. And loading a pre-training model, creating a model object model and tokenizer objects, and initializing.

4. The model structure is modified, and a linear layer is added after the GPT-2 model for recommending tasks.

5. A loss function is defined, a loss function object criterion is created, and initialization is performed using CrossEntropyLoss.

6. A fine-tuning dataset is prepared, a dataset object dataset is created, initialized using RecommenderDataset classes, incoming datapath and tokenizer as parameters. A data loader object dataloader is created, initialized with the torch.uteils.data.dataloader, and the dataset, batch _size and shuffle parameters are entered.

7. An optimizer is defined, an optimizer object optimizer is created, initialized using torch.optim.adamw, and model.parameters () and learning rate lr are entered as parameters.

8. The model was fine-tuned and multiple epochs were trained using cycles.

In each epoch, each batch in dataloader is traversed.

Input data inputs and tags are acquired.

The data is input into the model to obtain the output of the model.

Loss is calculated, and cross entropy loss between the prediction result and the label is calculated using criterion.

And (5) clearing the gradient of the optimizer, performing back propagation and optimization, and updating model parameters.

The current epoch penalty value is printed.

9. And saving the trimmed model to a specified path.

In some embodiments, the step of inputting the media asset feature and the user data into a recommendation generation model to obtain a recommendation text includes:

And inputting the media asset characteristics, the user data and the context information into a recommendation generation model to obtain a recommendation text, wherein the context information is used for representing time, equipment information and position information.

The embodiment of the application can take the context information, the media resource characteristics and the user data as the input of model training or application, and the recommended language text is still output as a model. The fine tuning training data includes context information, media asset characteristics, and user data and their corresponding recommended language text. Context information generally refers to the current environment or state of the user, and may include current time, device information, location information, and the like. In a recommendation system, the context information helps to better understand the needs of the user in a particular context.

determining a user portrait based on the user data, the user portrait being used to characterize user media asset preference features;

in some embodiments, the step of determining a user representation based on the user data comprises:

And analyzing the media asset type and the content characteristics in the media asset viewing history record and the media asset preference record of the user to determine the user portraits.

The embodiment of the application uses the viewing history and praise history of the user to analyze the behavior mode of the user. The type, category, and praise content characteristics of the viewed assets are specifically analyzed. For example, if a user frequently watches and likes a science fiction movie, it can be inferred that the user has a higher interest in the science fiction movie, which is a science fiction movie fan.

And inputting the media asset viewing history record and the media asset preference record of the user into a classification model to obtain a user preference label, and determining that the user portrait comprises the user preference label.

The embodiment of the application can introduce a preference label for each user, and represent the possible interest fields of the user, such as 'science fiction movie lovers', 'favorite action sheets', and the like. These tags may be inferred by the user's historical behavior.

The determination of preference tags may use machine learning methods, such as classification models, to predict the user's interest tags. The mapping relationship is learned by training a model with the viewing history and the praise history as input features and the preference label as output label. The model may predict the likely field of interest of the user based on their behavior patterns.

The media asset search records and media asset reviews are analyzed to determine a user representation.

The user's preference tags may be updated periodically, taking into account that the user's interests may change over time. This can be achieved by monitoring the latest viewing history and praise history, ensuring that the recommender system is able to reflect the current interests of the user.

And inputting the user portrait and the media asset characteristics into a recommendation generation model to obtain a recommendation text.

Embodiments of the present application may determine a user representation based on user data. For example: the user is a action fan or love fan. The user portrait and the media asset feature are used as the input of model training or application, and the recommended language text is still used as the model output. The fine tuning training data comprises sample user portraits and media asset characteristics and corresponding sample recommended language texts. Illustratively, the user portraits are analyzed as action movie fans based on user data. And inputting the recommendation language generating model into the action film fan to obtain the recommendation language text.

Illustratively, both user A and user B are recommended XXX, but user A's preference is "natural, earth", user B's preference is "war, mechanical", then this movie is presented as user A's recommendation: a visual feast brings you into a colorful world of stars full of lively life, so that you feel the magic charm of nature as if you were in the world. Recommendation presented to user B: a field of high-tech weapons such as a fighter, a plane, an aircraft, and the like, fight between a human being and an alien race, and people can take the best of the eyes.

And inputting the media asset characteristics, the user image and the context information into a recommendation generation model to obtain a recommendation text.

The embodiment of the application can take the context information, the media resource characteristics and the user portrait together as the input of model training or application, and the recommended language text is still output as a model. The fine tuning training data includes contextual information, media asset characteristics, and user portraits and their corresponding recommended language text.

Example one: movie digest generation

The recommendation includes a movie summary. And generating a movie abstract according to the interests and the historical film watching records of the user, helping the user to know movie contents more quickly, and improving the recommended attraction. The task of generating a movie summary may be by fine tuning a large language model so that it can understand the user's historical viewing records and generate a personalized summary related to the movie. In practice more complex model structures and larger data sets may be required. Furthermore, the selection and fine tuning strategy of the pre-trained model may need to be adjusted according to the actual application. At fine tuning, it is ensured that the dataset contains movie summaries and related information so that the model learns to generate movie related content.

The step of generating the film abstract model comprises the following steps:

1) Importing required libraries: transformers, torch.

2) Class MovieSummaryDataset of the movie summary generated fine tuning dataset is defined and the __ init __, __ len __ and __ getitem __ methods are implemented.

3) The pre-trained model is loaded, model objects model and tokenizer objects are created, and initialized using GPT2LMHeadModel. From_ pretrained and GPT2Tokenizer. From_ pretrained.

4) The model structure is modified, the token embeddings size of the model is modified using model size_token_ embeddings, and a linear layer is added after the model using model size_linear_layer.

5) A loss function is defined, a loss function object criterion is created, and initialization is performed using CrossEntropyLoss.

6) A fine-tuning dataset is prepared, a dataset object dataset is created, initialized using MovieSummaryDataset classes, incoming datapath and tokenizer as parameters. A data loader object dataloader is created, initialized with the torch.uteils.data.dataloader, and the dataset, batch _size and shuffle parameters are entered.

The data of the fine tuning data set comprises a user history film watching record, a praise record, a media asset characteristic and a corresponding film abstract.

7) An optimizer is defined, an optimizer object optimizer is created, initialized using torch.optim.adamw, and model.parameters () and learning rate lr are entered as parameters.

8) The model was fine-tuned and multiple epochs were trained using cycles.

In each epoch, each batch in dataloader is traversed.

Input data inputs and tags are acquired.

The data is input into the model to obtain the output of the model.

The current epoch penalty value is printed.

9) And saving the trimmed model to a specified path.

After receiving an instruction of displaying a movie detail page or a movie abstract from a user, inputting a user history film watching record, a praise record and a movie media asset characteristic into a movie abstract model to obtain a personalized movie abstract.

The embodiment of the application generates the personalized movie abstract based on the historical watching record of the user and the movie media asset characteristics.

Example two: user interface interaction cues

During user interface interactions, the user provides input that may include a user's behavioral records, preferences, or contextual information. According to the input of the user, the system calls a personalized recommendation generation model, generates a movie recommendation related to the current situation of the user in real time, and displays the movie recommendation to the user through an interface.

The data of the user interface interaction fine-tuning dataset includes user input history, system generated personalized recommendations, and user feedback. The user input may be text, commands, or other forms of instructions representing real-time interactions of the user on the interface. By using such a data set, the system can learn the user's preferences and preferences in different situations, providing recommendation cues that are more closely related to the user's needs.

The training process of the user interface interaction recommendation model is as follows:

1) Loading a pre-training model and a classifier.

2) Print welcome information and prompt the user how to end the dialog.

3) Initializing a user history record: an empty list user_history is created for storing the user's input history.

A loop is entered awaiting user input.

4) The user input user_input is obtained.

5) It is checked whether the user is about to exit the dialog, and if the user inputs 'exit', the print end information and the loop is skipped.

6) User input is added to the user_history.

7) And generating a prompt recommendation _prompt of personalized recommendation according to the user history record, and splicing the contents in the user_history.

8) The generate recommendation function is called and recommendation _sample, model, and tokenizer are entered to generate the recommendation recommendation.

9) The printing system generates a recommendation recommendation.

Illustratively, the user directly inputs a voice instruction of "recommending a comedy movie" through the control device 100, or as shown in fig. 15, the user inputs "recommending a comedy movie" on the search box of the user interface. The display device 200 transmits the user input information to the server 400. The server 400 invokes the recommendation generation module to generate a sentence of recommendation "like happy and relaxed atmosphere? Attempts are not made to see AAAA. The server 400 transmits the recommendation to the display device 200. The display device 200 displays a recommended language on the user interface, and simultaneously displays controls corresponding to the media assets included in the recommended language, and personalized clips of the recommended media assets. The user can choose whether to watch the movie or not according to the prompt, and can also change a recommended movie according to the user's needs.

Example three: movie tag generation

The recommendation includes a movie tag. Movie tags may be displayed in movie detail pages. And automatically generating labels for the movies according to the historical behaviors and the favorites of the users, and describing the features of the movies more accurately, thereby improving the accuracy of a recommendation system. Creating an application of a movie tag generation algorithm, by means of a large language model and Transformers library, needs to be divided into several steps:

Data preprocessing: movie data and user history behavior data including movie information and user preference data are prepared. The data should contain a textual description of the movie, and the user's historical behavior, such as movies that have been watched, scores, etc.

Building a model: a generative model, such as GPT-3.5, is built using the Transformers library as a label generator. This model will learn the ability to generate tags from movie descriptions and user history.

Model training: the model is trained using the prepared data. In the training process, the movie description and the historical behavior of the user are used as input, and the model is enabled to generate corresponding labels. And optimizing model parameters to improve the matching degree of the generated label and the actual label.

User recommendation: the user's historical behavior (e.g., searching movies, viewing histories, etc.) and movie descriptions are passed into the trained model, generating personalized movie tags. These tags may be used in a recommendation system to improve understanding of the user's interests and thereby more accurately recommend movies.

The personalized generation algorithm based on the large language model has the following advantages and characteristics: 1. learning user behavior and preferences: large language models enable rich language representations to be learned in user historical behavior and preference data by pre-training on large-scale text data, by fine tuning model parameters. This enables the model to more accurately capture the user's personalized features, improving understanding of the user's interests. 2. Context awareness: the large language model uses context-aware techniques in the recommendation system to more accurately understand user needs by considering contextual information such as time, place, equipment, etc. of user behavior and preferences. This context awareness facilitates the generation of recommended content that better conforms to the current context of the user. 3. Generating various contents: large language models have the ability to generate text, which can generate diverse and creative content. The recommendation system can generate simple recommendation words, and can also generate more complex contents such as comments, abstracts and the like, so that more comprehensive recommendation information is provided for users. 4. Personalized adaptability: the process of pre-training and fine tuning enables the model to adapt to the personalized requirements of different fields. By fine tuning on data in a specific field, the model can be more specialized and better meet the recommended task requirements of the field. 5. Real-time and dynamic: the large language model has real-time generation capability, and can dynamically generate recommended content according to real-time behaviors and context information of a user. The recommendation system can be more quickly adapted to the user change and the dynamic change of the demand, and the real-time performance of the system is improved.

Some embodiments of the present application provide a recommendation generation method, the method being applicable to a server configured to: receiving a recommendation request sent by display equipment, wherein the recommendation request comprises a user identifier; acquiring user data based on the user identifier, wherein the user data comprises at least one of a media asset viewing history record, a media asset preference record, a media asset search record and a media asset comment of a user; determining recommended media assets and acquiring media asset characteristics corresponding to the recommended media assets, wherein the media asset characteristics are used for representing information of the recommended media assets; and inputting the media asset characteristics and the user data into a recommendation generation model to obtain a recommendation text, wherein the recommendation generation model is trained based on fine tuning training data after a pre-training model is acquired, the fine tuning training data comprise sample input data and sample recommendation corresponding to the sample input data, and the sample input data comprise at least one of media asset viewing history records, media asset preference records, media asset search records and media asset comments, and the media asset characteristics. According to the embodiment of the application, the user data and the media resource characteristics are combined, and the machine learning algorithm is utilized to construct the efficient recommendation generation model, so that the recommendation generation model can accurately generate the recommendation meeting the user interests and requirements according to the personalized requirements and the media resource characteristics of the user.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. The illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A server, configured to:

2. The server of claim 1, wherein the server performs the inputting of the user data into a target model resulting in a textual description, further configured to:

3. The server of claim 1, wherein the server performs the obtaining of at least one target media asset segment based on the textual description, and is further configured to:

4. The server of claim 3, wherein the server performs screening of at least one target asset segment corresponding to the text description from asset data corresponding to the asset recommendation list, and is further configured to:

5. The server of claim 1, wherein the server performs the obtaining of user data, and is further configured to:

6. The server of claim 5, wherein the media asset recommendation page includes a recommended media asset display area, the server being configured to, upon obtaining recommended media assets:

7. The server of claim 1, wherein the server is configured to:

8. A display device, characterized by comprising:

A display;

A controller configured to:

9. A recommended media resource generating method is applied to a server and is characterized by comprising the following steps:

10. A recommended media asset generation method applied to a display device, comprising: