WO2024113641A1 - 视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品 - Google Patents

视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2024113641A1
WO2024113641A1 PCT/CN2023/088886 CN2023088886W WO2024113641A1 WO 2024113641 A1 WO2024113641 A1 WO 2024113641A1 CN 2023088886 W CN2023088886 W CN 2023088886W WO 2024113641 A1 WO2024113641 A1 WO 2024113641A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
vector
loss
recommended
Prior art date
Application number
PCT/CN2023/088886
Other languages
English (en)
French (fr)
Inventor
祖传龙
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024113641A1 publication Critical patent/WO2024113641A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the Internet field, and relate to but are not limited to a video recommendation method, device, electronic device, computer-readable storage medium, and computer program product.
  • video recall structures include the dual-tower structure and the double-enhanced dual-tower structure.
  • the dual-tower structure As a classic recall structure, the dual-tower structure has been widely used in recommendation scenarios due to its convenient offline training and fast online retrieval.
  • the most obvious feature of the dual-tower structure is “dual-tower independence", that is, the target vectors of massive content can be calculated in batches offline, and there is no need to repeat the calculation online. Online, the user target vector only needs to be calculated once, and then the nearest neighbor algorithm is used to quickly retrieve similar content.
  • “dual-tower independence” also limits the model effect.
  • the dual-tower structure lacks the opportunity to cross-learn user features and content features, but cross-features and cross-learning can significantly improve the model effect.
  • the dual-enhanced dual-tower structure generates vectors for fitting the information of the other tower at the input layer of the user tower and the content tower.
  • the vector is called an "enhanced vector".
  • the enhanced vector is continuously updated through the target vector of the other tower and participates in the calculation process of the target vector.
  • the parameter scale of the enhanced vector of the user tower in the dual-enhanced dual-tower structure is too large, and the tower structure does not support multiple targets, so the enhanced vector cannot fit multiple target vectors at the same time.
  • the video recall structure in the related technology has too large a parameter scale when performing recall calculation, resulting in a large amount of calculation, which makes the calculation delay of the recall process during video recommendation high, reducing the efficiency of video recommendation; and because it is impossible to fit target vectors of multiple dimensions at the same time, the accuracy of the recall calculation is low.
  • the embodiments of the present application provide a video recommendation method, device, electronic device, computer-readable storage medium, and computer program product, which can at least be applied to the fields of artificial intelligence and video recommendation, and can improve the efficiency and accuracy of video recall.
  • An embodiment of the present application provides a video recommendation method, including: obtaining an object feature vector of a target object, a historical playback sequence of the target object within a preset historical time period, and a video multi-target vector index of each video to be recommended in a video library to be recommended; performing vectorization processing on the historical playback sequence to obtain an object enhancement vector of the target object; performing vector splicing processing and multi-target feature learning on the object feature vector and the object enhancement vector in sequence to obtain an object multi-target vector of the target object; determining a target recommended video corresponding to the target object from the video library to be recommended based on the object multi-target vector and the video multi-target vector index of each video to be recommended; and performing video recommendation for the target object based on the target recommended video.
  • the present application provides a video recommendation device, which includes: an acquisition module configured to acquire a target video.
  • the invention relates to an object feature vector of a target object, a historical playback sequence of the target object within a preset historical time period, and a video multi-target vector index of each video to be recommended in a video library to be recommended; a vectorization processing module, configured to perform vectorization processing on the historical playback sequence to obtain an object enhancement vector of the target object; a multi-target processing module, configured to perform vector splicing processing and multi-target feature learning on the object feature vector and the object enhancement vector in sequence to obtain an object multi-target vector of the target object; a determination module, configured to determine a target recommended video corresponding to the target object from the video library to be recommended based on the object multi-target vector and the video multi-target vector index of each video to be recommended; a video recommendation module, configured to perform video recommendation on the target object based on the target recommended video.
  • An embodiment of the present application provides an electronic device, including: a memory configured to store executable instructions; and a processor configured to implement the above-mentioned video recommendation method when executing the executable instructions stored in the memory.
  • An embodiment of the present application provides a computer program product or a computer program, wherein the computer program product or the computer program includes executable instructions, and the executable instructions are stored in a computer-readable storage medium; when a processor of an electronic device reads the executable instructions from the computer-readable storage medium and executes the executable instructions, the above-mentioned video recommendation method is implemented.
  • An embodiment of the present application provides a computer-readable storage medium storing executable instructions for causing a processor to execute the executable instructions to implement the above-mentioned video recommendation method.
  • the embodiment of the present application has the following beneficial effects: the historical playback sequence of the target object is vectorized to obtain the object enhancement vector of the target object, thereby determining the object multi-target vector of the target object based on the object enhancement vector of the target object.
  • the object multi-target vector is an object feature vector that incorporates the object enhancement vector.
  • the target object when determining the target recommended video from the video library to be recommended, the target object can be accurately analyzed in combination with the information of the target object in multiple dimensions, thereby accurately recalling the video; and, since the object enhancement vector is generated based on the historical playback sequence of the target object, the historical playback sequence includes the playback record of the target object for the video, and the number of playback records is much smaller than the number of target objects in the video application, therefore, the process of determining the object multi-target vector of the target object based on the object enhancement vector and then determining the target recommended video has a small amount of data calculation, thereby greatly improving the efficiency of video recommendation.
  • FIG1 is a schematic structural diagram of a double-tower structure in the related art
  • FIG2 is a schematic structural diagram of a double-reinforced twin-tower structure in the related art
  • FIG3 is a schematic diagram of an optional architecture of a video recommendation system provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of an optional flow chart of a video recommendation method provided in an embodiment of the present application.
  • FIG6 is another optional flowchart of the video recommendation method provided in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a flow chart of a method for creating a multi-objective vector index for a video according to an embodiment of the present application
  • FIG8 is a schematic diagram of a process of vectorizing a historical playback sequence provided by an embodiment of the present application.
  • FIG9 is a schematic diagram of a process of vector concatenation processing and multi-objective feature learning provided in an embodiment of the present application.
  • FIG10 is a schematic diagram of a process for determining a target recommended video according to an embodiment of the present application.
  • FIG11 is a flow chart of a method for training a video recall model according to an embodiment of the present application.
  • FIG12 is a schematic diagram of a process for obtaining sample data provided in an embodiment of the present application.
  • FIG13 is a schematic diagram of a process for determining a target loss result provided in an embodiment of the present application.
  • FIG14 is a schematic diagram of another process for determining a target loss result provided in an embodiment of the present application.
  • FIG. 15 is a diagram of an embodiment of the present application providing a method for determining object enhancement loss and video enhancement loss based on a multi-objective network.
  • FIG16 is an interface diagram of a video application home page provided in an embodiment of the present application.
  • FIG17 is an interface diagram of a material card provided in an embodiment of the present application.
  • FIG18 is a schematic diagram of a calculation flow provided in an embodiment of the present application.
  • FIG19 is a schematic diagram of a process for generating a feature vector according to an embodiment of the present application.
  • FIG20 is a schematic diagram of a process for generating a user enhancement vector according to an embodiment of the present application.
  • FIG21 is a schematic diagram of the structure of a PLE network provided in an embodiment of the present application.
  • FIG22 is a schematic diagram of a process for determining a similarity score according to an embodiment of the present application.
  • FIG23 is a schematic diagram of a training process provided in an embodiment of the present application.
  • FIG24 is a schematic diagram of the implementation process of screening positive and negative examples provided in an embodiment of the present application.
  • FIG25 is a schematic diagram of the implementation process of the association feature provided in an embodiment of the present application.
  • Figure 26 is a schematic diagram of the structure of the MMOE network provided in an embodiment of the present application.
  • FIG1 is a schematic diagram of the structure of the dual-tower structure in the related art.
  • the dual-tower structure is composed of a user tower 11 for generating a user target vector and a content tower 12 for generating a content target vector.
  • the user's object feature 111 is first input into the user tower 11, and the content feature 121 of the content to be recommended is input into the content tower 12.
  • the inner product between the user target vector 112 output by the user tower 11 and the content target vector 122 output by the content tower 12 is calculated, and similarity calculation is performed based on the inner product to obtain the similarity result between the user and the content to be recommended, and the similarity result is used as the prediction value of the dual-tower structure; finally, by reducing the loss function of the corresponding scene target, the model parameters of the dual-tower structure are continuously updated.
  • a dual-enhanced dual-tower structure is also proposed in the related art. As shown in FIG2 , the dual-enhanced dual-tower structure generates vectors for fitting the information of another tower at the input layer of the user tower 21 and the content tower 22, respectively.
  • the vector is called an "enhanced vector", namely the user enhanced vector 211 and the content enhanced vector 221.
  • the enhanced vector is continuously updated through the target vector of the other tower and participates in the calculation process of the target vector.
  • the double-enhanced dual-tower structure can solve the problem of insufficient intersection between users and the content to be recommended to a certain extent, it introduces the following problems: the parameter scale of the user enhancement vector is too large; the tower structure of the double-enhanced dual-tower structure does not support the prediction of multiple targets; the enhancement vector cannot fit multiple target vectors at the same time.
  • an embodiment of the present application provides a video recommendation method.
  • the video recommendation method provided by the embodiment of the present application, first, an object feature vector of a target object, a historical playback sequence of the target object within a preset historical time period, and a video multi-target vector index of each video to be recommended in a video library to be recommended are obtained; then, the historical playback sequence is vectorized to obtain an object enhancement vector of the target object; and vector splicing processing and multi-target feature learning are performed on the object feature vector and the object enhancement vector in turn to obtain an object multi-target vector of the target object; finally, based on the object multi-target vector and each video to be recommended The target recommended video corresponding to the target object is determined from the video library to be recommended by using the multi-target vector index of the video of the frequency.
  • the target object can be accurately analyzed by combining the information of the target object in multiple dimensions, so as to accurately recall the video; and because the object enhancement vector is generated based on the historical playback sequence of the target object, the historical playback sequence at least includes the playback record of the target object for the video, and the number of playback records is significantly reduced compared to the number of target objects in the video application, therefore, the amount of data calculation during video recall can be greatly reduced, thereby greatly improving the efficiency of video recommendation.
  • the video recommendation device of the embodiment of the present application, which is an electronic device for implementing the video recommendation method.
  • the video recommendation device i.e., electronic device
  • the video recommendation device can be implemented as a terminal or a server.
  • the video recommendation device provided in the embodiment of the present application can be implemented as a laptop, a tablet computer, a desktop computer, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable gaming device, an intelligent robot, an intelligent home appliance, and an intelligent vehicle-mounted device, and any terminal with a video data processing function; in another implementation, the video recommendation device provided in the embodiment of the present application can also be implemented as a server, wherein the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution networks (CDN, Content Delivery Network), and big data and artificial intelligence platforms.
  • the terminal and the server can be directly or indirectly connected by wired or wireless communication, which is not limited in the embodiment of the present application.
  • FIG. 3 is an optional architecture diagram of a video recommendation system provided in an embodiment of the present application.
  • the embodiment of the present application is described by taking the application of a video recommendation method to any video application as an example.
  • the video application includes multiple featured pages and vertical channels. After the user scrolls down the featured pages and each vertical channel, he can see a non-target area.
  • the video recommendation method of the embodiment of the present application can be applied to the recommendation of the video displayed in the non-target area.
  • the video push system includes at least a terminal 100, a network 200 and a server 300.
  • the server 300 can be a server of a video application.
  • the server 300 can constitute a video recommendation device of the embodiment of the present application.
  • the terminal 100 is connected to the server 300 via the network 200, and the network 200 can be a wide area network or a local area network, or a combination of the two.
  • the terminal 100 when making a video recommendation, receives a user's browsing operation (for example, the browsing operation may be a pull-down operation on any vertical channel) through the client of the video application, and in response to the browsing operation, obtains the user's object features and historical play sequence, encapsulates the object features and historical play sequence into a video recommendation request, and the terminal 100 sends the video recommendation request to the server 300 through the network 200.
  • a user's browsing operation for example, the browsing operation may be a pull-down operation on any vertical channel
  • the server 300 After receiving the video recommendation request, the server 300 responds to the video recommendation request, obtains the user's object features and historical play sequence, obtains the user's object feature vector based on the object features, and obtains the video multi-target vector index of each video to be recommended in the video library to be recommended; then, the historical play sequence is vectorized to obtain the object enhancement vector of the target object; then, the object feature vector and the object enhancement vector are sequentially subjected to vector splicing processing and multi-target feature learning to obtain the object multi-target vector of the target object; then, based on the object multi-target vector and the video multi-target vector index of each video to be recommended, the target recommended video is determined from the video library to be recommended. After obtaining the target recommended video, the server 300 sends the target recommended video to the terminal 100, so that the terminal 100 displays the target recommended video to the user in the non-target area of the current interface.
  • the video recommendation device can also be implemented as a terminal, that is, the video recommendation method of the embodiment of the present application is implemented with the terminal as the execution subject.
  • the terminal obtains the user's browsing operation through the client of the video application, and in response to the browsing operation, obtains the user's object feature vector, the historical playback sequence within a preset historical time period, and the video multi-target vector index of each video to be recommended in the video library to be recommended; then, the terminal uses the video recommendation method of the embodiment of the present application to recall the target recommended video, and after obtaining the target recommended video, displays the target recommended video to the user in the non-target area of the current interface.
  • the video recommendation method provided in the embodiment of the present application can also be implemented based on a cloud platform and through cloud technology.
  • the server 300 can be a cloud server.
  • the cloud server vectorizes the historical playback sequence, or sequentially performs vector splicing and multi-target feature learning on the object feature vector and the object enhancement vector, and determines the target recommended video from the video library to be recommended based on the object multi-target vector and the video multi-target vector index of each video to be recommended.
  • a cloud storage may be provided, and the video library to be recommended and the video multi-target vector index of each video to be recommended may be stored in the cloud storage, or the user's object feature vector and the historical play sequence within a preset historical time period may be stored in the cloud storage, or the target recommended video may be stored in the cloud storage.
  • the corresponding information may be obtained from the cloud storage to recall the target recommended video, thereby improving the efficiency of recalling the target recommended video, and further improving the efficiency of video recommendation.
  • cloud technology refers to a hosting technology that unifies hardware, software, network and other resources within a wide area network or local area network to achieve data computing, storage, processing and sharing.
  • Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the cloud computing business model application. It can form a resource pool, which is used on demand and flexible and convenient. Cloud computing technology will become an important support.
  • the background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites.
  • each item may have its own identification mark, which needs to be transmitted to the background system for logical processing. Data of different levels will be processed separately. All kinds of industry data need strong system backing support, which can be achieved through cloud computing.
  • FIG4 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present application.
  • the electronic device shown in FIG4 may be a video recommendation device, which includes: at least one processor 310, a memory 350, at least one network interface 320, and a user interface 330.
  • the various components in the video recommendation device are coupled together through a bus system 340.
  • the bus system 340 is used to achieve connection and communication between these components.
  • the bus system 340 also includes a power bus, a control bus, and a status signal bus.
  • various buses are labeled as bus system 340 in FIG4.
  • the processor 310 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., wherein the general-purpose processor can be a microprocessor or any conventional processor, etc.
  • DSP digital signal processor
  • the user interface 330 includes one or more output devices 331 that enable presentation of media content, and one or more input devices 332 .
  • the memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard disk drives, and optical disk drives, etc.
  • the memory 350 may optionally include one or more storage devices that are physically remote from the processor 310.
  • the memory 350 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM), and the volatile memory may be a random access memory (RAM).
  • the memory 350 described in the embodiments of the present application is intended to include any suitable type of memory. In some embodiments, the memory 350 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • the operating system 351 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks; a network communication module 352, for reaching other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 include: Bluetooth, wireless compatibility certification (WiFi) and Universal Serial Bus (USB), etc.; an input processing module 353, for processing one or more input devices from one or more input devices; One or more user inputs or interactions of one of the devices 332 are detected and the detected inputs or interactions are translated.
  • a framework layer for processing various basic system services and performing hardware-related tasks
  • a core library layer for implementing various basic services and processing hardware-based tasks
  • a driver layer for implementing various basic services and processing hardware-based tasks
  • a network communication module 352 for reaching other computing devices via one or more (wired or wireless) network interfaces 320, exemplary network interfaces 320 include
  • FIG. 4 shows a video recommendation device 354 stored in a memory 350.
  • the video recommendation device 354 can be a video recommendation device in an electronic device. It can be software in the form of a program or a plug-in, and includes the following software modules: an acquisition module 3541, a vectorization processing module 3542, a multi-target processing module 3543, a determination module 3544, and a video recommendation module 3545. These modules are logical, and therefore can be arbitrarily combined or further split according to the functions implemented. The functions of each module will be described below.
  • the device provided in the embodiments of the present application can be implemented in hardware.
  • the device provided in the embodiments of the present application can be a processor in the form of a hardware decoding processor, which is programmed to execute the video recommendation method provided in the embodiments of the present application.
  • the processor in the form of a hardware decoding processor can adopt one or more application specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field programmable gate arrays (FPGAs), or other electronic components.
  • ASICs application specific integrated circuits
  • DSPs digital signal processor
  • PLDs programmable logic devices
  • CPLDs complex programmable logic devices
  • FPGAs field programmable gate arrays
  • the video recommendation method provided in each embodiment of the present application can be executed by an electronic device, wherein the electronic device can be a server or a terminal, that is, the video recommendation method in each embodiment of the present application can be executed by a server, can be executed by a terminal, or can be executed through interaction between a server and a terminal.
  • FIG5 is an optional flow chart of a video recommendation method provided in an embodiment of the present application. The following will be described in conjunction with the steps shown in FIG5. It should be noted that the video recommendation method in FIG5 is described by taking a server as an execution subject as an example. As shown in FIG5, the method includes the following steps S101 to S105:
  • Step S101 obtaining an object feature vector of a target object, a historical play sequence within a preset historical time period, and a video multi-target vector index of each video to be recommended in a video library to be recommended.
  • the object feature vector is obtained by vectorizing the object features of the target object.
  • the object features of the target object include but are not limited to at least one of the following: the age, gender, education, tags, video browsing history and interests of the target object.
  • the vectorization of object features can be achieved through feature extraction.
  • Vectorization processing refers to searching a preset feature vector table, and retrieving a feature vector corresponding to each object feature of the target object from the preset feature vector table.
  • the feature vector corresponding to each object feature can be pre-stored in the preset feature vector table.
  • the corresponding feature vector can be queried from the preset feature vector table using multiple object features as retrieval indexes.
  • a preset feature vector table can be queried, and the feature vector corresponding to each object feature can be queried in the preset feature vector table to obtain the object feature vector of the target object.
  • the feature vector corresponding to each feature information can be queried from the preset feature vector table, and then the feature vectors corresponding to all the feature information can be spliced to form a multi-dimensional object feature vector.
  • the preset feature vector table may include two dimensions, the first dimension is the feature identifier, and the second dimension is the vector corresponding to each feature representation, and the vector representations of different features are independent of each other. That is, corresponding to the target object (such as a user), there may be an object feature vector table; corresponding to the video, there may be a video feature vector table. Based on the object feature vector table and the video feature vector table, the object feature vector of the target object and the video feature vector of the video to be recommended may be queried respectively.
  • the feature vector table can be directly queried to obtain the feature vector of the discrete feature; for continuous features, the continuous features can be first discretized to obtain the discrete features, and then the feature vector table can be queried to obtain the feature vector corresponding to the discretized features.
  • the discretization processing can be to perform equal-frequency division of the continuous features using a specific equal-frequency division interval to obtain multiple discrete features.
  • a feature vector table can be pre-constructed for querying feature vectors, and the pre-constructed feature vector table can be stored in a preset storage unit.
  • the feature vector table is obtained from the preset storage unit for feature vector query.
  • the feature vector table can also be updated according to the update of the video recall model and the update of feature information. For example, when there is new feature information, the feature vector of the feature information is obtained, and the feature vector is updated to the feature vector table.
  • the historical play sequence refers to a video sequence played by a target object (eg, a user) within a preset historical time period.
  • the historical play sequence includes historical video identifiers of historically played videos and the historical play duration of each historically played video.
  • the video library to be recommended includes multiple videos to be recommended, including videos that the user may be interested in and videos that the user is not interested in.
  • the video library to be recommended may include a large number of candidate videos (that is, the number of candidate videos in the video library to be recommended is greater than the video number threshold).
  • the video recommendation method of the embodiment of the present application is to accurately select videos that the user is interested in from the large number of candidate videos, and recommend the videos that the user is interested in as target recommended videos to the user.
  • Each video to be recommended has a video multi-target vector index
  • the video multi-target vector index is index information for querying the video multi-target vector of the video to be recommended.
  • the video multi-target vector of each video to be recommended can be generated in advance. After the video multi-target vector is generated, the video multi-target vectors of all videos to be recommended can be stored in a preset video multi-target vector storage unit.
  • index information corresponding to each video multi-target vector can also be generated. The index information is used to index the storage location of the video multi-target vector, so that the video multi-target vector can be obtained based on the video multi-target vector index.
  • a video multi-target vector of each video to be recommended can be generated before video recommendation, or a video multi-target vector of the video to be recommended can be generated simultaneously when the video to be recommended is generated, and a video multi-target vector index of the video multi-target vector is created.
  • the video multi-target vector index can be used to query the video multi-target vector of the video to be recommended online in real time without generating a video multi-target vector every time a video is recommended, that is, there is no need to repeatedly generate the video multi-target vector, thereby greatly reducing the amount of data calculation during video recommendation and improving the efficiency of video recommendation.
  • Step S102 vectorize the historical playback sequence to obtain an object enhancement vector of the target object.
  • a query can be performed through a preset feature vector table, and a feature vector corresponding to each sequence information in the historical playback sequence can be queried in the preset feature vector table to obtain an object enhancement vector of the target object.
  • each sequence information includes the video identifier of the historical playback video and the playback time of each historical playback video, the feature vector corresponding to the video identifier and the feature vector corresponding to the playback time can be queried.
  • the historical playback sequence when the historical playback sequence is vectorized, it can be realized in the following manner: first, a preset feature vector table is retrieved for each historical video identifier in the historical playback sequence, that is, based on each historical video identifier, a search is performed in the preset feature vector table to obtain a historical video vector set, and the number of historical video vectors in the historical video vector set is consistent with the number of historical video identifiers in the historical playback sequence.
  • the total duration of the historical playback time in the historical playback sequence is counted to obtain the total historical playback time, that is, the total duration of the historical playback sequence is counted, and the historical playback time in all sequence information can be summed to obtain the total historical playback time, and then each historical playback time in the historical playback sequence is divided by the total historical playback time to obtain the normalized playback time corresponding to each historical playback time, and the normalized playback time is used as the video vector weight, wherein the number of video vector weights is also consistent with the number of historical video identifiers in the historical playback sequence.
  • each historical video vector in the historical video vector set is multiplied by the corresponding video vector weight to obtain a video weighted vector set, and at the same time, all video weighted vectors in the video weighted vector set are merged to obtain the object enhancement vector of the target object, that is, the user enhancement vector.
  • merging all video weighted vectors in the video weighted vector set may refer to performing splicing processing on all video weighted vectors in the video weighted vector set to obtain a multi-dimensional user enhancement vector.
  • the dimension of the user enhancement vector is equal to the sum of the dimensions of all video weighted vectors.
  • Step S103 performing vector concatenation processing and multi-target feature learning on the object feature vector and the object enhancement vector in sequence to obtain an object multi-target vector of the target object.
  • the object feature vector and the object enhancement vector are sequentially subjected to vector splicing processing and multi-target feature learning.
  • the object feature vector and the object enhancement vector are first subjected to vector splicing processing to obtain the object splicing vector; and then, the object splicing vector is subjected to multi-target feature learning.
  • the object feature vector and the object enhancement vector are subjected to vector concatenation processing to obtain an object concatenation vector, which refers to a concatenation vector that combines the object feature vector and the object enhancement vector of the target object.
  • the dimension of the object concatenation vector is equal to the sum of the dimensions of the object feature vector and the object enhancement vector.
  • multi-target feature learning refers to learning the object target vector of the object splicing vector under different target dimensions through a pre-trained multi-target neural network.
  • the different target dimensions include but are not limited to: a click dimension related to the user's click behavior, and a duration dimension related to the user's browsing time.
  • the click target vector of the target object under the click dimension and the duration target vector under the duration dimension can be learned through a multi-target neural network, thereby obtaining the object multi-target vector of the target object.
  • the multi-objective neural network can be implemented as a PLE network, which mainly includes: an expert network for learning multiple objectives, a shared network for learning common information between different expert networks, and a gate network for calculating the weights corresponding to each vector when the output vector of the fusion of multiple networks is calculated. For example, if two objectives, click and duration, need to be learned, two sets of expert networks are required; regardless of the number of objectives to be learned, the shared network is fixed to one; the length of the output vector of the last layer of the gate network is the same as the number of weights to be determined.
  • Step S104 determining a target recommended video from the to-be-recommended video library based on the object multi-target vector and the video multi-target vector index of each to-be-recommended video.
  • the video multi-target vector of the video to be recommended can be obtained based on the video multi-target vector index of the video to be recommended, and then the inner product between the object multi-target vector and the video multi-target vector is calculated, and the calculated inner product is determined as the similarity score between the target object and the corresponding video to be recommended. Then, the target recommended video can be determined from the video library to be recommended based on the similarity score.
  • a video to be recommended whose similarity score is greater than a score threshold can be selected as the target recommended video; in another implementation method, the videos to be recommended in the recommended video library can be sorted based on the similarity score to form a sequence of videos to be recommended, and then the first N videos to be recommended are selected from the sequence of videos to be recommended as the target recommended videos, where N is an integer greater than or equal to 1.
  • Step S105 recommending videos to the target object based on the target recommended video.
  • the server After the server obtains the target recommended video, it can send the target recommended video to the terminal, and recommend the target recommended video to the target object through the terminal.
  • the target recommended video can be displayed on the terminal of the target object, for example, the target recommended video can be displayed on the client of the video application.
  • the video recommendation method provided in the embodiment of the present application performs vectorization processing on the historical playback sequence of the target object to obtain the object enhancement vector of the target object, thereby obtaining the object multi-target vector of the target object based on the object enhancement vector of the target object, and the object multi-target vector is an object feature vector that integrates the object enhancement vector.
  • the target object when determining the target recommended video from the video library to be recommended, the target object can be accurately analyzed in combination with the information of the target object in multiple dimensions, so as to accurately recall the video; and, because the object enhancement vector is generated based on the historical playback sequence of the target object, the historical playback sequence is the playback record of the target object for the video, and the number of playback records will be significantly reduced relative to the number of target objects in the video application, therefore, the amount of data calculation during video recall can be greatly reduced, thereby greatly improving the efficiency of video recommendation.
  • the video recommendation system includes at least a terminal and a server, and a video application is installed on the terminal.
  • a video application is installed on the terminal.
  • the method of the embodiment of the present application can be used to recall the target recommended video, and the target recommended video can be displayed on the current interface of the video application to achieve video recommendation for the user.
  • FIG. 6 is another optional flow chart of the video recommendation method provided in an embodiment of the present application. As shown in FIG. 6 , the method includes the following steps S201 to S210:
  • Step S201 the terminal receives a browsing operation of a target object.
  • the browsing operation may be a pull-down operation on any vertical channel of the video application.
  • the client of the video application may receive the pull-down operation of the user and obtain the browsing operation.
  • Step S202 the terminal obtains object features and historical playback sequences of the target object in response to the browsing operation.
  • the object characteristics and historical playback sequence of the target object are obtained, wherein the object characteristics may be the information entered by the user when registering in the video application and during the use process, and the server of the video application stores this information and uses it as the object characteristics of the target object.
  • the server of the video application collects the user's playback records, and each playback record records the video ID, playback time, and playback duration of the played video.
  • the playback records within the preset historical time period can be selected based on the playback time to form a historical playback sequence.
  • the preset historical time period may be a specific time period before the current time, for example, a preset historical time period such as one month or half a year before the current time.
  • Step S203 The terminal encapsulates the object features and the historical play sequence into a video recommendation request.
  • Step S204 the terminal sends a video recommendation request to the server.
  • Step S205 the server responds to the video recommendation request, obtains an object feature vector based on the object feature, and obtains a historical play sequence within a preset historical time period and a video multi-target vector index of each video to be recommended in the video library to be recommended.
  • a video multi-target vector of each video to be recommended in the video library to be recommended can be generated in advance, and a video multi-target vector index of each video to be recommended can be created, and the video multi-target vector index can be stored.
  • the video multi-target vector can be queried based on the video multi-target vector index without having to generate the video multi-target vector of the video to be recommended again, thereby greatly improving the efficiency of recalling the target recommended video.
  • an embodiment of the present application provides a method for creating a video multi-target vector index.
  • FIG7 is a flow chart of the method for creating a video multi-target vector index provided by an embodiment of the present application. As shown in FIG7, the method for creating a video multi-target vector index can be executed by a server, and the method includes the following steps S301 to S304:
  • Step S301 retrieve a preset feature vector table based on the video identifier of each video to be recommended, and obtain a corresponding video enhancement vector for each video to be recommended.
  • a preset feature vector table may be retrieved through the video identifier of each video to be recommended to obtain a video enhancement vector for each video to be recommended.
  • the preset feature vector table includes two dimensions, the first dimension is the feature identifier, and the second dimension is the vector representation corresponding to each feature, and the vector representations of different features are independent of each other. That is, corresponding to the target object (such as a user), there may be an object feature vector table; corresponding to the video, there may be a video feature vector table. Based on the object feature vector table and the video feature vector table, the object feature vector of the target object and the video feature vector of the video to be recommended can be queried respectively.
  • Step S302 performing vector splicing processing on the video feature vector and the video enhancement vector of each video to be recommended, and obtaining a corresponding video splicing vector of each video to be recommended.
  • the method for generating the video feature vector is the same as the method for generating the object feature vector of the target object mentioned above, and a query can be performed based on the video feature vector table to obtain the video feature vector of the video to be recommended.
  • the video feature vector and the video enhancement vector can be processed by vector splicing.
  • the video feature vector and the video enhancement vector can be connected into a vector with a higher dimension, that is, the video Splicing vector.
  • the dimension of the video splicing vector is equal to the sum of the dimensions of the video feature vector and the video enhancement vector.
  • Step S303 performing multi-objective feature learning on the video splicing vector of each video to be recommended, and obtaining a corresponding video multi-objective vector of each video to be recommended.
  • multi-target feature learning when multi-target feature learning is performed on the video splicing vector of each video to be recommended, it can be achieved in the following way: first, for each video to be recommended, multi-target feature learning is performed on the video splicing vector of the video to be recommended through a multi-target neural network to obtain the video target vector of the video to be recommended in multiple target dimensions; then, the target weight under each target dimension is obtained; using the target weight, weighted calculation is performed on the video target vector under each target dimension respectively to obtain a weighted video target vector; finally, the weighted video target vectors under multiple target dimensions are spliced to obtain a video multi-target vector of the video to be recommended.
  • a video click target vector of the video to be recommended under the click dimension and a video duration target vector under the duration dimension (wherein the video click target vector and the video duration target vector constitute the video target vector of the video to be recommended) can be output through a multi-target neural network, wherein, for the video to be recommended, there are click target weights under the click dimension and duration target weights under the duration dimension, respectively.
  • the product between the click target weight and the video click target vector, and the product between the duration target weight and the video duration target vector are calculated respectively, and then the two products are added to obtain a video multi-target vector of the video to be recommended. That is to say, the video click target vector and the video duration target vector are weighted and summed by the click target weight and the duration target weight to obtain a video multi-target vector.
  • both the click target weight and the duration target weight are parameters in the video recall model.
  • the update method of the click target weight and the duration target weight will be explained below.
  • Step S304 Create a video multi-objective vector index corresponding to the video multi-objective vector of each video to be recommended.
  • the video multi-objective vector index is the index information for querying the video multi-objective vector of the video to be recommended.
  • the video multi-objective vectors of all the videos to be recommended can be stored in a preset video multi-objective vector storage unit.
  • index information corresponding to each video multi-objective vector can also be generated, and the index information is used to index the storage location of the video multi-objective vector, so that the video multi-objective vector can be obtained based on the video multi-objective vector index.
  • the video multi-target vector of the video to be recommended can be queried online in real time through the video multi-target vector index, without the need to generate the video multi-target vector every time the video is recommended, that is, there is no need to repeatedly generate the video multi-target vector, thereby greatly reducing the amount of data calculation during video recommendation and improving the efficiency of video recommendation.
  • Step S206 The server performs vectorization processing on the historical playback sequence to obtain an object enhancement vector of the target object.
  • FIG. 8 shows that step S206 can be implemented by following steps S2061 to S2066:
  • Step S2061 obtaining the historical video identifier and historical playback duration of each historical playback video in the historical playback sequence.
  • Step S2062 Based on each historical video identifier, a search is performed in a preset feature vector table to obtain a historical video vector set.
  • the number of historical video vectors in the historical video vector set is the same as the number of historical video identifiers in the historical playback sequence.
  • Step S2063 performing total duration statistics on the historical playback duration in the historical playback sequence to obtain the total historical playback duration.
  • the sum of the historical playback durations of all historical playback videos in the historical playback sequence may be calculated to obtain the total historical playback duration.
  • Step S2064 based on the total historical playback time, each historical playback time is normalized to obtain the normalized playback time of each historical playback video, and the normalized playback time is determined as the video vector weight of the corresponding historical playback video.
  • the normalization process can be to normalize each historical playback time by the total historical playback time.
  • each historical playback time can be divided by the total historical playback time, and the calculated quotient can be determined as the normalized playback time of each historical playback video.
  • the normalized playback time is determined as the video vector weight of the historical playback video.
  • Step S2065 Based on the video vector weight, weighted processing is performed on each historical video vector in the historical video vector set to obtain a video weighted vector set.
  • the historical video vector set includes historical video vectors corresponding to all historically played videos. Each historical video vector in the historical video vector set is multiplied by a corresponding video vector weight to obtain a plurality of video weighted vectors, which constitute a video weighted vector set.
  • Step S2066 merge the video weighted vectors in the video weighted vector set to obtain an object enhancement vector of the target object.
  • the merging process refers to splicing all the video weighted vectors in the video weighted vector set into a vector with a higher dimension, and the higher dimensional vector is the object enhancement vector.
  • the dimension of the object enhancement vector is equal to the sum of the dimensions of all the video weighted vectors.
  • Step S207 the server performs vector concatenation processing and multi-target feature learning on the object feature vector and the object enhancement vector in sequence to obtain an object multi-target vector of the target object.
  • FIG. 9 shows that step S207 can be implemented by following steps S2071 to S2073:
  • Step S2071 performing vector splicing processing on the object feature vector and the object enhancement vector to obtain an object splicing vector.
  • the object concatenation vector refers to the concatenation vector of the target object that combines the object feature vector and the object enhancement vector.
  • the dimension of the object concatenation vector is equal to the sum of the dimensions of the object feature vector and the object enhancement vector.
  • Step S2072 performing multi-objective feature learning on the object splicing vector through a multi-objective neural network to obtain the object target vector of the target object in multiple target dimensions.
  • Multi-objective feature learning refers to learning the object target vector of the object splicing vector under different target dimensions through a pre-trained multi-objective neural network.
  • the different target dimensions include but are not limited to: the click dimension related to the user's click behavior and the duration dimension related to the user's browsing time.
  • Step S2073 concatenating the object target vectors under multiple target dimensions to obtain a multi-target object vector of the target object.
  • a click target vector and a duration target vector of a target object may be learned through a multi-target neural network, and then the click target vector and the duration target vector are concatenated to obtain a multi-target vector of the target object.
  • Step S208 The server determines a target recommended video from the to-be-recommended video library based on the object multi-target vector and the video multi-target vector index of each to-be-recommended video.
  • FIG. 10 shows that step S208 can be implemented by following steps S2081 to S2084:
  • Step S2081 obtaining the video multi-objective vector of each video to be recommended based on the video multi-objective vector index.
  • the storage location of the video multi-objective vector of each video to be recommended may be determined based on the video multi-objective vector index, and then the stored video multi-objective vector may be obtained from the storage location.
  • Step S2082 determining the inner product between the object multi-target vector and the video multi-target vector of each video to be recommended, and determining the inner product as the similarity score between the target object and the corresponding video to be recommended.
  • the inner product between the object multi-target vector and the video multi-target vector of each video to be recommended in the video library to be recommended can be calculated to obtain the similarity score between the target object and each video to be recommended.
  • Step S2083 Select a specific number of videos to be recommended from the video library to be recommended according to the similarity scores.
  • Step S2084 Determine the selected specific number of videos to be recommended as target recommended videos corresponding to the target object.
  • Step S209 the server recommends the target recommended video to the terminal.
  • Step S210 The terminal displays the target recommended video on the current interface.
  • the video recommendation method provided in the embodiment of the present application is based on the object multi-target vector and the video multi-target vector index of each video to be recommended.
  • it can accurately analyze the target object in combination with the information of the target object in multiple dimensions, thereby accurately recalling the video; and, since the object enhancement vector is generated based on the historical playback sequence of the target object, the historical playback sequence is the playback record of the video by the target object, and the number of playback records will be significantly reduced relative to the number of target objects in the video application, therefore, the amount of data calculation during video recall can be greatly reduced, thereby greatly improving the efficiency of video recommendation.
  • the above-mentioned video recommendation method can be implemented by a video recall model; the video recall model includes an object tower and a video tower.
  • the object tower refers to a neural network structure for determining an object multi-target vector (i.e., a user multi-target vector), and the video tower refers to a neural network structure for determining a video multi-target vector.
  • FIG 11 is a flow chart of the video recall model training method provided in an embodiment of the present application.
  • the video recall model training method can be performed by a model training device.
  • the model training device can be a device in a video recommendation device (i.e., an electronic device), that is, the model training device can be a server or a terminal; or, the model training device can also be another device independent of the video recommendation device, that is, the model training device is other electronic devices other than the server and terminal used to implement the video recommendation method.
  • the video recall model training method includes the following steps S401 to S405:
  • Step S401 The model training device obtains sample data.
  • sample data includes: sample object features, sample video features, and target parameters under multiple target dimensions.
  • sample object features and sample video features include but are not limited to user IDs and request IDs corresponding to user video recommendation requests, sample video features include but are not limited to video IDs, and target parameters under multiple target dimensions include but are not limited to: whether the user clicks on the sample video and the playback time of the sample video.
  • FIG. 12 shows that step S401 can be implemented by following steps S4011 to S4014:
  • Step S4011 obtaining original sample data.
  • the original sample data includes multiple real positive samples and multiple real negative samples.
  • the real positive samples refer to the sample data corresponding to the "real exposure and playback behavior data”
  • the real negative samples refer to the sample data corresponding to the "real exposure but not playback behavior data”.
  • Step S4012 construct random negative samples based on multiple real positive samples, and delete a certain number of real negative samples from the multiple real negative samples.
  • the user identifier and request identifier in the positive sample can be extracted from the entire video pool for each positive sample, and then n video identifiers are randomly screened and spliced to form a negative sample, that is, a random negative sample is obtained.
  • Deleting some of the true negative samples means reducing the number of true negative samples.
  • some of the true negative samples may be randomly deleted, or some of the true negative samples may be randomly selected from all the true negative samples.
  • the preset proportional relationship can be determined according to the model parameters of the video recall model.
  • the preset ratio may be 1:1:4, that is, the number of true negative examples after random sampling is the same as the number of true positive examples, and 4 videos are randomly selected for each positive sample as random negative samples.
  • Step S4013 determine the true positive samples as positive samples, and determine the reduced true negative samples and random negative samples as negative samples.
  • the true positive samples are the positive samples for model training, and the true negative samples after deleting some of their numbers and the random negative samples together constitute the negative samples for model training.
  • Step S4014 Based on the object identifier and the video identifier, feature association is performed on the positive sample and the negative sample to obtain sample data.
  • the object features are associated through the user identifier, and the video features are associated through the video identifier, so as to obtain the sample object features and sample video features, and in the subsequent model training process, the sample object features are input into the object tower (i.e., the user tower), and the sample video features are input into the video tower.
  • the constructed sample object features include both positive samples and negative samples; the constructed sample video features include both positive samples and negative samples.
  • step S402 the model training device inputs the sample object features into the object tower of the video recall model, and predicts the sample object target vector of the sample object under multiple target dimensions through the object tower.
  • the object tower can vectorize the sample object features to obtain a sample object feature vector.
  • the object tower can also generate a sample object enhancement vector, and perform vector concatenation and multi-target feature learning on the sample object feature vector and the sample object enhancement vector in sequence to obtain a sample object target vector of the sample object under multiple target dimensions. In this way, by concatenating the sample object target vectors of the sample object under multiple target dimensions, a sample object multi-target vector of the sample object can be obtained.
  • step S403 the model training device inputs the sample video features into the video tower of the video recall model, and predicts the sample video target vector of the sample video under multiple target dimensions through the video tower.
  • the video tower can vectorize the sample video features to obtain a sample video feature vector.
  • the video tower can also generate a sample video enhancement vector, and perform vector splicing processing and multi-target feature learning on the sample video feature vector and the sample video enhancement vector in turn to obtain a sample video target vector of the sample video under multiple target dimensions. In this way, by splicing the sample video target vectors of the sample video under multiple target dimensions, the sample video multi-target vector of the sample video can be obtained.
  • the sample similarity score between the sample object and the sample video can be determined.
  • step S404 the model training device inputs the sample object target vector, the sample video target vector and the target parameters into the target loss model, performs loss calculation through the target loss model, and obtains the target loss result.
  • the sample object target vector includes an object click target vector; the sample video target vector includes a video click target vector; and the target parameter includes a click target value.
  • FIG. 13 shows that step S404 can be implemented by following steps S4041a to S4044a:
  • Step S4041a determining the vector inner product between the object click target vector and the video click target vector through the target loss model.
  • Step S4042a based on the vector inner product and the preset activation function, determine the predicted value in the click dimension.
  • the vector inner product can be input into the preset activation function, and the vector inner product can be calculated by the preset activation function to obtain the predicted value in the click dimension.
  • the preset activation function can map the vector inner product to any value between 0 and 1, that is, the preset activation function can obtain a mapping value between 0 and 1 according to the input vector inner product, and the mapping value constitutes the predicted value in the click dimension.
  • the preset activation function may be a sigmoid activation function, and thus, the vector inner product may be input into the sigmoid activation function, and the predicted value in the click dimension may be calculated by the sigmoid activation function.
  • Step S4043a using the logarithmic loss function, determines the correlation between the predicted value and the click target value under the click dimension. Number of losses.
  • Step S4044a determining the logarithmic loss as the target loss result.
  • the sample object target vector includes an object duration target vector; the sample video target vector includes a video duration target vector; and the target parameter includes a duration target value.
  • FIG. 14 shows that step S404 can be implemented by following the steps S4041b to S4047b:
  • Step S4041b truncating the duration target value according to the preset number of truncation intervals to obtain a duration truncation value having the number of truncation intervals.
  • the preset number of truncation intervals is 100, and the target duration value is t, then t can be divided by 100 to obtain 100 duration truncation values.
  • Step S4042b determining a target truncation value based on the duration truncation value of the number of truncation intervals.
  • the duration truncation value having the smallest duration among the 100 duration truncation values may be determined as the target truncation value.
  • Step S4043b based on the target cutoff value, normalize each duration cutoff value to obtain a normalized duration cutoff value.
  • MinMax function may be used to normalize each duration cutoff value.
  • Each duration cutoff value may be divided by the target cutoff value to obtain a normalized duration cutoff value corresponding to the duration cutoff value.
  • Step S4044b determining the vector inner product between the object duration target vector and the video duration target vector.
  • Step S4045b based on the vector inner product and the preset activation function, determine the predicted value in the duration dimension.
  • the vector inner product between the object duration target vector and the video duration target vector can be input into a preset activation function, and the vector inner product is calculated by the preset activation function to obtain a predicted value in the duration dimension.
  • the preset activation function here can also be a sigmoid activation function, so that the vector inner product can be input into the sigmoid activation function, and the predicted value in the duration dimension is calculated by the sigmoid activation function.
  • Step S4046b determining the mean square error loss between the predicted value and the normalized duration cutoff value in the duration dimension through the mean square error loss function.
  • Step S4047b determining the mean square error loss as the target loss result.
  • the video recall model also includes a multi-target network, and the object enhancement loss of the sample object and the video enhancement loss of the sample video can be calculated through the multi-target network.
  • FIG15 is a flow chart of determining the object enhancement loss and the video enhancement loss based on the multi-target network provided by an embodiment of the present application, as shown in FIG15, including the following steps S501 to S503:
  • Step S501 when the sample data is a positive sample, the model training device outputs a target enhancement vector under multiple target dimensions through a multi-target network, wherein the target enhancement vector under multiple target dimensions includes a target enhancement vector corresponding to the object tower and a target enhancement vector corresponding to the video tower.
  • Step S502 Under each target dimension, the model training device determines a first mean square error between a target enhancement vector corresponding to the object tower and a sample video target vector output by the video tower, or determines a second mean square error between a target enhancement vector corresponding to the video tower and a sample object target vector output by the object tower.
  • Step S503 The model training device determines the first mean square error and the second mean square error as the object enhancement loss of the sample object and the video enhancement loss of the sample video, respectively.
  • object enhancement loss and video enhancement loss constitute part of the loss results in the target loss result.
  • the target loss result includes the logarithmic loss in the click dimension, the mean square error loss in the duration dimension, the object enhancement loss in the click dimension, the video enhancement loss in the click dimension, the object enhancement loss in the duration dimension, and the video enhancement loss in the duration dimension.
  • these multiple losses can also be subjected to loss fusion processing, and the video recall model can be retrained and the model parameters can be corrected based on the fusion loss result after the loss fusion processing.
  • the logarithmic loss in the click dimension, the mean square error loss in the duration dimension, and the point The loss weights corresponding to the object enhancement loss in the click dimension, the video enhancement loss in the click dimension, the object enhancement loss in the duration dimension and the video enhancement loss in the duration dimension are respectively obtained; and the preset regularization term is obtained; then, based on the loss weights and the regularization term, the logarithmic loss in the click dimension, the mean square error loss in the duration dimension, the object enhancement loss in the click dimension, the video enhancement loss in the click dimension, the object enhancement loss in the duration dimension and the video enhancement loss in the duration dimension are subjected to loss fusion processing to obtain the fusion loss result; finally, based on the fusion loss result, the parameters in the object tower and the video tower are corrected to obtain the trained video recall model.
  • Step S405 The model training device modifies the parameters in the object tower and the video tower based on the target loss result to obtain a trained video recall model.
  • the target weight under each target dimension is obtained, and the target weight is used to perform weighted calculation on the video target vector under each target dimension.
  • the sample object target vector and the sample video target vector can be input into the recommendation prediction layer of the video recall model, and the click parameters and video duration parameters of the sample object for the sample video are determined through the recommendation prediction layer; then, according to the click parameters, the performance index value of the video recall model is determined; according to the video duration parameters, the average head duration of the video recall model is determined; finally, based on the performance index value and the average head duration, multiple rounds of tests are performed to obtain the target weight in the click dimension and the target weight in the video duration dimension.
  • the performance index value is an index value used to measure the quality of the video recall model
  • the performance index value can be the AUC value of the video recall model.
  • AUC Absolute Under Curve
  • ROC receiver operating characteristic curve
  • AUC is a performance index for measuring the quality of the video recall model. AUC can be obtained by summing the areas of each part under the ROC curve.
  • the video recommendation method provided by the embodiment of the present application can optimize the initialization method of the user enhancement vector, and generate the user enhancement vector according to the user playback sequence.
  • the user playback sequence is the user's playback record of the content (such as video), wherein the playback record includes information such as content ID and duration. Because the content ID in the user playback sequence is only in the millions, which is a hundred times smaller than the user level, the process of generating the user enhancement vector based on the user playback sequence can significantly reduce the model size; and the content and duration played by most users are different, which can also achieve the distinction between different users.
  • the embodiment of the present application also optimizes the structure of each tower, and uses a multi-objective neural network to output multiple vectors corresponding to multiple targets.
  • the embodiment of the present application also optimizes the way of fitting the target vector with the enhancement vector, and realizes the simultaneous fitting of multiple target vectors with one enhancement vector by adding a multi-gate mixture of experts (MMOE) network in the fitting process.
  • MMOE multi-gate mixture of experts
  • a multi-target recall model i.e., video recall model
  • a multi-target recall model i.e., video recall model
  • the tower structure is only applicable to a single target
  • the enhancement vector cannot fit the multi-target information by initializing the user enhancement vector (i.e., object enhancement vector) according to the historical playback sequence, using the multi-target network as the tower structure, and dynamically extracting the multi-target information of the enhancement vector.
  • the embodiment of the present application optimizes the calculation method of each target loss by truncating and normalizing the target value and the predicted value in the training stage, and optimizes the fusion method of multiple target losses by adaptively learning the weights of each target loss; on the other hand, in the application stage, by defining an evaluation formula, the measurement score of each target similarity under different weight combinations is calculated, and the best combination that satisfies multiple targets at the same time is explored offline, thereby optimizing the fusion method of multiple target similarities.
  • the video recall model of the embodiment of the present application finally has the ability to fully increase the cross-opportunities between users and content (i.e., videos to be recommended) in the offline training phase in a recommendation scenario that takes into account objectives such as click-through rate and average playback time per person, and to quickly retrieve content that meets multiple objectives in a balanced manner in the online application phase.
  • objectives such as click-through rate and average playback time per person
  • the user tower and content tower of the video recall model in the embodiment of the present application are respectively In addition to the constant vectorization, each of them splices the enhanced vectors carrying the multi-target information of another tower. After the spliced vector is calculated by the multi-target tower, a total of four vectors are output, namely the user's click target vector, the content's click target vector, the user's duration target vector and the content's duration target vector.
  • the two target vectors of the user are directly spliced as the user's multi-target vector.
  • the two target vectors of the content are respectively multiplied by their respective weights and spliced together as the content multi-target vector.
  • the inner product of the user's multi-target vector and the content multi-target vector is calculated online in real time, and the inner product is used as the similarity score.
  • the higher the similarity score the more interested the user is.
  • the video recommendation system will merge and remove the retrieved head score results with other recalled results, and then continue to perform fine sorting, mixed sorting and other logics, and finally recommend them to users.
  • the implementation process of the video recall and recommendation by the video recall model in the embodiment of the present application will be described in detail below.
  • the application scenario of the embodiment of the present application can be a recommended waterfall flow of material cards in the non-destination area of each channel on the homepage of any video application.
  • Figure 16 is an interface diagram of the homepage of a video application.
  • the homepage of a video application includes multiple channels, mainly a featured page and each vertical channel.
  • the featured page can comprehensively display various types of content such as type 1, type 2, type 3, type 4, etc. (for example, the various types here can be types such as TV series, movies, variety shows and animations), and each vertical channel will only display the corresponding type of content, such as a TV series channel will only display TV series. What the user sees after sliding down each channel is the non-destination area, as shown in Figure 17.
  • the main content type is material card 171, and the video that the user is interested in is personalized through material card 171.
  • the technical difficulty of the embodiment of the present application is how to retrieve videos that the user is really interested in based on limited information, and improve the click-through rate of the scene and the average playback time per person.
  • the main purpose of the embodiments of the present application is to retrieve videos that users are interested in based on personalized information such as characteristics and historical behaviors of the target object (such as users), while considering both click and duration goals, and to display the videos of interest to users through recommendation services, thereby attracting user clicks and driving growth in business indicators.
  • the core technologies of the embodiments of the present application include: a calculation process based on a multi-target recall model (i.e., a process for calculating the similarity scores between users and videos), a training process (i.e., a process for updating model parameters using training data), and an application process (i.e., a process for online real-time retrieval of videos of interest to users).
  • a multi-target recall model i.e., a process for calculating the similarity scores between users and videos
  • a training process i.e., a process for updating model parameters using training data
  • an application process i.e., a process for online real-time retrieval of videos of interest to users.
  • a request for obtaining a video is sent to the recommendation service, which returns the video of interest to the user (i.e., the target recommended video) through recall, fine sorting, and mixed sorting logic.
  • the video recall model of the embodiment of the present application is located at the recall layer, and the video is retrieved through the calculation process shown in Figure 18.
  • the calculation process includes: step S181, inputting features and vectorizing; step S182, outputting multi-target vectors; step S183, calculating multi-target similarity.
  • step S181 the features of the input model cannot be directly involved in the calculation and need to be vectorized.
  • enhanced vectors used to increase the chances of user and video interaction are also generated in this stage.
  • the feature vector when the feature vector is generated, the feature vector is generated in the manner shown in FIG. 19.
  • the video recall model first inputs the object feature in the user tower and the video feature in the video tower, and then performs vectorization processing through the feature vector table 190 (i.e., the above-mentioned preset feature vector table).
  • the feature vector table has two dimensions, the first dimension is the feature ID, and the second dimension is the vector corresponding to each feature ID.
  • the vector tables of different features are independent. If it is a discrete feature 191, such as: discrete related information of the target object, the ID, category and label of the video, etc., the corresponding vector table can be directly retrieved according to the ID.
  • the discretization processing can be to determine the interval to which the continuous feature 192 belongs according to the feature equifrequency distribution table 193 that has been counted offline, and then retrieve the corresponding feature vector table 190 according to the interval ID.
  • the equifrequency distribution statistics of each continuous feature are also independent.
  • a discrete feature vector 194 and a continuous feature vector 195 are obtained respectively. Then, the discrete feature vector 194 and the continuous feature vector 195 are concatenated to form Feature vector 196. After the object features and video features are vectorized, they are concatenated internally to generate an object feature vector and a video feature vector.
  • the feature vector table is part of the model parameters, and the update method of the model parameters will be described in the training process below.
  • the enhancement vector when the enhancement vector is generated, the enhancement vector includes a user enhancement vector and a video enhancement vector.
  • the video enhancement vector is obtained by retrieving the feature vector table by the video ID; the user enhancement vector is generated by the play sequence (i.e., the historical play sequence). As shown in Figure 20, the play sequence 20 includes the video ID and duration that the user has played in history.
  • the feature vector table 201 is retrieved for each video ID in the play sequence 20 to generate a video vector set 202, and the number of video vectors in the video vector set 202 is consistent with the number of video IDs in the play sequence; then, the total play duration 203 of the play sequence is counted, and each play duration in the play sequence is divided by the total play duration 203 to obtain the normalized duration as the video vector weight, and the number of video vector weights is also consistent with the number of video IDs in the play sequence; finally, each vector in the video vector set is multiplied by the corresponding video vector weight to obtain a video weighted vector set 204, and the vectors in the video weighted vector set 204 are combined to obtain the user enhancement vector 205.
  • the feature vector table at this stage is also part of the model parameters, and the update method of the model parameters will be described in the training process below.
  • the user enhancement vector is generated using the playback sequence. Since the videos and durations played by most users are different, and the embodiment of the present application also uses the playback duration to weight the vector, it is possible to achieve differentiated processing between different users without affecting the confidence of the user enhancement vector.
  • the user enhancement vector generation method of the embodiment of the present application has the following two advantages: First, although the playback sequences of most users are different, the order of magnitude of the video IDs covered by the user's playback sequence remains unchanged, and the number is reduced by a hundred times compared to the hundreds of millions of users accumulated in the application scenario, so the computing resources and storage space of the model can be significantly saved; second, assuming that the number of samples remains unchanged, because the number of video IDs in the playback sequence is much less than the number of users, each video ID can get more sufficient training opportunities, thereby obtaining a more accurate user enhancement vector.
  • the model at the input layer of the user tower and the video tower splices the generated feature vector and the enhancement vector together to obtain a splicing vector, wherein the splicing vectors obtained by the user tower and the video tower are used as the user splicing vector and the video splicing vector, respectively, and the subsequent process continues.
  • the structure of the user tower and the video tower in the embodiment of the present application are both multi-objective neural networks.
  • the multi-objective neural network can be a progressive layered extraction network (PLE).
  • the PLE network (as shown in FIG21) mainly includes: (1) an expert network for learning multiple targets. For example, if two targets, click and duration, need to be learned, two groups of expert networks are required; (2) a shared network for learning shared information between different expert networks. No matter how many targets need to be learned, the shared network is fixed to one; (3) a gate network for calculating the output vector of the fusion of multiple networks, each vector corresponds to a weight.
  • the length of the output vector of the last layer of the gate network is the same as the number of weights to be determined.
  • the expert network, the shared network and the gate network can be multi-layered, and each layer is a multi-layer neural network. Because the output of the last layer of the expert network is the target vector, there is no need to interact with the shared network and the gate network in the future, so the shared network and the gate network will have one less layer than the expert network.
  • the specific number of layers can be determined by offline testing.
  • the embodiment of the present application can use a three-layer expert network, a two-layer shared network and a two-layer gate network.
  • the click expert network will output the click target vector
  • the duration expert network will output the duration target vector, that is, the two towers each output two vectors.
  • the PLE network parameters are part of the model parameters, and the update method of the model parameters will be explained in the training process below.
  • step S183 the PLE network is proposed to optimize the fine ranking model in the recommendation.
  • the predicted click rate value is multiplied by the predicted duration value, which is equivalent to the user's duration expectation.
  • This method requires multiple calculations and can be applied to situations where the number of candidate sets is small.
  • the recall model faces a massive candidate set and adopts a more efficient nearest neighbor algorithm.
  • the score is generally the inner product that only needs to be calculated once. The score calculation method of the original structure cannot be applied to the recall model and needs to be redesigned.
  • the most intuitive way to calculate the multi-target similarity based on the inner product is to first calculate the inner product of the user click target vector and the video click target vector, as well as the inner product of the user duration target vector and the video duration target vector, and then weighted sum the two inner products, but this method also requires multiple calculations.
  • the method of the embodiment of the present application is to first splice the click target vector and the duration target vector inside each tower, except that the user click target vector 221 and the user duration target vector 222 are directly spliced in the user tower, and the spliced vector obtained after splicing is used as the user multi-target vector 223, and each bit of the video click target vector 224 is multiplied by the click target weight ⁇ , and each bit of the video duration target vector 225 is multiplied by the click target weight ⁇ , and then spliced, and the spliced vector obtained after splicing is used as the video multi-target vector 226.
  • the inner product of the user multi-target vector 223 and the video multi-target vector 226 is calculated as the multi-target similarity 227 (i.e., similarity score) between the user and the video, as shown in Figure 22.
  • the embodiment of the present application is equivalent to realizing the weighted summation of the click target inner product and the duration target inner product through one calculation, thereby adapting the nearest neighbor algorithm.
  • ⁇ and ⁇ are part of the model parameters, and the updating method of the model parameters will be described in the training process below.
  • the embodiment of the present application is a video recall model for online real-time retrieval of videos of interest to users.
  • the model can be updated regularly through the training process shown in FIG. 23 .
  • the training process includes: step S231 , constructing training samples; step S232 , updating model parameters; and step S233 , exploring multi-objective weights.
  • step S231 the training samples are used for offline training of the model, and the construction process is divided into two steps: screening positive and negative examples and associating features.
  • the positive and negative examples are represented as: [user ID, video ID, whether clicked, playback duration, request ID], which is composed of five fields.
  • "real exposure and playback behavior data” is used as a positive example
  • the positive example can be represented as: [user ID, video ID, 1, playback duration, request ID]
  • "real exposure and non-playing behavior data” is used as a negative example
  • the negative example can be represented as: [user ID, video ID, 0, 0, request ID].
  • the positive example screening method can be reused in the video recall model, but the negative example screening method is not applicable, so the embodiment of the present application will be optimized, as shown in Figure 24, including increasing random negative examples and reducing the number of real negative examples.
  • the recall model When adding random negative examples, when the recall model searches for videos of interest to users online in real time, the video candidate set 241 it faces is a resource pool of millions of levels. Therefore, unlike the fine ranking samples that only use real behavior data as the screening method for positive and negative examples, the recall samples need to add randomly sampled negative examples (i.e., random negative examples 242). Because the candidate set faced by the fine ranking model is already a video that is relatively matched with the user's interests after recall and coarse ranking screening, the main purpose is to determine the more interesting parts from it. However, the video recall model also needs to have the ability to identify videos that the user is completely not interested in from the entire candidate set, that is, to separate the videos that the user is interested in from the videos that he is not interested in to the greatest extent.
  • the candidate set of random negative examples 242 is the entire video candidate set 241.
  • the user ID and request ID are extracted, and n video IDs are randomly selected to form a negative example.
  • the negative example is represented as: [user ID, video ID, 0, 0, request ID].
  • the embodiment of the present application introduces random negative examples to simulate videos that the user is completely uninterested in, so that the video recall model can see more parts that should be filtered during the training stage, thereby strengthening the interest discrimination judgment of the candidate set.
  • true negative examples 244 When reducing the true negative examples 244, the introduction of random negative examples will significantly increase the sample size, thereby lengthening the training time. While requiring more computing and storage resources, it will also affect the update speed of the model, resulting in insufficient learning of the user's behavioral characteristics, so further optimization is needed.
  • the embodiment of the present application adopts a method of random sampling of true negative examples. True negative examples are behavioral data that users have exposed but not played.
  • object features can be associated through user IDs and video features can be associated through video IDs according to the selected positive and negative examples 246, as shown in FIG25. Because the feature values corresponding to some features such as the user's play sequence and the city where they are located are changing, and the feature values corresponding to some features such as the video's play volume and click-through rate are also changing, if the same user has different access times and the corresponding request times are different, these features will also be different. Therefore, object features and video features are generally stored in the recommendation service with the request ID when the user visits.
  • the training sample at this time can be expressed as: [object features, video features, request ID]; while in the offline case, the request ID of the positive and negative examples is associated with the request ID of the feature 251 stored online, so as to match the object features and video features when the request occurs, and obtain the final training sample 252, which can be expressed as: [object features, video features, whether to click, play duration].
  • the "object features" are input into the user tower
  • the "video features” are input into the video tower
  • “whether to click” is the click target
  • play duration is the duration target.
  • the process of updating the model is the process of updating its own parameters.
  • the model adjusts the model's own parameters by reducing the loss.
  • the loss of the model can be calculated by inputting the target value and the predicted value into the loss function.
  • the loss of the model is used to measure the difference between the target value and the predicted value. The smaller the loss, the smaller the difference.
  • the model continuously fits the target by continuously reducing the loss during the training phase.
  • the embodiment of the present application also optimizes the fusion method using a fixed loss weight to achieve that each predicted value is balanced and close to the target value.
  • the following is a description of several loss calculations involved in the embodiment of the present application.
  • the click target is divided into two types according to "whether it is clicked". Because the target is a discrete value and belongs to a binary classification prediction, the embodiment of the present application uses a logarithmic loss function, such as the following formula (1).
  • the loss calculation process is: first, determine the target value, that is, y. If the exposure is not played, the target value is 0.
  • the target value is 1; then, calculate the predicted value, that is, ⁇ ( ⁇ p u ,p v >), first calculate the inner product of the user click target vector p u and the video click target vector p v , and then pass through the output of the sigmoid activation function ( ⁇ ) as the predicted value; finally, calculate the logarithmic loss, input the target value and the predicted value into the logarithmic loss function, and finally obtain the logarithmic loss loss clk .
  • T represents the number of training samples.
  • the duration target is the actual playback duration of the user. If it is not clicked, it is 0, and if it is clicked, it is a decimal greater than 0. Because the target is a continuous value and belongs to regression prediction, the embodiment of the present application uses the mean square error loss function, such as the following formula (2).
  • the loss calculation process is: first, truncate the duration target min(dur,max dur ). Since some users may forget to exit the playback, or there is a problem with the duration report, there are very few abnormally large values of the duration target dur in the sample. In order to avoid interfering with the recall model fitting of the duration target, it is necessary to truncate it according to the specified truncation value.
  • the equifrequency distribution of the duration target in the test sample randomly screened by offline statistics can be divided into 100 intervals, and the minimum value of the 100th interval is used as the specified truncation value max dur (that is, the above-mentioned target truncation value). If the duration target is less than or equal to the truncation value, the original value is retained, and if it is greater than the truncation value, it is replaced with the truncation value; then, normalize the duration target to obtain the normalized duration truncation value Since the span of the truncated duration target is very large, ranging from 0 seconds to tens of thousands of seconds, the parameters of the model fluctuate drastically when learning different samples.
  • the MinMax function is used to normalize the duration truncation value.
  • the min in the function is 0, and the max is the specified maximum truncation value max dur .
  • the output of the MinMax function is used as the target value, and the interval corresponding to the duration target is adjusted to [0, 1.0]. In this way, the span interval of the duration target is significantly reduced, thereby facilitating model fitting.
  • the predicted value in the duration dimension that is, ⁇ ( ⁇ d u ,d v >), is calculated.
  • the predicted value needs to be consistent with the span interval of the target value, that is, the span interval of the predicted value is also [0, 1.0].
  • the inner product of the user duration target vector du and the video duration target vector d v is first calculated, and then the inner product is input into the sigmoid activation function ( ⁇ ) to obtain the output result, and the output result is used as the predicted value in the duration dimension.
  • the mean square error is calculated, and the target value and the predicted value can be input into the mean square error loss function (2) to obtain the mean square error loss loss dur .
  • the embodiment of the present application adds an MMOE network in the fitting process, as shown in Figure 26.
  • the MMOE network is a primary multi-target network.
  • the MMOE network can output multiple vectors corresponding to multiple targets.
  • the MMOE only has an expert network and a gate network, and no shared network. When the number of layers is configured very few, the parameter scale is also small, which will not cause the model to over-learn this part of the structure.
  • the enhancement vector 261 passes through the click gate network 262, the click expert network 263, the duration expert network 264 and the duration gate network 265 in the MMOE network, and outputs the click target enhancement vector 266 and the duration target enhancement vector 267.
  • the model calculates the mean square error loss aug between the target enhancement vector of the current tower and the target vector of another tower under each target dimension (see the following formula (3), ti represents each digit of the target vector, ai represents each digit of the target enhancement vector), and obtains four losses in total, namely: user enhancement loss under click target, video enhancement loss under click target, user enhancement loss under duration target and video enhancement loss under duration target.
  • the target vector requires the enhancement vector of another tower during the calculation process, and they are dependent on each other. If the model updates the enhancement vector, the target vector needs to be fixed, otherwise there will be a problem that the model parameters cannot be updated.
  • the model when the model faces multiple target losses, it is generally a weighted sum of each loss, and the weighted sum result is used as the fusion loss. After calculating the fusion loss, it is possible to reduce the fusion loss to achieve simultaneous reduction of each loss.
  • the weight of each loss can be determined by offline testing, but during the training process, when different training samples are input, due to differences between the data, fixed weights will inevitably lead to multiple goals being neglected, so the embodiment of the present application uses uncertainty weighting to adaptively adjust the weight of each loss to achieve a balanced close to the target value for each predicted value.
  • the targets faced by the model include: click target loss, duration target loss, user enhancement loss under click target, video enhancement loss under click target, user enhancement loss under duration target, and video enhancement loss under duration target, a total of six losses.
  • the corresponding original uncertainty weighted formula is the following formula (4), where loss clk , loss dur , loss u_clk_aug , loss u_dur_aug , loss v_clk_aug , and loss v_dur_aug correspond to the six target losses respectively, w clk , w dur , w u_clk_aug , w u_dur_aug , w v_clk_aug , and w v_dur_aug are the weights of each target loss in the six target losses, and log(w clk , w dur, w u_clk_aug , w u_dur_aug , w v v_dur_
  • w clk w dur w u_clk_aug w u_dur_aug w v_clk_aug w v_dur_aug may be negative, resulting in incorrect calculation of the fusion target loss and thus training failure. Therefore, the embodiment of the present application adjusts the regularization term to formula (5) to ensure that the fusion target loss of each step of training is calculated correctly.
  • norm log(w clk 2 w dur 2 w u_clk_aug 2 w v_clk_aug 2 w u_dur_aug 2 w v_dur_aug 2 ) (5).
  • the model of the embodiment of the present application When the model of the embodiment of the present application is updated periodically, it continuously learns the latest user behavior characteristics by incrementally inputting samples, calculating losses and reducing losses, thereby ensuring that the videos retrieved online are more in line with user interests.
  • step S233 since the online available model has been created and updated through the previous steps, after inputting the features of the user and the video, the multi-target vectors of the user and the video will be output.
  • the click target weight and the duration target weight that is, ⁇ and ⁇ in the calculation of the multi-target similarity in the above calculation process, which are determined through offline testing in the embodiment of the present application.
  • the click target belongs to the binary prediction.
  • the embodiment of the present application uses AUC to determine ⁇ and ⁇ .
  • the target value when calculating AUC is 0 or 1, and the predicted value is the similarity score under different ⁇ and ⁇ .
  • the duration target belongs to regression prediction.
  • the average head duration of a single test set (AD@K, AvgDurPerBatch@TopK) can be used to determine ⁇ and ⁇ . See the following formula (6):
  • the first step is to screen the test samples and calculate the similarity scores under different ⁇ and ⁇ .
  • n test sets batches
  • n*m the total number of test samples
  • the second step is to sort from large to small according to the similarity score in each test set (batch).
  • the sum result is finally divided by n to obtain AD@K.
  • the higher AD@K is, the more suitable ⁇ and ⁇ are.
  • du ij represents the duration target corresponding to the j-th sample in the i-th test set.
  • the first step is to define the value range of ⁇ and ⁇ in the first round of testing.
  • the value range of the embodiment of the present application is [1, 10], and the step size is 1. Because there are two weights, it will be explored 100 times. For each group of ⁇ and ⁇ , the corresponding ⁇ AUC, AD@K> is calculated. However, the order of magnitude of AUC and AD@K is different, which is not convenient for direct comparison.
  • the embodiment of the present application will first use the MinMax function for normalization, and then compare the difference minus between ⁇ and ⁇ in each group, as shown in the following formula (7). The closer the difference minus is to 0, the more appropriate ⁇ and ⁇ are. Finally, ⁇ 1 and ⁇ 1 with the smallest difference in the first round are determined. In the subsequent exploration, ⁇ will be fixed to ⁇ 1 , and only a more accurate ⁇ will be explored; the second step is to define the value range of ⁇ in the second round of testing. The value range of ⁇ in the embodiment of the present application is ( ⁇ 1 -1, ⁇ 1 +1), the step size is 0.1, and the exploration is performed 19 times. Finally, ⁇ 2 with the smallest difference in the second round is determined.
  • the third step is to define the range of ⁇ values for the third round of testing.
  • the range of ⁇ values is ( ⁇ 2 -0.1, ⁇ 2 +0.1), with a step length of 0.01, and explore 19 times, and finally determine the ⁇ 3 with the smallest difference in the third round;
  • Step 4 loop from Step 1 to Step 3, and the embodiment of the present application will explore y for five rounds, and the final ⁇ 1 and ⁇ 5 are the target weights under clicks and duration, respectively, which are used to calculate the similarity score between the user and the video online.
  • min AUC represents the minimum AUC value in this round of testing, and max AUC represents the maximum AUC value in this round of testing;
  • min AD@K represents the minimum AD@K value in this round of testing, and max AD@K represents the maximum AD@K value in this round of testing.
  • the features of the video candidate set are first input into the video tower in batches to generate the video click target vector and the video duration target vector of the corresponding video, and then each bit of the two target vectors is multiplied by their respective weights and spliced together to obtain a splicing vector, which is used as a video multi-target vector.
  • an index for online real-time query is created for the video multi-target vector, so that the video multi-target vector can be directly queried based on the index, ensuring that there is no need to repeatedly generate the video multi-target vector online.
  • the model is deployed online to ensure that the real-time generated user multi-target vector is consistent with the model version corresponding to the video multi-target vector.
  • the online real-time input of the requesting user's features into the user tower, the corresponding user click target vector and user duration target vector are generated through the user tower, and then the user click target vector and the user duration target vector are spliced into a user multi-target vector, and finally the nearest neighbor algorithm is used to query the head video with the highest similarity score and return it.
  • the queried videos and other recalled videos can be merged and deduplicated in turn, and then logical processing such as fine sorting and mixed sorting can be continued.
  • the screened target recommended videos can be recommended to users in the form of material cards in the targetless area.
  • the user enhancement vector of the embodiment of the present application is initialized using the playback sequence. While ensuring the differentiation of different users, it can not only significantly save computing and storage resources by reducing the scale of model parameters, but also improve the expression accuracy of the enhancement vector by increasing the training opportunities of each video ID in the playback sequence.
  • each tower in the embodiment of the present application adopts a PLE network to output multiple vectors after inputting features. These multiple vectors are used to fit multiple targets such as clicks and duration, and these multiple vectors can be adjusted according to business goals.
  • an MMOE network is added to achieve an enhancement vector that fits multiple target vectors at the same time.
  • the embodiment of the present application also redesigns the calculation method of the target value and the predicted value in the duration target loss function to obtain the duration target loss. In this way, after the duration target loss is added to other losses such as the click target loss and the enhancement target loss, multiple losses are fused based on the improved version of uncertainty weighting to achieve that each predicted value is balanced and close to the target value.
  • the embodiment of the present application calculates the multi-objective similarity scores between users and videos in real time, the multi-objective weights used can be determined by the exploration method designed in the embodiment of the present application, so that the most appropriate weights that meet multiple objectives simultaneously can be obtained offline.
  • the user enhancement vector of the embodiment of the present application is initialized by the playback sequence. Because the playback sequence contains the playback order in addition to the video ID and playback duration, a sequence model can be used to introduce sequential information into the user enhancement vector, such as long short-term memory network (LSTM, Long Short-Term Memory), transformer model and BERT and other models.
  • LSTM Long Short-Term memory network
  • transformer model transformer model
  • BERT BERT
  • other structures can be used to replace the PLE network, such as ResNet or parallel dual tower structures to replace the PLE network, thereby enhancing the expression ability of the multi-target vector.
  • the video recall model of the embodiment of the present application can also be used to personalize the optimization target of the model according to the characteristics of the scenarios of different video applications.
  • the content involves user information, such as object feature vectors, historical play sequences, and target recommended videos, and if data related to user information or enterprise information is involved
  • user permission or consent is required, and the relevant data collection and
  • it should be strictly in accordance with the requirements of relevant national laws and regulations, obtain the informed consent or separate consent of the personal information subject, and carry out subsequent data use and processing within the scope of authorization of laws and regulations and the personal information subject.
  • the video recommendation device 354 includes: an acquisition module 3541, configured to acquire an object feature vector of a target object, a historical playback sequence of the target object within a preset historical time period, and a video multi-target vector index of each video to be recommended in a video library to be recommended; a vectorization processing module 3542, configured to perform vectorization processing on the historical playback sequence to obtain an object enhancement vector of the target object; a multi-target processing module 3543, configured to perform vector splicing processing and multi-target feature learning on the object feature vector and the object enhancement vector in sequence to obtain an object multi-target vector of the target object; a determination module 3544, configured to determine a target recommended video corresponding to the target object from the video library to be recommended based on the object multi-target vector and the video multi-target vector index of each video to be recommended; and a video acquisition module 3541, configured to acquire an object feature vector of a target object, a historical playback sequence of the target object within a preset historical time period, and a video multi-target vector index
  • the device also includes: a retrieval module, configured to search in a preset feature vector table based on the video identification of each video to be recommended, and obtain a corresponding video enhancement vector for each video to be recommended; a vector splicing module, configured to perform vector splicing processing on the video feature vector and the video enhancement vector of each video to be recommended, and obtain a corresponding video splicing vector for each video to be recommended; a multi-target feature learning module, configured to perform multi-target feature learning on the video splicing vector of each video to be recommended, and obtain a corresponding video multi-target vector for each video to be recommended; a creation module, configured to create a video multi-target vector index corresponding to the video multi-target vector of each video to be recommended.
  • a retrieval module configured to search in a preset feature vector table based on the video identification of each video to be recommended, and obtain a corresponding video enhancement vector for each video to be recommended
  • a vector splicing module configured to perform vector splicing processing on the
  • the vectorization processing module is further configured to: obtain the historical video identifier and historical playback time of each historical playback video in the historical playback sequence; based on each of the historical video identifiers, search in a preset feature vector table to obtain a historical video vector set; the number of historical video vectors in the historical video vector set is the same as the number of historical video identifiers in the historical playback sequence; perform total duration statistics on the historical playback time in the historical playback sequence to obtain the total historical playback time; based on the total historical playback time, perform duration normalization processing on each of the historical playback times to obtain the normalized playback time of each historical playback video, and determine the normalized playback time as the video vector weight of the corresponding historical playback video; based on the video vector weight, perform weighted processing on each historical video vector in the historical video vector set to obtain a video weighted vector set; merge the video weighted vectors in the video weighted vector set to obtain the object enhancement vector of the target object.
  • the multi-target processing module is further configured to: perform vector splicing processing on the object feature vector and the object enhancement vector to obtain an object splicing vector; perform multi-target feature learning on the object splicing vector through a multi-target neural network to obtain an object target vector of the target object in multiple target dimensions; perform splicing processing on the object target vectors in the multiple target dimensions to obtain an object multi-target vector of the target object.
  • the multi-objective feature learning module is further configured as follows: for each of the videos to be recommended, multi-objective feature learning is performed on the video splicing vector of the video to be recommended through a multi-objective neural network to obtain the video target vector of the video to be recommended under multiple target dimensions; the target weight under each target dimension is obtained; using the target weight, weighted calculation is performed on the video target vector under each target dimension to obtain a weighted video target vector; the weighted video target vectors under multiple target dimensions are spliced to obtain the video multi-objective vector of the video to be recommended.
  • the determination module is further configured to: obtain a video multi-target vector of each video to be recommended based on the video multi-target vector index; determine the inner product between the object multi-target vector and the video multi-target vector of each video to be recommended, and determine the inner product as a similarity score between the target object and the corresponding video to be recommended; select a specific number of videos to be recommended from the video library to be recommended according to the similarity score; The selected specific number of videos to be recommended are determined as target recommended videos corresponding to the target object.
  • the video recommendation method is implemented by a video recall model; the video recommendation device also includes a model training device, and the model training device is configured to: obtain sample data, the sample data including: sample object features, sample video features and target parameters under multiple target dimensions; input the sample object features into the object tower of the video recall model, and predict the sample object target vector of the sample object under the multiple target dimensions based on the sample object features through the object tower; input the sample video features into the video tower of the video recall model, and predict the sample video target vector of the sample video under the multiple target dimensions based on the sample video features through the video tower; input the sample object target vector, the sample video target vector and the target parameters into a target loss model, and perform loss calculation through the target loss model to obtain a target loss result; based on the target loss result, correct the parameters in the object tower and the video tower to obtain a trained video recall model.
  • the model training device is configured to: obtain sample data, the sample data including: sample object features, sample video features and target parameters under multiple target dimensions; input the sample object features into the object
  • the model training device is further configured to: obtain original sample data; the original sample data includes multiple real positive samples and multiple real negative samples; construct random negative samples based on the multiple real positive samples, and delete a part of the real negative samples from the multiple real negative samples; wherein there is a preset proportional relationship between the number of real positive samples, the number of real negative samples after deleting a part, and the number of random negative samples; determine the real positive samples as positive samples, and determine the real negative samples after deleting a part and the random negative samples as negative samples; based on the object identification and the video identification, perform feature association on the positive samples and the negative samples to obtain the sample data.
  • the sample object target vector includes an object click target vector; the sample video target vector includes a video click target vector; the target parameter includes a click target value; the model training device is also configured to: determine the vector inner product between the object click target vector and the video click target vector through the target loss model; determine the predicted value in the click dimension based on the vector inner product and a preset activation function; determine the logarithmic loss between the predicted value in the click dimension and the click target value through a logarithmic loss function; and determine the logarithmic loss as the target loss result.
  • the sample object target vector includes an object duration target vector; the sample video target vector includes a video duration target vector; the target parameter includes a duration target value; the model training device is also configured to: truncate the duration target value according to a preset number of truncation intervals to obtain a duration truncation value having the number of truncation intervals; determine a target truncation value based on the duration truncation value having the number of truncation intervals; normalize each of the duration truncation values based on the target truncation value to obtain a normalized duration truncation value; determine the vector inner product between the object duration target vector and the video duration target vector; determine a predicted value in the duration dimension based on the vector inner product and a preset activation function; determine the mean square error loss between the predicted value in the duration dimension and the normalized duration truncation value through a mean square error loss function; and determine the mean square error loss as the target loss result.
  • the video recall model also includes a multi-target network; the model training device is also configured to: when the sample data is a positive sample, output a target enhancement vector corresponding to the object tower and the video tower in multiple target dimensions through the multi-target network; in each target dimension, determine a first mean square error between the target enhancement vector corresponding to the object tower and the sample video target vector output by the video tower, or determine a second mean square error between the target enhancement vector corresponding to the video tower and the sample object target vector output by the object tower; determine the first mean square error and the second mean square error as the object enhancement loss of the sample object and the video enhancement loss of the sample video, respectively; the object enhancement loss and the video enhancement loss constitute part of the loss result in the target loss result.
  • the target loss result includes the logarithmic loss in the click dimension, the mean square error loss in the duration dimension, the object enhancement loss in the click dimension, the video enhancement loss in the click dimension, the object enhancement loss in the duration dimension, and the video enhancement loss in the duration dimension;
  • the model training device is also configured to: obtain the click The loss weights corresponding to the logarithmic loss in the dimension, the mean square error loss in the duration dimension, the object enhancement loss in the click dimension, the video enhancement loss in the click dimension, the object enhancement loss in the duration dimension and the video enhancement loss in the duration dimension respectively; obtaining a preset regularization term; based on the loss weights and the regularization term, performing loss fusion processing on the logarithmic loss in the click dimension, the mean square error loss in the duration dimension, the object enhancement loss in the click dimension, the video enhancement loss in the click dimension, the object enhancement loss in the duration dimension and the video enhancement loss in the duration dimension to obtain a fusion loss result; based on the fusion loss result, correcting the parameters in the object tower
  • the model training device is further configured to: input the sample object target vector and the sample video target vector into the recommendation prediction layer of the video recall model, and determine the click parameters and video duration parameters of the sample object for the sample video through the recommendation prediction layer; determine the performance index value of the video recall model according to the click parameters; determine the average head duration of the video recall model according to the video duration parameters; and perform multiple rounds of testing based on the performance index value and the average head duration to obtain the target weight in the click dimension and the target weight in the video duration dimension.
  • the embodiment of the present application provides a computer program product or a computer program, which includes executable instructions; the executable instructions are stored in a computer-readable storage medium.
  • a processor of an electronic device reads the executable instructions from the computer-readable storage medium and the processor executes the executable instructions, the electronic device executes the video recommendation method described in the embodiment of the present application.
  • the embodiment of the present application provides a storage medium storing executable instructions, wherein the executable instructions are stored.
  • the processor will be caused to execute the video recommendation method provided by the embodiment of the present application, for example, the video recommendation method shown in FIG5.
  • the storage medium can be a computer-readable storage medium, for example, a ferroelectric memory (FRAM, Ferromagnetic Random Access Memory), a read-only memory (ROM, Read Only Memory), a programmable read-only memory (PROM, Programmable Read Only Memory), an erasable programmable read-only memory (EPROM, Erasable Programmable Read Only Memory), an electrically erasable programmable read-only memory (EEPROM, Electrically Erasable Programmable Read Only Memory), a flash memory, a magnetic surface memory, an optical disk, or a compact disk read-only memory (CD-ROM, Compact Disk-Read Only Memory) and other memories; it can also be various devices including one or any combination of the above memories.
  • FRAM Ferroelectric memory
  • ROM Read Only Memory
  • PROM programmable read-only memory
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • flash memory a magnetic surface memory, an optical disk, or a compact disk read
  • executable instructions may be in the form of a program, software, software module, script or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may, but need not, correspond to a file in a file system, may be stored as part of a file storing other programs or data, such as one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files storing one or more modules, subroutines, or code portions).
  • executable instructions may be deployed to be executed on one electronic device, or on multiple electronic devices located at one location, or on multiple electronic devices distributed at multiple locations and interconnected by a communication network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供一种视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品,至少应用于人工智能领域和视频推荐领域,其中方法包括:获取目标对象的对象特征向量、所述目标对象在预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引;对所述历史播放序列进行向量化处理,得到所述目标对象的对象增强向量;对所述对象特征向量和所述对象增强向量依次进行向量拼接处理和多目标特征学习,得到所述目标对象的对象多目标向量;基于所述对象多目标向量和每一所述待推荐视频的视频多目标向量索引,从所述待推荐视频库中确定对应于所述目标对象的目标推荐视频。

Description

视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202211526679.9、申请日为2022年11月30日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请实施例涉及互联网领域,涉及但不限于一种视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品。
背景技术
目前,在视频推荐领域,常用的视频召回结构包括双塔结构和双重增强双塔结构。
双塔结构作为经典召回结构,因离线训练方便,在线检索快速的特点,一直被广泛应用于推荐场景中。双塔结构最明显的特点为“双塔独立”,也就是离线可以批量计算海量内容的目标向量,线上不需要重复计算,在线只需要计算一次用户目标向量,然后使用最近邻算法快速检索相似内容。不过“双塔独立”同时也限制了模型效果,双塔结构缺少用户的特征和内容特征交叉学习的机会,但是交叉特征和交叉学习可以显著提升模型效果。而双重增强双塔结构在用户塔和内容塔的输入层,分别生成用于拟合另一个塔信息的向量,该向量称为“增强向量”,增强向量通过另一个塔的目标向量不断更新,并参与到目标向量的计算过程中。但是,双重增强双塔结构中用户塔的增强向量的参数规模过大,且塔结构不支持多目标,增强向量无法同时拟合多个目标向量。
由此可见,相关技术中的视频召回结构在进行召回计算时,参数规模过大从而导致计算量较大,使得视频推荐时的召回过程计算时延较高,降低了视频推荐的效率;且由于无法同时拟合多个维度的目标向量,从而使得召回计算的准确率较低。
发明内容
本申请实施例提供一种视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品,至少能够应用于人工智能领域和视频推荐领域,能够提高视频召回的效率和准确率。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种视频推荐方法,包括:获取目标对象的对象特征向量、所述目标对象在预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引;对所述历史播放序列进行向量化处理,得到所述目标对象的对象增强向量;对所述对象特征向量和所述对象增强向量依次进行向量拼接处理和多目标特征学习,得到所述目标对象的对象多目标向量;基于所述对象多目标向量和每一所述待推荐视频的视频多目标向量索引,从所述待推荐视频库中确定对应于所述目标对象的目标推荐视频;基于所述目标推荐视频对所述目标对象进行视频推荐。
本申请实施例提供一种视频推荐装置,所述装置包括:获取模块,配置为获取目 标对象的对象特征向量、所述目标对象在预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引;向量化处理模块,配置为对所述历史播放序列进行向量化处理,得到所述目标对象的对象增强向量;多目标处理模块,配置为对所述对象特征向量和所述对象增强向量依次进行向量拼接处理和多目标特征学习,得到所述目标对象的对象多目标向量;确定模块,配置为基于所述对象多目标向量和每一所述待推荐视频的视频多目标向量索引,从所述待推荐视频库中确定对应于所述目标对象的目标推荐视频;视频推荐模块,配置为基于所述目标推荐视频对所述目标对象进行视频推荐。
本申请实施例提供一种电子设备,包括:存储器,配置为存储可执行指令;处理器,配置为执行所述存储器中存储的可执行指令时,实现上述的视频推荐方法。
本申请实施例提供一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括可执行指令,所述可执行指令存储在计算机可读存储介质中;当电子设备的处理器从所述计算机可读存储介质读取所述可执行指令,并执行所述可执行指令时,实现上述的视频推荐方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行所述可执行指令时,实现上述的视频推荐方法。
本申请实施例具有以下有益效果:对目标对象的历史播放序列进行向量化处理,得到目标对象的对象增强向量,从而基于目标对象的对象增强向量确定目标对象的对象多目标向量。这里,对象多目标向量是融合了对象增强向量的对象特征向量,如此,基于对象多目标向量和每一待推荐视频的视频多目标向量索引,从待推荐视频库中确定目标推荐视频时,能够结合目标对象在多个维度下的信息对目标对象进行准确的分析,从而准确的进行视频召回;并且,由于对象增强向量是基于目标对象的历史播放序列生成的,历史播放序列包括目标对象对视频的播放记录,而播放记录的数量远小于视频应用中的目标对象的数量,因此,基于对象增强向量确定目标对象的对象多目标向量进而确定目标推荐视频的过程,数据计算量较小,从而极大的提高视频推荐的效率。
附图说明
图1是相关技术中的双塔结构的结构示意图;
图2是相关技术中的双重增强双塔结构的结构示意图;
图3是本申请实施例提供的视频推荐系统的一个可选的架构示意图;
图4是本申请实施例提供的电子设备的结构示意图;
图5是本申请实施例提供的视频推荐方法的一个可选的流程示意图;
图6是本申请实施例提供的视频推荐方法的另一个可选的流程示意图;
图7是本申请实施例提供的视频多目标向量索引创建方法的流程示意图;
图8是本申请实施例提供的对历史播放序列进行向量化处理的流程示意图;
图9是本申请实施例提供的向量拼接处理和多目标特征学习的流程示意图;
图10是本申请实施例提供的确定目标推荐视频的流程示意图;
图11是本申请实施例提供的视频召回模型的训练方法的流程示意图;
图12是本申请实施例提供的获取样本数据的流程示意图;
图13是本申请实施例提供的一种确定目标损失结果的流程示意图;
图14是本申请实施例提供的另一种确定目标损失结果的流程示意图;
图15是本申请实施例提供的基于多目标网络确定对象增强损失和视频增强损失的 流程示意图;
图16是本申请实施例提供的视频应用首页的界面图;
图17是本申请实施例提供的素材卡片的界面图;
图18是本申请实施例提供的计算流程示意图;
图19是本申请实施例提供的特征向量的生成过程示意图;
图20是本申请实施例提供的用户增强向量的生成过程示意图;
图21是本申请实施例提供的PLE网络的结构示意图;
图22是本申请实施例提供的确定相似度分数的流程示意图;
图23是本申请实施例提供的训练流程示意图;
图24是本申请实施例提供的筛选正负例的实现过程示意图;
图25是本申请实施例提供的关联特征的实现过程示意图;
图26是本申请实施例提供的MMOE网络的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。除非另有定义,本申请实施例所使用的所有的技术和科学术语与属于本申请实施例的技术领域的技术人员通常理解的含义相同。本申请实施例所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
在解释本申请实施例的视频推荐方法之前,首先对相关技术中的视频推荐方法和视频召回模型进行说明。
图1是相关技术中的双塔结构的结构示意图,如图1所示,双塔结构是由用于生成用户目标向量的用户塔11,以及用于生成内容目标向量的内容塔12构成。在离线训练时,首先将用户的对象特征111输入至用户塔11,待推荐内容的内容特征121输入至内容塔12,然后计算用户塔11输出的用户目标向量112与内容塔12输出的内容目标向量122之间的内积,并基于该内积进行相似度计算,得到用户与待推荐内容之间的相似度结果,将相似度结果作为双塔结构的预测值;最后通过减小对应场景目标的损失函数,实现持续更新双塔结构的模型参数。另外,相关技术中还提出了双重增强双塔结构,如图2所示,双重增强双塔结构在用户塔21和内容塔22的输入层,分别生成用于拟合另一个塔信息的向量,该向量称为“增强向量”,即用户增强向量211和内容增强向量221。增强向量通过另一个塔的目标向量不断更新,并参与到目标向量的计算过程中。
虽然双重增强双塔结构可以在一定程度上解决用户与待推荐内容之间交叉不足的问题,但却引入了以下问题:用户增强向量的参数规模过大;双重增强双塔结构的塔结构不支持多目标的预测;增强向量无法同时拟合多个目标向量。
基于相关技术中的双重增强双塔结构存在的上述至少一个问题,本申请实施例提供一种视频推荐方法,本申请实施例提供的视频推荐方法中,首先,获取目标对象的对象特征向量、目标对象在预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引;然后,对历史播放序列进行向量化处理,得到目标对象的对象增强向量;并对对象特征向量和对象增强向量依次进行向量拼接处理和多目标特征学习,得到目标对象的对象多目标向量;最后,基于对象多目标向量和每一待推荐视 频的视频多目标向量索引,从待推荐视频库中确定对应于所述目标对象的目标推荐视频。如此,能够结合目标对象在多个维度下的信息对目标对象进行准确的分析,从而准确的进行视频召回;并且,由于对象增强向量是基于目标对象的历史播放序列生成的,历史播放序列至少包括目标对象对视频的播放记录,而播放记录的数量相对于视频应用中的目标对象的数量会明显下降,因此,能够极大的降低视频召回时的数据计算量,从而极大的提高视频推荐的效率。
下面说明本申请实施例的视频推荐设备的示例性应用,该视频推荐设备是用于实现视频推荐方法的电子设备。在一种实现方式中,本申请实施例提供的视频推荐设备(即电子设备)可以实施为终端,也可以实施为服务器。在一种实现方式中,本申请实施例提供的视频推荐设备可以实施为笔记本电脑,平板电脑,台式计算机,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备,智能机器人,智能家电和智能车载设备等任意的具备视频数据处理功能的终端;在另一种实现方式中,本申请实施例提供的视频推荐设备还可以实施为服务器,其中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN,Content Delivery Network)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。下面,将说明视频推荐设备实施为服务器时的示例性应用。
参见图3,图3是本申请实施例提供的视频推荐系统的一个可选的架构示意图,本申请实施例以视频推荐方法应用于任一视频应用为例进行说明。在视频应用中包括多个精选页和各垂直频道,用户在精选页和各个垂直频道下滑后可以看到无目标区,本申请实施例的视频推荐方法就可以应用于对无目的区所展示视频的推荐。本申请实施例中,视频推送系统中至少包括终端100、网络200和服务器300。其中,服务器300可以是视频应用的服务器。服务器300可以构成本申请实施例的视频推荐设备。终端100通过网络200连接服务器300,网络200可以是广域网或者局域网,又或者是二者的组合。
本申请实施例中,在进行视频推荐时,终端100通过视频应用的客户端接收用户的浏览操作(例如,该浏览操作可以是在任一垂直频道的下拉操作),并响应于浏览操作,获取用户的对象特征和历史播放序列,将对象特征和历史播放序列封装至视频推荐请求中,终端100通过网络200将视频推荐请求发送给服务器300。服务器300在接收到视频推荐请求之后,响应于视频推荐请求,获取用户的对象特征和历史播放序列,并基于对象特征获取用户的对象特征向量,以及获取待推荐视频库中的每一待推荐视频的视频多目标向量索引;然后,对历史播放序列进行向量化处理,得到目标对象的对象增强向量;再对对象特征向量和对象增强向量依次进行向量拼接处理和多目标特征学习,得到目标对象的对象多目标向量;之后,基于对象多目标向量和每一待推荐视频的视频多目标向量索引,从待推荐视频库中确定目标推荐视频。服务器300在得到目标推荐视频之后,将目标推荐视频发送给终端100,以使得终端100在当前界面的无目的区向用户展示目标推荐视频。
在另一些实施例中,视频推荐设备还可以实施为终端,也就是说,以终端为执行主体实现本申请实施例的视频推荐方法。在实现的过程中,终端通过视频应用的客户端获取用户的浏览操作,并响应于浏览操作,获取用户的对象特征向量、预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引;然后,终端采用本申请实施例的视频推荐方法进行目标推荐视频的召回,并在得到目标推荐视频之后,在当前界面的无目的区向用户展示目标推荐视频。
本申请实施例所提供的视频推荐方法还可以基于云平台并通过云技术来实现,例如,上述服务器300可以是云端服务器。通过云端服务器对历史播放序列进行向量化处理,或者,通过云端服务器对对象特征向量和对象增强向量依次进行向量拼接处理和多目标特征学习,以及,通过云端服务器基于对象多目标向量和每一待推荐视频的视频多目标向量索引,从待推荐视频库中确定目标推荐视频。
在一些实施例中,还可以具有云端存储器,可以将待推荐视频库以及每一待推荐视频的视频多目标向量索引存储至云端存储器中,或者,还可以将用户的对象特征向量和预设历史时间段内的历史播放序列存储至云端存储器中,或者,还可以将目标推荐视频存储至云端存储器中。这样,在接收到视频推荐请求时,可以从云端存储器中获取相应的信息进行目标推荐视频的召回,从而提高目标推荐视频召回的效率,进而提高视频推荐的效率。
这里需要说明的是,云技术(Cloud technology)是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。云技术是基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都有可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,其可以通过云计算来实现。
图4是本申请实施例提供的电子设备的结构示意图,图4所示的电子设备可以是视频推荐设备,视频推荐设备包括:至少一个处理器310、存储器350、至少一个网络接口320和用户接口330。视频推荐设备中的各个组件通过总线系统340耦合在一起。可理解,总线系统340用于实现这些组件之间的连接通信。总线系统340除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图4中将各种总线都标为总线系统340。
处理器310可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口330包括使得能够呈现媒体内容的一个或多个输出装置331,以及一个或多个输入装置332。
存储器350可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器、硬盘驱动器和光盘驱动器等。存储器350可选地包括在物理位置上远离处理器310的一个或多个存储设备。存储器350包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器350旨在包括任意适合类型的存储器。在一些实施例中,存储器350能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统351,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;网络通信模块352,用于经由一个或多个(有线或无线)网络接口320到达其他计算设备,示例性的网络接口320包括:蓝牙、无线相容性认证(WiFi)和通用串行总线(USB,Universal Serial Bus)等;输入处理模块353,用于对一个或多个来自一个或多个输入装 置332之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的装置可采用软件方式实现,图4示出了存储在存储器350中的一种视频推荐装置354,该视频推荐装置354可以是电子设备中的视频推荐装置,其可以是程序和插件等形式的软件,包括以下软件模块:获取模块3541、向量化处理模块3542、多目标处理模块3543、确定模块3544和视频推荐模块3545,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。将在下文中说明各个模块的功能。
在另一些实施例中,本申请实施例提供的装置可以采用硬件方式实现,作为示例,本申请实施例提供的装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的视频推荐方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
本申请各实施例提供的视频推荐方法可以由电子设备来执行,其中,该电子设备可以是服务器也可以是终端,即本申请各实施例的视频推荐方法可以通过服务器来执行,也可以通过终端来执行,或者也可以通过服务器与终端之间交互执行。
图5是本申请实施例提供的视频推荐方法的一个可选的流程示意图,下面将结合图5示出的步骤进行说明,需要说明的是,图5中的视频推荐方法是以服务器作为执行主体为例来说明的,如图5所示,方法包括以下步骤S101至步骤S105:
步骤S101,获取目标对象的对象特征向量、预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引。
这里,对象特征向量是对目标对象的对象特征进行向量化处理之后得到的。目标对象的对象特征包括但不限于以下至少之一:目标对象的年龄、性别、学历、标签、视频浏览记录和兴趣等。可以通过特征提取实现对对象特征的向量化处理。
向量化处理是指通过对预设的特征向量表进行检索,从预设的特征向量表中检索到与目标对象的每一对象特征对应的特征向量。在该预设的特征向量表中可以预先存储有每一对象特征对应的特征向量。在确定出目标对象以及目标对象的多个对象特征之后,可以以多个对象特征为检索索引,从预设的特征向量表中查询到对应的特征向量。本申请实施例中,在进行向量化处理时,可以通过预设的特征向量表进行查询,在预设的特征向量表中查询与每一对象特征对应的特征向量,得到目标对象的对象特征向量。在实现的过程中,由于对象特征包括多个特征信息,可以从预设的特征向量表中查询每一特征信息对应的特征向量,然后将全部特征信息对应的特征向量进行拼接,形成多维度的对象特征向量。
预设的特征向量表可以包括两维,第一维是特征标识,第二维是每个特征表示对应的向量,不同特征的向量表示是相互独立的。也就是说,对应于目标对象(例如用户),可以具有对象特征向量表;对应于视频,可以具有视频特征向量表。可以分别基于对象特征向量表和视频特征向量表,查询目标对象的对象特征向量和待推荐视频的视频特征向量。
本申请实施例中,由于对象特征不仅包括离散特征还包括连续特征,因此,在进行向量化处理时,针对于离散特征,可以直接查询特征向量表得到离散特征的特征向量;对于连续特征,可以先对连续特征进行离散化处理,得到离散化特征,然后查询特征向量表得到离散化特征对应的特征向量。这里,离散化处理可以是采用特定的等频划分区间对连续特征进行等频划分,得到多个离散化特征。
本申请实施例中,可以预先构建特征向量表用于特征向量的查询,预先构建的特征向量表可以存储于预设存储单元中,在进行向量化处理时,从预设存储单元中获取特征向量表进行特征向量查询。在一些实施例中,还可以根据视频召回模型的更新以及特征信息的更新,对特征向量表进行更新,例如,当存在新的特征信息时,获取该特征信息的特征向量,并将特征向量更新至特征向量表中。
历史播放序列是指目标对象(例如用户)在预设历史时间段内播放过的视频序列,历史播放序列中包括历史播放视频的历史视频标识和每一历史播放视频的历史播放时长。
待推荐视频库中包括多个待推荐视频,待推荐视频包括用户可能感兴趣的视频,也包括用户不感兴趣的视频,在待推荐视频库中可以包括海量的候选视频(即待推荐视频库中的候选视频的数量大于视频数量阈值)。本申请实施例的视频推荐方法,就是在这海量的候选视频中准确的选择出用户感兴趣的视频,将用户感兴趣的视频作为目标推荐视频推荐给用户。
每一待推荐视频具有一视频多目标向量索引,视频多目标向量索引是用于查询该待推荐视频的视频多目标向量的索引信息。本申请实施例中,可以预先生成每一待推荐视频的视频多目标向量,在生成视频多目标向量之后,可以将全部待推荐视频的视频多目标向量存储至预设的视频多目标向量存储单元中。并且,在存储视频多目标向量时,还可以生成每一视频多目标向量对应的索引信息,该索引信息用于索引视频多目标向量的存储位置,从而基于该视频多目标向量索引即可获取到视频多目标向量。
本申请实施例中,可以在视频推荐之前生成每一待推荐视频的视频多目标向量,或者在生成待推荐视频时同时生成该待推荐视频的视频多目标向量,并创建该视频多目标向量的视频多目标向量索引,通过该视频多目标向量索引可以在线实时查询待推荐视频的视频多目标向量,而无需在每次视频推荐时均进行视频多目标向量的生成,即无需重复生成视频多目标向量,从而极大的降低了视频推荐时的数据计算量,提高视频推荐效率。
步骤S102,对历史播放序列进行向量化处理,得到目标对象的对象增强向量。
这里,对历史播放序列进行向量化处理时,可以过预设的特征向量表进行查询,在预设的特征向量表中查询与历史播放序列中的每一序列信息对应的特征向量,得到目标对象的对象增强向量。在实现的过程中,由于历史播放序列中包括多个序列信息,每一序列信息包括历史播放视频的视频标识和每一历史播放视频的播放时长,因此可以查询与视频标识对应的特征向量和与播放时长对应的特征向量。
本申请实施例中,在对历史播放序列进行向量化处理时,可以通过以下方式实现:首先,对历史播放序列中每个历史视频标识检索预设的特征向量表,也就是说,基于每一历史视频标识,在预设的特征向量表中进行检索,得到历史视频向量集合,历史视频向量集合中的历史视频向量的数量与历史播放序列中的历史视频标识的数量一致。然后,对历史播放序列中的历史播放时长进行总时长统计,得到历史播放总时长,也就是说,统计历史播放序列的总时长,可以将全部序列信息中的历史播放时长求和得到历史播放总时长,再对历史播放序列中每个历史播放时长除以该历史播放总时长,得到每一历史播放时长对应的归一化播放时长,将归一化播放时长作为视频向量权重,其中,视频向量权重的数量也与历史播放序列中历史视频标识的数量一致。最后,对历史视频向量集合中每个历史视频向量乘以对应视频向量权重,得到视频加权向量集合,同时,合并视频加权向量集合中的全部视频加权向量,得到的就是目标对象的对象增强向量,即得到用户增强向量。这里,合并视频加权向量集合中的全部视频加权向量,可以是指将视频加权向量集合中的全部视频加权向量进行拼接处理,得到一个多维度的用户增强向量, 其中,用户增强向量的维度等于全部视频加权向量的维度之和。
步骤S103,对对象特征向量和对象增强向量依次进行向量拼接处理和多目标特征学习,得到目标对象的对象多目标向量。
这里,对对象特征向量和对象增强向量依次进行向量拼接处理和多目标特征学习,可以是对对象特征向量和对象增强向量先进行向量拼接处理,得到对象拼接向量;然后,对对象拼接向量再进行多目标特征学习。
本申请实施例中,对对象特征向量和对象增强向量进行向量拼接处理,可以得到对象拼接向量,对象拼接向量是指融合了目标对象的对象特征向量和对象增强向量的拼接向量。对象拼接向量的维度等于对象特征向量和对象增强向量的维度之和。
在进行向量拼接处理之后,对对象拼接向量进行多目标特征学习。这里,多目标特征学习是指通过预先训练的多目标神经网络,学习对象拼接向量在不同目标维度下的对象目标向量。其中,不同目标维度包括但不限于:与用户点击行为相关的点击维度、与用户的浏览时长相关的时长维度。本申请实施例中,通过多目标神经网络可以学习目标对象在点击维度下的点击目标向量和在时长维度下的时长目标向量,从而得到目标对象的对象多目标向量。
在一些实施例中,多目标神经网络可以实现为PLE网络,PLE网络主要包括:用于学习多个目标的专家网络、用于学习不同专家网络间的共用信息的共享网络和用于计算融合多个网络的输出向量时,每个向量对应权重的门网络。举例来说,如果需要学习点击和时长两个目标的话,就需要两组专家网络;而不管需要学习的目标有多少,共享网络固定为一个;门网络的最后一层输出向量的长度与待确定权重的数量相同。
步骤S104,基于对象多目标向量和每一待推荐视频的视频多目标向量索引,从待推荐视频库中确定目标推荐视频。
这里,可以基于待推荐视频的视频多目标向量索引获取待推荐视频的视频多目标向量,再计算对象多目标向量与视频多目标向量之间的内积,将计算得到的内积确定为目标对象与相应的待推荐视频之间的相似度分值。然后,可以基于相似度分值从待推荐视频库中确定目标推荐视频。
在确定目标推荐视频时,一种实现方式中,可以选择相似度分值大于分值阈值的待推荐视频为目标推荐视频;另一种实现方式中,可以基于相似度分值对待推荐视频库中的待推荐视频进行排序,形成待推荐视频序列,之后,从待推荐视频序列中选择前N个待推荐视频作为目标推荐视频,N为大于或等于1的整数。
步骤S105,基于目标推荐视频对目标对象进行视频推荐。
在服务器得到目标推荐视频之后,可以将目标推荐视频发送给终端,通过终端将目标推荐视频推荐给目标对象。本申请实施例中,可以在目标对象的终端上显示目标推荐视频,例如,可以在视频应用的客户端上显示目标推荐视频。
本申请实施例提供的视频推荐方法,对目标对象的历史播放序列进行向量化处理,得到目标对象的对象增强向量,从而基于目标对象的对象增强向量得到目标对象的对象多目标向量,对象多目标向量是融合了对象增强向量的对象特征向量。如此,基于对象多目标向量和每一待推荐视频的视频多目标向量索引,从待推荐视频库中确定目标推荐视频时,能够结合目标对象在多个维度下的信息对目标对象进行准确的分析,从而准确的进行视频召回;并且,由于对象增强向量是基于目标对象的历史播放序列生成的,历史播放序列为目标对象对视频的播放记录,播放记录的数量相对于视频应用中的目标对象的数量会明显下降,因此,能够极大的降低视频召回时的数据计算量,从而极大的提高视频推荐的效率。
在一些实施例中,视频推荐系统中至少包括终端和服务器,终端上安装有视频应用, 当用户在视频应用中进行下拉刷新和页面下拉操作时,可以采用本申请实施例的方法召回目标推荐视频,并将目标推荐视频显示在视频应用的当前界面上,实现对用户的视频推荐。
图6是本申请实施例提供的视频推荐方法的另一个可选的流程示意图,如图6所示,方法包括以下步骤S201至步骤S210:
步骤S201,终端接收目标对象的浏览操作。
这里,浏览操作可以是在视频应用的任一垂直频道的下拉操作。视频应用的客户端可以接收用户的下拉操作,得到浏览操作。
步骤S202,终端响应于浏览操作,获取目标对象的对象特征和历史播放序列。
这里,当客户端获取到浏览操作时,获取目标对象的对象特征和历史播放序列,其中,对象特征可以是用户在视频应用中进行注册时和使用过程中输入的信息,视频应用的服务器存储这些信息,并将这些作为目标对象的对象特征。在用户使用视频应用的过程中,视频应用的服务器采集用户的播放记录,每一条播放记录中记录有播放视频的视频标识、播放时间和播放时长等信息。可以基于播放时间选择出在预设历史时间段内的播放记录,形成历史播放序列。
预设历史时间段可以是在当前时间之前的特定时间段,例如,当前时间之前的一个月或半年等预设历史时间段。
步骤S203,终端将对象特征和历史播放序列封装至视频推荐请求中。
步骤S204,终端将视频推荐请求发送给服务器。
步骤S205,服务器响应于视频推荐请求,基于对象特征获取对象特征向量,获取预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引。
在一些实施例中,在步骤S205获取待推荐视频库中的每一待推荐视频的视频多目标向量索引之前,还可以预先生成待推荐视频库中的每一待推荐视频的视频多目标向量,并创建每一待推荐视频的视频多目标向量索引,存储该视频多目标向量索引。这样,在后续对目标推荐视频进行召回时,可以基于视频多目标向量索引查询视频多目标向量,而无需再次生成待推荐视频的视频多目标向量,从而能够极大的提高目标推荐视频召回的效率。基于此,本申请实施例提供一种视频多目标向量索引的创建方法,图7是本申请实施例提供的视频多目标向量索引创建方法的流程示意图,如图7所示,视频多目标向量索引创建方法可以由服务器来执行,方法包括以下步骤S301至步骤S304:
步骤S301,基于每一待推荐视频的视频标识检索预设的特征向量表,对应得到每一待推荐视频的视频增强向量。
这里,可以通过每一待推荐视频的视频标识检索预设的特征向量表,得到每一待推荐视频的视频增强向量。
预设的特征向量表包括两维,第一维是特征标识,第二维是每个特征对应的向量表示,不同特征的向量表示相互独立。也就是说,对应于目标对象(例如用户),可以具有对象特征向量表;对应于视频,可以具有视频特征向量表。可以分别基于对象特征向量表和视频特征向量表,查询目标对象的对象特征向量和待推荐视频的视频特征向量。
步骤S302,对每一待推荐视频的视频特征向量和视频增强向量进行向量拼接处理,对应得到每一待推荐视频的视频拼接向量。
本申请实施例中,视频特征向量的生成方式与上述目标对象的对象特征向量的生成方式相同,可以基于视频特征向量表进行查询,得到待推荐视频的视频特征向量。
这里,可以将视频特征向量和视频增强向量进行向量拼接处理,在进行向量拼接处理时,可以是将视频特征向量和视频增强向量连接成一个具有更高维度的向量,即视频 拼接向量。其中,视频拼接向量的维度等于视频特征向量和视频增强向量的维度之和。
步骤S303,对每一待推荐视频的视频拼接向量进行多目标特征学习,对应得到每一待推荐视频的视频多目标向量。
在一些实施例中,在对每一待推荐视频的视频拼接向量进行多目标特征学习时,可以通过以下方式实现:首先,针对每一待推荐视频,通过多目标神经网络对待推荐视频的视频拼接向量进行多目标特征学习,得到该待推荐视频在多个目标维度下的视频目标向量;然后,获取每个目标维度下的目标权重;采用目标权重,对每个目标维度下的视频目标向量分别进行加权计算,得到加权视频目标向量;最后,对多个目标维度下的加权视频目标向量进行拼接处理,得到待推荐视频的视频多目标向量。
这里,以上述多个目标维度包括点击维度和时长维度为例进行说明。本申请实施例中,通过多目标神经网络可以输出待推荐视频在点击维度下的视频点击目标向量,以及,在时长维度下的视频时长目标向量(其中,视频点击目标向量和视频时长目标向量构成待推荐视频的视频目标向量),其中,对于待推荐视频,分别具有点击维度下的点击目标权重和时长维度下的时长目标权重。在进行拼接处理时,是分别计算点击目标权重与视频点击目标向量之间的乘积、时长目标权重与视频时长目标向量之间的乘积,然后将两个乘积加和,得到待推荐视频的视频多目标向量。也就是说,通过点击目标权重和时长目标权重,对视频点击目标向量和视频时长目标向量进行加权求和,得到视频多目标向量。
需要说明的是,点击目标权重和时长目标权重均属于视频召回模型中的参数,将在下文中对点击目标权重和时长目标权重中的更新方式进行说明。
步骤S304,创建与每一所述待推荐视频的视频多目标向量对应的视频多目标向量索引。
本申请实施例中,视频多目标向量索引是用于查询该待推荐视频的视频多目标向量的索引信息。在得到每一待推荐视频的视频多目标向量之后,可以将全部待推荐视频的视频多目标向量存储至预设的视频多目标向量存储单元中。在存储视频多目标向量时,还可以生成每一视频多目标向量对应的索引信息,该索引信息用于索引视频多目标向量的存储位置,从而基于该视频多目标向量索引即可获取到视频多目标向量。
本申请实施例中,通过视频多目标向量索引可以在线实时查询待推荐视频的视频多目标向量,而无需在每次视频推荐时均进行视频多目标向量的生成,即无需重复生成视频多目标向量,从而极大的降低了视频推荐时的数据计算量,提高视频推荐效率。
步骤S206,服务器对历史播放序列进行向量化处理,得到目标对象的对象增强向量。
在一些实施例中,参见图8,图8示出了步骤S206可以通过以下步骤S2061至步骤S2066实现:
步骤S2061,获取历史播放序列中的每一历史播放视频的历史视频标识和历史播放时长。
步骤S2062,基于每一历史视频标识,在预设的特征向量表中进行检索,得到历史视频向量集合。
历史视频向量集合中的历史视频向量的数量与历史播放序列中的历史视频标识的数量相同。
步骤S2063,对历史播放序列中的历史播放时长进行总时长统计,得到历史播放总时长。
这里,可以计算历史播放序列中全部历史播放视频的历史播放时长之和,得到历史播放总时长。
步骤S2064,基于历史播放总时长,对每一历史播放时长进行时长归一化处理,得到每一历史播放视频的归一化播放时长,并将归一化播放时长确定为相应历史播放视频的视频向量权重。
这里,归一化处理可以是通过历史播放总时长,对每一历史播放时长进行归一化处理。在实现的过程中,可以用每一历史播放时长除以历史播放总时长,将计算得到的商值,确定为每一历史播放视频的归一化播放时长。在得到每一历史播放视频对应的归一化播放时长之后,将该归一化播放时长确定为该历史播放视频的视频向量权重。
步骤S2065,基于视频向量权重,对历史视频向量集合中的每一历史视频向量进行加权处理,得到视频加权向量集合。
历史视频向量集合中包括全部历史播放视频对应的历史视频向量。对历史视频向量集合中的每一历史视频向量乘以对应的视频向量权重,得到多个视频加权向量,该多个视频加权向量构成视频加权向量集合。
步骤S2066,对视频加权向量集合中的视频加权向量进行合并处理,得到目标对象的对象增强向量。
这里,合并处理是指将视频加权向量集合中全部视频加权向量拼接成一个具有更高维度的向量,该更高维度的向量即对象增强向量。其中,对象增强向量的维度等于全部视频加权向量的维度之和。
步骤S207,服务器对对象特征向量和对象增强向量依次进行向量拼接处理和多目标特征学习,得到目标对象的对象多目标向量。
在一些实施例中,参见图9,图9示出了步骤S207可以通过以下步骤S2071至步骤S2073实现:
步骤S2071,对对象特征向量和对象增强向量进行向量拼接处理,得到对象拼接向量。
对象拼接向量是指目标对象融合了对象特征向量和对象增强向量的拼接向量。对象拼接向量的维度等于对象特征向量和对象增强向量的维度之和。
步骤S2072,通过多目标神经网络对对象拼接向量进行多目标特征学习,得到目标对象在多个目标维度下的对象目标向量。
多目标特征学习是指通过预先训练的多目标神经网络学习对象拼接向量在不同目标维度下的对象目标向量。其中不同目标维度包括但不限于:与用户点击行为相关的点击维度、与用户的浏览时长相关的时长维度。
步骤S2073,对多个目标维度下的对象目标向量进行拼接处理,得到目标对象的对象多目标向量。
本申请实施例中,通过多目标神经网络可以学习目标对象的点击目标向量和时长目标向量,然后,将点击目标向量和时长目标向量进行拼接,即得到目标对象的对象多目标向量。
步骤S208,服务器基于对象多目标向量和每一待推荐视频的视频多目标向量索引,从待推荐视频库中确定目标推荐视频。
在一些实施例中,参见图10,图10示出了步骤S208可以通过以下步骤S2081至步骤S2084实现:
步骤S2081,基于视频多目标向量索引获取每一待推荐视频的视频多目标向量。
这里,可以基于视频多目标向量索引确定出每一待推荐视频的视频多目标向量的存储位置,然后从该存储位置获取存储的视频多目标向量。
步骤S2082,确定对象多目标向量与每一待推荐视频的视频多目标向量之间的内积,并将内积确定为目标对象与相应待推荐视频之间的相似度分值。
本申请实施例中,可以计算对象多目标向量与待推荐视频库中的每一待推荐视频的视频多目标向量之间的内积,从而得到目标对象与每一待推荐视频之间的相似度分值。
步骤S2083,根据相似度分值,从待推荐视频库中选择特定数量的待推荐视频。
步骤S2084,将选择出的特定数量的待推荐视频,确定为对应于目标对象的目标推荐视频。
步骤S209,服务器将目标推荐视频推荐给终端。
步骤S210,终端在当前界面上显示目标推荐视频。
本申请实施例提供的视频推荐方法,基于对象多目标向量和每一待推荐视频的视频多目标向量索引,从待推荐视频库中确定目标推荐视频时,能够结合目标对象在多个维度下的信息对目标对象进行准确的分析,从而准确的进行视频召回;并且,由于对象增强向量是基于目标对象的历史播放序列生成的,历史播放序列为目标对象对视频的播放记录,播放记录的数量相对于视频应用中的目标对象的数量会明显下降,因此,能够极大的降低视频召回时的数据计算量,从而极大的提高视频推荐的效率。
在一些实施例中,上述视频推荐方法可以通过视频召回模型来实现;视频召回模型包括对象塔和视频塔。其中,对象塔是指用于确定对象多目标向量(即用户多目标向量)的神经网络结构,视频塔是指用于确定视频多目标向量的神经网络结构。
图11是本申请实施例提供的视频召回模型的训练方法的流程示意图,视频召回模型的训练方法可以通过模型训练装置来执行。其中,模型训练装置可以是视频推荐设备(即电子设备)中的装置,即模型训练装置可以是服务器也可以是终端;或者,模型训练装置也可以是独立于视频推荐设备的另一设备,即模型训练装置是区别于上述用于实现视频推荐方法的服务器和终端之外的其他电子设备。如图11所示,视频召回模型的训练方法包括以下步骤S401至步骤S405:
步骤S401,模型训练装置获取样本数据。
这里,样本数据包括:样本对象特征、样本视频特征和多个目标维度下的目标参数。其中,样本对象特征和样本视频特征包括但不限于用户标识、用户的视频推荐请求对应的请求标识,样本视频特征包括但不限于视频标识,多个目标维度下的目标参数包括但不限于:用户是否点击样本视频和样本视频的播放时长等。
在一些实施例中,参见图12,图12示出了步骤S401可以通过以下步骤S4011至步骤S4014实现:
步骤S4011,获取原始样本数据。
这里,原始样本数据包括多个真实正例样本和多个真实负例样本。真实正例样本是指“真实曝光且播放行为数据”对应的样本数据,真实负例样本是指“真实曝光未播放行为数据”对应的样本数据。
步骤S4012,基于多个真实正例样本构建随机负例样本,并从多个真实负例样本中删减部分数量的真实负例样本。
这里,在构建随机负例样本时,可以从整个视频池中,对于每个正例样本,提取正例样本中的用户标识和请求标识,再随机筛选n个视频标识,对该n个视频标识进行拼接,形成负例样本,即得到随负例样本。
删减部分数量的真实负例样本即减少真实负例样本的数量,在减少真实负例样本的数量时,可以随机删减部分数量的真实负例样本,或者,从全部真实负例样本中,随机挑选出部分真实负例样本。
本申请实施例中,在构建随机负例样本和减少真实负例样本的数量之后,需要保证真实正例样本的数量、删减部分数量后的真实负例样本的数量与所述随机负例样本的数量之间具有预设比例关系。这里预设比例关系可以根据视频召回模型的模型参数确定, 例如,预设比例关系可以是1:1:4,也就是说,经过随机采样后的真实负例数量与真实正例数量相同,并对每条正例样本随机筛选4条视频作为随机负例样本。
步骤S4013,将真实正例样本确定为正例样本,将减少数量后的真实负例样本和随机负例样本确定为负例样本。
这里,真实正例样本为模型训练的正例样本,删减部分数量后的真实负例样本和随机负例样本共同构成模型训练的负例样本。
步骤S4014,基于对象标识和视频标识,对正例样本和负例样本进行特征关联,得到样本数据。
这里,根据筛选的正例样本和负例样本,将对象特征通过用户标识关联,视频特征通过视频标识关联,从而得到样本对象特征和样本视频特征,并在后续的模型训练过程中,将样本对象特征输入至对象塔(即用户塔)中,将样本视频特征输入至视频塔中。需要说明的是,构建的样本对象特征中即有正例样本,也有负例样本;构建的样本视频特征中既有正例样本,也有负例样本。
步骤S402,模型训练装置将样本对象特征输入至视频召回模型的对象塔中,通过对象塔预测样本对象在多个目标维度下的样本对象目标向量。
这里,对象塔可以对样本对象特征进行向量化处理,得到样本对象特征向量。对象塔还可以生成样本对象增强向量,并对样本对象特征向量和样本对象增强向量依次进行向量拼接处理和多目标特征学习,得到样本对象在多个目标维度下的样本对象目标向量。这样,通过对样本对象在多个目标维度下的样本对象目标向量进行拼接处理,即可得到样本对象的样本对象多目标向量。
步骤S403,模型训练装置将样本视频特征输入至视频召回模型的视频塔中,通过视频塔预测样本视频在多个目标维度下的样本视频目标向量。
这里,视频塔可以对样本视频特征进行向量化处理,得到样本视频特征向量。视频塔还可以生成样本视频增强向量,并对样本视频特征向量和样本视频增强向量依次进行向量拼接处理和多目标特征学习,得到样本视频在多个目标维度下的样本视频目标向量。这样,通过对样本视频在多个目标维度下的样本视频目标向量进行拼接处理,即可得到样本视频的样本视频多目标向量。
之后,通过计算样本对象多目标向量与样本视频多目标向量之间的内积,即可确定出样本对象与样本视频之间的样本相似度分值。
步骤S404,模型训练装置将样本对象目标向量、样本视频目标向量和目标参数输入至目标损失模型中,通过目标损失模型进行损失计算,得到目标损失结果。
在一种实现方式中,样本对象目标向量包括对象点击目标向量;样本视频目标向量包括视频点击目标向量;目标参数包括点击目标值。参见图13,图13示出了步骤S404可以通过以下步骤S4041a至步骤S4044a实现:
步骤S4041a,通过目标损失模型确定对象点击目标向量与视频点击目标向量之间的向量内积。
步骤S4042a,基于向量内积和预设激活函数,确定在点击维度下的预测值。
这里,可以将向量内积输入至预设激活函数中,通过预设激活函数对向量内积进行计算,得到在点击维度下的预测值。预设激活函数可以将向量内积映射成0到1之间的任意值,即预设激活函数能够根据输入的向量内积得到一个取值在0到1之间的映射值,该映射值构成点击维度下的预测值。
在一种实现方式,预设激活函数可以是sigmoid激活函数,这样,可以将向量内积输入至sigmoid激活函数中,通过sigmoid激活函数计算在点击维度下的预测值。
步骤S4043a,通过对数损失函数,确定点击维度下的预测值与点击目标值之间的对 数损失。
步骤S4044a,将对数损失确定为目标损失结果。
在另一种实现方式中,样本对象目标向量包括对象时长目标向量;样本视频目标向量包括视频时长目标向量;目标参数包括时长目标值。参见图14,图14示出了步骤S404可以通过以下步骤S4041b至步骤S4047b实现:
步骤S4041b,按照预设的截断区间数量,对时长目标值进行截断处理,得到具有截断区间数量的时长截断值。
举例来说,预设截断区间数量为100,时长目标值为t,则可以将t除以100,得到100个时长截断值。
步骤S4042b,基于截断区间数量的时长截断值,确定目标截断值。
这里,可以在100个时长截断值中,将具有最小时长的时长截断值,确定为目标截断值。
步骤S4043b,基于目标截断值,对每一时长截断值进行归一化处理,得到归一化时长截断值。
这里,可以采用MinMax函数对每一时长截断值进行归一化处理。可以将每一时长截断值除以目标截断值,得到该时长截断值对应的归一化时长截断值。
步骤S4044b,确定对象时长目标向量与视频时长目标向量之间的向量内积。
步骤S4045b,基于向量内积和预设激活函数,确定在时长维度下的预测值。
这里,可以将对象时长目标向量与视频时长目标向量之间的向量内积,输入至预设激活函数中,通过预设激活函数对该向量内积进行计算,得到在时长维度下的预测值。在一种实现方式,这里的预设激活函数也可以是sigmoid激活函数,这样,可以将向量内积输入至sigmoid激活函数中,通过sigmoid激活函数计算在时长维度下的预测值。
步骤S4046b,通过均方差损失函数,确定时长维度下的预测值与归一化时长截断值之间的均方差损失。
步骤S4047b,将均方差损失确定为目标损失结果。
在一些实施例中,视频召回模型还包括多目标网络,可以通过多目标网络实现对样本对象的对象增强损失和样本视频的视频增强损失进行计算。图15是本申请实施例提供的基于多目标网络确定对象增强损失和视频增强损失的流程示意图,如图15所示,包括以下步骤S501至步骤S503:
步骤S501,当样本数据为正样本时,模型训练装置通过多目标网络输出在多个目标维度下的目标增强向量,其中,所述在多个目标维度下的目标增强向量包括对应于所述对象塔的目标增强向量和对应于所述视频塔的目标增强向量。
步骤S502,在每一目标维度下,模型训练装置确定对应于对象塔的目标增强向量与视频塔输出的样本视频目标向量之间的第一均方差,或者,确定对应于视频塔的目标增强向量与对象塔输出的样本对象目标向量之间的第二均方差。
步骤S503,模型训练装置将第一均方差和第二均方差,分别确定为样本对象的对象增强损失和样本视频的视频增强损失。
这里,对象增强损失和视频增强损失构成目标损失结果中的部分损失结果。
本申请实施例中,目标损失结果包括点击维度下的对数损失、时长维度下的均方差损失、点击维度下的对象增强损失、点击维度下的视频增强损失、时长维度下的对象增强损失和时长维度下的视频增强损失。在一些实施例中,还可以对这多个损失进行损失融合处理,并基于损失融合处理后的融合损失结果对视频召回模型进行再次训练和模型参数修正。
在实现的过程中,可以获取点击维度下的对数损失、时长维度下的均方差损失、点 击维度下的对象增强损失、点击维度下的视频增强损失、时长维度下的对象增强损失和时长维度下的视频增强损失分别对应的损失权重;并获取预设的正则项;然后,基于损失权重和正则项,对点击维度下的对数损失、时长维度下的均方差损失、点击维度下的对象增强损失、点击维度下的视频增强损失、时长维度下的对象增强损失和时长维度下的视频增强损失进行损失融合处理,得到融合损失结果;最后,基于融合损失结果对对象塔和视频塔中的参数进行修正,得到训练后的视频召回模型。
步骤S405,模型训练装置基于目标损失结果,对对象塔和视频塔中的参数进行修正,得到训练后的视频召回模型。
在一些实施例中,在对每一待推荐视频的视频拼接向量进行多目标特征学习时,会获取每一目标维度下的目标权重,采用目标权重,对每个目标维度下的视频目标向量分别进行加权计算。
这里,对上述目标权重的获取过程进行说明:可以将样本对象目标向量和样本视频目标向量输入至视频召回模型的推荐预测层中,通过推荐预测层确定样本对象对样本视频的点击参数和视频时长参数;然后,根据点击参数,确定视频召回模型的性能指标值;根据视频时长参数,确定视频召回模型的平均头部时长;最后,基于性能指标值和平均头部时长进行循环多轮测试,得到在点击维度下的目标权重和在视频时长维度下的目标权重。
本申请实施例中,性能指标值是用于衡量视频召回模型优劣的指标值,性能指标值可以是视频召回模型的AUC值。这里,AUC(Area Under Curve)被定义为受试者工作特征曲线(ROC,Receiver Operating Characteristic curve)下的面积。AUC是衡量视频召回模型优劣的一种性能指标。AUC可通过对ROC曲线下各部分的面积求和而得。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
本申请实施例提供的视频推荐方法,一方面,本申请实施例能够优化用户增强向量的初始化方式,根据用户播放序列生成用户增强向量,用户播放序列为用户对内容(例如视频)的播放记录,其中播放记录包括内容ID和时长等信息,因为用户播放序列中的内容ID只有百万量级,比用户量级小百倍,从而基于用户播放序列生成用户增强向量的过程能够显著降低模型大小;而且绝大部分用户播放过的内容和时长都不相同,也能够实现不同用户的区分。另一方面,本申请实施例还优化了每个塔的结构,使用多目标神经网络从而输出对应多个目标的多个向量。再一方面,本申请实施例还优化了增强向量拟合目标向量的方式,通过在拟合过程中增加多门混合专家算法(MMOE,Multi-gate Mixture of Experts)网络,实现使用一个增强向量同时拟合多个目标向量。
本申请实施例的视频推荐方法中,重新设计出一种基于双重增强双塔结构的多目标召回模型(即视频召回模型),先后通过根据历史播放序列初始化用户增强向量(即对象增强向量)、使用多目标网络作为塔结构、动态提取增强向量多目标信息的方式,分别解决了原始结构中用户增强向量的参数规模过大、塔结构只适用于单目标、增强向量无法拟合多目标信息的问题。为了进一步同时学习和预测多个目标,本申请实施例一方面在训练阶段,通过对目标值和预测值做截断和归一化等处理,优化了各个目标损失的计算方式,以及通过自适应学习各个目标损失的权重,优化了多个目标损失的融合方式;另一方面在应用阶段,通过定义评估公式,计算各个目标相似度在不同权重组合下的衡量分数,并以此离线探索同时满足多个目标的最佳组合,优化了多个目标相似度的融合方式。经过一系列优化,本申请实施例的视频召回模型最终拥有在同时考虑点击率和人均播放时长等目标的推荐场景中,离线训练阶段充分增加用户和内容(即待推荐视频)的交叉机会,在线应用阶段快速检索均衡满足多目标内容的能力。
本申请实施例视频召回模型的用户塔和内容塔,除了分别对对象特征和内容特征正 常向量化之外,还各自拼接了携带另一个塔的多目标信息的增强向量,拼接向量经过多目标塔的计算后,一共输出四个向量,分别是用户的点击目标向量、内容的点击目标向量、用户的时长目标向量和内容的时长目标向量,用户的两个目标向量直接拼接后作为用户多目标向量,内容的两个目标向量分别按位乘以各自权重后,拼接在一起作为内容多目标向量,然后,在线实时计算用户多目标向量与内容多目标向量的内积,将该内积作为相似度分数。相似度分数越高,用户越感兴趣。视频推荐系统将检索到的头部分数结果,与其他召回的结果合并和去重后,继续进行精排、混排等逻辑,最终推荐给用户。下面将对本申请实施例通过视频召回模型进行视频召回和推荐的实现过程进行详细说明。
本申请实施例的应用场景可以是任一视频应用首页各频道无目的区中的素材卡片的推荐瀑布流。图16是视频应用首页的界面图,视频应用首页包括多个频道,主要为精选页和各垂直频道,精选页可以综合展示类型1、类型2、类型3、类型4等多种类型的内容(例如,这里的多种类型可以是电视剧、电影、综艺和动漫等类型),各垂直频道只会展示对应类型的内容,如电视剧频道只会展示电视剧。用户在各个频道下滑后看到的就是无目的区,如图17所示,主要内容类型为素材卡片171,通过素材卡片171个性化展示用户感兴趣的视频。在本申请实施例的场景中,用户最终是否点击会受多种因素影响,比如:当前时间、长期兴趣和近期热点等,所以本申请实施例的技术难点为如何根据有限信息检索用户真正感兴趣的视频,提升场景的点击率和人均播放时长等。本申请实施例的主要目的是根据目标对象(例如用户)的特征和历史行为等个性化信息,同时考虑点击和时长两个目标,检索用户感兴趣的视频,通过推荐服务将用户感兴趣的视频展示给用户,吸引用户点击,从而带动业务指标增长。
本申请实施例的核心技术包括:基于多目标召回模型的计算流程(即计算用户和视频相似度分数的流程)、训练流程(即使用训练数据更新模型参数的流程)、应用流程(即在线实时检索用户感兴趣视频的流程)。下面对三个流程分别进行说明。
下面,将对基于多目标召回模型的计算流程进行说明。
当用户访问视频应用各个频道的无目的区时,会发送获取视频的请求(即视频推荐请求)到推荐服务,推荐服务先后通过召回、精排和混排等逻辑,返回用户感兴趣的视频(即目标推荐视频)。本申请实施例的视频召回模型位于召回层,通过图18所示的计算流程检索视频,计算流程包括:步骤S181,输入特征并向量化;步骤S182,输出多目标向量;步骤S183,计算多目标相似度。
在步骤S181中,输入模型的特征无法直接参与计算,需要经过向量化处理,除了生成特征向量外,用于增加用户和视频交互机会的增强向量,也会在本阶段生成。
本申请实施例中,在特征向量生成时,特征向量的生成采用图19所示的方式生成,视频召回模型首先在用户塔输入对象特征,在视频塔输入视频特征,然后通过特征向量表190(即上述预设的特征向量表)进行向量化处理。特征向量表有两维,第一维是特征ID,第二维是每个特征ID对应的向量,不同特征的向量表是独立的。如果是离散特征191的话,比如:目标对象的离散型相关信息等,视频的ID、品类和标签等,可以直接根据ID检索对应向量表。如果是连续特征192的话,比如:用户的上次登录时间距今的小时数、最近30天播放的视频数量、最后一次的播放时长等,视频的时长、开播距今天数、点击率等,就必须要进行离散化处理。例如,离散化处理可以是根据离线已经统计好的特征等频分布表193判断连续特征192所属的区间,再根据区间ID检索对应特征向量表190,每个连续特征的等频分布统计也都是独立的。本申请实施例中,对于离散特征191和连续特征192查询特征向量表190之后,分别得到离散特征向量194和连续特征向量195,之后,对离散特征向量194和连续特征向量195进行拼接,形成 特征向量196。对象特征和视频特征经过向量化处理后,各自内部拼接在一起,生成对象特征向量和视频特征向量。特征向量表属于模型参数的一部分,模型参数的更新方式会在下文的训练流程中说明。
本申请实施例中,在增强向量生成时,增强向量包括用户增强向量和视频增强向量。其中,视频增强向量通过视频ID检索特征向量表得到;用户增强向量通过播放序列(即历史播放序列)生成。如图20所示,播放序列20包括用户历史播放过的视频ID和时长。这里,首先,对播放序列20中每个视频ID检索特征向量表201,生成视频向量集合202,视频向量集合202中的视频向量的数量与播放序列中视频ID的数量一致;然后,统计播放序列的总播放时长203,对播放序列中每个播放时长除以总播放时长203,得到归一化时长,作为视频向量权重,视频向量权重的数量也与播放序列中视频ID的数量一致;最后,对视频向量集合中每个向量乘以对应视频向量权重,得到视频加权向量集合204,并合并视频加权向量集合204中的向量得到的就是用户增强向量205。本阶段的特征向量表也属于模型参数的一部分,模型参数的更新方式会在下文的训练流程中说明。
本申请实施例中,使用播放序列生成用户增强向量,由于绝大部分用户播放过的视频和时长都不相同,而且本申请实施例还使用了播放时长对向量加权,所以能够做到不同用户间的差异化处理,不影响用户增强向量的置信度。本申请实施例的用户增强向量生成方式,具有如下两个优点:第一,虽然绝大部分用户的播放序列不同,但用户的播放序列所涵盖的视频ID数量级不变,与应用场景积累的几亿用户相比,数量减少了百倍,所以能够明显节省模型的计算资源和存储空间;第二,假设在样本数量不变的前提下,因为播放序列中的视频ID数量比用户数少得多,所以每个视频ID能够得到更加充分的训练机会,从而得到更加准确的用户增强向量。
本申请实施例中,在进行向量拼接时,模型在用户塔和视频塔的输入层,对于生成的特征向量和增强向量均各自拼接在一起后得到拼接向量,其中用户塔和视频塔得到的拼接向量分别作为用户拼接向量和视频拼接向量,并继续后续的流程。
在步骤S182中,与原始双重增强双塔结构相比,本申请实施例将用户塔和视频塔的结构,均为多目标神经网络。例如,多目标神经网络可以是渐进分层提取网络(PLE,Progressive Layered Extraction)。PLE网络(如图21所示)主要包括:(1)用于学习多个目标的专家网络,比如需要学习点击和时长两个目标的话,就需要两组专家网络;(2)用于学习不同专家网络间共用信息的共享网络,不管需要学习的目标有多少,共享网络固定为一个;(3)用于计算融合多个网络的输出向量时,每个向量对应权重的门网络,门网络最后一层输出向量的长度,与待确定权重的数量相同。专家网络、共享网络和门网络可以为多层,每层都为多层神经网络,因为最后一层专家网络的输出就是目标向量,后续不需要再与共享网络和门网络交互,所以共享网络和门网络会比专家网络少一层,具体层数可以通过离线测试确定。本申请实施例可以采用三层专家网络、两层共享网络和两层门网络。上一步的用户拼接向量和视频拼接向量,在经过各自的PLE网络后,会由点击专家网络输出点击目标向量,时长专家网络输出时长目标向量,也就是两个塔各输出两个向量。PLE网络参数属于模型参数的一部分,模型参数的更新方式会在下文的训练流程中说明。
在步骤S183中,PLE网络的提出是为了优化推荐中的精排模型,在计算分数的时候,使用的是点击率预测值乘时长预测值,相当于用户的时长期望,这种方式需要多次计算,可以适用于候选集数量不多的情况。但召回模型面对的是海量候选集,采用的方式是检索效率更高的最近邻算法,分数一般为只需要一次计算的内积,原始结构的分数计算方式不能适用召回模型,需要重新设计。
计算基于内积的多目标相似度,最直观的方式为首先计算用户点击目标向量和视频点击目标向量的内积,以及用户时长目标向量和视频时长目标向量的内积,再对两个内积加权求和,但这种方式同样需要多次计算。本申请实施例的方式为首先在每个塔内部,将点击目标向量和时长目标向量拼接,不同的是在用户塔中将用户点击目标向量221和用户时长目标向量222直接拼接,将拼接后得到的拼接向量作为用户多目标向量223,在视频塔中会对视频点击目标向量224的每一位乘以点击目标权重α,视频时长目标向量225的每一位乘以点击目标权重β,然后再拼接,将拼接后得到的拼接向量作为视频多目标向量226。最后计算用户多目标向量223和视频多目标向量226的内积,作为用户与视频之间的多目标相似度227(即相似度分数),如图22所示。相似度分数越大,用户越感兴趣。本申请实施例相当于通过一次计算,实现对点击目标内积和时长目标内积的加权求和,从而适配最近邻算法。α、β属于模型参数的一部分,模型参数的更新方式会在下文的训练流程中说明。
下面,将对训练流程进行说明。
本申请实施例用于线上实时检索用户感兴趣视频的视频召回模型,离线可以通过图23所示的训练流程定时更新,训练流程包括:步骤S231,构造训练样本;步骤S232,更新模型参数;步骤S233,探索多目标权重。
在步骤S231中,训练样本用于离线训练模型,构造的过程分为两步:筛选正负例和关联特征。
在筛选正负例时,正负例表示为:【用户ID,视频ID,是否点击,播放时长,请求ID】,由五个字段构成。比如在精排模型的训练样本中,使用“真实曝光且播放行为数据”作为正例,正例可以表示为:【用户ID,视频ID,1,播放时长,请求ID】;“真实曝光未播放行为数据”作为负例,负例可以表示为:【用户ID,视频ID,0,0,请求ID】。其中的正例筛选方式可以复用于视频召回模型,但负例筛选方式不适用,所以本申请实施例会进行优化,如图24所示,包括增加随机负例和减少真实负例的数量。
在增加随机负例时,召回模型在线上实时检索用户感兴趣视频的时候,面临的视频候选集241是百万量级资源池,所以与精排样本只使用真实行为数据作为正负例的筛选方式不同,召回样本需要另外添加随机采样的负例(即随机负例242),因为精排模型面临的候选集已经是经过召回和粗排筛选后,与用户兴趣相对匹配的视频,主要目的是从中判断更感兴趣的部分,但视频召回模型还需要有能在整个候选集中辨别用户完全不感兴趣视频的能力,也就是把用户感兴趣视频与不感兴趣视频最大程度隔绝开。
随机负例242的候选集为整个视频候选集241,对于每个正例,提取其中的用户ID、请求ID,随机筛选n个视频ID拼接成负例,负例表示为:【用户ID,视频ID,0,0,请求ID】,负例共有n个,n可以通过离线测试确定。本申请实施例通过引入随机负例,模拟用户完全不感兴趣的视频,让视频召回模型在训练阶段能够见识更多应该被过滤的部分,从而加强对候选集的兴趣区分度判断。
在减少真实负例244时,由于随机负例的引入会显著增加样本量,从而拉长训练时间,在需要更多计算和存储资源的同时,还会影响模型的更新速度,导致对用户的行为特征的学习不够及时,所以需要进一步优化。为减少样本数量,本申请实施例采用的方式为对真实负例随机采样。真实负例是用户有曝光但未播放的行为数据,作为用户感兴趣视频与不感兴趣视频的过度数据,如果在训练样本中过量存在,会误导模型过度倾向学习这部分模糊行为,从而干扰模型对更加重要的目标,也就是最大程度隔绝感兴趣视频与不感兴趣视频的能力。经过离线测试,本申请实施例中,真实正例243、真实负例244与随机负例242之间的比值可以为1:1:4,即真实正例:真实负例:随机负例=1:1:4,相当于经过随机采样后的随机真实负例245的数量与真实正例数量相同,对每条 正例随机筛选四条视频作为随机负例。
在关联特征时,可以根据筛选的正负例246,对象特征通过用户ID关联,视频特征通过视频ID关联,如图25所示。因为用户的播放序列和所在城市等部分特征对应的特征值处于变化中,且视频的播放量和点击率等部分特征对应的特征值也处于变化中,如果同一个用户的访问时间不同,对应的请求时间不同,这些特征也会有区别。所以对象特征和视频特征一般会在推荐服务中,跟随用户访问时的请求ID绑定存储,此时的训练样本可以表示为:【对象特征,视频特征,请求ID】;而在离线情况下,则根据正负例的请求ID与线上存储的特征251的请求ID相关联,从而匹配到请求发生时的对象特征和视频特征,得到最后的训练样本252,该训练样本可以表示为:【对象特征,视频特征,是否点击,播放时长】。在离线训练时,将“对象特征”输入至用户塔,将“视频特征”输入至视频塔,“是否点击”为点击目标,“播放时长”为时长目标。
在步骤S232中,更新模型的过程就是更新自身参数的过程,构造完成的样本输入到模型后,需要计算对应损失,模型通过减小损失实现调整模型自身参数。模型的损失可以通过将目标值和预测值输入到损失函数后计算得到,模型的损失用来衡量目标值与预测值之间的差异,损失越小,差异越小。模型在训练阶段通过不断减小损失,从而不断拟合目标。本申请实施例中,除了需要使用预测值拟合点击目标和时长目标以外,还需要使用当前塔的增强向量拟合另一个塔输出的两个目标向量,每个拟合过程都需要各自的损失函数,为了能够在训练过程中将多个损失自适应融合到一起,本申请实施例还优化了采用固定损失权重的融合方式,实现每个预测值都均衡贴近目标值。下面对本申请实施例涉及的几种损失计算进行说明。
(1)点击目标损失。
点击目标根据“是否点击”分为两种,因为目标为离散值,属于二分类预测,所以本申请实施例使用对数损失函数,如以下公式(1)。损失的计算过程为:首先,确定目标值,也就是y,如果曝光未播放,则目标值为0,如果曝光且播放,则目标值为1;然后,计算预测值,也就是σ(<pu,pv>),首先计算用户点击目标向量pu、视频点击目标向量pv的内积,然后经过sigmoid激活函数(σ)的输出,作为预测值;最后,计算对数损失,将目标值、预测值输入对数损失函数中,最后得到对数损失lossclk。其中T表示训练样本的数量。
(2)时长目标损失。
时长目标是用户的真实播放时长,未点击的话为0,点击的话为大于0的小数。因为目标为连续值,属于回归预测,所以本申请实施例使用均方差损失函数,如以下公式(2)。损失的计算过程为:首先,截断时长目标min(dur,maxdur),由于有用户可能忘记退出播放,或者时长上报有问题,样本中的时长目标dur存在极少数的异常过大值,为避免干扰召回模型拟合时长目标,需要按规定的截断值截断。本申请实施例中,离线统计随机筛选的测试样本中时长目标的等频分布,可以划定为100份区间,以第100个区间的最小值作为规定的截断值maxdur(即上述目标截断值),时长目标小于等于截断值的话保留原始值,大于截断值的话替换为截断值;然后,归一化时长目标,得到归一化时长截断值由于截断后的时长目标跨度依很大,从0秒到几万秒不等,导致模型学习不同样本时的参数波动剧烈,因为PLE网络有共享参数的存在,甚至还会干扰对点击目标的学习,所以需要调整跨度区间,为了不改变样本中时长目标的分 布,本申请实施例使用MinMax函数对时长截断值进行归一化处理,函数中的min为0,max为规定的最大截断值maxdur,时长目标经过MinMax函数的输出后,将MinMax函数的输出作为目标值,并将时长目标对应的区间调整为[0,1.0],如此,时长目标的跨度区间显著降低,从而方便模型拟合;再然后,计算在时长维度下的预测值,也就是σ(<du,dv>),预测值为了拟合目标值,需要与目标值的跨度区间一致,即预测值的跨度区间也是[0,1.0]。本申请实施例首先计算用户时长目标向量du与视频时长目标向量dv的内积,然后将内积输入至sigmoid激活函数(σ)中得到输出结果,将该输出结果作为在时长维度下的预测值;最后,计算均方差,可以将目标值和预测值输入至均方差损失函数(2)中,得到均方差损失lossdur
(3)增强向量与目标向量损失。
为了解决原始双重增强双塔结构的增强向量无法同时拟合多个目标向量的问题,本申请实施例在拟合过程中增加了MMOE网络,如图26所示。MMOE网络是一种初级的多目标网络,MMOE网络可以输出对应多个目标的多个向量,与PLE网络相比,MMOE只有专家网络和门网络,没有共享网络,当层数配置的很少时,参数规模也较小,不会导致模型过度学习这部分结构。当输入样本为正例时,增强向量261经过MMOE网络中的点击门网络262、点击专家网络263、时长专家网络264和时长门网络265,输出点击的目标增强向量266和时长目标增强向量267,模型分别计算每个目标维度下,当前塔的目标增强向量与另一个塔的目标向量的均方差lossaug(参见以下公式(3),ti表示目标向量的每一位数,ai表示目标增强向量的每一位数),共得到四个损失,分别为:点击目标下的用户增强损失、点击目标下的视频增强损失、时长目标下的用户增强损失和时长目标下的视频增强损失。
需要注意的是,因为增强向量是当前塔的输入,被拟合的目标向量是另一个塔的输出,目标向量在计算过程中需要另一个塔输入的增强向量,彼此存在着依赖,模型更新增强向量的话,需要固定目标向量,否则会有模型参数无法更新的问题。
(4)融合损失。
在训练过程中,当模型面对多个目标损失时,一般是对各个损失加权求和,将加权求和结果作为融合损失。在计算出融合损失之后,可以通过减小融合损失,实现同时减小各个损失。每个损失的权重可以通过离线测试确定,但在训练过程中,当输入不同训练样本时,由于数据间存在差异,固定的权重难免会出现多个目标顾此失彼的问题,所以本申请实施例使用不确定性加权的方式,自适应调整每个损失的权重,实现每个预测值都均衡贴近目标值。
本申请实施例中,模型面对的目标包括:点击目标损失、时长目标损失、点击目标下的用户增强损失、点击目标下的视频增强损失,以及时长目标下的用户增强损失、时长目标下的视频增强损失,一共六个损失。对应的原始不确定性加权的公式为以下公式(4),其中lossclk、lossdur、lossu_clk_aug、lossu_dur_aug、lossv_clk_aug、lossv_dur_aug分别对应这六个目标损失,wclk、wdur、wu_clk_aug、wu_dur_aug、wv_clk_aug、wv_dur_aug分别是对应六个目标损失中的每一目标损失的权重,log(wclkwdurwu_clk_augwu_dur_augwv_clk_augwv_dur_aug) 作为正则项,用来防止权重学习的过大。因为权重在训练过程中是动态变化的,所以wclkwdurwu_clk_augwu_dur_augwv_clk_augwv_dur_aug可能为负数,导致融合目标损失计算不正确,从而训练失败,所以本申请实施例将正则项调整为公式(5),以保证每一步训练的融合目标损失都计算正确。

norm=log(wclk 2wdur 2wu_clk_aug 2wv_clk_aug 2wu_dur_aug 2wv_dur_aug 2)      (5)。
本申请实施例模型定时更新时,通过增量输入样本、计算损失和降低损失,实现持续学习最新的用户的行为特征,保证在线检索到的视频更符合用户兴趣。
在步骤S233中,由于经过前面步骤,在线可用的模型已经实现创建、更新,输入用户、视频的特征后,就会输出用户、视频的多目标向量,但当计算用户、视频的相似度分数时,还需要确定点击目标权重、时长目标权重,也就是上述计算流程的计算多目标相似度中的α和β,本申请实施例通过离线测试确定。
首先,探索点击目标权重。点击目标属于二分类预测,本申请实施例采用AUC确定α和β,计算AUC时的目标值为0或1,预测值为不同α和β下的相似度分数,AUC越高,α和β越合适。然后,探索时长目标权重。时长目标属于回归预测,可以采用单测试集的平均头部时长(AD@K,AvgDurPerBatch@TopK)确定α和β,参见以下公式(6):第一步,筛选测试样本,计算不同α和β下的相似度分数,可以随机划定为n份测试集(batch),每份的样本数为m,那么n*m为测试样本的总数;第二步,在每份测试集(batch)内,按相似度分数从大到小排序,首先对前k个样本的时长求和,得到头部时长,然后对所有测试集(batch)的头部时长求和,将求和结果最后除以n,得到AD@K,AD@K越高,则α和β越合适。其中,durij表示第i份测试集中的第j个样本对应的时长目标。
最后,探索多目标权重。因为AUC最高时对应的α和β,与AD@K最高时对应的α和β一般不会相同,所以需要确定同时满足两个目标的最合适α和β。本申请实施例中,可以采用多轮探索来确定最终的α和β:第一步,划定第一轮测试的α和β的取值范围,本申请实施例的取值范围为[1,10],步长为1,因为有两个权重,所以会探索100次,对于每组α和β,计算对应的<AUC,AD@K>,不过AUC和AD@K的数量级不同,不方便直接比较,本申请实施例会首先采用MinMax函数归一化,然后再比较每组下α和β两者的差值minus,如以下公式(7),差值minus越接近0,α和β越合适,最后确定第一轮差异最小的α1和β1,之后的探索会固定α为α1,只探索更精确的β;第二步,划定第二轮测试的β取值范围,本申请实施例对β的取值范围是(β1-1,β1+1),步长为0.1,探索19次,最后确定第二轮差异最小的β2;第三步,划定第三轮测试的β值范围,本申请实施例对β的取值范围是(β2-0.1,β2+0.1),步长为0.01,探索19 次,最后确定第三轮差异最小的β3;第四步,按第一步至第三步的方式循环,本申请实施例会对y探索五轮,最后得到的α1和β5分别就是点击和时长下的目标权重,用于在线计算用户与视频的相似度分数。在公式(7)中,minAUC表示本轮测试时最小的AUC值,maxAUC表示本轮测试测试时最大的AUC值;minAD@K表示本轮测试时最小的AD@K值,maxAD@K表示本轮测试时最大的AD@K值。
下面,将对本申请实施例视频推荐方法的应用流程进行说明。
本申请实施例中,视频召回模型每次完成更新后,首先将视频候选集的特征批量输入到视频塔,生成对应视频的视频点击目标向量和视频时长目标向量,然后将两个目标向量的每一位分别乘以各自权重后拼接在一起得到拼接向量,该拼接向量作为视频多目标向量,最后为该视频多目标向量创建用于在线实时查询的索引,从而可以基于索引直接查询视频多目标向量,保证在线无需重复生成视频多目标向量。每次视频多目标向量的索引完成创建后,再将模型部署到线上,以保证实时生成的用户多目标向量与视频多目标向量对应的模型版本一致。当用户在视频应用的各个频道下滑浏览或者刷新无目的区时,线上实时将请求用户的特征输入到用户塔,通过用户塔生成对应的用户点击目标向量和用户时长目标向量,再将用户点击目标向量和用户时长目标向量拼接成用户多目标向量,最后使用最近邻算法查询相似度分数最高的头部视频并返回。在实现的过程中,可以将查询到的视频与其他召回的视频依次进行合并和去重后,继续进行精排和混排等逻辑处理,最终将筛选出的目标推荐视频在无目的区以素材卡片的形式推荐给用户。
本申请实施例的用户增强向量使用播放序列初始化,在保证了不同用户存在差异化的同时,既能通过减小模型参数规模,显著节省计算和存储资源,又能通过增加播放序列中每个视频ID的训练机会,提升增强向量的表达准确度。并且,本申请实施例中的每个塔采用PLE网络,实现输入特征后能够输出多个向量,这多个向量用于拟合点击和时长等多个目标,并可以根据业务目标自行调整这多个向量。本申请实施例中,当前塔输入的增强向量拟合另一个塔输出的目标向量时,通过增加MMOE网络,实现一个增强向量同时拟合多个目标向量。另外,本申请实施例还重新设计了时长目标损失函数中目标值与预测值的计算方式,得到时长目标损失,这样,时长目标损失再加上点击目标损失和增强目标损失等其他损失后,通过基于改良版不确定性加权来融合多个损失,实现每个预测值都均衡贴近目标值。除此之外,本申请实施例在实时计算用户与视频之间的多目标相似度分数时,使用的多目标权重可以通过本申请实施例设计的探索方式确定,如此,能够离线得到同时满足多个目标的最合适权重。
需要说明的是,本申请实施例的用户增强向量通过播放序列初始化,因为播放序列中除了视频ID和播放时长,还包含着播放先后顺序,所以可以采用序列模型实现在用户增强向量中引入顺序信息,比如长短期记忆网络(LSTM,Long Short-Term Memory)、transformer模型与BERT等模型。本申请实施例中每个塔的多目标网络,可以使用其他结构替换PLE网络,比如使用ResNet或者并联双塔等结构替换PLE网络,从而加强多目标向量的表达能力。本申请实施例的视频召回模型除了用于任意一种视频应用无目的区素材卡片召回之外,还可以根据不同视频应用的场景的特点,个性化调整模型的优化目标。
可以理解的是,在本申请实施例中,涉及到用户信息的内容,例如,对象特征向量、历史播放序列和目标推荐视频等信息,如果涉及与用户信息或企业信息相关的数据,当本申请实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据收 集处理在实例应用时应该严格根据相关国家法律法规的要求,获取个人信息主体的知情同意或单独同意,并在法律法规及个人信息主体的授权范围内,开展后续数据使用及处理行为。
下面继续说明本申请实施例提供的视频推荐装置354实施为软件模块的示例性结构,在一些实施例中,如图4所示,视频推荐装置354包括:获取模块3541,配置为获取目标对象的对象特征向量、所述目标对象在预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引;向量化处理模块3542,配置为对所述历史播放序列进行向量化处理,得到所述目标对象的对象增强向量;多目标处理模块3543,配置为对所述对象特征向量和所述对象增强向量依次进行向量拼接处理和多目标特征学习,得到所述目标对象的对象多目标向量;确定模块3544,配置为基于所述对象多目标向量和每一所述待推荐视频的视频多目标向量索引,从所述待推荐视频库中确定对应于所述目标对象的目标推荐视频;视频推荐模块3545,配置为基于所述目标推荐视频对所述目标对象进行视频推荐。
在一些实施例中,所述装置还包括:检索模块,配置为基于每一所述待推荐视频的视频标识,在预设的特征向量表中进行检索,对应得到每一所述待推荐视频的视频增强向量;向量拼接模块,配置为对每一所述待推荐视频的视频特征向量和所述视频增强向量进行向量拼接处理,对应得到每一所述待推荐视频的视频拼接向量;多目标特征学习模块,配置为对每一所述待推荐视频的视频拼接向量进行多目标特征学习,对应得到每一所述待推荐视频的视频多目标向量;创建模块,配置为创建与每一所述待推荐视频的视频多目标向量对应的视频多目标向量索引。
在一些实施例中,所述向量化处理模块还配置为:获取所述历史播放序列中的每一历史播放视频的历史视频标识和历史播放时长;基于每一所述历史视频标识,在预设的特征向量表中进行检索,得到历史视频向量集合;所述历史视频向量集合中的历史视频向量的数量与所述历史播放序列中的历史视频标识的数量相同;对所述历史播放序列中的所述历史播放时长进行总时长统计,得到历史播放总时长;基于所述历史播放总时长,对每一所述历史播放时长进行时长归一化处理,得到每一历史播放视频的归一化播放时长,并将所述归一化播放时长确定为相应历史播放视频的视频向量权重;基于所述视频向量权重,对所述历史视频向量集合中的每一历史视频向量进行加权处理,得到视频加权向量集合;对所述视频加权向量集合中的视频加权向量进行合并处理,得到所述目标对象的对象增强向量。
在一些实施例中,所述多目标处理模块还配置为:对所述对象特征向量和所述对象增强向量进行向量拼接处理,得到对象拼接向量;通过多目标神经网络对所述对象拼接向量进行多目标特征学习,得到所述目标对象在多个目标维度下的对象目标向量;对所述多个目标维度下的对象目标向量进行拼接处理,得到所述目标对象的对象多目标向量。
在一些实施例中,所述多目标特征学习模块还配置为:针对每一所述待推荐视频,通过多目标神经网络对所述待推荐视频的视频拼接向量进行多目标特征学习,得到所述待推荐视频在多个目标维度下的视频目标向量;获取每个目标维度下的目标权重;采用所述目标权重,对每个目标维度下的视频目标向量分别进行加权计算,得到加权视频目标向量;对多个目标维度下的加权视频目标向量进行拼接处理,得到所述待推荐视频的视频多目标向量。
在一些实施例中,所述确定模块还配置为:基于所述视频多目标向量索引,获取每一待推荐视频的视频多目标向量;确定所述对象多目标向量与每一待推荐视频的视频多目标向量之间的内积,并将所述内积确定为所述目标对象与相应待推荐视频之间的相似度分值;根据所述相似度分值,从所述待推荐视频库中选择特定数量的待推荐视频;将 选择出的所述特定数量的待推荐视频,确定为对应于所述目标对象的目标推荐视频。
在一些实施例中,所述视频推荐方法通过视频召回模型来实现;所述视频推荐装置还包括模型训练装置,所述模型训练装置配置为:获取样本数据,所述样本数据包括:样本对象特征、样本视频特征和多个目标维度下的目标参数;将所述样本对象特征输入至所述视频召回模型的对象塔中,通过所述对象塔基于所述样本对象特征,预测样本对象在所述多个目标维度下的样本对象目标向量;将所述样本视频特征输入至所述视频召回模型的视频塔中,通过所述视频塔基于所述样本视频特征,预测样本视频在所述多个目标维度下的样本视频目标向量;将所述样本对象目标向量、所述样本视频目标向量和所述目标参数输入至目标损失模型中,通过所述目标损失模型进行损失计算,得到目标损失结果;基于所述目标损失结果对所述对象塔和所述视频塔中的参数进行修正,得到训练后的视频召回模型。
在一些实施例中,所述模型训练装置还配置为:获取原始样本数据;所述原始样本数据包括多个真实正例样本和多个真实负例样本;基于所述多个真实正例样本构建随机负例样本,并从所述多个真实负例样本中删减部分数量的真实负例样本;其中,所述真实正例样本的数量、删减部分数量后的真实负例样本的数量与所述随机负例样本的数量之间具有预设比例关系;将所述真实正例样本确定为正例样本,将所述删减部分数量后的真实负例样本和所述随机负例样本确定为负例样本;基于对象标识和视频标识,对所述正例样本和所述负例样本进行特征关联,得到所述样本数据。
在一些实施例中,所述样本对象目标向量包括对象点击目标向量;所述样本视频目标向量包括视频点击目标向量;所述目标参数包括点击目标值;所述模型训练装置还配置为:通过所述目标损失模型确定所述对象点击目标向量与所述视频点击目标向量之间的向量内积;基于所述向量内积和预设激活函数,确定在点击维度下的预测值;通过对数损失函数,确定所述点击维度下的预测值与所述点击目标值之间的对数损失;将所述对数损失确定为所述目标损失结果。
在一些实施例中,所述样本对象目标向量包括对象时长目标向量;所述样本视频目标向量包括视频时长目标向量;所述目标参数包括时长目标值;所述模型训练装置还配置为:按照预设的截断区间数量,对所述时长目标值进行截断处理,得到具有所述截断区间数量的时长截断值;基于具有所述截断区间数量的时长截断值,确定目标截断值;基于所述目标截断值,对每一所述时长截断值进行归一化处理,得到归一化时长截断值;确定所述对象时长目标向量与所述视频时长目标向量之间的向量内积;基于所述向量内积和预设激活函数,确定在时长维度下的预测值;通过均方差损失函数,确定所述时长维度下的预测值与所述归一化时长截断值之间的均方差损失;将所述均方差损失确定为所述目标损失结果。
在一些实施例中,所述视频召回模型还包括多目标网络;所述模型训练装置还配置为:当所述样本数据为正样本时,通过所述多目标网络输出对应于所述对象塔和所述视频塔在多个目标维度下的目标增强向量;在每一目标维度下,确定对应于所述对象塔的目标增强向量与所述视频塔输出的样本视频目标向量之间的第一均方差,或者,确定对应于所述视频塔的目标增强向量与所述对象塔输出的样本对象目标向量之间的第二均方差;将所述第一均方差和所述第二均方差,分别确定为所述样本对象的对象增强损失和所述样本视频的视频增强损失;所述对象增强损失和所述视频增强损失构成所述目标损失结果中的部分损失结果。
在一些实施例中,所述目标损失结果包括点击维度下的对数损失、时长维度下的均方差损失、点击维度下的对象增强损失、点击维度下的视频增强损失、时长维度下的对象增强损失和时长维度下的视频增强损失;所述模型训练装置还配置为:获取所述点击 维度下的对数损失、所述时长维度下的均方差损失、所述点击维度下的对象增强损失、所述点击维度下的视频增强损失、所述时长维度下的对象增强损失和所述时长维度下的视频增强损失分别对应的损失权重;获取预设的正则项;基于所述损失权重和所述正则项,对所述点击维度下的对数损失、时长维度下的均方差损失、点击维度下的对象增强损失、点击维度下的视频增强损失、时长维度下的对象增强损失和时长维度下的视频增强损失进行损失融合处理,得到融合损失结果;基于所述融合损失结果,对所述对象塔和所述视频塔中的参数进行修正,得到所述训练后的视频召回模型。
在一些实施例中,所述模型训练装置还配置为:将所述样本对象目标向量和所述样本视频目标向量输入至所述视频召回模型的推荐预测层中,通过所述推荐预测层确定所述样本对象对所述样本视频的点击参数和视频时长参数;根据所述点击参数,确定所述视频召回模型的性能指标值;根据所述视频时长参数,确定所述视频召回模型的平均头部时长;基于所述性能指标值和所述平均头部时长进行循环多轮测试,得到在点击维度下的目标权重和在视频时长维度下的目标权重。
需要说明的是,本申请实施例装置的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果,因此不做赘述。对于本装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括可执行指令;该可执行指令存储在计算机可读存储介质中。当电子设备的处理器从计算机可读存储介质读取该可执行指令,处理器执行该可执行指令时,使得该电子设备执行本申请实施例上述的视频推荐方法。
本申请实施例提供一种存储有可执行指令的存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的视频推荐方法,例如,如图5示出的视频推荐方法。在一些实施例中,存储介质可以是计算机可读存储介质,例如,铁电存储器(FRAM,Ferromagnetic Random Access Memory)、只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read Only Memory)、带电可擦可编程只读存储器(EEPROM,Electrically Erasable Programmable Read Only Memory)、闪存、磁表面存储器、光盘、或光盘只读存储器(CD-ROM,Compact Disk-Read Only Memory)等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。作为示例,可执行指令可被部署为在一个电子设备上执行,或者在位于一个地点的多个电子设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个电子设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (17)

  1. 一种视频推荐方法,所述方法由电子设备执行,所述方法包括:
    获取目标对象的对象特征向量、所述目标对象在预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引;
    对所述历史播放序列进行向量化处理,得到所述目标对象的对象增强向量;
    对所述对象特征向量和所述对象增强向量依次进行向量拼接处理和多目标特征学习,得到所述目标对象的对象多目标向量;
    基于所述对象多目标向量和每一所述待推荐视频的视频多目标向量索引,从所述待推荐视频库中确定对应于所述目标对象的目标推荐视频;
    基于所述目标推荐视频对所述目标对象进行视频推荐。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    基于每一所述待推荐视频的视频标识,在预设的特征向量表中进行检索,对应得到每一所述待推荐视频的视频增强向量;
    对每一所述待推荐视频的视频特征向量和所述视频增强向量进行向量拼接处理,对应得到每一所述待推荐视频的视频拼接向量;
    对每一所述待推荐视频的视频拼接向量进行多目标特征学习,对应得到每一所述待推荐视频的视频多目标向量;
    创建与每一所述待推荐视频的视频多目标向量对应的视频多目标向量索引。
  3. 根据权利要求2所述的方法,其中,所述对每一所述待推荐视频的视频拼接向量进行多目标特征学习,对应得到每一所述待推荐视频的视频多目标向量,包括:
    针对每一所述待推荐视频,通过多目标神经网络对所述待推荐视频的视频拼接向量进行多目标特征学习,得到所述待推荐视频在多个目标维度下的视频目标向量;
    获取每个目标维度下的目标权重;
    采用所述目标权重,对每个目标维度下的视频目标向量分别进行加权计算,得到加权视频目标向量;
    对多个目标维度下的加权视频目标向量进行拼接处理,得到所述待推荐视频的视频多目标向量。
  4. 根据权利要求1所述的方法,其中,所述对所述历史播放序列进行向量化处理,得到所述目标对象的对象增强向量,包括:
    获取所述历史播放序列中的每一历史播放视频的历史视频标识和历史播放时长;
    基于每一所述历史视频标识,在预设的特征向量表中进行检索,得到历史视频向量集合;所述历史视频向量集合中的历史视频向量的数量与所述历史播放序列中的历史视频标识的数量相同;
    对所述历史播放序列中的所述历史播放时长进行总时长统计,得到历史播放总时长;
    基于所述历史播放总时长,对每一所述历史播放时长进行时长归一化处理,得到每一历史播放视频的归一化播放时长,并将所述归一化播放时长确定为相应历史播放视频的视频向量权重;
    基于所述视频向量权重,对所述历史视频向量集合中的每一历史视频向量进行加权处理,得到视频加权向量集合;
    对所述视频加权向量集合中的视频加权向量进行合并处理,得到所述目标对象的对象增强向量。
  5. 根据权利要求1所述的方法,其中,所述对所述对象特征向量和所述对象增强向量依次进行向量拼接处理和多目标特征学习,得到所述目标对象的对象多目标向量, 包括:
    对所述对象特征向量和所述对象增强向量进行向量拼接处理,得到对象拼接向量;
    通过多目标神经网络对所述对象拼接向量进行多目标特征学习,得到所述目标对象在多个目标维度下的对象目标向量;
    对所述多个目标维度下的对象目标向量进行拼接处理,得到所述目标对象的对象多目标向量。
  6. 根据权利要求1所述的方法,其中,所述基于所述对象多目标向量和每一所述待推荐视频的视频多目标向量索引,从所述待推荐视频库中确定对应于所述目标对象的目标推荐视频,包括:
    基于所述视频多目标向量索引,获取每一待推荐视频的视频多目标向量;
    确定所述对象多目标向量与每一待推荐视频的视频多目标向量之间的内积,并将所述内积确定为所述目标对象与相应待推荐视频之间的相似度分值;
    根据所述相似度分值,从所述待推荐视频库中选择特定数量的待推荐视频;
    将选择出的所述特定数量的待推荐视频,确定为对应于所述目标对象的目标推荐视频。
  7. 根据权利要求1至6任一项所述的方法,其中,所述视频推荐方法通过视频召回模型来实现;其中,
    所述方法还包括:通过以下方式训练所述视频召回模型:
    获取样本数据,所述样本数据包括:样本对象特征、样本视频特征和多个目标维度下的目标参数;
    将所述样本对象特征输入至所述视频召回模型的对象塔中,通过所述对象塔基于所述样本对象特征,预测样本对象在所述多个目标维度下的样本对象目标向量;
    将所述样本视频特征输入至所述视频召回模型的视频塔中,通过所述视频塔基于所述样本视频特征,预测样本视频在所述多个目标维度下的样本视频目标向量;
    将所述样本对象目标向量、所述样本视频目标向量和所述目标参数输入至目标损失模型中,通过所述目标损失模型进行损失计算,得到目标损失结果;
    基于所述目标损失结果对所述对象塔和所述视频塔中的参数进行修正,得到训练后的视频召回模型。
  8. 根据权利要求7所述的方法,其中,所述获取样本数据,包括:
    获取原始样本数据;所述原始样本数据包括多个真实正例样本和多个真实负例样本;
    基于所述多个真实正例样本构建随机负例样本,并从所述多个真实负例样本中删减部分数量的真实负例样本;其中,所述真实正例样本的数量、删减部分数量后的真实负例样本的数量与所述随机负例样本的数量之间具有预设比例关系;
    将所述真实正例样本确定为正例样本,将所述删减部分数量后的真实负例样本和所述随机负例样本确定为负例样本;
    基于对象标识和视频标识,对所述正例样本和所述负例样本进行特征关联,得到所述样本数据。
  9. 根据权利要求7所述的方法,其中,所述样本对象目标向量包括对象点击目标向量;所述样本视频目标向量包括视频点击目标向量;所述目标参数包括点击目标值;
    所述将所述样本对象目标向量、所述样本视频目标向量和所述目标参数输入至目标损失模型中,通过所述目标损失模型进行损失计算,得到目标损失结果,包括:
    通过所述目标损失模型确定所述对象点击目标向量与所述视频点击目标向量之间的向量内积;
    基于所述向量内积和预设激活函数,确定在点击维度下的预测值;
    通过对数损失函数,确定所述点击维度下的预测值与所述点击目标值之间的对数损失;
    将所述对数损失确定为所述目标损失结果。
  10. 根据权利要求7所述的方法,其中,所述样本对象目标向量包括对象时长目标向量;所述样本视频目标向量包括视频时长目标向量;所述目标参数包括时长目标值;
    所述将所述样本对象目标向量、所述样本视频目标向量和所述目标参数输入至目标损失模型中,通过所述目标损失模型进行损失计算,得到目标损失结果,包括:
    按照预设的截断区间数量,对所述时长目标值进行截断处理,得到具有所述截断区间数量的时长截断值;
    基于具有所述截断区间数量的时长截断值,确定目标截断值;
    基于所述目标截断值,对每一所述时长截断值进行归一化处理,得到归一化时长截断值;
    确定所述对象时长目标向量与所述视频时长目标向量之间的向量内积;
    基于所述向量内积和预设激活函数,确定在时长维度下的预测值;
    通过均方差损失函数,确定所述时长维度下的预测值与所述归一化时长截断值之间的均方差损失;
    将所述均方差损失确定为所述目标损失结果。
  11. 根据权利要求7所述的方法,其中,所述视频召回模型还包括多目标网络;所述方法还包括:
    当所述样本数据为正样本时,通过所述多目标网络输出在多个目标维度下的目标增强向量,其中,所述在多个目标维度下的目标增强向量包括对应于所述对象塔的目标增强向量和对应于所述视频塔的目标增强向量;
    在每一目标维度下,确定对应于所述对象塔的目标增强向量与所述视频塔输出的样本视频目标向量之间的第一均方差,或者,确定对应于所述视频塔的目标增强向量与所述对象塔输出的样本对象目标向量之间的第二均方差;
    将所述第一均方差和所述第二均方差,分别确定为所述样本对象的对象增强损失和所述样本视频的视频增强损失;所述对象增强损失和所述视频增强损失构成所述目标损失结果中的部分损失结果。
  12. 根据权利要求7所述的方法,其中,所述目标损失结果包括点击维度下的对数损失、时长维度下的均方差损失、点击维度下的对象增强损失、点击维度下的视频增强损失、时长维度下的对象增强损失和时长维度下的视频增强损失;
    所述方法还包括:
    获取所述点击维度下的对数损失、所述时长维度下的均方差损失、所述点击维度下的对象增强损失、所述点击维度下的视频增强损失、所述时长维度下的对象增强损失和所述时长维度下的视频增强损失分别对应的损失权重;
    获取预设的正则项;
    基于所述损失权重和所述正则项,对所述点击维度下的对数损失、时长维度下的均方差损失、点击维度下的对象增强损失、点击维度下的视频增强损失、时长维度下的对象增强损失和时长维度下的视频增强损失进行损失融合处理,得到融合损失结果;
    基于所述融合损失结果,对所述对象塔和所述视频塔中的参数进行修正,得到所述训练后的视频召回模型。
  13. 根据权利要求7所述的方法,其中,所述方法还包括:
    将所述样本对象目标向量和所述样本视频目标向量输入至所述视频召回模型的推荐预测层中,通过所述推荐预测层确定所述样本对象对所述样本视频的点击参数和视频 时长参数;
    根据所述点击参数,确定所述视频召回模型的性能指标值;
    根据所述视频时长参数,确定所述视频召回模型的平均头部时长;
    基于所述性能指标值和所述平均头部时长进行循环多轮测试,得到在点击维度下的目标权重和在视频时长维度下的目标权重。
  14. 一种视频推荐装置,所述装置包括:
    获取模块,配置为获取目标对象的对象特征向量、所述目标对象在预设历史时间段内的历史播放序列和待推荐视频库中的每一待推荐视频的视频多目标向量索引;
    向量化处理模块,配置为对所述历史播放序列进行向量化处理,得到所述目标对象的对象增强向量;
    多目标处理模块,配置为对所述对象特征向量和所述对象增强向量依次进行向量拼接处理和多目标特征学习,得到所述目标对象的对象多目标向量;
    确定模块,配置为基于所述对象多目标向量和每一所述待推荐视频的视频多目标向量索引,从所述待推荐视频库中确定对应于所述目标对象的目标推荐视频;
    视频推荐模块,配置为基于所述目标推荐视频对所述目标对象进行视频推荐。
  15. 一种电子设备,包括:
    存储器,配置为存储可执行指令;处理器,配置为执行所述存储器中存储的可执行指令时,实现权利要求1至13任一项所述的视频推荐方法。
  16. 一种计算机可读存储介质,其中存储有可执行指令,所述可执行指令配置为被处理器执行时实现权利要求1至13任一项所述的视频推荐方法。
  17. 一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括可执行指令,所述可执行指令存储在计算机可读存储介质中;
    当电子设备的处理器从所述计算机可读存储介质读取所述可执行指令,并执行所述可执行指令时,实现权利要求1至13任一项所述的视频推荐方法。
PCT/CN2023/088886 2022-11-30 2023-04-18 视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品 WO2024113641A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211526679.9 2022-11-30
CN202211526679.9A CN117009575A (zh) 2022-11-30 2022-11-30 一种视频推荐方法、装置及电子设备

Publications (1)

Publication Number Publication Date
WO2024113641A1 true WO2024113641A1 (zh) 2024-06-06

Family

ID=88566118

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/088886 WO2024113641A1 (zh) 2022-11-30 2023-04-18 视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品

Country Status (2)

Country Link
CN (1) CN117009575A (zh)
WO (1) WO2024113641A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117213A (zh) * 2021-11-12 2022-03-01 杭州网易云音乐科技有限公司 一种推荐模型训练、推荐方法、装置、介质和设备
CN114298182A (zh) * 2021-12-17 2022-04-08 北京达佳互联信息技术有限公司 资源召回方法、装置、设备及存储介质
US20220215032A1 (en) * 2020-02-13 2022-07-07 Tencent Technology (Shenzhen) Company Limited Ai-based recommendation method and apparatus, electronic device, and storage medium
CN115114478A (zh) * 2022-07-11 2022-09-27 未来电视有限公司 视频召回方法、装置、电子设备以及存储介质
CN115114461A (zh) * 2022-04-21 2022-09-27 腾讯科技(深圳)有限公司 多媒体数据的推荐方法、设备以及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220215032A1 (en) * 2020-02-13 2022-07-07 Tencent Technology (Shenzhen) Company Limited Ai-based recommendation method and apparatus, electronic device, and storage medium
CN114117213A (zh) * 2021-11-12 2022-03-01 杭州网易云音乐科技有限公司 一种推荐模型训练、推荐方法、装置、介质和设备
CN114298182A (zh) * 2021-12-17 2022-04-08 北京达佳互联信息技术有限公司 资源召回方法、装置、设备及存储介质
CN115114461A (zh) * 2022-04-21 2022-09-27 腾讯科技(深圳)有限公司 多媒体数据的推荐方法、设备以及计算机可读存储介质
CN115114478A (zh) * 2022-07-11 2022-09-27 未来电视有限公司 视频召回方法、装置、电子设备以及存储介质

Also Published As

Publication number Publication date
CN117009575A (zh) 2023-11-07

Similar Documents

Publication Publication Date Title
CN111241311B (zh) 媒体信息推荐方法、装置、电子设备及存储介质
CN111444428B (zh) 基于人工智能的信息推荐方法、装置、电子设备及存储介质
CN111291266B (zh) 基于人工智能的推荐方法、装置、电子设备及存储介质
CN111966914B (zh) 基于人工智能的内容推荐方法、装置和计算机设备
CN111242310B (zh) 特征有效性评估方法、装置、电子设备及存储介质
CN109993627B (zh) 推荐方法、推荐模型的训练方法、装置和存储介质
CN112052387B (zh) 一种内容推荐方法、装置和计算机可读存储介质
CN111429161B (zh) 特征提取方法、特征提取装置、存储介质及电子设备
CN111159563A (zh) 用户兴趣点信息的确定方法、装置、设备及存储介质
CN111695084A (zh) 模型生成方法、信用评分生成方法、装置、设备及存储介质
CN114417058A (zh) 一种视频素材的筛选方法、装置、计算机设备和存储介质
WO2022133178A1 (en) Systems and methods for knowledge distillation using artificial intelligence
CN114817692A (zh) 确定推荐对象的方法、装置和设备及计算机存储介质
CN112330442A (zh) 基于超长行为序列的建模方法及装置、终端、存储介质
CN117573961A (zh) 信息推荐方法、装置、电子设备、存储介质及程序产品
Gupta et al. Machine learning enabled models for YouTube ranking mechanism and views prediction
CN116932906A (zh) 一种搜索词推送方法、装置、设备及存储介质
CN116662527A (zh) 用于生成学习资源的方法及相关产品
Gao et al. [Retracted] Construction of Digital Marketing Recommendation Model Based on Random Forest Algorithm
Hao et al. Deep collaborative online learning resource recommendation based on attention mechanism
WO2024113641A1 (zh) 视频推荐方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN114595323A (zh) 画像构建、推荐、模型训练方法、装置、设备及存储介质
CN113094584A (zh) 推荐学习资源的确定方法和装置
Dong et al. A hierarchical network with user memory matrix for long sequence recommendation
CN111651643A (zh) 候选内容的处理方法及相关设备