Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a video pair mining process provided in an embodiment of the present application, which specifically includes the following steps:
s101: and acquiring sample data of each video pair.
In the present application, in order to be able to mine potential video pairs, sample data of each video pair needs to be acquired first.
It should be noted here that the acquired video pair sample data is video pair data that has appeared historically, that is, the video pair that appears historically is composed of a video set (i.e., a request video) that a user wants to watch and a video (i.e., a click video) that is watched by the user in videos recommended for the user according to the request video, which are currently sent to the server by the user, and one video pair only includes one request video and one click video.
And the video pair data consists of a video identifier of the requested video, a video identifier of the clicked video and the click rate of the clicked video under the condition that the user watches the requested video, wherein the click rate of the clicked video is specific to a certain video pair, and the ratio of the number of times that the clicked video in the video pair is watched by the user to the number of times that the video pair is requested by the requested video in the video pair is recommended to the user according to the requested video in the video pair.
And S102, determining a probability matrix according to the video identification of the request video, the video identification of the click video and the click rate of the click video.
According to the method and the device, the video pairs needing to be mined are mainly determined through the transition matrix, and the probability matrix needs to be established firstly to obtain the transition matrix, so that in the method and the device, after the sample data of each video pair is obtained, the probability matrix can be established according to the sample data of each video pair.
Further, the present application provides a method for establishing a probability matrix, specifically, according to sample data of each video pair, establishing a probability matrix, specifically, taking a video identifier corresponding to each requested video as a row of the probability matrix, taking a video identifier corresponding to each click video as a column of the probability matrix, and taking a click rate of the click video in sample data of each video pair as an element value of the probability matrix to determine the probability matrix, for example, assuming that there are three video pair data, namely, a video pair a including 1(1 is an identifier of the requested video) and 9(9 is an identifier of the click video) and a click rate of the click video of 0.2, a video pair B including 3(3 is an identifier of the requested video) and 10(10 is an identifier of the click video) and a click rate of the click video of 0.5, a video pair C including 5(5 is an identifier of the requested video) and 6(6 is an identifier of the click video) and a click rate of the click of the video of 0.7, therefore, the specific example of determining the probability matrix by using the video identifier corresponding to each requested video as a row of the probability matrix, the video identifier corresponding to each click video as a column of the probability matrix, and the click rate of each video on the click video in the sample data as an element value of the probability matrix is shown in fig. 2.
It should be noted that, if there is no video pair in any element position in the probability matrix, 0 may be filled directly.
In general, since the identifiers of the requested videos and the identifiers of the clicked videos may reach tens or hundreds of millions, but actually, for a certain requested video, the number of video pairs formed by the clicked videos connected with the requested video may only reach tens or hundreds of millions, so that the number of rows and the number of columns of the whole matrix may reach tens or hundreds of millions according to the method for establishing the probability matrix given above, and subsequently, computational calculation and pressure may be caused on a computer, in this application, the video identifiers of the requested videos and the video identifiers of the clicked videos that are not repeated may be extracted from the sample data of each video pair (that is, the video identifiers of the extracted requested videos and the video identifiers of the clicked videos are not repeated two by two), and the video identifiers corresponding to the extracted requested videos and the video identifiers corresponding to the clicked videos are numbered, and generating a corresponding relation between a video identifier and a number corresponding to the request video and a corresponding relation between a video identifier and a number corresponding to the click video, and determining a probability matrix according to the number corresponding to the video identifier corresponding to each request video after numbering, the number corresponding to the video identifier corresponding to the click video and the click rate of the click video, so that the number of rows and columns of the probability matrix can be effectively reduced, and the calculation amount and pressure of a computer in processing the probability matrix can be subsequently reduced 5(5 is the identifier of the requested video) and 6(6 is the identifier of the clicked video), and the video identifiers corresponding to the extracted requested videos and the video identifiers corresponding to the clicked videos are numbered, that is, 1 (the number corresponding to the identifier 1 of the requested video), 2 (the number corresponding to the identifier 9 of the clicked video), 3 (the number corresponding to the identifier 3 of the requested video), 4 (the number corresponding to the identifier 10 of the clicked video), 5 (the number corresponding to the identifier 5 of the requested video), and 6 (the number corresponding to the identifier 6 of the clicked video).
It should be noted that, in the process of numbering the identifications of the requested videos and the identifications of the clicked videos, consecutive numbering may be done, e.g. 2, 3, 4, 5, or non-consecutive numbering may be done, e.g. 2, 4, 6, 8, but such numbering is not such that the number of rows and columns of the probability matrix is minimal, the optimal numbering way is that the video identifications of all the requested videos and the video identifications of the clicked videos are numbered continuously from 1, so that the number of rows and columns of the established probability matrix is the minimum, and it should be noted that, although numbering starts from 1 in theory, in computer running processes, numbering needs to start from 0, in fact, 0 represents the first row or the first column in the probability matrix, which is equivalent to the number 1 in theory, so that the subsequent calculation amount and pressure of the computer can be effectively reduced.
Further, in the process of determining the probability matrix according to the serial number corresponding to the video identifier corresponding to each numbered request video, the serial number corresponding to the video identifier corresponding to the click video and the click rate of the click video, an embodiment is provided in the present application, specifically, the serial number corresponding to the video identifier corresponding to each request video is used as a row of the probability matrix, the serial number corresponding to the video identifier corresponding to each click video is used as a column of the probability matrix, and the click rate of each video on the click video in the sample data is used as an element value of the probability matrix to determine the probability matrix.
S103: and calculating the N-step transition matrix corresponding to the probability matrix.
After the probability matrix is established, N transition matrices corresponding to the probability matrix need to be calculated, where N is an integer greater than 1.
However, the probability matrix established in step S102 is usually inconsistent with the number of valid rows and the number of valid columns (it should be noted here that the maximum number of rows in the non-zero element is the valid number of rows, and the maximum number of columns in the non-zero element is the valid number of columns), so that when the number of valid rows of the probability matrix is inconsistent with the number of valid columns, the N-step transition matrix cannot be directly calculated, and based on this, in this application, before calculating the N-step transition matrix corresponding to the probability matrix, the number of rows or columns needs to be expanded on the probability matrix, if the number of rows is greater than the number of columns, the number of columns needs to be increased, so that the total number of rows is equal to the total number of rows, and if the number of columns is greater than the number of rows, the number of rows needs to be increased, so that the total number of rows is equal to the total number of columns, and the elements in the increased rows or the elements in the increased columns are all filled with, however, in actual computer storage, to reduce the additional storage overhead, all 0 elements in the probability matrix are actually not stored, and only those non-zero elements are stored.
It should be noted that N in the present application may be set according to actual conditions, and in a general case, N may be set to 2 if one layer of indirect request click relationship is to be mined, for example, the request video a and the click video B are a video pair, and the request video B and the click video C are a video pair, where the request video B and the click video B are the same video, so that the request video a and the click video C only have one layer of indirect request click relationship, and N may be set to N +1 if N layers of indirect request click relationship are to be mined.
Further, since N probability matrices are multiplied in series and N-1 multiplication operations are required, in order to reduce the number of multiplication operations, the N probability matrices may be grouped, wherein at least two groups among the grouped groups include the same number of probability matrices, the groups including the repeated number of probability matrices are removed, all the probability matrices included in the group are multiplied for any remaining group, the matrices obtained by multiplying all the groups after grouping are multiplied, and the matrices obtained by multiplying all the groups after grouping are used as N-step transition matrices corresponding to the probability matrices. For example, assuming that 7 transition matrices need to be calculated, and M represents a probability matrix, the 7 probability matrices can be divided into four groups, that is, the first group includes two probability matrices, the second group includes two probability matrices, the third group includes two probability matrices, and the fourth group includes one probability matrix, each group including the repetition of the number of probability matrices, that is, any two groups of the first group, the second group, or the third group, is removed, because the number of probability matrices in the three groups is the same repetition, therefore, only the probability matrices included in any one group need to be multiplied, the matrix obtained by multiplication is directly used as the matrix of the other two groups, the probability matrices included in the other two groups do not need to be multiplied repeatedly, all the probability matrices included in the fourth group are multiplied, and finally the matrices obtained by multiplication corresponding to all the groups after grouping are directly multiplied, that is, the matrix obtained by multiplying the matrix corresponding to the first group by the matrix obtained by multiplying the matrix corresponding to the second group by the matrix obtained by multiplying the matrix corresponding to the third group by the matrix obtained by multiplying the matrix corresponding to the fourth group, and the matrix obtained by multiplying all the grouped groups is used as the 7-step transition matrix corresponding to the probability matrix, so that the 7-step transition matrix can be obtained by only performing four operations, and the operation times are reduced.
In addition, a section of computer code for multiplying the probability matrix once by the computer is also provided in the application, and the specific steps are as follows:
// read data
valcoorMatrix=MTUtils.loadCoordinateMatrix(sc.args(0))
// conversion format CoordinateMatrix → DenseVecMatrix
valdenseVecMatrix=coorMatrixTODenseVecMatrix(coorMatrix,row,cols)
// conversion format CoordinateMatrix → SparseVecMatrix
valsparseVecMatrix=denseVecMatrixtoSparseVecMatrix
valleftMatrix=SparseVecMatrix
valrightMatrix=leftMatrix
// matrix multiplication
valmultiplyResult=leftMatrix.multiplySparse(rightMatrix)
In the computer code, coormatxtodensevecmatrix is a function realized by self-definition, and is mainly used for forcibly specifying the number of rows and columns of a transformation matrix, so as to avoid matrix dimension inconsistency during subsequent multiplication caused by matrix rows and columns obtained in an original transformation mode being smaller than an expected value due to the absence of effective data of boundary positions in actual data, and the corresponding operations are as follows:
s104: and mining the video pairs according to the calculated N-step transfer matrix and the acquired sample data of each video pair.
After the N-step transition matrix is obtained, the video pair data needs to be restored according to the N-step transition matrix, that is, according to the N-step transition matrix, the video pair data is established.
Further, in the process of establishing video pair data according to the N-step transition matrix, if the video identifier corresponding to each request video is used as a row of the probability matrix, the video identifier corresponding to each clicked video is used as a column of the probability matrix, and the click rate of the clicked video in each video pair sample data is used as an element value of the probability matrix to determine the probability matrix when the probability matrix is determined in step S102, the process of establishing video pair data according to the N-step transition matrix is as follows: and aiming at each element which is not 0 in the N-step transfer matrix, determining a video identifier corresponding to a request video corresponding to the element according to the row of the matrix probability corresponding to the determined video identifier corresponding to each request video, determining a video identifier corresponding to a click video corresponding to the element according to the column corresponding to the determined video identifier corresponding to each click video, taking the value of the element as the click rate of the click video corresponding to the element, establishing video pair data according to the determined video identifier corresponding to the request video, the video identifier corresponding to the click video and the click rate of the click video, and mining a video pair according to the established video pair data and the acquired video pair sample data.
If the number corresponding to the video identifier corresponding to each request video is used as a row of the probability matrix, the number corresponding to the video identifier corresponding to each click video is used as a column of the probability matrix, and the click rate of each video to the click video in the sample data is used as an element value of the probability matrix to determine the probability matrix when the probability matrix is determined in the step S102, the process of establishing video to data according to the N-step transition matrix is as follows: for each element which is not 0 in the N-step transfer matrix, determining the number of a request video corresponding to the element according to the line number corresponding to the element, determining the number of a click video corresponding to the element according to the column number corresponding to the element, determining the click rate of the click video corresponding to the element according to the numerical value of the element, determining the video identifier of the request video corresponding to the number of the request video according to the corresponding relation between the video identifier corresponding to the generated request video and the number, determining the video identifier of the click video corresponding to the number of the click video according to the corresponding relation between the video identifier corresponding to the generated click video and the number, establishing video pair data according to the determined video identifier of the request video, the video identifier of the click video and the click rate of the click video, and establishing video pair data and acquired video pair sample data, and (5) mining the video pairs.
Further, after video pair data are established according to the N-step transfer matrix, video pairs need to be mined according to the established video pair data and the acquired video pair sample data, specifically, the established video pair data are matched with the acquired video pair sample data, and video pair data inconsistent with the acquired video pair sample data are determined in the establishment of the video pair data.
When the video pair data inconsistent with the acquired video pair sample data is determined, two conditions mainly exist, the first case is that the click rate of the clicked video in the created video pair data is different from the click rate of the clicked video in the acquired video pair sample data, and therefore, when video pair data inconsistent with the acquired video pair sample data is determined, the click rate of the click video in the established video pair data and the click rate of the click video in the acquired video pair sample data can be determined, and subsequently, the video identification of the requested video and the video identification of the click video in the data can be determined according to the determined video, and searching the video pair sample data corresponding to the determined video pair data in a video recommendation system containing each video pair sample data, and replacing the video pair sample data corresponding to the determined video pair data with the determined video pair sample data.
It should be noted that, in the process of replacing the determined video pair data with the video pair sample data corresponding to the determined video pair data, the video identifier of the video request, the video identifier of the clicked video, and the click rate of the clicked video in the video pair sample data corresponding to the determined video pair data may all be replaced with the video identifier of the video request, the video identifier of the clicked video, and the click rate of the clicked video in the video pair sample data corresponding to the determined video, and the click rate of the clicked video in the video pair data may also be replaced with the click rate of the clicked video in the video pair sample data corresponding to the determined video pair data.
The second case is that the video identifier of the request video and the video identifier of the click video in the established video pair data are determined to have at least one video pair data different from the video identifier of the request video and the video identifier of the click video in the acquired video pair sample data, so that when the video pair data inconsistent with the acquired video pair sample data are determined, the video identifier of the request video and the video identifier of the click video in the established video pair data are determined to have at least one video pair data different from the video identifier of the request video and the video identifier of the click video in the acquired video pair sample data, and then the determined video pair data can be added into the video recommendation system.
By the method, potential video pairs can be effectively mined, namely, the video pairs with indirect request click relation are effectively mined, so that the number of the video pairs in the video recommendation system is enriched, and the recommendation accuracy of the video recommendation system can be improved.
In addition, in practical application, after the number of video pairs in the video recommendation system is enriched and expanded, the video recommendation system can be only used for improving the accuracy of the video recommendation system, and can also be used for increasing the number of videos recommended by a user when the user watches a request video, because click videos corresponding to a certain request video may increase after mining the video pairs, for example, when the user watches a certain request video, it is assumed that a website recommends ten click videos related to the request video to the user, but before mining the video pairs, only five click videos in the video recommendation system have a relationship with the request video, therefore, the video recommendation system recommends the five click videos to the user, and also finds five videos unrelated to the request video to recommend to the user, and after mining the video pairs, click videos related to the request video may exceed ten, therefore, the video recommendation system can recommend the ten click videos related to the request video to the user, so that the situation that the number of recommended videos obtained by a prediction model used by the video recommendation system is insufficient is avoided.
Based on the same idea, the video pair mining method provided in the embodiment of the present application further provides a video pair mining device, as shown in fig. 3.
Fig. 3 is a schematic structural diagram of a video pair mining device according to an embodiment of the present application, where the video pair mining device includes:
the obtaining module 201 is configured to obtain sample data of each video pair, where the sample data of the video pair includes a video identifier of a requested video, a video identifier of a clicked video, and a click rate of the clicked video when a user watches the requested video;
a determining module 202, configured to determine a probability matrix according to the video identifier of the request video, the video identifier of the clicked video, and the click rate of the clicked video;
a calculating module 203, configured to calculate an N-step transition matrix corresponding to the probability matrix, where N is an integer greater than 1;
and the mining module 204 is configured to mine a video pair according to the calculated N-step transition matrix and the obtained sample data of each video pair, where the video pair is composed of a request video and a click video, and the click video is a video that is watched by a user in videos recommended by the user according to the request video.
The determining module 202 is specifically configured to determine the probability matrix by using the video identifier corresponding to each request video as a row of the probability matrix, using the video identifier corresponding to each click video as a column of the probability matrix, and using the click rate of each video on the click video in the sample data as an element value of the probability matrix.
The determining module 202 is specifically configured to extract a video identifier of a non-duplicate request video from each video pair sample data, click the video identifier of the video, number the extracted video identifier corresponding to each request video and the video identifier corresponding to the click video, generate a corresponding relationship between the video identifier corresponding to the request video and the number and a corresponding relationship between the video identifier corresponding to the click video and the number, and determine a probability matrix according to the number corresponding to the video identifier corresponding to each request video after numbering, the number corresponding to the video identifier corresponding to the click video, and the click rate of the click video.
The determining module 202 is specifically configured to determine the probability matrix by taking the number corresponding to the video identifier corresponding to each request video as a row of the probability matrix, taking the number corresponding to the video identifier corresponding to each click video as a column of the probability matrix, and taking the click rate of each video on the click video in the sample data as an element value of the probability matrix.
The calculating module 203 is specifically configured to group the N probability matrices, where at least two groups in each group after grouping include the same number of probability matrices, remove each group including a repetition number of probability matrices, multiply all probability matrices included in the group for any remaining group, multiply matrices obtained by multiplying all groups after grouping, and use a matrix obtained by multiplying all groups after grouping as an N-step transition matrix corresponding to the probability matrix.
The mining module 204 is specifically configured to, for each element in the N-step transition matrix that is not 0, determine a video identifier corresponding to the requested video corresponding to the element according to a row of a matrix probability corresponding to a video identifier corresponding to each determined requested video, determine a video identifier corresponding to a clicked video corresponding to the element according to a column corresponding to a video identifier corresponding to each determined clicked video, use a value of the element as a click rate of the clicked video corresponding to the element, establish video pair data according to the determined video identifier corresponding to the requested video, the video identifier corresponding to the clicked video, and the click rate of the clicked video, and mine a video pair according to the established video pair data and the acquired video pair sample data.
The mining module 204 is specifically configured to, for each element in the N-step transfer matrix that is not 0, determine, according to a row number corresponding to the element, a number of a request video corresponding to the element, determine, according to a column number corresponding to the element, a number of a click video corresponding to the element, determine, according to a numerical value of the element, a click rate of the click video corresponding to the element, determine, according to a correspondence between a video identifier corresponding to the generated request video and the number, a video identifier of the request video corresponding to the number of the request video, determine, according to a correspondence between the video identifier corresponding to the generated click video and the number, a video identifier of the click video corresponding to the number of the click video, and create video pair data according to the determined video identifier of the request video, the video identifier of the click video and the click rate of the click video, and mining the video pairs according to the established video pair data and the acquired video pair sample data.
The mining module 204 is specifically configured to match the established video pair data with the acquired video pair sample data, and determine video pair data that is inconsistent with the acquired video pair sample data in the establishment of the video pair data.
The mining module 204 is specifically configured to determine video pair data in which the click rate of a clicked video in the established video pair data is different from the click rate of a clicked video in the acquired video pair sample data, search video pair sample data corresponding to the determined video pair data in a video recommendation system including the video pair sample data according to a video identifier of a requested video and a video identifier of a clicked video in the determined video pair sample data, and replace the determined video pair sample data with the video pair sample data corresponding to the determined video pair data.
The mining module 204 is specifically configured to determine that at least one of the video identifier of the request video and the video identifier of the click video in the established video pair data is different from the video identifier of the request video and the video identifier of the click video in the acquired video pair sample data, and add the determined video pair data to the video recommendation system.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement storage by any method or technology. May be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store data that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.