CN115052193B

CN115052193B - Video recommendation method, system, device and storage medium

Info

Publication number: CN115052193B
Application number: CN202210575753.XA
Authority: CN
Inventors: 郝德禄; 肖冠正
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2023-07-18
Anticipated expiration: 2042-05-25
Also published as: CN115052193A

Abstract

The invention discloses a video recommendation method, a system, a device and a storage medium, wherein the method is used for carrying out matching processing according to eyeball information and preset rules to obtain a first emotion classification result, and determining the first emotion classification result by utilizing the eyeball information, so that the accuracy is high; extracting face information from the context information, inputting the face information and the context information into a network model to obtain a second emotion classification result, analyzing according to the first emotion classification result and the second emotion classification result to obtain an emotion type, and combining the first emotion classification result analyzed based on eyeball information with the second emotion classification result analyzed based on the context information to obtain the emotion type, so that the accuracy of the emotion type is further improved; the target video is determined and recommended according to the emotion type, and targeted video recommendation is performed, so that the current emotion of the user is relieved and improved, the applicability is strong, and the method and the device can be widely applied to the technical field of artificial intelligence.

Description

Video recommendation method, system, device and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a video recommendation method, a video recommendation system, a video recommendation device and a storage medium.

Background

The existing video pushing mode is not classified but pushed in a large batch, according to the video category clicked by a user, interests of the user are analyzed, after a background processor learns the interests of the user, similar videos are pushed according to a certain frequency, for example, the user clicks a movie class short video for many times, and after detection of a rear data end, related movie class short videos are pushed.

However, this pushing method cannot be applied to various specific situations, for example, when the user is in different emotions, the recommendation cannot be performed according to the current emotion of the user, so the following situations may occur: recommending boring knowledge-based videos makes the user more boring when the user feels boring; when the mood of the user is excited, recommending the excited video to be excited by the other user; when the user's mood falls, passive, horror videos are pushed, which more affects the user's mood, and thus solutions need to be sought.

Disclosure of Invention

In view of the above, the present invention aims to provide a video recommendation method, apparatus, device and storage medium for performing targeted video recommendation.

The technical scheme adopted by the embodiment of the invention is as follows:

the video recommendation method comprises the following steps:

acquiring user data; the user data comprises eyeball information and context information, wherein the context information consists of background and face information;

matching processing is carried out according to the eyeball information and a preset rule, and a first emotion classification result is obtained;

extracting the face information from the context information, and inputting the face information and the context information into a network model to obtain a second emotion classification result;

analyzing according to the first emotion classification result and the second emotion classification result to obtain emotion types;

and determining the target video according to the emotion type and recommending the target video.

Further, the matching processing is performed according to the eyeball information and a preset rule to obtain a first emotion classification result, including:

generating a scanned image of the eyeball information;

analyzing the scanned image to obtain pupil states;

matching the pupil state with a preset rule to obtain a first emotion classification result;

the preset rules comprise that pupil constriction represents negative emotion, pupil enlargement represents excitement or fear, and pupil unchanged represents boring.

Further, before the inputting the face information and the context information into the network model, the method further includes:

performing first preprocessing on the face information to obtain first preprocessed face information;

performing second preprocessing on the context information to obtain context information after the second preprocessing;

the first preprocessing and the second preprocessing include cropping and scaling, and the size of the face information is smaller than the size of the context information.

Further, the extracting the face information from the context information, inputting the face information and the context information into a network model, and obtaining a second emotion classification result includes:

inputting the context information into a context RNN to obtain a context feature;

inputting the face information to a face RNN, and obtaining a second emotion classification result through the face RNN according to the face information and the context characteristics;

the network model comprises the context RNN and the face RNN, and the context RNN have a cascade relation with an attention mechanism.

Further, the face RNN includes a plurality of CNN units and LSTM units; the obtaining, by the face RNN, a second emotion classification result according to the face information and the contextual feature includes:

coding the facial information through the CNN unit, and obtaining an LSTM context vector according to a coding result and the attention operation of the context feature based on an attention mechanism;

and outputting a second emotion classification result through the LSTM unit according to the context vector.

Further, the analyzing according to the first emotion classification result and the second emotion classification result to obtain emotion types includes:

when the first emotion classification result is the same as the second emotion classification result, taking the first emotion classification result or the second emotion classification result as an emotion type;

or alternatively, the process may be performed,

and when the first emotion classification result is different from the second emotion classification result, determining a target emotion classification result with higher priority from the first emotion classification result and the second emotion classification result according to the preset priority as an emotion type.

Further, the determining the target video according to the emotion type and recommending the target video comprises the following steps:

acquiring video resources;

classifying the video resources to obtain videos of different video types;

and determining the video of the video type opposite to the emotion type as a target video and recommending the target video.

The embodiment of the invention also provides a video recommendation system, which comprises:

the acquisition module is used for acquiring user data; the user data comprises eyeball information and context information, wherein the context information consists of background and face information;

the processing module is used for carrying out matching processing according to the eyeball information and a preset rule to obtain a first emotion classification result;

the classification module is used for extracting the face information from the context information, inputting the face information and the context information into a network model, and obtaining a second emotion classification result;

the analysis module is used for analyzing according to the first emotion classification result and the second emotion classification result to obtain emotion types;

and the recommending module is used for determining the target video according to the emotion type and recommending the target video.

The embodiment of the invention also provides a video recommending device, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the method.

Embodiments of the present invention also provide a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the method.

The beneficial effects of the invention are as follows: the method comprises the steps that eyeball information, context information and face information are used for forming the context information; the eyeball information is matched with a preset rule to obtain a first emotion classification result, and the first emotion classification result is determined by utilizing the eyeball information, so that the accuracy is high; extracting the face information from the context information, inputting the face information and the context information into a network model to obtain a second emotion classification result, analyzing according to the first emotion classification result and the second emotion classification result to obtain an emotion type, and combining the first emotion classification result based on eyeball information analysis with the second emotion classification result based on context information analysis to obtain the emotion type, so that the accuracy of the emotion type is further improved; and determining and recommending the target video according to the emotion type, and recommending the targeted video, so that the current emotion of the user is relieved and improved, and the applicability is strong.

Drawings

FIG. 1 is a flowchart illustrating steps of a video recommendation method according to the present invention;

fig. 2 is a schematic diagram of a video recommendation method according to an embodiment of the invention.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

As shown in fig. 1, an embodiment of the present invention provides a video recommendation method, including steps S100 to S500:

s100, acquiring user data.

In the embodiment of the invention, the user data comprises eyeball information and context information, and the context information consists of background and face information. Research shows that environmental information including surrounding environment and human body can provide additional clues for more accurately identifying emotion, so that the embodiment of the invention collects context information. When user data is acquired, the user's permission is firstly inquired, the camera of the user terminal is turned on after the user's permission is acquired, one or more pictures or video clips with a certain time length are shot through the camera, and therefore the user data is acquired. In an exemplary embodiment of the present invention, taking a shot video clip as an example, a picture sequence formed by a plurality of pictures may be obtained for subsequent processing.

And S200, performing matching processing according to eyeball information and preset rules to obtain a first emotion classification result.

Optionally, step S200 includes steps S210-S230.

S210, generating a scanning image of eyeball information.

Optionally, real-time scanning capturing is performed according to the user data collected in real time or the user data just collected, and a scanning image corresponding to eyeball information is generated.

S220, analyzing the scanned image to obtain pupil states.

Alternatively, the scanned image may be analyzed by an eyeball analysis model or expression analysis software constructed in advance, thereby determining the pupil state. The pupil state includes, but is not limited to, pupil movement to the left, pupil movement to the right, pupil constriction, pupil dilation, or pupil no change.

And S230, matching the pupil state with a preset rule to obtain a first emotion classification result.

Optionally, the preset rules include, but are not limited to, pupil constriction characterizing negative emotions (e.g., aversive stimulus, tired, offensive, low, sad, hard, frustrated, etc.), pupil dilation characterizing excitement (e.g., pleasure, loving, exciting, etc.) or fear (e.g., upon encountering a pleasurable stimulus, the pupil automatically enlarges, and when the subject is panicked, excited, the pupil enlarges 4 times usual), pupil unchanged characterizing boring (e.g., indifference, no idea, etc.), pupil movement to the left characterizing recall, kong Xiangyou side movement characterizing thinking. In some embodiments, recall, thinking may be categorized as one of boring, passive emotion, excitement, or fear, and may be set according to different circumstances. It should be noted that, the preset rule is applicable to ninety percent of people, so the analysis accuracy is high.

Optionally, steps S240 to S250 are further included after step S200 and before step S300:

s240, performing first preprocessing on the face information to obtain the face information after the first preprocessing.

In the embodiment of the present invention, the face information is a face stream (face stream), which includes the face information detected in each frame of the original frame in the video clip, that is, the sequence of the face information, and the face information is subjected to a first preprocessing, specifically, the face information is cut and scaled to a first size, so as to obtain the face information after the first preprocessing, where the first size is 128×128, and other embodiments may have other sizes.

S250, performing second preprocessing on the context information to obtain the context information after the second preprocessing.

In the embodiment of the present invention, the context information is used as a context stream (context stream), where the context stream includes each frame of the original frame of the video clip, the context information is subjected to a second preprocessing, specifically, the context information is subjected to a center clipping and scaling to a second size, so as to obtain the face information after the second preprocessing, and the second size is 224×224, which may be other sizes in other embodiments.

S300, extracting face information from the context information, and inputting the face information and the context information into the network model to obtain a second emotion classification result.

Alternatively, the face information is extracted separately from the context information. As shown in fig. 2, optionally, the network model adopts a CACA-RNN, that is, a cascade type RNN with context awareness capability and based on attention, and the network model adopts a cascade structure, and is composed of two neural networks, namely a face RNN and a context RNN, where a cascade relationship exists between the context RNN and a context RNN and an attention mechanism (attention), and the attention mechanism can locate relevant context information in the context RNN.

Step S300 includes steps S310-S320:

s310, inputting the context information into the context RNN to obtain the context characteristics.

S320, inputting the face information into a face RNN, and obtaining a second emotion classification result through the face RNN according to the face information and the context characteristics.

Optionally, the context RNN has the same structure as the face RNN, and each includes a plurality of RNN units and LSTM units. Specifically, the context information is input into the context RNN, and a plurality of RNN units in the context RNN process the input context information, that is, images of different frames (t=1-4), respectively, and input the processing result into each LSTM unit in the context RNN, where each LSTM unit outputs a corresponding context feature. The facial information, that is, the images corresponding to different frames (t=1-4) and containing the facial information, are processed and input to the face RNN, the face information is encoded by the CNN unit in the face RNN, the LSTM context vector of the LSTM unit input to the face RNN is obtained according to the encoding result and the attention operation of the context feature based on the attention mechanism, and then the second emotion classification result is output by the LSTM unit according to the context vector, and the output smiling face as shown in fig. 2 represents the second emotion classification result as excitement (such as pleasure, favorites, excitement, etc.).

In the embodiment of the invention, conditional probability is utilized in the face RNN processing process

Wherein y is _i Is a prediction generated at output time i, h (·) is a nonlinear function,is a sequence of facial information read by the human face RNN in time steps 1 to i,/for human face RNN>Is the hidden state of the face RNN, expressed as:

wherein f (·) is a nonlinear function,is the hidden state of the face RNN at output time i-1,/and the like>Is LSTM

Context vector, expressed as:

where T is the total time step, score () is a function of the calculated score,is an LSTM context vectorThe hidden state at time t is expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,is the hidden state of the LSTM context vector at time T-1, the context RNN reads the context feature sequence +.>Entering an LSTM context vector And (5) extracting the context characteristics at the time t.

S400, analyzing according to the first emotion classification result and the second emotion classification result to obtain emotion types.

Optionally, step S400 includes step S410 or S420:

s410, when the first emotion classification result is the same as the second emotion classification result, taking the first emotion classification result or the second emotion classification result as the emotion type.

Specifically, when the first emotion classification result is the same as the second emotion classification result, for example, both are negative emotions (such as disgust, offensive, lowly, sad, depressed, etc.), one of the first emotion classification result and the second emotion classification result is taken as an emotion type, and the emotion type is obtained as a negative emotion.

S420, when the first emotion classification result is different from the second emotion classification result, determining a target emotion classification result with higher priority from the first emotion classification result and the second emotion classification result according to the preset priority as an emotion type.

Specifically, when the first emotion classification result is different from the second emotion classification result, the target emotion classification result may be determined as the emotion type according to a preset priority. For example, the preset priority may be:

1) The priority of the first emotion classification result is higher than that of the second emotion classification result, and the first emotion classification result is determined to be a target emotion classification result and is used as an emotion type;

2) The priority of the second emotion classification result is higher than that of the first emotion classification result, and the second emotion classification result is determined to be a target emotion classification result and is used as an emotion type;

3) Setting the priority level from high to low as follows: negative emotion, excitement or fear, boring, for example, the first emotion classification result is boring, the second emotion classification result is negative emotion, and the second emotion classification result, namely, the negative emotion is taken as emotion type.

S500, determining a target video according to the emotion type and recommending the target video.

Optionally, step S500 includes steps S510-S530:

s510, acquiring video resources.

Specifically, the user's terminal or software, applet, web page, etc. used by the user may acquire video resources on the network.

S520, classifying the video resources to obtain videos of different video types.

Specifically, the videos can be classified by identifying the labels of the videos in the video resources to obtain videos of different video types, or the video resources are processed through an artificial intelligence algorithm to identify the types of the videos, so that the videos of different video types are obtained.

And S530, determining the video of the video type opposite to the emotion type as a target video and recommending the target video.

It should be noted that, the video type opposite to the emotion type refers to a video type opposite to the emotion type capable of helping the user improve the current emotion type, and then the video of the video type opposite to the emotion type is determined as a target video and recommended to the user, and after clicking, the user's emotion is improved and relaxed. For example:

when the emotion type is boring, the corresponding recommended video type is novel and interesting, so that the user does not feel boring, and the purpose of relaxation can be achieved;

when the emotion type is excited, the corresponding recommended video type is calm and calm, so that the user can calm the emotion;

when the emotion type is panic, the corresponding recommended video type is placebo, warm, positive energy, soothing the panic emotion of the user;

when the emotion types are sad and difficult, the corresponding recommended video types are pleasant, relaxed and fun, so that the negative emotion is relieved after the user watches the recommended video;

when the emotion type is frustration and the corresponding recommended video type is happy, relaxed, fun and inspired, the user can relax and re-establish confidence after watching the recommended video, and the user can inject vitality into the spirit and actively face life.

the acquisition module is used for acquiring user data; the user data comprises eyeball information and context information, wherein the context information comprises background and face information;

the processing module is used for carrying out matching processing according to eyeball information and preset rules to obtain a first emotion classification result;

the classification module is used for extracting facial information from the context information, inputting the facial information and the context information into the network model and obtaining a second emotion classification result;

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

The embodiment of the invention also provides a video recommending device, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the video recommending method of the previous embodiment. The video recommendation device of the embodiment of the invention comprises, but is not limited to, any intelligent terminal such as a mobile phone, a tablet personal computer, a vehicle-mounted computer and the like.

The content in the method embodiment is applicable to the embodiment of the device, and the functions specifically realized by the embodiment of the device are the same as those of the method embodiment, and the obtained beneficial effects are the same as those of the method embodiment.

The embodiment of the invention also provides a computer readable storage medium, in which at least one instruction, at least one section of program, code set or instruction set is stored, and the at least one instruction, the at least one section of program, code set or instruction set is loaded and executed by a processor to implement the video recommendation method of the foregoing embodiment.

Embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the video recommendation method of the foregoing embodiment.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. The video recommendation method is characterized by comprising the following steps:

extracting the face information from the context information, and inputting the face information and the context information into a CACA-RNN model to obtain a second emotion classification result; the CACA-RNN model is a cascade type RNN with context awareness capability and is based on attention, and before the face information and the context information are input into the CACA-RNN model, the CACA-RNN model further comprises:

the first preprocessing and the second preprocessing comprise clipping and scaling, and the size of the face information is smaller than that of the context information;

determining and recommending the target video according to the emotion type, wherein the method comprises the following steps:

acquiring video resources;

classifying the video resources to obtain videos of different video types;

2. The video recommendation method of claim 1, wherein: the matching processing is performed according to the eyeball information and a preset rule to obtain a first emotion classification result, which comprises the following steps:

generating a scanned image of the eyeball information;

analyzing the scanned image to obtain pupil states;

3. The video recommendation method according to claim 1 or 2, wherein: the extracting the face information from the context information, inputting the face information and the context information into a network model, and obtaining a second emotion classification result, including:

4. The video recommendation method of claim 3, wherein: the face RNN comprises a plurality of CNN units and LSTM units; the obtaining, by the face RNN, a second emotion classification result according to the face information and the contextual feature includes:

5. The video recommendation method of claim 1, wherein: analyzing according to the first emotion classification result and the second emotion classification result to obtain emotion types, including:

or alternatively, the process may be performed,

6. A video recommendation system, characterized in that the video recommendation method according to any one of claims 1-5 is applied, comprising:

the classification module is used for extracting the face information from the context information, inputting the face information and the context information into a CACA-RNN model and obtaining a second emotion classification result; the CACA-RNN model is a cascade type RNN with context awareness and based on attention;

7. A video recommendation device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set being loaded and executed by the processor to implement the method of any of claims 1-5.

8. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of any of claims 1-5.