WO2023185257A1

WO2023185257A1 - Data processing method, and device and computer-readable storage medium

Info

Publication number: WO2023185257A1
Application number: PCT/CN2023/074763
Authority: WO
Inventors: 陈小帅
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2022-04-01
Filing date: 2023-02-07
Publication date: 2023-10-05
Also published as: CN114419527A; CN114419527B

Abstract

A data processing method, and a device and a computer-readable storage medium. The method comprises: according to first sharing qualities respectively corresponding to at least two video clips, determining candidate video clips; according to an object label text sequence and the candidate video clips, determining second sharing qualities corresponding to the candidate video clips, and according to the second sharing qualities corresponding to the candidate video clips, determining candidate shared video clips; according to the object label text sequence and the candidate shared video clips, determining third sharing qualities corresponding to the candidate shared video clips and auxiliary description information corresponding to the candidate shared video clips; and according to the first sharing quality, the second sharing quality, the third sharing quality and the auxiliary description information, which respectively correspond to each candidate shared video clip, determining shared data.

Description

Data processing methods, equipment and computer-readable storage media

This application requires the priority of the Chinese patent application submitted to the China Patent Office on April 1, 2022, with the application number 202210336414.6, and the invention name is "a data processing method, equipment and computer-readable storage medium", and its entire content is approved by This reference is incorporated into this application.

Technical field

This application relates to the field of Internet technology, and in particular, to a data processing method, equipment and computer-readable storage medium.

Background technique

Computer vision technology (Computer Vision, CV) is a science that studies how to make machines "see". Furthermore, it refers to using cameras and computers to replace human eyes to identify and measure targets and other machine vision, and further to do graphics. Processing, so that computer processing becomes an image more suitable for human eye observation or transmitted to instrument detection. As a scientific discipline, computer vision studies related theories and technologies, trying to build artificial intelligence systems that can obtain information from images or multi-dimensional data. Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual Reality, augmented reality, simultaneous positioning and map construction, autonomous driving, smart transportation and other technologies, as well as common biometric recognition technologies such as face recognition and fingerprint recognition.

Video sharing (video sharing) means that the browsing object corresponding to the video shares the video with other browsing objects when browsing the video in the video application. Video sharing is a main way for the video browsing objects to communicate, and the object activity and playback status of the video application are affected. Greater impact.

Technical content

Embodiments of the present application provide a data processing method, equipment, and computer-readable storage media, which can save network transmission resources and processing resources of shared data receiving devices on the premise of improving video sharing efficiency and sharing effects.

On the one hand, embodiments of the present application provide a data processing method, which is executed by a computer device, including:

Obtain at least two video segments in the video, determine first sharing qualities corresponding to the at least two video segments, and select at least one video segment from the at least two video segments as a candidate video segment based on the first sharing quality;

Obtain an object tag text sequence associated with the video, the object tag text sequence includes sharing the video The object tag text of the browsing object and the object tag text of the shared object that receives the share; the object tag text of the browsing object is used to characterize the interest of the browsing object, and the object tag text of the shared object is used to characterize the shared object interest of;

Determine the second sharing quality corresponding to each candidate video segment according to the object label text sequence and the candidate video segment, and select at least one candidate video segment from the candidate video segment as the candidate shared video segment according to the second sharing quality corresponding to each candidate video segment. ;The second sharing quality is used to characterize the correlation between the candidate video clip and the object label text of the shared object;

According to the object label text sequence and the candidate shared video clips, the third sharing quality corresponding to each candidate shared video clip and the auxiliary description information corresponding to each candidate shared video clip are determined; the third sharing quality is used to characterize the auxiliary description information Matching degree with the candidate shared video clip and the object tag text of the shared object;

According to the first sharing quality, the second sharing quality and the third sharing quality corresponding to each candidate shared video segment, the shared video segment is determined from the candidate shared video segments, and the shared video segment and the auxiliary description information corresponding to the shared video segment are determined as For shared data sent to a shared object.

On the one hand, embodiments of the present application provide a data processing device, including:

The first acquisition module is used to acquire at least two video clips in the video, determine the first sharing quality corresponding to the at least two video clips, and select at least one video from the at least two video clips according to the first sharing quality. clips as candidate video clips;

The second acquisition module is used to obtain the object tag text sequence associated with the video, the object tag text sequence includes the object tag text of the browsing object that shares the video and the object tag text of the shared object that receives the sharing; the browsing The object label text of the object is used to represent the interest of the browsing object, and the object label text of the shared object is used to represent the interest of the shared object; according to the object label text sequence and the candidate video clips, determine the corresponding content of each candidate video clip. The second sharing quality is to select at least one candidate video segment from the candidate video segments as the candidate shared video segment according to the second sharing quality corresponding to each candidate video segment; the second sharing quality is used to characterize the relationship between the candidate video segment and the candidate video segment. Describes the relevance of the object label text of the shared object;

The first determination module is used to determine the third sharing quality corresponding to each candidate shared video clip and the auxiliary description information corresponding to each candidate shared video clip according to the object label text sequence and the candidate shared video clip; the third sharing quality is used To characterize the matching degree between the auxiliary description information and the candidate shared video clip and the object tag text of the shared object;

The second determination module is configured to determine the shared video segments from the candidate shared video segments according to the first sharing quality, the second sharing quality and the third sharing quality corresponding to each candidate shared video segment, and associate the shared video segments with the shared video segments. The auxiliary description information is determined as shared data sent to the shared object.

An embodiment of the present application also provides a computer device, including: a processor, a memory, and a network interface;

The above-mentioned processor is connected to the above-mentioned memory and the above-mentioned network interface, wherein the above-mentioned network interface is used to provide data communication functions, the above-mentioned memory is used to store computer programs, and the above-mentioned processor is used to call the above-mentioned computer programs to cause the computer device to execute the embodiments of the present application. method in.

On the one hand, embodiments of the present application provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. The computer program is suitable for being loaded by a processor and executing the method in the embodiment of the present application.

Embodiments of the present application also provide a computer program product. The computer program product includes a computer program, and the computer program is stored in a computer-readable storage medium; the processor of the computer device reads the computer program from the computer-readable storage medium, The processor executes the computer program, so that the computer device executes the method in the embodiment of the present application.

Brief description of the drawings

Figure 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;

Figure 2 is a schematic diagram of a data processing scenario provided by an embodiment of the present application;

Figure 3 is a schematic flow chart of a data processing method provided by an embodiment of the present application;

Figure 4 is a schematic model structure diagram of a first video recognition sub-model provided by an embodiment of the present application;

Figure 5 is a schematic model structure diagram of a second video recognition sub-model provided by an embodiment of the present application;

Figure 6 is another schematic flowchart of a data processing method provided by an embodiment of the present application;

Figure 7 is a schematic model structure diagram of a fourth video recognition sub-model provided by an embodiment of the present application;

Figure 8 is a schematic model structure diagram of a fifth video recognition sub-model provided by an embodiment of the present application;

Figure 9 is another schematic flowchart of a data processing method provided by an embodiment of the present application;

Figure 10 is a schematic structural diagram of a data processing device provided by an embodiment of the present application;

Figure 11 is another structural schematic diagram of a data processing device provided by an embodiment of the present application;

Figure 12 is another structural schematic diagram of a data processing device provided by an embodiment of the present application;

Figure 13 is another structural schematic diagram of a data processing device provided by an embodiment of the present application;

Figure 14 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other practical results obtained by those of ordinary skill in the art without making creative efforts Examples all belong to the protection scope of this application.

In some solutions, the video sharing process is to share the entire video content with friends, and the auxiliary description information carried is information built in advance by the operating platform corresponding to the video application. Obviously, sharing the entire video will occupy too many network resources. , thereby reducing the sharing efficiency of the video; and because the same auxiliary description information is shared to different objects, the sharing display method will be too single and the sharing effect will be reduced.

In this embodiment of the present application, the computer device determines the first sharing quality corresponding to at least two video clips in the video. Therefore, candidate video clips can be determined from the at least two video clips based on the first sharing quality. It can be understood that , the candidate video clip belongs to the video and its shared value (quality) is better than the shared value of the video; further, the computer device obtains the object label text sequence associated with the video, and determines the candidate video clip corresponding to the object label text sequence and the candidate video clip. The second sharing quality, therefore, the candidate shared video clip can be determined from the candidate video clips according to the second sharing quality corresponding to the candidate video clip. It can be understood that the candidate shared video clip is not only determined based on the video content of the candidate video clip, but also It is determined based on the object label text sequence, so its shared value (quality) is better than the shared value of the candidate video clip; further, the computer device determines the third sharing quality corresponding to the candidate shared video clip based on the object label text sequence and the candidate shared video clip, According to the third sharing quality corresponding to the candidate shared video clip, the auxiliary description information corresponding to the candidate shared video clip is determined. It can be understood that the auxiliary description information is not only associated with the candidate shared video clip, but also associated with the object label text sequence; further , the computer device determines the shared video segment from the candidate shared video segment according to the first sharing quality, the second sharing quality, and the third sharing quality corresponding to the candidate shared video segment, and uses the shared video segment and the auxiliary description information corresponding to the shared video segment , identified as shared data to be sent to the shared object. As can be seen from the above, the shared data in this application is determined based on the sharing quality of different dimensions. It is not only associated with the video content of the shared video clip itself, but also associated with the object label text sequence. Therefore, by sharing the data, the quality of the video can be improved. Sharing efficiency and sharing effects. Moreover, since video clips are shared rather than the entire video, network transmission resources and processing resources of the receiving device sharing the data can be saved.

Please refer to Figure 1, which is a schematic diagram of a system architecture provided by an embodiment of the present application. As shown in Figure 1, the system may include a business server 100 and a terminal device cluster. The terminal device cluster may include: terminal device 200a, terminal device 200b, terminal device 200c,..., terminal device 200n. It can be understood that the above system can Including one or more terminal devices, this application does not limit the number of terminal devices.

There may be a communication connection between the terminal devices, for example, a communication connection exists between the terminal device 200a and the terminal device 200b, and a communication connection exists between the terminal device 200a and the terminal device 200c. At the same time, any terminal device in the terminal device cluster may have a communication connection with the service server 100. For example, there is a communication connection between the terminal device 200a and the service server 100. The above-mentioned communication connection is not limited to a connection method and can be carried out through wired communication. Directly or indirectly, it can also be connected directly or indirectly through wireless communication, or it can also be connected through other The method is not limited in this application.

It should be understood that each terminal device in the terminal device cluster as shown in Figure 1 can be installed with an application client. When the application client is running in each terminal device, it can be connected to the above-mentioned Figure 1 through the above communication connection. The business server 100 shown performs data exchange. Among them, the application client can be a video application, a live broadcast application, a social networking application, an instant messaging application, a game application, a music application, a shopping application, a novel application, a browser, and other application clients with a video loading function. Among them, the application client can be an independent client, or it can be an embedded sub-client integrated in a certain client (for example, a social client, an education client, a multimedia client, etc.), and there is no limitation here. . Taking a video application as an example, the business server 100 can be a collection of multiple servers including a background server corresponding to the video application, a data processing server, etc. Therefore, each terminal device can communicate with the business server 100 through the application client corresponding to the video application. For data transmission, for example, each terminal device can upload its local video to the business server 100 through the application client of the video application, and then the business server 100 can deliver the video to other terminal devices or transmit it to the cloud server.

It can be understood that in the specific implementation of this application, related data such as user information (such as object label text sequence) is involved. When the embodiments in this application are applied to specific products or technologies, user permission needs to be obtained. Or agree, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.

To facilitate subsequent understanding and explanation, in this embodiment of the present application, one terminal device can be selected as the target terminal device in the terminal device cluster shown in FIG. 1 , for example, terminal device 200a is used as the target terminal device. When acquiring a video and receiving a video sharing instruction to share the video to a shared object associated with the browsing object, the terminal device 200a may send the video identification, the browsing object identification, and the sharing object identification as data to be identified to the service server 100 , the embodiment of the present application refers to the user using the terminal device 200a as a browsing object, and the users (such as friend users) who are associated with the browsing object are called shared objects. The embodiment of the present application does not identify the browsing object (the browsing object has been authorization), including but not limited to the mobile phone and identification number bound to the browsing object in the application client, which can be set according to the actual application scenario. Similarly, the same is true for the shared object identification; the video identification can be any Information that can be used to identify the video in the application client.

Further, after receiving the data to be identified sent by the terminal device 200a, the service server 100 can obtain the video according to the video identification, and obtain the object tag text sequence according to the browsing object identification and the shared object identification. The object tag text sequence includes the object tag text of the browsing object that shares the video and the object tag text of the shared object that receives the share; the object tag text of the browsing object is used to represent the interest of the browsing object, and the sharing The object tag text of an object is used to characterize the interest of the shared object. The service server 100 obtains at least two video clips in the video, and the service server 100 obtains a trained video recognition model. The video recognition model may include a first Video recognition sub-model, second video recognition sub-model and third video recognition sub-model; through the first video recognition sub-model, the service server 100 can determine the first sharing quality respectively corresponding to at least two video clips, according to the first Sharing quality, candidate video segments can be determined from at least two video segments; further, in the second video recognition sub-model, the service server 100 can determine the third video segment corresponding to each candidate video segment according to the object label text sequence and the candidate video segments. Second sharing quality, according to the second sharing quality corresponding to each candidate video segment, candidate shared video segments can be determined from the candidate video segments; further, in the third video recognition sub-model, according to the object label text sequence and candidate shared video segments , the service server 100 can determine the third sharing quality corresponding to each candidate shared video segment, and the auxiliary description information corresponding to each candidate shared video segment; further, according to the first sharing quality, the second sharing quality corresponding to each candidate shared video segment Sharing quality, and the third sharing quality, the service server 100 can determine the shared video clip from the candidate shared video clips, and determine the shared video clip and the auxiliary description information corresponding to the shared video clip as shared data for sending to the sharing object.

Subsequently, the business server 100 sends the shared data to the terminal device 200a. After receiving the shared data sent by the business server 100, the terminal device 200a can display the shared data on its corresponding screen. Furthermore, the terminal device 200a can carry the video identification The shared data is sent to the terminal device corresponding to the sharing object (for example, the terminal device 200b in Figure 1). After the terminal device 200b obtains the shared data carrying the video identifier, it can display the shared data on its screen. Furthermore, the sharing object can view the complete video based on the video identifier carried by the shared data. In some embodiments, if the browsing object authorizes the service server 100 to have sharing permissions, after generating the shared data, the service server 100 can send the shared data to the terminal device corresponding to the shared object (the terminal device 200b in Figure 1). , please refer to the above description for the subsequent process and will not be repeated here.

In some embodiments, the service server 100 generates a sharing identifier for the shared video clip, and sends the sharing identifier and auxiliary description information to the terminal device 200a. Then, after the terminal device 200a obtains the sharing identifier, it can generate a sharing identifier for the video carrying the sharing identifier. and the sharing information of the auxiliary description information. Furthermore, the terminal device 200a sends the sharing information to the terminal device 200b corresponding to the sharing object. When the terminal device 200b obtains the sharing information, it can play the shared video clip in the video according to the sharing identification. In some embodiments, if the browsing object authorizes the service server 100 to have sharing permissions, then after the service server 100 generates the sharing identifier, it can send the sharing identifier and auxiliary description information to the terminal device 200b. For the subsequent process, please refer to the above description, here No further details will be given.

In some embodiments, if the terminal device 200a stores the above-mentioned video recognition model locally, the terminal device 200a can use the video recognition model to determine the first sharing quality corresponding to at least two video clips in the video, so it can obtain from at least Determine the candidate video clip from the two video clips; according to the object label text sequence and the candidate video clip, the terminal device 200a can determine the second sharing quality corresponding to the candidate video clip, and then determine the candidate shared video clip from the candidate video clip; according to the object tag text sequence and candidate shared video clips, the terminal device 200a can Determine the third sharing quality corresponding to the candidate shared video clip, and the auxiliary description information corresponding to the candidate shared video clip; according to the first sharing quality, the second sharing quality, and the third sharing quality corresponding to the candidate shared video clip, the terminal device 200a can The shared video clip is determined from the candidate shared video clips, so the shared video clip and the auxiliary description information corresponding to the shared video clip can be determined as shared data for sending to the sharing object.

Among them, since training the video recognition model involves a large amount of offline calculations, the local video recognition model of the terminal device 200a can be sent to the terminal device 200a after the training is completed by the service server 100.

It can be understood that the shared data in the embodiment of the present application is automatically constructed based on the video and the object tag text sequence, and has high sharing value. Therefore, the shared video clips can intuitively reflect the wonderful content of the video, and at the same time, share with the browsing object/ The object's interest tags match, so the sharing efficiency and effect of the video can be improved.

It should be noted that the above-mentioned business server 100, terminal equipment 200a, terminal equipment 200b, terminal equipment 200c..., terminal equipment 200n can all be blockchain nodes in the blockchain network, and the data described in the full text (such as object tags) Text sequences and shared data) can be stored. The storage method can be that the blockchain node generates blocks based on the data and adds the blocks to the blockchain for storage.

Blockchain is a new application model of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism and encryption algorithm. It is mainly used to organize data in chronological order and encrypt it into a ledger, making it impossible to be tampered with and forged. , and data verification, storage and update can be performed at the same time. Blockchain is essentially a decentralized database. Each node in the database stores an identical blockchain. The blockchain network can distinguish nodes into core nodes, data nodes and light nodes. Core nodes, data nodes and light nodes together form a blockchain node. The core node is responsible for the consensus of the entire blockchain network, which means that the core node is the consensus node in the blockchain network. The process for the transaction data in the blockchain network to be written into the ledger can be as follows: the data node or light node in the blockchain network obtains the transaction data and transmits the transaction data in the blockchain network (that is, the node passes the baton until the consensus node receives the transaction data, the consensus node then packages the transaction data into a block, performs consensus on the block, and writes the transaction data into the ledger after the consensus is completed. Here, object tag text sequence and shared data are used as examples of transaction data. After passing the consensus on the transaction data, the business server 100 (blockchain node) generates blocks based on the transaction data and stores the blocks in the blockchain network; and For reading transaction data (i.e., object tag text sequence and shared data), the blockchain node can obtain the block containing the transaction data in the blockchain network, and further obtain the transaction data in the block. .

It can be understood that the methods provided by the embodiments of the present application can be executed by computer equipment, including but not limited to terminal equipment or business servers. Among them, the business server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also provide cloud database, cloud service, cloud computing, cloud function, cloud storage, network service, cloud Communication, middleware services, domain name services, security services, CDN, etc. and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms. Terminal devices include but are not limited to mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle-mounted terminals, aircraft, etc. The terminal device and the service server may be connected directly or indirectly through wired or wireless methods, and the embodiments of the present application are not limited here.

Further, please refer to Figure 2, which is a schematic diagram of a data processing scenario provided by an embodiment of the present application. Embodiments of this application can be applied to various scenarios, including but not limited to cloud technology, artificial intelligence, smart transportation, assisted driving, etc. The embodiments of the present application can be applied to business scenarios such as video clip recommendation scenarios, video clip distribution scenarios, and video clip search scenarios. Specific business scenarios will not be listed here. Among them, the implementation process of this data processing scenario can be carried out in the business server or in the terminal device. It can also be carried out interactively in the terminal device and the business server. There is no restriction here. In order to facilitate description and understanding, the embodiments of the present application are described by taking the interaction between a terminal device and a service server as an example. The terminal device can be any terminal device in the terminal device cluster in the embodiment corresponding to Figure 1. 2 Taking the terminal device 200a as an example for description, the service server may be the service server 100 in the embodiment corresponding to the above-mentioned FIG. 1 .

As shown in Figure 2, the browsing object 20b has a binding relationship with the terminal device 200a. When the browsing object 20b browses the video 201a through the terminal device 200a, the terminal device 200a can display the basic information of the video 201a on the playback interface, such as the video duration (Fig. 2 example is 6 minutes), video cover (example in Figure 2 is a cat image 205a), video copy (example copy in Figure 2 is "kittens fighting for food" 206a). In addition, the terminal device 200a can also display controls for the video 201a on the playback interface, such as the playback control 207a and the sharing control 202a illustrated in FIG. 2 . When the browsing object 20b triggers the sharing control 202a, the terminal device 200a responds to the triggering operation on the sharing control 202a and displays the friend list of the browsing object 20b. The example friend list in Figure 2 includes three friends, namely friend "aa" and friend "bb" and friend "cc". If the browsing object 20b triggers the selection control 203a corresponding to the friend "cc", the terminal device 200a can display a prompt sub-page, and the prompt sub-page can display a "cancel control" and a "share" control 204a. When browsing When the object 20b triggers the "share" control 204a, the terminal device 200a determines the friend "cc" as the sharing object.

It can be understood that the interfaces and controls shown in Figure 2 are only some representations for reference. In actual business scenarios, developers can carry out relevant designs according to product requirements. The embodiments of this application do not include the interfaces involved. There are no restrictions on the specific form of controls.

The terminal device 200a can obtain the video identification corresponding to the video 201a, the browsing object identification corresponding to the browsing object 20b, and the sharing object identification corresponding to the shared object, and then send the video identification, browsing object identification, and sharing object identification to the service server 100, so as to The business server 100 obtains the video 201a through the video identification, and determines the object tag text sequence through the browsing object identification and the shared object identification. In some embodiments, the object tag text sequence includes object tag text for the browse object 20b and object tag text for the shared object. The object tag text of the browsing object 20b is used to characterize the interest of the browsing object 20b; the object tag text of the shared object is used to characterize the interest of the shared object. Among them, the embodiment of the present application does not limit the way in which the service server 100 obtains the video 201a and the object label text sequence. The video 201a and the object label text sequence can be obtained as described above, or the terminal device 200a can obtain the video 201a and the object label text sequence. are sent to the business server 100. The business server 100 can also determine the video 201a and the object label text sequence through other methods. The specific settings should be based on the actual scenario.

Further, the service server 100 can segment the video 201a through a time window to obtain at least two video clips 20d. The length of the time window in the embodiment of this application is 1 minute. Combined with the video duration corresponding to the video 201a (the example in Figure 2 is 6 minutes), so the number of at least two video clips 20d is 6, such as the video clips 201d, 202d, 203d, 204d, 205d and 206d as shown in FIG. 2 . The service server 100 obtains the trained video recognition model 20c. The video recognition model 20c may include a first video recognition sub-model 20e, a second video recognition sub-model 20f and a third video recognition sub-model 20g.

The service server 100 inputs at least two video clips 20d to the first video recognition sub-model 20e respectively, and determines the first sharing quality corresponding to the at least two video clips 20d through the first video recognition sub-model 20e. In some embodiments, , the first sharing quality is used to characterize the sharing value of the video clip. For example, the first sharing quality may be the interaction rate of the video clip. As shown in the example of Figure 2, the first shared quality of the video clip 201d is 0.8, the first shared quality of the video clip 202d is 0.85, the first shared quality of the video clip 203d is 0.89, and the first shared quality of the video clip 204d is 0.7, The first shared quality of the video clip 205d is 0.75, and the first shared quality of the video clip 206d is 0.9. The specific process of the service server 100 determining the first shared quality corresponding to the video clip will not be described here. Please refer to the following. Figure 3 corresponds to the description of step S101 in the embodiment.

The service server 100 obtains the first sharing quality threshold. It can be understood that the first sharing quality threshold can be adjusted according to the actual application scenario. An example in the embodiment of this application is 0.8. The service server 100 compares the first shared quality of each video clip with the first shared quality threshold, and determines the video clip whose first shared quality is equal to or greater than the first shared quality threshold as a candidate video clip 201e, as shown in Figure 2 As shown, candidate video segment 201e includes video segments 201d, 202d, 203d, and 206d. Further, the business server 100 inputs both the object label text sequence and the candidate video segment 201e to the second video recognition sub-model 20f. Through the second video recognition sub-model 20f, the second sharing quality corresponding to the candidate video segment 201e can be determined. In some cases, In an embodiment, the second sharing quality is used to characterize the correlation between the candidate video clip and the object label text of the shared object. As shown in the example of Figure 2, the second shared quality of the video clip 201d is 0.74, the second shared quality of the video clip 202d is 0.86, the second shared quality of the video clip 203d is 0.8, and the second shared quality of the video clip 206d is 0.9; The specific process of the service server 100 determining the second sharing quality corresponding to the candidate video clip will not be described here. Please refer to the description of step S102 in the embodiment corresponding to FIG. 3 below.

The service server 100 obtains the second sharing quality threshold. It can be understood that the second sharing quality threshold can be based on Adjust the actual application scenario, and the example in the embodiment of this application is 0.85. The service server 100 compares the four second shared qualities with the second shared quality threshold respectively, and determines the candidate video segment whose second shared quality is greater than the second shared quality threshold as the candidate shared video segment 201f, as shown in Figure 2 , the candidate shared video clip 201f includes video clips 202d and 206d. Further, the business server 100 inputs both the object label text sequence and the candidate shared video clip 201f to the third video recognition sub-model 20g. Through the third video recognition sub-model 20g, the third sharing quality corresponding to the candidate shared video clip 201f can be determined, As shown in the example of Figure 2, the third sharing quality of the video clip 202d is 0.82, and the third sharing quality of the video clip 206d is 0.87; the specific process of the service server 100 determining the third sharing quality corresponding to the candidate shared video clip is here The description will not be carried out for now, please refer to the description of step S103 in the embodiment corresponding to Figure 3 below.

According to the third sharing quality corresponding to the candidate shared video clip, the service server 100 can determine the auxiliary description information corresponding to the candidate shared video clip. As shown in Figure 2, the service server 100 determines the auxiliary description information 202g of the video clip 202d and determines the video clip 206d. The auxiliary description information 206g; wherein, the specific process of the service server 100 determining the auxiliary description information corresponding to the candidate shared video clip will not be described here. Please refer to the description of step S103 in the embodiment corresponding to Figure 3 below.

Further, the service server 100 performs a weighted calculation on the first sharing quality (0.85 in the example of Figure 2), the second sharing quality (0.86 in the example of Figure 2), and the third sharing quality (0.82 in the example of Figure 2) corresponding to the video clip 202d. and, the total shared quality corresponding to the video segment 202d can be obtained; similarly, the first shared quality (0.9 in the example of Figure 2), the second shared quality (0.9 in the example of Figure 2), and the third shared quality corresponding to the video segment 206d The service server 100 can obtain the total shared quality corresponding to the video clip 206d by weighted summation of the quality (0.87 in the example of Figure 2); further, the service server 100 calculates the total shared quality corresponding to the video clip 202d and the total shared quality corresponding to the video clip 206d. By comparison, among the two total shared qualities, the maximum total shared quality is obtained. In the embodiment of the present application, the total shared quality corresponding to the video clip 206d is the maximum total shared quality, then the service server 100 can determine that the video clip 206d is shared. Video clip; further, the shared video clip (ie, the video clip 206d) and the auxiliary description information corresponding to the shared video clip (the auxiliary description information 206g as shown in FIG. 2) can be determined as the shared data 20h. Subsequently, the business server 100 can synchronize the shared data 20h to the terminal device 200a, so the terminal device 200a can send the shared data 200a to the sharing object (the friend "cc" as shown in Figure 2).

As can be seen from the above, this application can construct multiple video clips with high sharing value through deep modeling of videos. Combined with the object tag text sequence, auxiliary description information that is strongly related to the browsing objects and shared objects can be generated to achieve video sharing. diversified personalities, enriched video sharing functions, and improved the user experience of video sharing. In addition, since only video clips are shared instead of the entire video, network resources and processing resources of the device receiving the shared object are saved.

Further, please refer to Figure 3. Figure 3 is a schematic flowchart 1 of a data processing method provided by an embodiment of the present application. This data processing method can be executed by a business server (for example, the business server 100 shown in FIG. 1 above), or by a terminal device (for example, the terminal device 200a shown in FIG. 1 above), or by a business server and a terminal. Device interaction execution. For ease of understanding, the embodiment of this application takes the method being executed by the service server as an example for description. As shown in Figure 3, the data processing method may at least include the following steps S101 to S104.

Step S101: Obtain at least two video clips in the video, determine the first sharing quality corresponding to the at least two video clips, and select at least one video clip from the at least two video clips as a candidate video based on the first sharing quality. fragment.

In some embodiments, the video can be segmented according to the time window to obtain at least two video clips corresponding to the video; the first sharing quality is used to characterize the popularity of the video clips, such as the interaction rate. The popularity can be determined by the characteristics of the video clip in multiple dimensions, such as image characteristics, audio characteristics and text characteristics. For each video clip in the at least two video clips, perform the following operations to determine the first sharing quality corresponding to the video clip:

Obtain K video frames from the video clip and the audio frames corresponding to the K video frames; K is a positive integer;

Fusion of video features corresponding to the K video frames to obtain video features of the video clip;

Fusion of the audio features corresponding to the K audio frames to obtain the audio features of the video clip;

Obtain the text features corresponding to the video clip according to the audio recognition text, video description text and object comment text of the video clip;

Fusion of video features, audio features and text features of the video clip to obtain multi-dimensional fusion features corresponding to the video clip;

According to the multi-dimensional fusion features, the first sharing quality corresponding to the video clip is determined.

Among them, the video clips can be subjected to audio recognition processing to obtain audio recognition text, such as voice dialogue text obtained through ASR recognition; the video clips can be subjected to text recognition processing, such as OCR processing, to obtain video description text (for example, subtitle text); You can get the barrage text corresponding to the video clip as the object comment text. .

The specific process of generating multi-dimensional fusion features corresponding to video clips may include: obtaining a video recognition model; the video recognition model includes a first video recognition sub-model; the first video recognition sub-model includes a video fusion network layer, an audio fusion network layer, Text fusion network layer and multi-dimensional fusion network layer; K video frames are input to the video fusion network layer respectively, through the video fusion network layer, feature extraction is performed on the K video frames respectively, and the corresponding features to be fused of the K video frames are obtained. Video features, perform feature fusion on the K video features to be fused, and obtain the video features corresponding to the video segments A _b ; input the K audio frames to the audio fusion network layer respectively, and perform the feature fusion on the K audio frames through the audio fusion network layer. Feature extraction, obtain the audio features to be fused corresponding to the K audio frames, and for the K audio features to be fused Perform feature fusion to obtain the audio features corresponding to the video clip; determine the audio recognition text, the video description text and the object comment text as the content text corresponding to the video clip, and input the content text into the text fusion network layer, through the text fusion network layer, extract the key text in the content text, perform feature extraction on the key text, and obtain the text features corresponding to the key text; input the video features, audio features, and text features to the multi-dimensional fusion network layer respectively, and through the multi-dimensional fusion network layer The dimensional fusion network layer performs feature fusion on video features, audio features and text features to obtain multi-dimensional fusion features corresponding to video clips.

Wherein, the first video recognition sub-model further includes a first fully connected network layer. According to the multi-dimensional fusion feature, the specific process of determining the first shared quality corresponding to at least two video clips may include: for each video clip: input the multi-dimensional fusion feature corresponding to the video clip into the first fully connected network layer , through the first fully connected network layer, feature transformation is performed on the multi-dimensional fusion features corresponding to the video clips, and the first shared quality corresponding to the video clips is obtained.

The specific process of selecting at least one video segment from at least two video segments as a candidate video segment may include: determining, among the at least two video segments, the video segment whose first shared quality is equal to or greater than the first shared quality threshold as the candidate video. fragment.

The business server can segment the video through the time window to obtain at least two video clips of the video, where the time window can be set according to the actual application scenario. It can be understood that the process of the service server determining the first sharing quality corresponding to each video clip is consistent. Therefore, the embodiment of the present application takes determining the first sharing quality corresponding to video clip A ₁ as an example for description. At least two For the process of determining the first sharing quality corresponding to the remaining video clips in the video clip, please refer to the following description. Please also refer to Figure 4. Figure 4 is a schematic model structure diagram of a first video recognition sub-model provided by an embodiment of the present application. As shown in Figure 4, the service server obtains K video frames from the video segment A ₁ and the audio frames corresponding to the K video frames. The K video frames can be randomly selected or based on cycles (for example, (one frame per second). The embodiment of the present application does not limit the method of obtaining video frames, and can set it according to the actual application scenario; the business server performs audio recognition processing on video clip _A1 , for example, through ASR technology, to obtain audio recognition Text; for example, through OCR technology, extract the video description text in the video clip _A1 and extract the object comment text, where the video description text may include subtitle text and the object comment text may include barrage text; further, the business server will recognize the text with the audio , video description text and object comment text, determined to be the content text E ₁ corresponding to the video clip A ₁ .

Please refer to Figure 4 again. The business server obtains the first video recognition sub-model in the video recognition model. The first video recognition sub-model includes a video fusion network layer 40a, an audio fusion network layer 40b, a text fusion network layer 40c, and a multi-dimensional fusion network. layer 40e and the first fully connected network layer 40f. The service server inputs the K video frames to the video fusion network layer 40a respectively. Assuming that the K video frames include the first video frame and the second video frame, then through the video fusion network layer 40a, feature extraction is performed on the first video frame. Obtain the first video feature to be fused corresponding to the first video frame, and compare the second video By performing feature extraction on the frequency frame, the second video feature to be fused corresponding to the second video frame can be obtained, so the business server can obtain the video features to be fused corresponding to the K video frames respectively; perform feature fusion on the K video features 401a to be fused, The service server can obtain the video feature 401d corresponding to the video clip _A1 . It can be understood that the video fusion network layer 40a can be regarded as a network for extracting deep features of K video frames. The embodiment of the present application does not limit the network type of the video fusion network layer 40a, and it can be any one or more It consists of various neural networks, such as Convolutional Neural Networks (CNN), Residual Network (ResNet), High-Resolution Net (HRNet), and Standardized Convolutional Network Extension (EfficientNet) wait.

In addition, the service server inputs K audio frames to the audio fusion network layer 40b respectively. Assuming that the K audio frames include the first audio frame corresponding to the first video frame and the second audio frame corresponding to the second video frame, then through the audio The fusion network layer 40b performs feature extraction on the first audio frame to obtain the first audio feature to be fused corresponding to the first audio frame, and performs feature extraction on the second audio frame to obtain the second audio feature to be fused corresponding to the second audio frame. Audio features, with this, the business server can obtain the audio features to be fused corresponding to the K audio frames, perform feature fusion on the K audio features 401b to be fused, and obtain the audio features 402d corresponding to the video segment A ₁ . It can be understood that the audio fusion network layer 40b can be regarded as a network used to extract deep features of K audio frames. The embodiment of the present application does not limit the network type of the audio fusion network layer 40b. It can be any one or more It consists of several kinds of neural networks, such as convolution-time domain audio separation network (Conv-TasNet), bidirectional long short-term memory network and time domain audio separation network (BiLSTM-TasNet), Visual Geometry Group Network model (VGGish) based on tensorflow, etc.

The business server inputs the content text E ₁ into the text fusion network layer 40c, extracts the key text in the content text E ₁ through the text fusion network layer 40c, performs feature extraction on the key text, and obtains text features corresponding to the key text. The embodiment of the present application does not limit the network type of the text fusion network layer 40c. It can be any natural language processing network, such as a deep self-attention transform network (Transformer, a deep self-attention transform network widely used in the fields of natural language translation and image processing). Learning model), Word2Vec (model used to generate word vectors), Bidirectional Encoder Representation from Transformers, Bert), etc.

Further, the business server inputs the video feature 401d, the audio feature 402d, and the text feature 403d to the multi-dimensional fusion network layer 40e respectively. Through the multi-dimensional fusion network layer 40e, the video feature 401d, the audio feature 402d, and the text feature 403d are feature fused. The multi-dimensional fusion feature 401e corresponding to the video clip A ₁ is obtained. The business server inputs the multi-dimensional fusion feature 401e to the first fully connected network layer 40f, and performs feature transformation on the multi-dimensional fused feature 401e through the first fully connected network layer 40f to obtain the first shared quality corresponding to the video segment _A1 . Among them, according to the first sharing quality, the service server determines the specific process of candidate video clips from at least two video clips. Please refer to the description in Figure 2 above, which will not be described again this time.

Step S102, obtain the object label text sequence associated with the video, and Select video segments, determine the second sharing quality corresponding to each candidate video segment, and select at least one candidate video segment from the candidate video segments as the candidate shared video segment according to the second sharing quality corresponding to each candidate video segment.

Specifically, the second sharing quality is used to characterize the correlation between the candidate video clip and the object label text of the shared object. The object tag text sequence includes the object tag text of the browsing object that shares the video and the object tag text of the shared object that receives the share; the object tag text of the browsing object is used to represent the interest of the browsing object, and the sharing The object tag text of an object is used to characterize the interest of the shared object. Obtain the object tag text of the browsing object associated with the video, and obtain the object tag text of the shared object associated with the browsing object; generate an object tag text sequence based on the object tag text of the browsing object and the object tag text of the shared object; obtain the video The recognition model inputs the object label text sequence and the candidate video clips to the video recognition model respectively; the video recognition model includes a second video recognition sub-model; the second video recognition sub-model includes a first text encoding network layer; through the first text encoding network layer, perform text encoding on each object label text in the object label text sequence, and obtain the first object label feature corresponding to the object label text sequence; obtain the multi-dimensional fusion features corresponding to each candidate video clip, and based on the first object label feature and The multi-dimensional fusion features corresponding to each candidate video clip determine the second shared quality corresponding to each candidate video clip.

Wherein, the second video recognition sub-model also includes a first splicing network layer and a second fully connected network layer; the specific process of determining the second sharing quality corresponding to the candidate video clip may include: for each candidate video clip, the first object The label features and the multi-dimensional fusion features corresponding to the candidate video clips are respectively input to the first splicing network layer; through the first splicing network layer, feature splicing is performed on the first object label features and the multi-dimensional fusion features corresponding to the candidate video clips, Obtain the first multi-dimensional splicing feature corresponding to the candidate video clip; input the first multi-dimensional splicing feature to the second fully connected network layer, perform feature transformation on the first multi-dimensional splicing feature through the second fully connected network layer, and obtain The second shared quality corresponding to the candidate video clip;

Wherein, the number of candidate video segments is at least two; the specific process of selecting at least one candidate video segment from the candidate video segments as the candidate shared video segment may include: among the at least two candidate video segments, the second shared quality is greater than the second Candidate video clips with a shared quality threshold are determined as candidate shared video clips.

Step S101 constructs candidate video clips with high interaction rate and high sharing value. This step limits the candidate video clips to the relevance of the subject's interests, so that the constructed candidate video clips are more consistent with the subject's interests, which can further improve the playback of video sharing. Transformation. The business server obtains the object tag text of the browsing object (abbreviated as browsing object tag text). The browsing object tag text can represent the browsing object's interest. For example, the tag text (cat, animation, pet) represents the browsing object's interest in cats, animation, and pets. type of video that interests you; similarly, obtain the object tag text of the shared object (abbreviated as shared object tag text). The shared object tag text can represent the interest of the shared object. For example, the tag text (cat, cartoon, children) represents Shared with people interested in cats, cartoons, and children's videos Interestingly, further, by combining the browsed object tag text and the shared object tag text, the business server obtains the object tag text sequence, for example, combining the tag text (cat, animation, pet) and tag text (cat, cartoon, children) to obtain the tag text Sequences (cats, anime, pets, cartoons, children). If when constructing the object label text sequence, the object label text of only one object (such as a browsing object or a shared object) can be obtained, the object label text sequence is generated using the obtained object label text.

Embodiments of the present application can provide two ways of obtaining multi-dimensional fusion features corresponding to candidate video clips. The first way: Step S101 has provided multi-dimensional fusion features corresponding to at least two video clips (including the multi-dimensional fusion features in Figure 4 Fusion features 401e), and the candidate video clips belong to at least two video clips, so the business server can obtain the candidate video clips corresponding to the multi-dimensional fusion features corresponding to the at least two video clips output in the first video recognition sub-model. multi-dimensional fusion features. Please refer to Figure 2 again. The business server can respectively obtain the multi-dimensional fusion features corresponding to the video clip 201d, the multi-dimensional fusion features corresponding to the video clip 202d, the multi-dimensional fusion features corresponding to the video clip 203d, and the video through the first video recognition sub-model 20e. The multi-dimensional fusion feature corresponding to segment 204d, the multi-dimensional fusion feature corresponding to video segment 205d, and the multi-dimensional fusion feature corresponding to video segment 206d, and the business server determines that video segment 201d, video segment 202d, video segment 203d, and video segment 206d are candidate video clips, you can directly combine the multi-dimensional fusion features for the video clip 201d, the multi-dimensional fusion features for the video clip 202d, the multi-dimensional fusion features for the video clip 203d, and the multi-dimensional fusion features for the video clip output by the first video recognition sub-model 20e. The multi-dimensional fusion feature of segment 206d is determined as the multi-dimensional fusion feature corresponding to the candidate video segment. The above-mentioned first method of obtaining multi-dimensional fusion features corresponding to candidate video clips can reduce the calculation time and cost of the video recognition model.

In order to improve the accuracy of the multi-dimensional fusion features corresponding to the candidate video clips, the business server can adopt the second method. Please also refer to Figure 5. Figure 5 is a model structure of a second video recognition sub-model provided by an embodiment of the present application. Schematic diagram. As shown in Figure 5, the model structure in the dotted area is the same as the model structure in the first video recognition sub-model of Figure 4, but the model parameters between the two are inconsistent because when training the second video recognition sub-model, the business server The model parameters corresponding to the video fusion network layer 40a, audio fusion network layer 40b, text fusion network layer 40c and multi-dimensional fusion network layer 40e in the trained first video recognition sub-model are used as the dotted line area in Figure 5 Initialize model parameters in , and fine-tune the initialized model parameters based on the second training sample set (including multiple sample videos, object label sample text sequences, and second quality labels corresponding to each sample video). It can be understood that the process in which the service server obtains the multi-dimensional fusion features 402e corresponding to the candidate video clips through the dotted line area in Figure 5 is different from the process in which the multi-dimensional fusion features corresponding to at least two video clips are obtained through the first video recognition sub-model. are consistent, so please refer to the description of step S101 above, which will not be described again here. Since the model parameters in the dotted line area of Figure 5 are better than the model parameters in Figure 4, the multi-dimensional fusion feature 402e is better than at least two multi-dimensional fusion features in step S101.

Through Figure 5, the embodiment of this application jointly models the personalized interests of the object and the content of the video clips at the same time. As shown in Figure 5, the second video recognition sub-model may include the first text encoding network layer 40g, the first splicing network layer 40h and a second fully connected network layer 40i. Through the first text encoding network layer 40h, the business server performs text encoding on each object label text in the object label text sequence to obtain the first object label feature 401g corresponding to the object label text sequence; the business server converts the first object label feature 401g and the multi-dimensional fusion features corresponding to the candidate video clips (such as the multi-dimensional fusion features 402e in Figure 5) are respectively input to the first splicing network layer 40h. Through the first splicing network layer 40h, the first object label feature 401g and the multi-dimensional The fusion feature 402e performs feature splicing to obtain the first multi-dimensional splicing feature 401h corresponding to the candidate video clip; further, the business server inputs the first multi-dimensional splicing feature 401h to the second fully connected network layer 40i, through the second fully connected network In layer 40i, the second shared quality corresponding to the candidate video clip can be obtained.

The embodiment of the present application does not limit the network type of the first text encoding network layer 40g, and it can be any natural language processing network.

The process in which the service server selects at least one candidate video segment from the candidate video segments as a candidate shared video segment according to the second sharing quality corresponding to the candidate video segment may be referred to the description in Figure 2 above, and will not be described again here.

Step S103: Determine the third sharing quality corresponding to each candidate shared video clip and the auxiliary description information corresponding to each candidate shared video clip based on the object label text sequence and the candidate shared video clips.

Specifically, the auxiliary description information refers to the description information used to assist the video clip, including but not limited to the following modal information or a combination of multiple modal information: the copy (text mode), cover (image mode) of the video clip. mode), voice introduction (audio mode), etc., which can be set according to the actual application scenario. The third sharing quality is used to characterize the matching degree of the auxiliary description information with the video clip and the object tag text of the shared object.

The service server determines the third sharing quality corresponding to the candidate shared video clip through the third video recognition sub-model in the video recognition model, and then determines the auxiliary description information. Please refer to the description in Figure 2 above. If the auxiliary description information includes copywriting, the above-mentioned third video recognition sub-model includes a fourth video recognition sub-model; if the auxiliary description information includes a cover, the above-mentioned third video recognition sub-model includes a fifth video recognition sub-model; if the auxiliary The description information includes copy and cover, then the third video recognition sub-model may include a fourth video recognition sub-model and a fifth video recognition sub-model. For related descriptions of the fourth video recognition sub-model and the fifth video recognition sub-model, please refer to the description in the embodiment corresponding to FIG. 6 below, which will not be described here.

Step S104: Determine the shared video segments from the candidate shared video segments based on the first sharing quality, the second sharing quality, and the third sharing quality corresponding to each candidate shared video segment, and combine the shared video segments and the auxiliary information corresponding to the shared video segments. Description information, identified as shared data to be sent to the shared object.

Specifically, for each candidate shared video segment, the first shared quality, the second shared quality, and the third shared quality corresponding to the candidate shared video segment are weighted and summed to obtain the total shared quality corresponding to the candidate shared video segment. ; Determine the candidate shared video clip with the largest total sharing quality among the candidate shared video clips as the shared video clip; obtain the auxiliary description information corresponding to the shared video clip from the auxiliary description information corresponding to at least two candidate shared video clips.

The embodiment of this application proposes a method for realizing intelligent sharing of videos. By deeply understanding the multi-dimensional content of the video and combining it with interactive data such as barrages, this method can automatically dig out multiple video clips with high sharing value in the video. Based on the mining of the object's interests, high-value shared clips that are more consistent with the object's personalized interests can be selected, and corresponding personalized shared cover images and shared copywriting can be generated, making video sharing more intelligent and showing more intuitively. While providing valuable video content, this method can be more consistent with object personalization, so it can further improve the video sharing effect. On the premise of improving the video sharing effect, only video clips are shared instead of the entire video, which saves network transmission resources and processing resources of the receiving device of the shared data.

Please refer to FIG. 6 , which is another schematic flowchart of a data processing method provided by an embodiment of the present application. This method may be executed by a business server (for example, the business server 100 shown in Figure 1 above), or by a terminal device (for example, the terminal device 200a shown in Figure 1 above), or by interaction between the business server and the terminal device. implement. For ease of understanding, the embodiment of this application takes the method being executed by the service server as an example for description. As shown in Figure 6, the method may include at least the following steps.

Step S201: Obtain at least two video segments in the video, determine the first sharing quality corresponding to the at least two video segments, and select at least one video segment from the at least two video segments as a candidate video based on the first sharing quality. fragment.

Step S202: Obtain the object label text sequence associated with the video, determine the second sharing quality corresponding to each candidate video clip according to the object label text sequence and the candidate video clips, and determine the second sharing quality corresponding to each candidate video clip from the candidate video clip. At least one candidate video segment is selected from the video segments as a candidate shared video segment.

For the specific implementation process of step S201 to step S202, please refer to step S101 to step S102 in the embodiment corresponding to FIG. 3 above, which will not be described again here.

In some embodiments, the auxiliary description information corresponding to the candidate shared video clip includes a description image corresponding to the candidate shared video clip, and a description text corresponding to the candidate shared video clip; the candidate shared video clip corresponds to the third The three sharing qualities include the image sharing quality corresponding to the description image, and the text sharing quality corresponding to the description text.

For each candidate shared video segment determined in step S201, the following steps S203 to S206 are performed to determine the third sharing quality and auxiliary description information of each candidate shared video segment.

Step S203: Obtain at least two video frames in the candidate shared video clips, and determine the image sharing quality corresponding to each video frame in the at least two video needles.

Specifically, image sampling is performed on the candidate shared video clips according to the image sampling period, and at least two video frames in the candidate shared video clips are obtained; for each video frame, the video frame is input to the video recognition model, and through the video The image recognition network layer of the recognition model performs feature extraction on the video frames respectively to obtain the shared image features corresponding to the video frames; wherein the video recognition model includes a fourth video recognition sub-model; the fourth video recognition sub-model includes an image Identify the network layer and the second concatenated network layer.

The business server can obtain at least two video frames from the candidate shared video clips through the image sampling cycle (for example, sampling one picture per second), and the at least two video frames are used as candidate description images. The business server needs to determine at least two videos To determine the image sharing quality corresponding to each frame, and then determine the image sharing quality corresponding to the candidate shared video clip, please refer to Figure 7 as well. Figure 7 is a schematic model structure diagram of a fourth video recognition sub-model provided by an embodiment of the present application. It can be understood that the process for the business server to obtain the image sharing quality corresponding to each video frame through the fourth video recognition sub-model is consistent. Therefore, the embodiment of the present application takes obtaining the image sharing quality corresponding to the video frame F1 as an example to describe , please refer to the description below for the processing process of the remaining video frames among at least two video frames.

The business server inputs the video frame F1 to the image recognition network layer 70a in the fourth video recognition sub-model, and performs feature extraction on the video frame F1 through the image recognition network layer 70a to obtain the shared image feature 701a corresponding to the video frame F1.

Obtain the multi-dimensional fusion feature corresponding to the candidate shared video clip, and obtain the second object label feature corresponding to the object label text sequence; combine the shared image feature 701a corresponding to the video frame F1, the multi-dimensional fusion feature corresponding to the candidate shared video clip and the second object The label features are respectively input to the second splicing network layer; through the second splicing network layer, the shared image features corresponding to the video frame F1, the multi-dimensional fusion features corresponding to the candidate shared video segments and the second object label features are spliced to obtain the video The second multi-dimensional splicing feature corresponding to the frame F1; determine the image sharing quality corresponding to the video frame F1 according to the second multi-dimensional splicing feature corresponding to the video frame F1.

The fourth video recognition sub-model also includes a third fully connected network layer; for each video frame, the second multi-dimensional splicing feature corresponding to the video frame is input to the third fully connected network layer, and through the third fully connected network layer, Perform feature transformation on the second multi-dimensional splicing feature corresponding to the video frame to obtain the image sharing quality corresponding to the video frame.

Step S204: Determine the image sharing quality corresponding to the candidate shared video segment based on the image sharing quality corresponding to each video frame, and select one video frame from the at least two video frames as the description image corresponding to the candidate shared video segment.

Among them, the maximum image sharing quality is obtained from the image sharing qualities corresponding to at least two video frames, and the maximum image sharing quality is determined as the image sharing quality corresponding to the candidate shared video clip; among at least two video frames, the maximum image sharing quality is determined The video frame corresponding to the maximum image sharing quality is determined as the description image corresponding to the candidate shared video clip.

The embodiments of this application can provide three different ways of obtaining multi-dimensional fusion features corresponding to candidate shared video clips. For the first obtaining method, please refer to step S102 in the embodiment corresponding to Figure 3 above for obtaining the multi-dimensional fusion features corresponding to candidate video clips. The description of multi-dimensional fusion features has the same principle. The second acquisition method is similar to the first acquisition method. Step S102 in Figure 3 has provided the multi-dimensional fusion features corresponding to the candidate video clips (including the multi-dimensional fusion in Figure 4 Feature 402e), and the candidate shared video clip belongs to the candidate video clip, so the business server can obtain the multi-dimensional fusion corresponding to the candidate shared video clip from the multi-dimensional fusion feature corresponding to the candidate video clip output in the second video recognition sub-model. feature. Both of the above acquisition methods can reduce the computing time and cost of the video recognition model.

In order to improve the accuracy of the multi-dimensional fusion features corresponding to the candidate shared video clips, the business server can adopt the third method. Please also refer to Figure 7. Figure 7 is a model of a fourth video recognition sub-model provided by an embodiment of the present application. Schematic. As shown in Figure 7, the model structure in the dotted area is the same as the model structure in the second video recognition sub-model of Figure 5, but the model parameters between the two are inconsistent because when training the fourth video recognition sub-model, the business server The model parameters in the trained second video recognition sub-model are used as the initialization model parameters in the dotted area of Figure 7, and based on the third training sample set (including multiple sample videos, object label sample text sequences, each The sample description image corresponding to the sample video and the description image quality label corresponding to each sample video) are used to fine-tune the initialization model parameters. It can be understood that the process in which the service server obtains the multi-dimensional fusion features corresponding to the candidate shared video clips through the dotted area in Figure 7 is consistent with the process in which the multi-dimensional fusion features 402e are obtained through the second video recognition sub-model, so please refer to The description of step S101 above will not be repeated here. Since the model parameters in the dotted area of Figure 7 are better than the model parameters in Figure 5, the multi-dimensional fusion features corresponding to the candidate shared video clips output in Figure 7 are better than the multi-dimensional fusion features 402e in Figure 5.

Based on the same principle, embodiments of the present application can provide two ways to obtain the second object label feature. The first acquisition method: determine the first object tag feature 401g output in Figure 5 as the second object tag feature; Obtaining method: As shown in Figure 7, the object label text sequence is input to the fourth video recognition sub-model. The process of the business server obtaining the second object label feature through the dotted area in Figure 7 is the same as the process through the first process in Figure 5. The process of obtaining the first object label feature 401g by the text encoding network layer 40g is the same, so please refer to the description of step S102 above and will not be described again here.

Please refer to Figure 7 again. The business server inputs the shared image feature 701a corresponding to the video frame F1, the multi-dimensional fusion feature corresponding to the candidate shared video clip and the second object label feature respectively to the second splicing network layer 70b; through the second splicing network Layer 70b can perform feature splicing on the shared image feature 701a, the multi-dimensional fusion feature corresponding to the candidate shared video clip, and the second object label feature, so the second multi-dimensional splicing feature 701b corresponding to the video frame F1 can be obtained; further, the business server The second multi-dimensional splicing feature 701b is input to the third fully connected network layer 70c. Through the third fully connected network layer 70c, feature transformation can be performed on the second multi-dimensional splicing feature 701b, so the visual image can be obtained. The image sharing quality corresponding to frequency frame F1. According to the above description, the service server can obtain the image sharing quality corresponding to at least two video frames.

Step S205: Based on the object tag text sequence and the content text corresponding to the candidate shared video clip, determine the text sharing quality corresponding to the candidate shared video clip and the description text corresponding to the candidate shared video clip.

Specifically, the description text is composed of N shared words; a video recognition model is obtained; the video recognition model includes the fifth video recognition sub-model; the fifth video recognition sub-model includes the second text encoding network layer and the third text encoding network layer, attention network layer and text decoding network layer; input the content text corresponding to the candidate shared video clip into the second text encoding network layer, and perform text encoding on the content text corresponding to the candidate shared video clip through the second text encoding network layer , obtain the content text features; input the object label text sequence into the third text encoding network layer, and perform text encoding on the object label text sequence through the third text encoding network layer to obtain the third object label feature; input the content text features, candidates The text features to be decoded _Si and the third object label features corresponding to the shared video clips are input to the attention network layer respectively. Through the attention network layer, the content text features, the text features to be decoded _Si and the third object label features are characterized. Fusion, obtain the attention weight corresponding to the content text feature; i is a non-negative integer less than N; according to the attention weight corresponding to the content text feature, determine the to-be-decoded text feature S _i+1 corresponding to the candidate shared video clip; the to-be-decoded text The shared word indicated by the feature S _i is the previous shared word of the shared word indicated by the text feature S _i+1 to be decoded; when i+1 is equal to N, N text features to be decoded are input to the text decoding network layer respectively. , through the text decoding network layer, generate N shared words indicated by the text features to be decoded respectively, and form the N shared words into description texts corresponding to the candidate shared video clips; based on the N text features to be decoded, generate the text corresponding to the candidate shared video clips Text sharing quality.

For the definition of content text corresponding to the candidate shared video clip, please refer to the definition of content text E ₁ in Figure 3 above. For the definitions of the second text encoding network layer and the third text encoding network layer, please refer to Figure 3 above. The definition of the first text encoding network layer; the attention network layer is the Attention network.

Please also refer to FIG. 8 , which is a schematic model structure diagram of a fifth video recognition sub-model provided by an embodiment of the present application. As shown in Figure 8, the business server performs basic processing on the content text corresponding to the candidate shared video clip, including word segmentation and tagging (Token), and queries each word (word 1 as shown in Figure 8) through a vocabulary table (such as Lookup table) , word 2,...word n) respectively. Each initial word vector is used as the input of the second text encoding network layer to understand the content text corresponding to the candidate shared video clip and obtain the content text characteristics, that is, The word vector corresponding to each word is represented by word 1, word 2,..., and word n as shown in the figure. Among them, the process of the business server obtaining the third object label feature (ie, the object representation in Figure 8) can be referred to the generation process of the second object label feature above, and will not be described again here.

Further, the business server uses the content text features (word 1 representation, word 2 representation,..., word n representation), the third object label feature (object representation) and the shared word representation generated in the previous step as input to the attention network layer. chase Step 1 generates sharing copy (i.e., description text) corresponding to the candidate shared video clip. When generating shared words at each step, it is determined based on the Attention mechanism whether to copy the content text or select the word from the vocabulary for generation. Finally, the business server Multiply the maximum probability in each generation step as a candidate shared video clip to generate text sharing quality describing the text. Among them, the symbol "<S>" in Figure 8 identifies the start.

Step S206: Determine the third sharing quality corresponding to the candidate shared video clip according to the image sharing quality corresponding to the candidate shared video clip and the text sharing quality corresponding to the candidate shared video clip; according to the description image corresponding to the candidate shared video clip and the candidate sharing The description text corresponding to the video clip determines the auxiliary description information corresponding to the candidate shared video clip.

In some embodiments, the image sharing quality and text sharing quality of the candidate shared video segment may be used as the third sharing quality of the candidate shared video segment.

The description image can be used as the video cover of the candidate shared video clip, and the description text can be used as the video copy of the candidate shared video clip. The embodiment of the present application takes the auxiliary description information including the description image and the description text as an example. In some embodiments, the auxiliary description information includes the description image and the description text. The description information only includes description text, or only includes description images, or the auxiliary description information includes audio content, etc. The embodiments of the present application do not limit the content of the auxiliary description information, and can be set according to actual application scenarios.

Step S207: Determine the shared video segments from the candidate shared video segments based on the first sharing quality, the second sharing quality, and the third sharing quality corresponding to the candidate shared video segments, and add the shared video segments and the auxiliary description information corresponding to the shared video segments. , identified as shared data to be sent to the shared object.

Specifically, the first sharing quality, the second sharing quality, the image sharing quality and the text sharing quality corresponding to the candidate shared video clips are weighted and summed to obtain the total sharing quality corresponding to the candidate shared video clips. The subsequent process can be seen in the figure above. The description of step S104 in the embodiment corresponding to 3 will not be described again here.

The embodiment of this application proposes a method for implementing intelligent video sharing. Through in-depth mining of video content and video interaction data, multiple candidate shared video clips with high sharing value are automatically constructed and selected based on object interest (i.e., object tag text sequence). Share the shared video clips that match the sharing object, and construct a personalized description image (which can be used as the cover of the shared video clip) and description text (which can be used as the copywriting of the shared video clip) that suits the sharing object, so it can attract the sharing object to watch the shared video Clips can thereby improve the sharing conversion of the video platform and improve the overall playback status of the video platform.

Please refer to FIG. 9 , which is another schematic flowchart of a data processing method provided by an embodiment of the present application. This method may be executed by a business server (for example, the business server 100 shown in Figure 1 above), or by a terminal device (for example, the terminal device 200a shown in Figure 1 above), or by interaction between the business server and the terminal device. implement. For ease of understanding, the embodiment of this application takes the method being executed by the service server as an example for description. As shown in Figure 9, this method The method may include at least the following steps.

Step S301, obtain a training sample set; the training sample set includes a plurality of sample videos, a sample text sequence of object labels of the browsing sample objects associated with each sample video, a first quality label, a second quality label corresponding to each sample video, and a third quality label. Three quality labels.

In some embodiments, for each sample video in the plurality of sample videos, the following operations are performed to obtain the first quality label corresponding to the sample video:

Perform a product operation on the number of plays, duration and average play completion corresponding to the sample video to obtain the first sample parameter corresponding to the sample video;

Perform a summation operation on the number of object comment texts corresponding to the sample video and the number of object comment text interactions to obtain the second sample parameter corresponding to the sample video;

Determine a first ratio between the first sample parameter corresponding to the sample video and the maximum value of the first sample parameter, and determine a second ratio between the second sample parameter corresponding to the sample video and the maximum value of the second sample parameter;

Perform a weighted sum of the first ratio and the second ratio to obtain the candidate first quality label corresponding to the sample video;

If the candidate first quality label corresponding to the sample video is less than the first quality label threshold, then the candidate first quality label corresponding to the sample video is determined to be the first quality label corresponding to the sample video; if the candidate first quality label corresponding to the sample video is If the label is equal to or greater than the first quality label threshold, the first quality label threshold is determined as the first quality label corresponding to the sample video.

In some embodiments, for each sample video, perform the following operations to obtain the second quality label corresponding to the sample video: obtain the first playback completion degree of the browsed sample object for the sample video; if the first playback completion degree is greater than the A playback completion threshold, then it is determined that there is a first positive correlation between the object label sample text and the sample video, and the first positive correlation is determined as the second quality label of the sample video; if the first playback completion is less than or equal to the first playback completion threshold, it is determined that there is a first reverse correlation between the object label sample text and the sample video, and the first reverse correlation is determined as the second quality label of the sample video.

In some embodiments, the training sample set also includes a sample description image corresponding to each sample video; the third quality label includes a description image quality label; for each sample video: obtain the second playback completion of the browse sample object for the sample video degree; if the second playback completion degree is greater than the second playback completion degree threshold, it is determined that there is a second positive correlation between the sample description image, the object label sample text and the sample video, and the second positive correlation is determined as Description image quality label of the sample video; if the second playback completion degree is less than or equal to the second playback completion degree threshold, it is determined that there is a second reverse correlation between the sample description image, the object label sample text and the sample video, and the second The reverse correlation relationship is determined as the descriptive image quality label of the sample video.

In some embodiments, the third quality label includes a description text quality label; the method further includes: for Each sample video:

Obtain the third playback completion degree of the browsing sample object for the sample video; if the third playback completion degree is greater than the third playback completion degree threshold, obtain the sample content text corresponding to the sample video, and add the sample content text to the training sample set; It is determined that there is a third positive correlation relationship between the object label sample text sequence and the sample content text, and the third positive correlation relationship is determined to be the description text quality label of the sample video.

The training sample set may include a first training sample set for training the first video recognition sub-model, a second training sample set for training the second video recognition sub-model, and a third training sample set for training the third video recognition sub-model. Three training sample sets, when the auxiliary description information only includes description images, the third video recognition sub-model includes the fourth video recognition sub-model, and the third training sample set is the fourth training sample set; when the auxiliary description information only includes description text , the third video recognition sub-model includes the fifth video recognition sub-model, and the third training sample set is the fifth training sample set; when the auxiliary description information includes description images and description text, the third video recognition sub-model includes the fourth video recognition sub-model and the fifth video recognition sub-model, and the third training sample set includes the fourth training sample set and the fifth training sample set. Among them, the first training sample set includes a plurality of sample videos and the first quality label corresponding to each sample video; the fifth training sample set includes a plurality of sample videos, an object label sample text sequence of the browse sample object associated with each sample video And the description text quality label corresponding to the sample video.

It can be understood that the sample videos included in the above five training sample sets can be the same or different. The main difference is that the labels and uses are different. It can be understood that the video platform has a lot of short videos, so the short videos can be determined as sample videos. Compared with the corresponding duration of the video in the embodiment corresponding to Figure 3, the corresponding duration of the short video is shorter, for example The duration corresponding to the short video is equal to the duration corresponding to the video clip.

It can be understood that the first quality label threshold, the first playback completion threshold, the second playback completion threshold, and the third playback completion threshold can all be adjusted according to actual application scenarios. The embodiments of this application do not use the above four thresholds. Make restrictions.

Step S302: Input the training sample set to the video recognition model, and determine the first prediction quality corresponding to each sample video through the video recognition model.

Specifically, the business server can input the first training sample set in step S301 to the first video recognition sub-model in the video recognition model, where the business server obtains the first video recognition sub-model corresponding to each sample video through the first video recognition sub-model. The process of predicting quality is consistent with the process of obtaining the first shared quality corresponding to the video clip through the first video recognition sub-model. Therefore, please refer to the description of step S101 in the embodiment corresponding to Figure 3 above. No further details will be given.

Step S303: Determine the second prediction quality and the third prediction quality corresponding to each sample video according to the object label sample text sequence and each sample video.

Specifically, the business server can input the second training sample set in step S301 to the second video recognition sub-model in the video recognition model, where the business server obtains the second video recognition sub-model corresponding to each sample video through the second video recognition sub-model. The process of predicting quality is consistent with the process of obtaining the second shared quality corresponding to the video clip through the second video recognition sub-model. Therefore, please refer to the description of step S102 in the embodiment corresponding to Figure 3 above. No further details will be given.

The business server may input the third training sample set in step S301 to the third video recognition sub-model in the video recognition model, where the business server obtains the third prediction quality corresponding to each sample video through the third video recognition sub-model. The processing process is consistent with the processing process of obtaining the third shared quality corresponding to the video clip through the third video recognition sub-model. Therefore, please refer to the description of step S103 in the embodiment corresponding to Figure 3 above, which will not be performed here. Repeat.

Step S304: Adjust the parameters in the video recognition model according to the first quality label, the second quality label, the third quality label, the first prediction quality, the second prediction quality and the third prediction quality to obtain the trained video recognition Model; the trained video recognition model is used to determine the shared data of the video; the shared data includes shared video segments in the video and auxiliary description information corresponding to the shared video segments.

Specifically, the video recognition model includes a first video recognition sub-model used to determine the first prediction quality, a second video recognition sub-model used to determine the second prediction quality, and a third video recognition sub-model used to determine the third prediction quality. sub-model; the parameters in the video recognition model include parameters in the first video recognition sub-model, parameters in the second video recognition sub-model, and parameters in the third video recognition sub-model; determining the first quality label and the first prediction The first quality loss value between qualities, adjust the parameters in the first video recognition sub-model according to the first quality loss value, and obtain the trained first video recognition sub-model; determine the second quality label and the second prediction The second quality loss value between the qualities, adjust the parameters in the second video recognition sub-model according to the second quality loss value, and obtain the trained second video recognition sub-model; determine the third quality label and the third prediction The third quality loss value between the qualities, according to the third quality loss value, adjust the parameters in the third video recognition sub-model to obtain the trained third video recognition sub-model; when the first video recognition sub-model, the third video recognition sub-model When both the second video recognition sub-model and the third video recognition sub-model meet the model convergence conditions, the first video recognition sub-model after training, the second video recognition sub-model after training and the third video recognition sub-model after training are generated. The trained video recognition model.

The embodiment of the present application performs in-depth modeling on the first video recognition sub-model through the first training sample set, so that the first video recognition sub-model can determine candidate video clips with high sharing value among multiple video clips, and through the second The training sample set performs in-depth modeling on the second video recognition sub-model, so that the second video recognition sub-model can determine candidate shared video segments with high sharing value among the candidate video segments, and assists video recognition through the third training sample set. The sub-model performs deep modeling so that the third video recognition sub-model can determine the third shared quality corresponding to the candidate shared video clip. quantity and auxiliary description information, and then the shared video clips and their corresponding auxiliary description information can be determined through the sharing quality of different dimensions, and then the shared data can be generated, because the shared data is not only associated with the video content of the shared video clip itself, but also with the video content of the shared video clip itself. Object tag text sequences are associated, so by sharing data, the sharing efficiency and sharing effect of the video can be improved.

Further, please refer to FIG. 10 , which is a schematic structural diagram of a data processing device provided by an embodiment of the present application. The above-mentioned data processing device 1 can be used to execute corresponding steps in the method provided by the embodiments of the present application. As shown in FIG. 10 , the data processing device 1 may include: a first acquisition module 110 , a second acquisition module 120 , a first determination module 130 and a second determination module 140 .

The first acquisition module 110 is configured to acquire at least two video clips in the video, determine the first sharing quality corresponding to the at least two video clips, and select at least one video clip from the at least two video clips according to the first sharing quality. Video clips as candidate video clips;

The second acquisition module 120 is configured to obtain an object tag text sequence associated with the video, where the object tag text sequence includes the object tag text of the browsing object that shares the video and the object tag text of the shared object that receives the share; The object tag text of the browsing object is used to characterize the interest of the browsing object, and the object tag text of the shared object is used to characterize the interest of the shared object; each candidate video segment is determined according to the object tag text sequence and the candidate video segments. Corresponding second sharing quality, according to the second sharing quality corresponding to each candidate video segment, select at least one candidate video segment from the candidate video segments as the candidate shared video segment; the second sharing quality is used to characterize the candidate video The relevance of the fragment to the object tag text of the shared object;

The first determination module 130 is configured to determine the third sharing quality corresponding to each candidate shared video segment and the auxiliary description information corresponding to each candidate shared video segment according to the object label text sequence and the candidate shared video segment; the third Sharing quality is used to characterize the matching degree of the auxiliary description information with the candidate shared video clip and the object tag text of the shared object;

The second determination module 140 is configured to determine the shared video segments from the candidate shared video segments according to the first sharing quality, the second sharing quality, and the third sharing quality corresponding to each candidate shared video segment, and combine the shared video segments and the shared video segments. The auxiliary description information corresponding to the video clip is determined as shared data for sending to the sharing object.

For the specific functional implementation of the first acquisition module 110, the second acquisition module 120, the first determination module 130 and the second determination module 140, please refer to steps S101 to S104 in the corresponding embodiment of Figure 3 above, which will not be performed here. Repeat. In addition, the description of the beneficial effects of using the same method will not be described again.

Further, please refer to FIG. 11 , which is another schematic structural diagram of a data processing device provided by an embodiment of the present application. The above-mentioned data processing device 2 can be used to execute corresponding steps in the method provided by the embodiments of the present application. As shown in Figure 11, the data processing device 2 may include: a first acquisition module 11, a second acquisition module 12, a first determination module Block 13 and the second determination module 14.

It should be noted that the first acquisition module 11 in Figure 11 has all or part of the functions of the first acquisition module 110 in Figure 10 , and the second acquisition module 12 in Figure 11 has the functions of the second acquisition module 120 in Figure 10 All or part of the functions, the first determination module 13 in Figure 11 has all or part of the functions of the first determination module 130 in Figure 10 , and the second determination module 14 in Figure 11 has the functions of the second determination module 140 in Figure 10 All or part of the functionality.

Referring again to FIG. 11 , the first acquisition module 11 may include: a first processing unit 111 and a first acquisition unit 112 .

The first processing unit 111 is used to obtain the video, segment the video according to the time window, and obtain at least two video segments corresponding to the video;

The first acquisition unit 112 is configured to perform the following operations for each video segment in the at least two video segments to determine the first sharing quality corresponding to the video segment:

Obtain K video frames and audio frames corresponding to the K video frames from the video clips; K is a positive integer; fuse the video features corresponding to the K video frames to obtain the video of the video clip. Features; fuse the audio features corresponding to the K audio frames to obtain the audio features of the video clip; obtain the audio features corresponding to the video clip according to the audio recognition text, video description text and object comment text of the video clip. Text features; fuse the video features, audio features and text features of the video clips to obtain multi-dimensional fusion features corresponding to the video clips; determine the third fusion feature corresponding to the video clips based on the multi-dimensional fusion features. a shared quality;

According to the multi-dimensional fusion features corresponding to each video clip, the first sharing quality of each video clip is determined respectively.

For the specific functional implementation of the first processing unit 111 and the first acquisition unit 112, please refer to step S101 in the corresponding embodiment of FIG. 3, which will not be described again here.

Referring again to FIG. 11 , the second acquisition module 12 may include: a second acquisition unit 121 and a generation unit 122 .

The second obtaining unit 121 is used to obtain the object tag text of the browsing object associated with the video, and obtain the object tag text of the shared object associated with the browsing object;

The object tag text sequence is generated according to the object tag text of the browse object and the object tag text of the shared object.

The generation unit 122 is configured to perform the following operations for each candidate video segment to determine the second sharing quality corresponding to the candidate video segment:

The object label text sequence and the candidate video segment are respectively input to a video recognition model; the video recognition model includes a second video recognition sub-model; the second video recognition sub-model includes a first text encoding network layer;

Through the first text encoding network layer, text encoding is performed on each object label text in the object label text sequence to obtain the first object label feature corresponding to the object label text sequence;

Multi-dimensional fusion features corresponding to the candidate video segments are obtained, and second sharing quality corresponding to the candidate video segments is determined based on the first object label features and the multi-dimensional fusion features corresponding to the candidate video segments.

For the specific functional implementation of the second acquisition unit 121 and the generation unit 122, please refer to step S102 in the corresponding embodiment of FIG. 3, which will not be described again here.

Referring again to Figure 11, the auxiliary description information corresponding to the candidate shared video clip includes a description image corresponding to the candidate shared video clip, and a description text corresponding to the candidate shared video clip; the third sharing quality corresponding to the candidate shared video clip includes the The image sharing quality corresponding to the description image, and the text sharing quality corresponding to the description text;

The first determination module 13 may include: a third acquisition unit 131, a second determination unit 132, and a third determination unit 133.

Share video clips for each candidate:

The third obtaining unit 131 is used to obtain at least two video frames in the candidate shared video clips;

The second determining unit 132 is configured to determine the image sharing quality corresponding to each video frame in the at least two video frames, determine the image sharing quality of the candidate shared video segment according to the image sharing quality corresponding to each video frame, and determine the image sharing quality from Select one video frame from the at least two video frames as the description image corresponding to the candidate shared video segment;

The third determination unit 133 is configured to determine the text sharing quality corresponding to the candidate shared video clips and the description text corresponding to the candidate shared video clips based on the object tag text sequence and the content text corresponding to the candidate shared video clips.

For the specific functional implementation of the third obtaining unit 131, the second determining unit 132 and the third determining unit 133, please refer to steps S203 to S206 in the corresponding embodiment of FIG. 6, which will not be described again here.

Referring again to FIG. 11 , the second determination module 14 may include: a quality summation unit 141 and a fourth determination unit 142 .

The quality summation unit 141 is configured to perform a weighted sum of the first shared quality, the second shared quality, and the third shared quality corresponding to each candidate shared video segment, respectively, to obtain the total shared quality corresponding to each candidate shared video segment;

The fourth determination unit 142 is configured to determine the candidate shared video segment with the largest total sharing quality among the at least two candidate shared video segments as the shared video segment;

From the auxiliary description information corresponding to at least two candidate shared video clips, the auxiliary description information corresponding to the shared video clip is obtained.

The specific functional implementation of the quality summation unit 141 and the fourth determination unit 142 can be referred to step S104 in the corresponding embodiment of FIG. 3 above, and will not be described again here.

The shared data in this application is determined based on the sharing quality of different dimensions. It is not only associated with the video content of the shared video clip itself, but also associated with the object tag text sequence. Therefore, by sharing data, the sharing efficiency of the video can be improved. Share effects.

Further, please refer to Figure 12. Figure 12 is another structure of a data processing device provided by an embodiment of the present application. Schematic diagram. The above-mentioned data processing device 3 can be used to execute corresponding steps in the method provided by the embodiments of this application. As shown in FIG. 12 , the data processing device 3 may include: a first acquisition module 210 , a first determination module 220 , a second determination module 230 and a parameter adjustment module 240 .

The first acquisition module 210 is used to acquire a training sample set; the training sample set includes a plurality of sample videos, a sample text sequence of object tags of browsing sample objects associated with each sample video, a first quality label corresponding to each sample video, a third Second quality label and third quality label;

The first determination module 220 is used to input the training sample set to the video recognition model, and determine the first prediction quality corresponding to each sample video through the video recognition model;

The second determination module 230 is configured to determine the second prediction quality and the third prediction quality corresponding to each sample video according to the object label sample text sequence and the plurality of sample videos;

The parameter adjustment module 240 is used to adjust the parameters in the video recognition model according to the first quality label, the second quality label, the third quality label, the first prediction quality, the second prediction quality and the third prediction quality to obtain training. The video recognition model after training is used to determine the shared data of the video; the shared data includes the shared video clips in the video and the auxiliary description information corresponding to the shared video clips.

Among them, the specific functional implementation of the first acquisition module 210, the first determination module 220, the second determination module 230 and the parameter adjustment module 240 can be referred to steps S301 to S304 in the corresponding embodiment of Figure 9 above, and will not be described again here. . In addition, the description of the beneficial effects of using the same method will not be described again.

Further, please refer to FIG. 13 , which is another schematic structural diagram of a data processing device provided by an embodiment of the present application. The above-mentioned data processing device 4 can be used to execute corresponding steps in the method provided by the embodiments of the present application. As shown in FIG. 13 , the data processing device 4 may include: a first acquisition module 21 , a first determination module 22 , a second determination module 23 and a parameter adjustment module 24 .

It should be noted that the first acquisition module 21 in Figure 13 has all or part of the functions of the first acquisition module 210 in Figure 12 , and the first determination module 22 in Figure 13 has the functions of the first determination module 220 in Figure 12 All or part of the functions, the second determination module 23 in Figure 13 has all or part of the functions of the second determination module 230 in Figure 12 , and the parameter adjustment module 24 in Figure 13 has all or part of the parameter adjustment module 240 in Figure 12 Some functions.

Please refer to Figure 13 again, the data processing device 4 may also include: a first operation module 25, a second operation module 26, a second acquisition module 27, a third determination module 28, a proportion summation module 29, a first comparison module 30 and The fourth determination module 31.

The first operation module 25 is configured to perform a product operation for each sample video on the number of plays, duration and average play completion corresponding to the sample video to obtain the first sample parameter corresponding to the sample video;

The second operation module 26 is used to calculate, for each sample video, the number of object comment texts corresponding to the sample video. and the number of interactions with the object's comment text are summed to obtain the second sample parameter corresponding to the sample video;

The second acquisition module 27 is configured to obtain the maximum value of the first sample parameter among the first sample parameters corresponding to at least two sample videos, and obtain the maximum value of the first sample parameter among the second sample parameters corresponding to the at least two sample videos. The maximum value of the two sample parameters;

The third determination module 28 is used to determine the first ratio between the first sample parameter corresponding to each sample video and the maximum value of the first sample parameter, and determine the second sample parameter corresponding to each sample video and the maximum value of the second sample parameter. second ratio between values;

The proportion summation module 29 is used to perform a weighted sum of the first proportion and the second proportion of each sample video to obtain the candidate first quality label corresponding to each sample video;

The first comparison module 30 is used to compare the first quality label candidate corresponding to each sample video with the first quality label threshold respectively;

The fourth determination module 31 is configured to determine, for each sample video, the candidate first quality label corresponding to the sample video if the candidate first quality label corresponding to the sample video is less than the first quality label threshold. The first quality label;

The fourth determination module 31 is also configured to determine the first quality label threshold corresponding to the first quality label corresponding to the sample video if the candidate first quality label corresponding to the sample video is equal to or greater than the first quality label threshold.

Among them, the specific functional implementation of the first operation module 25, the second operation module 26, the second acquisition module 27, the third determination module 28, the proportion summation module 29, the first comparison module 30 and the fourth determination module 31 can be found in The above-mentioned FIG. 9 corresponds to step S301 in the embodiment, and will not be described again here.

Referring again to FIG. 13 , the data processing device 4 may further include: a second comparison module 32 and a fifth determination module 33 .

The second comparison module 32 is used to obtain the first playback completion degree of the browse sample object for each sample video, and compare the first playback completion degree of each sample video with the first playback completion degree threshold respectively;

The fifth determination module 33 is configured to determine, for each sample video, that there is a first positive association between the object label sample text and the sample video if the first playback completion degree of the sample video is greater than the first playback completion degree threshold. relationship, determine the first positive relationship as the second quality label of the sample video;

The fifth determination module 33 is also configured to determine that there is a first reverse association between the object label sample text and the sample video if the first playback completion degree of the sample video is less than or equal to the first playback completion degree threshold, and the The first reverse correlation relationship is determined as the second quality label of the sample video.

The specific functional implementation of the second comparison module 32 and the fifth determination module 33 can be referred to step S301 in the corresponding embodiment of FIG. 9 , and will not be described again here.

Please refer to Figure 13 again. The training sample set also includes the sample description image corresponding to the sample video; the third quality label package Includes labels describing image quality;

The data processing device 4 may also include: a third comparison module 34 and a sixth determination module 35 .

The third comparison module 34 is used to obtain the second playback completion degree of the browse sample object for each sample video, and compare the second playback completion degree of each sample video with the second playback completion degree threshold respectively;

The sixth determination module 35 is used for each sample video, if the second playback completion degree of the sample video is greater than the second playback completion degree threshold, determine the sample description image, the object label sample text and the sample corresponding to the sample video. There is a second positive correlation between the videos, and the second positive correlation is determined as the descriptive image quality label of the sample video;

The sixth determination module 35 is also configured to determine the relationship between the sample description image corresponding to the sample video, the object label sample text and the sample video if the second playback completion degree of the sample video is less than or equal to the second playback completion degree threshold. There is a second reverse correlation relationship, and the second reverse correlation relationship is determined as a descriptive image quality label of the sample video.

The specific functional implementation of the third comparison module 34 and the sixth determination module 35 can be referred to step S301 in the corresponding embodiment of FIG. 9 , and will not be described again here.

Referring again to Figure 13, the third quality label includes a description text quality label;

The data processing device 4 may also include: a third acquisition module 36 , a fourth acquisition module 37 and a seventh determination module 38 .

The third acquisition module 36 is used to obtain the third playback completion degree of the browsed sample object for each sample video;

The fourth acquisition module 37 is used for each sample video, if the third playback completion degree of the sample video is greater than the third playback completion degree threshold, obtain the sample content text corresponding to the sample video, and add the sample content text to the training sample set;

The seventh determination module 38 is used to determine that there is a third positive correlation relationship between the object label sample text sequence and the sample content text of the sample video, and determine the third positive correlation relationship as the description text quality label of the sample video.

For the specific functional implementation of the third acquisition module 36, the fourth acquisition module 37 and the seventh determination module 38, please refer to step S301 in the corresponding embodiment of FIG. 9, and will not be described again here.

Referring again to FIG. 13, the video recognition model includes a first video recognition sub-model for determining the first prediction quality, a second video recognition sub-model for determining the second prediction quality, and a third video recognition sub-model for determining the third prediction quality. Three video recognition sub-models; the parameters in the video recognition model include parameters in the first video recognition sub-model, parameters in the second video recognition sub-model, and parameters in the third video recognition sub-model;

The parameter adjustment module 24 may include: a first adjustment unit 241, a second adjustment unit 242, a third adjustment unit 243, and a model generation unit 244.

The first adjustment unit 241 is used to determine the first quality loss value between the first quality label and the first predicted quality, Adjust the parameters in the first video recognition sub-model according to the first quality loss value to obtain the trained first video recognition sub-model;

The second adjustment unit 242 is used to determine the second quality loss value between the second quality label and the second prediction quality, and adjust the parameters in the second video recognition sub-model according to the second quality loss value to obtain the trained The second video recognition sub-model;

The third adjustment unit 243 is used to determine the third quality loss value between the third quality label and the third prediction quality, and adjust the parameters in the third video recognition sub-model according to the third quality loss value to obtain the trained The third video recognition sub-model;

The model generation unit 244 is configured to generate, when the first video recognition sub-model, the second video recognition sub-model and the third video recognition sub-model all meet the model convergence conditions, the trained first video recognition sub-model, the trained The second video recognition sub-model and the trained video recognition model of the trained third video recognition sub-model.

For the specific functional implementation of the first adjustment unit 241, the second adjustment unit 242, the third adjustment unit 243 and the model generation unit 244, please refer to step S304 in the corresponding embodiment of FIG. 9, which will not be described again here.

The embodiment of the present application performs in-depth modeling on the first video recognition sub-model through the first training sample set, so that the first video recognition sub-model can determine candidate video clips with high sharing value among multiple video clips, and through the second The training sample set conducts in-depth modeling of the second video recognition sub-model, so that the second video recognition sub-model can determine candidate shared video segments with high sharing value among the candidate video segments, and the third video recognition sub-model is trained through the third training sample set. The recognition sub-model performs in-depth modeling so that the third video recognition sub-model can determine the third sharing quality and auxiliary description information corresponding to the candidate shared video clips, and then determine the shared video clips and their corresponding Auxiliary description information can then be used to generate shared data. Since the shared data is not only associated with the video content of the shared video clip itself, but also associated with the object tag text sequence, sharing the data can improve the sharing efficiency and sharing effect of the video.

Further, please refer to FIG. 14 , which is a schematic structural diagram of a computer device provided by an embodiment of the present application. As shown in Figure 14, the computer device 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, a memory 1005, and at least one communication bus 1002. Among them, the communication bus 1002 is used to realize connection communication between these components. In some embodiments, the user interface 1003 may include a display and a keyboard, and the network interface 1004 may include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. The memory 1005 may also be at least one storage device located remotely from the aforementioned processor 1001. As shown in Figure 14, memory 1005, which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in Figure 14, the network interface 1004 can provide network communication functions; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control application stored in the memory 1005 program to implement the video processing methods described in the above embodiments.

It should be understood that the computer device 1000 described in the embodiments of the present application can execute the data processing methods or devices described in the previous embodiments, which will not be described again here. In addition, the description of the beneficial effects of using the same method will not be described again.

Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the description of the data processing method or device in the previous embodiments is implemented. Herein No longer. In addition, the description of the beneficial effects of using the same method will not be described again.

The above-mentioned computer-readable storage medium may be the data processing apparatus provided in any of the foregoing embodiments or the internal storage unit of the above-mentioned computer equipment, such as the hard disk or memory of the computer equipment. The computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the computer device, Flash card, etc. Further, the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.

An embodiment of the present application also provides a computer program product. The computer program product includes a computer program, and the computer program is stored in a computer-readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device can execute the description of the data processing method or device in the previous embodiments, which will not be described again here. In addition, the description of the beneficial effects of using the same method will not be described again.

The terms “first”, “second”, etc. in the description, claims, and drawings of the embodiments of this application are used to distinguish different objects, rather than describing a specific sequence. Furthermore, the term "includes" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, device, product or equipment that includes a series of steps or units is not limited to the listed steps or modules, but may also include unlisted steps or modules, or may also include additional steps for these processes, methods. , devices, products or other step units inherent in the equipment.

Those of ordinary skill in the art can appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, computer software, or a combination of both. In order to clearly illustrate the relationship between hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described according to functions. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Technicians may implement the described functionality using different methods for each specific application, but such implementations should not considered beyond the scope of this application.

What is disclosed above is only the preferred embodiment of the present application. Of course, it cannot be used to limit the scope of rights of the present application. Therefore, equivalent changes made according to the claims of the present application still fall within the scope of the present application.

Claims

A data processing method, performed by a computer device, comprising:

Obtain at least two video segments in the video, determine first sharing qualities respectively corresponding to the at least two video segments, and select at least one video segment from the at least two video segments as a candidate based on the first sharing quality. video clips;

Obtain an object tag text sequence associated with the video, the object tag text sequence includes the object tag text of the browsing object that shares the video and the object tag text of the shared object that receives the share; the object tag text of the browsing object Used to characterize the interest of the browsing object, and the object tag text of the shared object is used to characterize the interest of the shared object;

Determine a second sharing quality corresponding to each candidate video segment according to the object label text sequence and the candidate video segment, and select at least one from the candidate video segment according to the second sharing quality corresponding to each candidate video segment. The candidate video segment is used as a candidate shared video segment; the second sharing quality is used to characterize the correlation between the candidate video segment and the object label text of the shared object;

According to the object label text sequence and the candidate shared video clips, the third sharing quality corresponding to each candidate shared video clip is determined, and the auxiliary description information corresponding to each candidate shared video clip is determined; the third sharing quality is used to characterize the candidate shared video clips. The matching degree between the auxiliary description information and the candidate shared video clip and the object tag text of the shared object;

According to the first sharing quality, the second sharing quality, and the third sharing quality corresponding to each candidate shared video segment, a shared video segment is determined from the candidate shared video segment, and the shared video segment and the shared video segment are corresponding to each other. The auxiliary description information is determined as shared data sent to the shared object.
The method of claim 1, wherein determining the first sharing quality respectively corresponding to the at least two video clips includes:

For each video clip in the at least two video clips, perform the following operations to determine the first sharing quality corresponding to the video clip:

Obtain K video frames and audio frames corresponding to the K video frames from the video clips; K is a positive integer;

Fusion of video features corresponding to the K video frames to obtain video features of the video clip;

Fusion of the audio features corresponding to the K audio frames to obtain the audio features of the video clip;

Obtain the text features corresponding to the video clip according to the audio recognition text, video description text and object comment text of the video clip;

Fusion of video features, audio features and text features of the video clip to obtain multi-dimensional fusion features corresponding to the video clip;

According to the multi-dimensional fusion features, the first sharing quality corresponding to the video clip is determined.
The method of claim 2, wherein

The fusion of the video features corresponding to the K video frames to obtain the video features of the video clip includes: inputting the K video frames to the video recognition model respectively, and passing the video fusion network of the video recognition model layer, perform feature extraction on the K video frames respectively to obtain the video features to be fused corresponding to the K video frames, perform feature fusion on the K video features to be fused, and obtain the video features corresponding to the video clips; The video recognition model includes a first video recognition sub-model; the first video recognition sub-model includes a video fusion network layer, an audio fusion network layer, a text fusion network layer and a multi-dimensional fusion network layer;

Fusion of the audio features corresponding to the K audio frames to obtain the audio features of the video clip includes: inputting the K audio frames to the audio fusion network layer respectively, and through the audio fusion The network layer performs feature extraction on the K audio frames respectively to obtain the audio features to be fused corresponding to the K audio frames, and performs feature fusion on the K audio features to be fused to obtain the audio features corresponding to the video clips. ;

Obtaining text features corresponding to the video clip based on the audio recognition text, the video description text and the object comment text includes: combining the audio recognition text, the video description text and the object comment The text is determined to be the content text corresponding to the video clip, the content text is input to the text fusion network layer, and the key text in the content text is extracted through the text fusion network layer, and the key text is Perform feature extraction to obtain text features corresponding to the key text;

The fusion of video features, audio features and text features of the video clip to obtain multi-dimensional fusion features corresponding to the video clip includes: combining the video features, the audio features and the text features They are respectively input to the multi-dimensional fusion network layer. Through the multi-dimensional fusion network layer, the video features, the audio features and the text features are feature fused to obtain the multi-dimensional fusion features corresponding to the video clips. .
The method according to claim 2, wherein determining the first sharing quality of the video clip according to the multi-dimensional fusion feature includes:

The multi-dimensional fusion features corresponding to the video clips are input into the video recognition model, and feature transformation is performed on the multi-dimensional fusion features corresponding to the video clips through the first fully connected network layer of the video recognition model to obtain the video The first shared quality corresponding to the segment; the video recognition model includes a first video recognition sub-model; the first video recognition sub-model includes the first fully connected network layer;

Selecting at least one video segment from the at least two video segments as a candidate video segment according to the first sharing quality includes:

Among the at least two video segments, the video segment whose first shared quality is equal to or greater than the first shared quality threshold is determined as the candidate video segment.
The method of claim 1, wherein said obtaining an object label text sequence associated with the video includes:

Obtain the object tag text of the browsing object associated with the video, and obtain the object tag text of the shared object associated with the browsing object;

Generate the object tag text sequence according to the object tag text of the browsing object and the object tag text of the shared object;

Determining the second sharing quality corresponding to each candidate video segment according to the object label text sequence and the candidate video segment includes:

For each candidate video clip, perform the following operations to determine the second sharing quality corresponding to the candidate video clip:

The object label text sequence and the candidate video segment are respectively input to a video recognition model; the video recognition model includes a second video recognition sub-model; the second video recognition sub-model includes a first text encoding network layer;

Through the first text encoding network layer, text encoding is performed on each object label text in the object label text sequence to obtain the first object label feature corresponding to the object label text sequence;

Multi-dimensional fusion features corresponding to the candidate video segments are obtained, and second sharing quality corresponding to the candidate video segments is determined based on the first object label features and the multi-dimensional fusion features corresponding to the candidate video segments.
The method according to claim 5, wherein the second video recognition sub-model further includes a first splicing network layer and a second fully connected network layer;

Determining the second sharing quality corresponding to the candidate video segment based on the first object label feature and the multi-dimensional fusion feature corresponding to the candidate video segment includes:

Input the first object label feature and the multi-dimensional fusion feature corresponding to the candidate video clip to the first splicing network layer respectively;

Through the first splicing network layer, feature splicing is performed on the first object label feature and the multi-dimensional fusion feature corresponding to the candidate video clip, to obtain the first multi-dimensional splicing feature corresponding to the candidate video clip;

Input the first multi-dimensional splicing feature into the second fully connected network layer, and perform feature transformation on the first multi-dimensional splicing feature through the second fully connected network layer to obtain the corresponding candidate video clip the second shared quality;

Wherein, the number of candidate video segments is at least two;

Then, selecting at least one candidate video segment from the candidate video segments as a candidate shared video segment according to the second sharing quality corresponding to each candidate video segment includes:

Among the at least two candidate video segments, a candidate video segment whose second shared quality is greater than the second shared quality threshold is determined as a candidate shared video segment.
The method of claim 1, wherein the auxiliary description information corresponding to the candidate shared video clip includes a description image corresponding to the candidate shared video clip, and a description text corresponding to the candidate shared video clip; the candidate shared video clip corresponds to a description image corresponding to the candidate shared video clip. The third sharing quality corresponding to the video clip includes the image sharing quality corresponding to the description image, and the text sharing quality corresponding to the description text;

Determining the third sharing quality corresponding to each candidate shared video segment and the auxiliary description information corresponding to each candidate shared video segment based on the object tag text sequence and the candidate shared video segment includes:

Share video clips for each candidate:

Obtain at least two video frames in the candidate shared video segments, determine the image sharing quality corresponding to each video frame in the at least two video frames, and determine the candidate based on the image sharing quality corresponding to each video frame. Share the image sharing quality corresponding to the video segment, and select one video frame from the at least two video frames as the description image corresponding to the candidate shared video segment;

According to the object tag text sequence and the content text corresponding to the candidate shared video clip, the text sharing quality corresponding to the candidate shared video clip and the description text corresponding to the candidate shared video clip are determined.
The method according to claim 7, wherein said obtaining at least two video frames in the candidate shared video segments and determining the image sharing quality corresponding to each video frame in the at least two video frames includes:

According to the image sampling period, perform image sampling on the candidate shared video segments to obtain at least two video frames in the candidate shared video segments;

For each of the at least two video frames:

Input the video frame to the third video recognition sub-model, perform feature extraction on the video frame through the image recognition network layer of the third video recognition sub-model, and obtain the shared image features corresponding to the video frame; The third video recognition model includes a fourth video recognition sub-model; the fourth video recognition sub-model includes an image recognition network layer and a second splicing network layer;

Obtain multi-dimensional fusion features corresponding to the candidate shared video clips, and obtain second object label features corresponding to the object label text sequence; wherein the second object label feature is obtained by text encoding the object label text sequence. of;

Input the shared image features corresponding to the video frames, the multi-dimensional fusion features corresponding to the candidate shared video segments, and the second object label features to the second splicing network layer respectively;

Through the second splicing network layer, feature splicing is performed on the shared image features corresponding to the video frames, the multi-dimensional fusion features corresponding to the candidate shared video segments, and the second object label features to obtain the corresponding video frames. The second multi-dimensional splicing feature;

The image sharing quality corresponding to the video frame is determined according to the second multi-dimensional splicing feature corresponding to the video frame.
The method according to claim 7, wherein the description text is composed of N shared words;

Determining the text sharing quality corresponding to the candidate shared video clip and the description text corresponding to the candidate shared video clip based on the object tag text sequence and the content text corresponding to the candidate shared video clip include:

The content text corresponding to the candidate shared video clip is input to the third video recognition sub-model, and the content text corresponding to the candidate shared video clip is text-coded through the second text encoding network layer of the third video recognition sub-model. Encoding to obtain content text features; the third video recognition sub-model includes a fifth video recognition sub-model; the fifth video recognition sub-model includes a second text encoding network layer, a third text encoding network layer, and an attention network layer and text decoding network layer;

Input the object label text sequence into the third text encoding network layer, perform text encoding on the object label text sequence through the third text encoding network layer, and obtain the third object label feature;

The content text features, the to-be-decoded text features Si corresponding to the candidate shared video segments, and the third object label features are respectively input to the attention network layer. Through the attention network layer, the The content text features, the to-be-decoded text features S i and the third object label features are feature fused to obtain the attention weight corresponding to the content text features; i is a non-negative integer less than N;

According to the attention weight corresponding to the content text feature, the to-be-decoded text feature Si +1 corresponding to the candidate shared video segment is determined; the shared word indicated by the to-be-decoded text feature Si is the to-be-decoded text feature The previous shared word of the shared word indicated by S i+1 ;

When i+1 is equal to N, N text features to be decoded are respectively input to the text decoding network layer, and shared words indicated by the N text features to be decoded are generated through the text decoding network layer, and all the text features to be decoded are generated. The N shared words constitute the description text corresponding to the candidate shared video clip;

According to the N text features to be decoded, a text sharing quality corresponding to the candidate shared video clip is generated.
The method of claim 1, wherein,

Determining shared video segments from the candidate shared video segments based on the first sharing quality, the second sharing quality, and the third sharing quality corresponding to the candidate shared video segments includes:

For each candidate shared video segment: perform a weighted sum of the first shared quality, the second shared quality, and the third shared quality corresponding to the candidate shared video segment to obtain the total shared quality corresponding to the candidate shared video segment;

Among the candidate shared video segments, the candidate shared video segment with the largest total shared quality is determined as the shared video segment.
The method of claim 1, further comprising:

Obtain a training sample set; the training sample set includes a plurality of sample videos, an object label sample text sequence of the browse sample object associated with each sample video, a first quality label, a second quality label corresponding to each sample video, and third quality label;

Input the training sample set to the video recognition model, and determine the first prediction quality corresponding to each sample video through the video recognition model;

According to the object label sample text sequence and the plurality of sample videos, determine the second prediction quality and the third prediction quality corresponding to each sample video respectively;

According to the first quality label, the second quality label, the third quality label, the first prediction quality, the second prediction quality and the third prediction quality, the video recognition model The parameters are adjusted to obtain a trained video recognition model; the trained video recognition model is used to determine the shared data of the video.
The method of claim 11, further comprising:

For each sample video in the plurality of sample videos, perform the following operations to determine the first quality label corresponding to the sample video:

Perform a product operation on the playback times, duration and average playback completion degree corresponding to the sample video to obtain the first sample parameter corresponding to the sample video;

Perform a summation operation on the number of object comment texts corresponding to the sample video and the number of object comment text interactions. Calculate to obtain the second sample parameter corresponding to the sample video;

Determine a first ratio between the first sample parameter corresponding to the sample video and the maximum value of the first sample parameter, and determine a second ratio between the second sample parameter corresponding to the sample video and the maximum value of the second sample parameter. Ratio; the maximum value of the first sample parameter is the largest value among the first sample parameters corresponding to the plurality of sample videos; the maximum value of the second sample parameter is the largest value of the first sample parameter corresponding to the plurality of sample videos. The one with the largest value of the two sample parameters;

Perform a weighted sum of the first ratio and the second ratio to obtain a candidate first quality label corresponding to the sample video;

If the candidate first quality label corresponding to the sample video is less than the first quality label threshold, determine the candidate first quality label corresponding to the sample video as the first quality label corresponding to the sample video;

If the candidate first quality label corresponding to the sample video is equal to or greater than the first quality label threshold, the first quality label threshold is determined as the first quality label corresponding to the sample video.
The method of claim 11, further comprising:

For each sample video:

Obtain the first playback completion degree of the browse sample object for the sample video;

If the first playback completion degree is greater than the first playback completion degree threshold, it is determined that there is a first positive association relationship between the object label sample text and the sample video, and the first forward association relationship is determined as The second quality label of the sample video;

If the first playback completion degree is less than or equal to the first playback completion degree threshold, it is determined that there is a first reverse association relationship between the object label sample text and the sample video, and the first reverse association relationship is The association relationship is determined as the second quality label of the sample video.
The method according to claim 11, wherein the training sample set further includes a sample description image corresponding to each sample video; the third quality label includes a description image quality label;

The method also includes:

For each sample video:

Obtain the second playback completion degree of the browse sample object for the sample video;

If the second playback completion degree is greater than the second playback completion degree threshold, it is determined that there is a second positive correlation between the sample description image corresponding to the sample video, the object label sample text and the sample video, and the The second positive correlation relationship is determined as the descriptive image quality label of the sample video;

If the second playback completion degree is less than or equal to the second playback completion degree threshold, it is determined that the sample video pair There is a second reverse correlation relationship between the corresponding sample description image, the object label sample text and the sample video, and the second reverse correlation relationship is determined as the description image quality label of the sample video.
The method of claim 11, wherein the third quality label includes a descriptive text quality label;

The method also includes:

For each sample video:

Obtain the third playback completion degree of the browsing sample object for the sample video;

If the third playback completion degree is greater than the third playback completion degree threshold, obtain the sample content text corresponding to the sample video, and add the sample content text to the training sample set;

It is determined that there is a third positive correlation relationship between the object label sample text sequence and the sample content text, and the third positive correlation relationship is determined as the description text quality label of the sample video.
The method of claim 11, wherein the video recognition model includes a first video recognition sub-model for determining the first prediction quality, a second video recognition sub-model for determining the second prediction quality. , and a third video recognition sub-model for determining the third prediction quality; the parameters in the video recognition model include parameters in the first video recognition sub-model, parameters in the second video recognition sub-model Parameters, as well as parameters in the third video recognition sub-model;

identifying the video according to the first quality label, the second quality label, the third quality label, the first prediction quality, the second prediction quality and the third prediction quality. The parameters in the model are adjusted to obtain the trained video recognition model, including:

Determine a first quality loss value between the first quality label and the first predicted quality, adjust parameters in the first video recognition sub-model according to the first quality loss value, and obtain the trained The first video recognition sub-model;

Determine a second quality loss value between the second quality label and the second predicted quality, adjust parameters in the second video recognition sub-model according to the second quality loss value, and obtain the trained The second video recognition sub-model;

Determine a third quality loss value between the third quality label and the third predicted quality, adjust parameters in the third video recognition sub-model according to the third quality loss value, and obtain the trained The third video recognition sub-model;

When the first video recognition sub-model, the second video recognition sub-model and the third video recognition sub-model all meet the model convergence conditions, generate the trained first video recognition sub-model, the second after training The video recognition sub-model and the trained video recognition model of the trained third video recognition sub-model.
A data processing device including:

The first acquisition module is used to acquire at least two video clips in the video, determine the first sharing quality corresponding to the at least two video clips, and obtain the first sharing quality from the at least two video clips according to the first sharing quality. Select at least one video segment as a candidate video segment;

The second acquisition module is configured to acquire an object tag text sequence associated with the video, where the object tag text sequence includes the object tag text of the browsing object that shares the video and the object tag text of the shared object that receives the share; The object tag text of the browsing object is used to characterize the interest of the browsing object, and the object tag text of the shared object is used to characterize the interest of the shared object; according to the object tag text sequence and the candidate video clip, determine According to the second sharing quality corresponding to each candidate video segment, at least one candidate video segment is selected from the candidate video segments as a candidate shared video segment; the second sharing Quality is used to characterize the relevance of the candidate video segment to the object tag text of the shared object;

The first determination module is configured to determine the third sharing quality corresponding to each candidate shared video segment according to the object tag text sequence and the candidate shared video segment, and according to the third sharing quality corresponding to each candidate shared video segment, Determine the auxiliary description information corresponding to each candidate shared video clip; the third sharing quality is used to characterize the matching degree of the auxiliary description information with the candidate shared video clip and the object tag text of the shared object;

The second determination module is configured to determine shared video segments from the candidate shared video segments according to the first sharing quality, the second sharing quality, and the third sharing quality corresponding to each candidate shared video segment, and convert the shared video segment into the shared video segment. The segments and the auxiliary description information corresponding to the shared video segments are determined as shared data to be sent to the sharing object.
A computer device including: a processor, a memory and a network interface;

The processor is connected to the memory and the network interface, wherein the network interface is used to provide data communication functions, the memory is used to store computer programs, and the processor is used to call the computer program so that The computer device performs the method of any one of claims 1 to 16.
A computer-readable storage medium having a computer program stored therein, the computer program being adapted to be loaded and executed by a processor, so that a computer device having the processor executes claims 1-16 any of the methods described.
A computer program product, the computer program product includes computer instructions, the computer instructions store Stored in a computer-readable storage medium, when the computer instructions are executed, the method according to any one of claims 1-16 is implemented.