CN114501132A

CN114501132A - Resource processing method and device, electronic equipment and storage medium

Info

Publication number: CN114501132A
Application number: CN202111599274.3A
Authority: CN
Inventors: 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-05-13
Anticipated expiration: 2041-12-24
Also published as: CN114501132B

Abstract

The present disclosure relates to a resource processing method, device, electronic device and storage medium, including: the method comprises the steps of segmenting an initial video to obtain a plurality of initial video segments and fusion characteristic information of each initial video segment, wherein the time length of the initial video is longer than a first preset time length, classifying the plurality of initial video segments based on the fusion characteristic information of each initial video segment to obtain a description segment of an object and a non-description segment of the object, cutting the non-description segment to obtain a cut non-description segment, integrating the description segment of the object and the cut non-description segment to obtain a target video, and the time length of the target video is less than or equal to the first preset time length. Under the condition of ensuring the video quality and the content frame number as much as possible, the first preset duration is met, and further the waste of service resources for popularizing the information is reduced.

Description

Resource processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a resource processing method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of the current mobile internet, information dissemination based on mobile terminals has become more and more mature. Generally, the information can be embedded in various browsers or applications on the terminal for playing. The applications may generally include social applications, video playback applications, gaming applications, and the like.

Generally, each message has a predetermined playing time period, so that it can be played in the browser or application within the playing time period. However, the original information provided by the information provider often exceeds the preset playing time, and on this basis, not only the complete playing cannot be performed, but also some important parts may need to be discarded even during the playing, which causes waste of service resources for popularizing the information.

Disclosure of Invention

The present disclosure provides a resource processing method, device, electronic device and storage medium, and the technical scheme of the present disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a resource processing method, including:

segmenting an initial video to obtain a plurality of initial video segments and fusion characteristic information of each initial video segment; the duration of the initial video is greater than a first preset duration;

classifying the plurality of initial video clips based on the fusion characteristic information of each initial video clip to obtain a description clip of the object and a non-description clip of the object;

cutting the non-description fragment to obtain a cut non-description fragment;

integrating the description fragment and the cut non-description fragment of the object to obtain a target video; the duration of the target video is less than or equal to a first preset duration.

In some possible embodiments, the integrating the description segment and the clipped non-description segment of the object to obtain the target video includes:

integrating the description fragment of the object and the cut non-description fragment to obtain integration duration;

if the integration time length is longer than the first preset time length, the description fragments and/or the cut non-description fragments are accelerated to obtain the target video, and the speed of each video fragment in the target video is kept consistent.

In some possible embodiments, the method further comprises:

determining a video clip to be processed from a plurality of initial video clips based on the fusion characteristic information of each initial video clip;

and determining the video sub-segment to be processed from the video segment to be processed based on the second preset time length, the voice information integrity and the starting frame definition.

In some possible embodiments, each initial video segment carries a segment sequence number, and if the integration duration is longer than a first preset duration, accelerating the description segment and/or the clipped non-description segment to obtain the target video comprises:

determining the sequence number of the subsection of the video sub-fragment to be processed;

determining the segment sequence number of the description segment and the segment sequence number of the clipped non-description segment based on the segment sequence number of each initial video segment;

splicing the video sub-segment to be processed, the description segment and the cut non-description segment based on the segmentation serial number of the video sub-segment to be processed, the segmentation serial number of the description segment and the segmentation serial number of the cut non-description segment to obtain a transition video;

determining the duration of the transition video as the splicing duration;

and if the splicing time length is longer than the first preset time length, accelerating the sub-segments, the description segments and/or the cut non-description segments of the video to be processed to obtain the target video.

In some possible embodiments, if the video sub-segment to be processed, the description segment, and the clipped non-description segment all contain voice text information, accelerating the video sub-segment to be processed, the description segment, and/or the clipped non-description segment to obtain the target video includes:

determining a first speech rate corresponding to voice text information in a video sub-segment to be processed;

determining a second speech rate corresponding to the voice text information in the description fragment;

determining a third speech rate corresponding to the cut voice text information of the non-description fragment;

if the first speech rate, the second speech rate and the third speech rate are the same, uniformly accelerating the video sub-segments to be processed, the description segments and/or the clipped non-description segments based on the preset speech rate to obtain a target video; otherwise; and accelerating the sub-segments, the description segments and/or the clipped non-description segments of the video to be processed based on the preset speech speed to obtain the target video.

In some possible embodiments, segmenting the initial video to obtain a plurality of initial video segments and a fusion characteristic credit for each initial video segment includes:

calculating the content difference degree of adjacent frames in the initial video;

segmenting the initial video based on the content difference degree to obtain a plurality of initial video segments;

fusion feature information for each of a plurality of initial video segments is obtained.

In some possible embodiments, obtaining the fusion feature information of each of the plurality of initial video segments comprises:

determining a plurality of video frames from each initial video segment;

determining visual characteristic information of each initial video segment based on a plurality of video frames;

acquiring picture text characteristic information of each initial video clip based on text information on a video frame of each initial video clip;

acquiring voice text characteristic information of each initial video clip based on the voice text information of each initial video clip;

acquiring voice attribute characteristic information of each initial video clip based on the voice attribute information of each initial video clip;

and splicing the visual characteristic information, the picture text characteristic information, the voice text characteristic information and the voice attribute characteristic information corresponding to each initial video clip to obtain the fusion characteristic information of each initial video clip.

In some possible embodiments, the obtaining the voice attribute feature information of each initial video segment based on the voice attribute information of each initial video segment comprises:

acquiring voice category information, voice emotion information, volume information and frequency information of each initial video clip;

determining voice attribute information based on the voice category information, the voice emotion information, the volume information, and the frequency information;

and acquiring voice attribute characteristic information of each initial video clip based on the voice attribute information.

In some possible embodiments, the clipping the non-description fragment, and the obtaining the clipped non-description fragment includes:

if the number of the non-description fragments is more than one and the non-description fragments are continuous fragments, splicing the non-description fragments according to the segmentation serial numbers of the non-description fragments to obtain non-description integration fragments;

and partially cutting the non-description integration segment according to the integrity of the voice information and a preset cutting sequence to obtain the cut non-description integration segment.

In some possible embodiments, if the integration duration is longer than the first preset duration, accelerating the description segment and/or the clipped non-description segment to obtain the target video includes:

integrating the description fragment of the object and the cut non-description integration fragment to obtain the current integration duration;

and if the current integration time length is longer than the first preset time length, accelerating the description fragment and/or the cut non-description integration fragment to obtain the target video.

According to a second aspect of the embodiments of the present disclosure, there is provided a resource processing apparatus including:

the characteristic information acquisition module is configured to segment the initial video to obtain a plurality of initial video segments and fusion characteristic information of each initial video segment; the duration of the initial video is greater than a first preset duration;

the segment classification module is configured to classify the plurality of initial video segments based on the fusion characteristic information of each initial video segment to obtain a description segment of the object and a non-description segment of the object;

the cutting module is configured to cut the non-description segment to obtain a cut non-description segment;

the integration module is configured to perform integration processing on the description segment and the cut non-description segment of the object to obtain a target video; the duration of the target video is less than or equal to a first preset duration.

In some possible embodiments, the integration module is configured to perform:

In some possible embodiments, the apparatus further comprises:

a to-be-processed video segment determination module configured to perform determining a to-be-processed video segment from a plurality of initial video segments based on the fusion feature information of each initial video segment;

and the to-be-processed video sub-segment determining module is configured to determine the to-be-processed video sub-segment from the to-be-processed video segment based on the second preset time length, the voice information integrity and the starting frame definition.

In some possible embodiments, each initial video segment carries a segment sequence number, and the integration module is configured to perform:

determining the segment sequence number of the video sub-segment to be processed;

determining the duration of the transition video as the splicing duration;

In some possible embodiments, if the to-be-processed video sub-segment, the description segment, and the clipped non-description segment all contain speech text information, the integration module is configured to perform:

In some possible embodiments, the feature information obtaining module is configured to perform:

determining a plurality of video frames from each initial video segment;

In some possible embodiments, the clipping module is configured to perform:

In some possible embodiments, the integration module is configured to perform:

and if the current integration time length is longer than a first preset time length, accelerating the description fragment and/or the cut non-description integration fragment to obtain the target video.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of any one of the first aspect as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of the first aspects of the embodiments of the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program, the computer program being stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the computer device to perform the method of any one of the first aspects of embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of segmenting an initial video to obtain a plurality of initial video segments and fusion characteristic information of each initial video segment, wherein the time length of the initial video is longer than a first preset time length, classifying the plurality of initial video segments based on the fusion characteristic information of each initial video segment to obtain a description segment of an object and a non-description segment of the object, cutting the non-description segment to obtain a cut non-description segment, integrating the description segment of the object and the cut non-description segment to obtain a target video, and the time length of the target video is less than or equal to the first preset time length. In the embodiment of the application, the non-description fragment is cut to remove part of the non-important video frames, so that the first preset time length is met, and further, the waste of service resources for popularizing the information is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram illustrating an application environment in accordance with an illustrative embodiment;

FIG. 2 is a flow diagram illustrating a resource handling method in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of obtaining fused feature information in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating a method of obtaining voice attribute feature information in accordance with an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a method of obtaining a sub-segment of a video to be processed in accordance with an exemplary embodiment;

FIG. 6 is a flow diagram illustrating a method of obtaining a target video in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating a resource processing apparatus in accordance with an exemplary embodiment;

FIG. 8 is a block diagram illustrating an electronic device for resource processing in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

All data about a user in the present application are data authorized by the user.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment of a resource processing method according to an exemplary embodiment, and as shown in fig. 1, the application environment may include a server 01 and a client 02.

In some possible embodiments, the server 01 may receive an initial video sent by the client 02, segment the initial video to obtain a plurality of initial video segments and fusion feature information of each initial video segment, where the duration of the initial video is greater than a first preset duration, classify the plurality of initial video segments based on the fusion feature information of each initial video segment to obtain a description segment of an object and a non-description segment of the object, clip the non-description segment to obtain a clipped non-description segment, and perform integration processing on the description segment of the object and the clipped non-description segment to obtain a target video, where the duration of the target video is less than or equal to the first preset duration.

In some possible embodiments, the server 01 may include an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, a cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The operating system running on the server may include, but is not limited to, an android system, an IOS system, linux, windows, Unix, and the like.

In some possible embodiments, the client 02 may include, but is not limited to, a smartphone, a desktop computer, a tablet computer, a laptop computer, a smart speaker, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, a smart wearable device, and the like. The software running on the client may also be an application program, an applet, or the like. Alternatively, the operating system running on the client may include, but is not limited to, an android system, an IOS system, linux, windows, Unix, and the like.

In addition, it should be noted that fig. 1 shows only one application environment of the resource processing method provided by the present disclosure, and in practical applications, other application environments may also be included.

Fig. 2 is a flowchart illustrating a resource processing method according to an exemplary embodiment, and as shown in fig. 2, the resource processing method may be applied to a server and may also be applied to other node devices, and includes the following steps:

in step S201, segmenting an initial video to obtain a plurality of initial video segments and fusion feature information of each initial video segment; the duration of the initial video is greater than a first preset duration.

In some possible embodiments, the server may receive an initial video transmitted by another device. Wherein the other devices may include devices of the provider of the initial video.

In some possible embodiments, based on various factors, such as the playing resources and the human and material investment, the platform capable of providing the playing service for the initial video is set with a playing duration, in this embodiment of the present application, the playing duration is a first preset duration, such as 30 seconds.

Optionally, the duration of the initial video is greater than a first preset duration.

In order to be able to distinguish descriptive segments from non-descriptive segments from the original video, subsequent cropping of non-descriptive segments is facilitated. Alternatively, in order to obtain the highlight segment from the initial segment, the initial video may be segmented as the beginning segment of the subsequent video.

In some alternative embodiments, the initial video may be segmented averagely according to its duration. For example, assuming that the duration of the initial video is 50 seconds, the initial video can be divided into 5 initial video segments on average, and each initial video segment has a duration of 5 seconds.

In other alternative embodiments, in order to make the correlation and continuity of each initial video segment obtained by segmentation stronger, the server may segment the initial video based on the content difference of the video frames to obtain a plurality of initial video segments.

In an embodiment where the initial video is segmented based on the degree of difference of the video frames, the initial video received by the server may carry the degree of difference identifier. Taking an initial video of 50 seconds as an example, assuming that disparity flags are respectively identified on a video frame corresponding to the 10 th second, a video frame corresponding to the 20 th second, a video frame corresponding to the 30 th second, and a video frame corresponding to the 40 th second, the server may divide the initial video into 5 initial video segments based on 4 disparity flags.

Optionally, when the 10 th second, the 20 th second, the 30 th second and the 40 th second are all located right between two video frames, the video frame corresponding to the 10 th second may be a previous video frame in the two video frames corresponding to the 10 th second, the video frame corresponding to the 20 th second may be a previous video frame in the two video frames corresponding to the 20 th second, the video frame corresponding to the 30 th second may be a previous video frame in the two video frames corresponding to the 30 th second, and the video frame corresponding to the 40 th second may be a previous video frame in the two video frames corresponding to the 40 th second. For example, assuming that the frame rate of the initial video is 25 frames per second, the video frame corresponding to 10 seconds may be the 250 th video frame of the initial video, the video frame corresponding to 20 seconds may be the 500 th video frame of the initial video, the video frame corresponding to 30 seconds may be the 750 th video frame of the initial video, and the video frame corresponding to 40 seconds may be the 1000 th video frame of the initial video.

Optionally, when the 10 th second, the 20 th second, the 30 th second and the 40 th second are all located right between two video frames, the video frame corresponding to the 10 th second may be a subsequent video frame of the two video frames corresponding to the 10 th second, the video frame corresponding to the 20 th second may be a subsequent video frame of the two video frames corresponding to the 20 th second, the video frame corresponding to the 30 th second may be a subsequent video frame of the two video frames corresponding to the 30 th second, and the video frame corresponding to the 40 th second may be a subsequent video frame of the two video frames corresponding to the 40 th second. For example, assuming that the frame rate of the initial video is 25 frames per second, the video frame corresponding to 10 seconds may be the 251 th video frame, the video frame corresponding to 20 seconds may be the 501 th video frame, the video frame corresponding to 30 seconds may be the 751 th video frame, and the video frame corresponding to 40 seconds may be the 1001 st video frame of the initial video.

Optionally, the server may divide the initial video into 5 initial video segments according to the difference degree identification on the video frames, wherein the first initial video segment includes the 1 st to 250 th video frames, the second initial video segment includes the 251 st to 500 th video frames, the third initial video segment includes the 501 st to 750 th video frames, the fourth initial video segment includes the 751 st to 1000 th video frames, and the fifth initial video segment includes the 1001 st to 1250 th video frames.

Alternatively, the disparity indicator on the initial video may be manually marked, and the disparity indicator may be provided by the provider of the initial video.

In another embodiment, in which the initial video is segmented based on the difference between the video frames, the server may calculate the content difference between adjacent frames in the initial video, and segment the initial video based on the content difference to obtain a plurality of initial video segments.

Alternatively, the server may determine the number of pairs of adjacent frames present in the initial video, such as 1249 pairs of adjacent frames, assuming the initial video includes 1250 frames of images. Then, the server may obtain the content difference degree between each pair of adjacent frames, and in general, the content difference degree may be represented by the content difference value between each pair of adjacent frames. The server may then segment the initial video based on the content disparity to obtain a plurality of initial video segments. Optionally, the server may sort the content disparity, determine a plurality of pairs of adjacent frames corresponding to the content disparity sorted before as target adjacent frames, and segment the initial video with the target adjacent frames as a boundary. Assuming that 4 pairs of adjacent frames corresponding to the top 4 content differences are taken as target adjacent frames, the initial video can be divided into 5 initial video segments with the 4 pairs of adjacent frames as boundaries. Optionally, the server may obtain a difference threshold, determine a plurality of pairs of adjacent frames with content difference greater than the difference threshold as target adjacent frames, and segment the initial video with the target adjacent frames as a boundary. Assuming that the content difference degree of the 4 pairs of adjacent frames is greater than the difference degree threshold value, the 4 pairs of adjacent frames are determined as target adjacent frames, and the initial video can be divided into 5 initial video segments with the 4 pairs of adjacent frames as boundaries.

In one possible embodiment, the server may obtain the content disparity between each pair of adjacent frames as follows. For each pair of adjacent frames, since each video frame of the same video contains the same number of pixels (for example, 1600 pixels) and layout, the server may determine the feature data of each pixel in the previous video frame and the feature data of each pixel in the next video frame in the adjacent frames, determine the corresponding pixel in the next video frame of each pixel in the previous video frame, form 1600 pixel pairs, determine the pixel difference value of each pixel pair based on the difference between the feature data of two pixels in each pixel pair, and determine the content difference value between the adjacent frames based on the 1600 pixel difference values, that is, the content difference degree.

Therefore, the initial video is segmented based on the difference degree of the video frames, so that the picture correlation, stability and content continuity of each initial video segment obtained through splitting are stronger, the occurrence of the situation of severe picture changes in the same initial video segment is reduced, and the method and the device for segmenting the initial video make a good cushion for subsequently distinguishing the description segments and the non-description segments and obtaining the wonderful segments. In the embodiment of the present application, there are many ways to obtain the fusion feature information of each initial video segment, and the following describes some alternative embodiments.

In some possible embodiments, the server may determine the fusion characteristic information from each video frame in each initial video segment. Specifically, the server may input each initial video segment into the image feature extraction model to obtain image feature information of each video frame of each initial video segment, and then perform fusion processing on the image feature information of each video frame to obtain fusion feature information of each initial video segment. Optionally, the image feature extraction model may extract features of pixels in each video frame.

In other possible embodiments, the server may perform feature extraction on all information involved in the initial video segment, and fuse all extracted information to obtain fused feature information.

Fig. 3 is a flowchart illustrating a method of obtaining fused feature information according to an exemplary embodiment, as shown in fig. 3, including the steps of:

in step S301, a plurality of video frames are determined from each initial video segment.

In step S302, visual characteristic information of each initial video segment is determined based on a plurality of video frames.

In some possible embodiments, the server may determine a plurality of video frames from each initial video segment and determine visual characteristic information for each initial video segment based on the plurality of video frames. The plurality of video frames may be all video frames or partial video frames. When the plurality of video frames are partial video frames, the server may obtain the plurality of video frames in an average decimation manner.

Alternatively, the server may determine visual characteristic information for each initial video segment based on the moco model. The server inputs each initial video clip into the trained moco model, the moco model determines a plurality of video frames from the trained moco model, and visual characteristic information of each initial video clip is determined based on the plurality of video frames.

In step S303, picture text feature information of each initial video clip is acquired based on text information on a video frame of each initial video clip.

In this embodiment, the server may identify text information on each video frame in each initial video segment, where the text information may include characters (chinese, english, etc.), characters in color, expressions, and the like appearing on the video frame. And acquiring the picture text characteristic information of each initial video segment based on characters (Chinese, English and the like), characters, expressions and the like.

Optionally, the server may identify text information on a video frame of each initial video segment based on an Optical Character Recognition (OCR) model, and perform feature extraction on the identified information by using a Bert model to obtain picture text feature information.

Optionally, to save computer power, the server performs video frame extraction for each initial video segment, for example, one video frame per second. And then, recognizing the text information on the extracted video frame by using an OCR (optical character recognition) model, and performing feature extraction on the recognized information by using a Bert model to obtain the picture text feature information.

In step S304, the speech text feature information of each initial video segment is acquired based on the speech text information of each initial video segment.

In this embodiment of the application, the server may use an Automatic Speech Recognition technology (ASR) to recognize Speech in each initial video segment, obtain Speech text information, and obtain Speech text feature information of each initial video segment from the Speech text information based on a Bert model. The voice text information refers to characters obtained by recognizing specific contents spoken by people in each initial video segment.

In step S305, voice attribute feature information of each initial video segment is acquired based on the voice attribute information of each initial video segment.

Fig. 4 is a flowchart illustrating a method of obtaining voice attribute feature information according to an exemplary embodiment, as shown in fig. 4, including the following steps:

in step S401, voice genre information, voice emotion information, volume information, and frequency information of each initial video segment are acquired.

In the embodiment of the application, a vggish model can be built in the server, and the server can obtain the voice category information, the voice emotion information, the volume information and the frequency information of each initial video clip by using the vggish model.

The voice category information may include sounds made by different things, sounds made by people, and music. The speech emotion information refers to emotion information carried in the above different sounds.

In step S402, voice attribute information is determined based on the voice category information, the voice emotion information, the volume information, and the frequency information.

The voice attribute information is different from the voice text information, the voice attribute information does not care about the specific content of the voice, and the voice attribute information focuses on distinguishing the categories of the audio, wherein the categories comprise the voice (without distinguishing the specific content of the voice), music, animal voice, noise and the like.

In step S403, voice attribute feature information of each initial video segment is acquired based on the voice attribute information.

In step S306, the visual feature information, the picture text feature information, the voice text feature information, and the voice attribute feature information corresponding to each initial video segment are spliced to obtain the fusion feature information of each initial video segment.

Therefore, the characteristic information of each initial video segment is obtained through various aspects, so that the spliced fusion characteristic information can be more comprehensive and express each initial video segment in more detail, and preparation is made for distinguishing the description segment and the non-description segment in the follow-up process.

In step S202, a plurality of initial video segments are classified based on the fusion feature information of each initial video segment, and a description segment of the object and a non-description segment of the object are obtained.

In the embodiment of the application, a trained first segment recognition model capable of distinguishing the description segment from the non-description segment can be built in the server. The server can input the fusion characteristic information of each initial video segment into the first segment identification model to obtain a result corresponding to each initial video segment, wherein the result comprises a description segment and a non-description segment.

Assuming that the initial video is a shared video of an item, a segment describing the item, i.e., a description segment, and a segment not describing the item, i.e., a non-description segment, may be included in the initial video. Wherein the non-description segments may be scenario segments.

The first segment recognition model is a Machine Learning model, and Machine Learning (ML) is a multi-field cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. Machine learning can be divided into supervised machine learning, unsupervised machine learning and semi-supervised machine learning. The first segment identification model may be constructed based on a convolutional neural network.

In step S203, the non-description segment is clipped to obtain a clipped non-description segment.

Since it has been stated above that the duration of the initial video is greater than the first preset duration, in order to enable the initial video to be played on the platform, the initial video may be clipped.

In some possible embodiments, each of the initial video segments may be cropped, for example, each of the 5 initial video segments is cropped for 4 seconds, so that the remaining initial video segments can be spliced into 30 seconds of video.

In some possible embodiments, the importance of each initial video segment is determined, and the cropping degree of each initial video segment may be determined according to the importance of the initial video segment. Optionally, when the importance of an initial video segment is higher, the cut-out portion is smaller.

In some possible embodiments, the determination of whether the initial video segment needs to be cropped may be based on the type of video segment. In the embodiment of the present application, it is mentioned above that the initial video segments may be descriptive segments mainly introducing the article and non-descriptive segments mainly showing the scenario. For a video shared by items, the importance of the description fragment is definitely greater than that of the non-description fragment. Therefore, in the embodiment of the present application, the description fragment may not be clipped, and the non-description fragment may be clipped.

For example, assume that 5 initial video segments (each 10 seconds) are identified by the first segment identification model, and the following results are obtained: fragment 1, non-descriptive fragment; fragment 2, non-descriptive fragment; segment 3, description segment; segment 4, non-descriptive segment; fragment 5, describe fragment. The server may clip non-description segments, here segment 1, segment 2 and segment 4.

Optionally, the server may cut the non-description segment according to the integrity of the voice information and a preset cutting order to obtain the cut non-description segment, so that each sentence in the cut non-description segment is complete. The preset cutting sequence can be from front to back, from back to front, or from the middle to two sides. Typically, a portion of a descriptive article may also appear in non-descriptive section, and that portion typically appears in the second half of a non-descriptive section. Therefore, the server can cut the non-description segment according to the completeness of the voice information from back to front, so that more rear parts are reserved, and each sentence in the cut non-description segment is complete. Optionally, the server may ensure that the beginning speech of the clipped non-descriptive segment is complete based on the ASR technique.

Optionally, the server may further cut the non-description segment according to the cutting proportion, the integrity of the voice information, and a preset cutting order to obtain the cut non-description segment, so that each sentence in the cut non-description segment is complete. Optionally, the clipping ratio may be a ratio of the clipped non-description segment to the original non-description segment. Assuming a cropping ratio of 0.5, each non-descriptive segment can be cropped off for 5 seconds. The above-mentioned clipping ratio is only an embodiment and does not limit the present application.

In some possible embodiments, there may be at least 2 non-descriptive segments, and the non-descriptive segments are consecutive segments, such as segment 1 and segment 2. Since the non-description segments may be scenario segments as explained above, segment 1 and segment 2 may be 2 small scenario segments split into one overall scenario. In order to ensure the relative completeness of the cut scenario, the non-description segments can be spliced before being cut to obtain non-description integration segments, and then the non-description integration segments are cut according to the completeness of the voice information and the preset cutting sequence to obtain the cut non-description integration segments. The preset cutting sequence can be from front to back, from back to front, or from the middle to two sides. Typically, a portion of the descriptive article may also appear in the non-descriptive integrated section, and this portion typically appears in the second half of one of the non-descriptive integrated sections. Therefore, the server can cut the non-description integration segment according to the completeness of the voice information from back to front, so that more later parts are reserved, and each sentence in the cut non-description integration segment is complete. Optionally, the server may ensure that the beginning speech of the clipped non-description integration segment is complete based on the ASR technique.

In step S204, integrating the description segment of the object and the clipped non-description segment to obtain a target video; the duration of the target video is less than or equal to a first preset duration.

Continuing with the above example, by cutting the non-descriptive segments, a 5 second segment 1, a 5 second segment 2, a 10 second segment 3, a 5 second segment 4, and a 10 second segment 5 can be obtained, and the integration duration of the next time is 35 seconds, which is the total duration of each segment, or 30 seconds longer than the first preset duration.

In the embodiment of the application, the server can integrate the description fragment of the object and the cut non-description fragment to obtain the integration duration. If the server determines that the integration time length is longer than the first preset time length, the description fragment and/or the clipped non-description fragment can be accelerated to obtain the target video, so that the time length in the target video is less than or equal to the first time length.

Optionally, the server may accelerate the description fragment and the non-description fragment (non-description integration fragment) uniformly, that is, accelerate the above-mentioned 35-second fragment to within 30 seconds. Optionally, the description segment may be accelerated to within 30 seconds of the 35 second segment. Optionally, a non-description segment (non-description integration segment) may be accelerated, and the 35 second segment is accelerated to within 30 seconds, where the speech rate of each video segment in the target video is kept consistent.

In some possible embodiments, in order to make the beginning of the final target video more appealing to the viewer, a highlight segment may be determined from the initial video as the beginning of the target video.

Fig. 5 is a flowchart illustrating a method of obtaining a sub-segment of a video to be processed according to an exemplary embodiment, as shown in fig. 5, including the following steps:

in step S501, a video segment to be processed is determined from a plurality of initial video segments based on the fusion feature information of each initial video segment.

In the embodiment of the application, a trained second segment recognition model capable of distinguishing the wonderful segment from the non-wonderful segment can be arranged in the server. The server can input the fusion characteristic information of each initial video segment into the second segment identification model to obtain a highlight degree result corresponding to each initial video segment, wherein the highlight degree result comprises a highlight segment and a non-highlight segment, and the video segment to be processed is determined from the plurality of initial video segments based on the highlight degree result.

In an alternative embodiment, when the server inputs a plurality of initial video segments into the second segment recognition model, the scores of the highlights of each initial video segment can be obtained respectively, and the scores can be between 0 and 1. The server may take the initial video segment with the highest score as the video segment to be processed.

The second segment recognition model is a Machine Learning model, and Machine Learning (ML) is a multi-field cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. Machine learning can be divided into supervised machine learning, unsupervised machine learning and semi-supervised machine learning. The first segment identification model may be constructed based on a convolutional neural network.

In step S502, a video sub-segment to be processed is determined from the video segments to be processed based on the second preset duration, the integrity of the voice information, and the definition of the start frame.

Assuming that the second preset duration is 3 seconds, the server may determine a plurality of sub-segments with duration of 3 seconds from the video segment to be processed. And determining whether the voice information in each sub-segment is complete based on an ASR technology, if the sub-segment with complete voice information exists, determining whether the definition of the starting frame in the sub-segment meets the requirement based on a definition algorithm (Laplacian operator), and if so, determining the starting frame as the video sub-segment to be processed. If a plurality of satisfied sub-segments exist, one of the satisfied sub-segments can be randomly selected as the to-be-processed video sub-segment. The video sub-segment to be processed, i.e. the highlight segment, can be used as the beginning of the final target video.

After the highlight segments are obtained and the non-description segments are cut, the total duration of all the segments may be greater than a first preset duration, and at this time, the server needs to accelerate based on the processing of the video sub-segments, the description segments and the cut non-description segments to obtain the target video meeting the first preset duration.

Fig. 6 is a flowchart illustrating a method of acquiring a target video according to an exemplary embodiment, as shown in fig. 6, including the steps of:

in step S601, the segment sequence number of the to-be-processed video sub-segment is determined.

Since it has been mentioned above that the to-be-processed video sub-segment is placed at the beginning of the target video, it can be determined that its segment number is 0.

In step S602, the segment sequence number of the description segment and the segment sequence number of the clipped non-description segment are determined based on the segment sequence number of each initial video segment.

In an alternative embodiment, each initial video segment carries a segment sequence number, such as segment sequence number 1 for segment 1, segment sequence number 2 for segment 2, segment sequence number 3 for segment 3, segment sequence number 4 for segment 4, and segment sequence number 5 for segment 5. By cutting the non-description segment, the sequence number of the segment 1 after cutting is still 1, the sequence number of the segment 2 after cutting is still 2, the sequence number of the segment 3 after cutting is still 3, the sequence number of the segment 4 after cutting is still 4, and the sequence number of the segment 5 after cutting is still 5.

In step S603, the to-be-processed video sub-segment, the description segment, and the clipped non-description segment are spliced to obtain the transition video based on the segment sequence number of the to-be-processed video sub-segment, the segment sequence number of the description segment, and the segment sequence number of the clipped non-description segment.

The server can splice the video sub-segment to be processed, the description segment and the clipped non-description segment based on the segment sequence number. The smaller the numerical value of the sequence number is, the more the position of the segment is, so that the transition video comprising the to-be-processed video sub-segment, the description segment and the clipped non-description segment can be obtained. Based on the duration of the individual segments in the above example, the total duration of the transition video has reached 38 seconds.

In step S604, the duration of the transition video is determined as the splicing duration.

Based on the duration of each segment in the above example, the total duration of the transition video, i.e., the integration duration, has reached 38 seconds.

In step S605, if the splicing time length is greater than the first preset time length, accelerating the sub-segment, the description segment and/or the clipped non-description segment of the video to be processed to obtain the target video.

Obviously, in this case, the splicing time duration is longer than the first preset time duration, and the server may accelerate the sub-segment, the description segment, and/or the clipped non-description segment of the video to be processed to obtain the target video. And the duration of the target video is less than or equal to a first preset duration, and the overall speed of speech in the target video is kept consistent.

In the embodiment of the present application, the acceleration is to ensure that the speech rate in each segment is consistent while the duration of the target video is reduced. Therefore, when the server accelerates the segments, it can first determine which segments (including the video sub-segment to be processed, the description segment, and the cropped non-description segment) have the speech text information, in other words, the words that the human is present.

In some possible embodiments, if the video sub-segment to be processed, the description segment, and the clipped non-description segment all contain speech text information, that is, each segment contains speech text information, the server may determine a first speech rate corresponding to the speech text information in the video sub-segment to be processed, determine a second speech rate corresponding to the speech text information in the description segment, and determine a third speech rate corresponding to the speech text information of the clipped non-description segment. If the first speech rate, the second speech rate and the third speech rate are the same, uniformly accelerating the video sub-segment to be processed, the description segment and the cut non-description segment based on the preset speech rate to obtain a target video; otherwise, accelerating the sub-fragment, the description fragment and the cut non-description fragment of the video to be processed based on the preset speech speed to obtain the target video. Wherein the speech rate in the target video can be kept consistent. The duration of the target video is less than or equal to a first preset duration.

In some possible embodiments, if only a part of the to-be-processed video sub-segment, the description segment, and the clipped non-description segment exists, the part of the to-be-processed video sub-segment and the part of the to-be-processed non-description segment contain the speech text information, for example, the to-be-processed video sub-segment and the description segment contain the speech text information. The server can determine a first speech rate corresponding to the voice text information in the video sub-segment to be processed and a second speech rate corresponding to the voice text information in the description segment, if the first speech rate and the second speech rate are the same, unified acceleration can be performed on the video sub-segment to be processed and the description segment based on the preset speech rate, and then the target video is obtained by combining the cut non-description segment; otherwise, accelerating the video sub-segment and the description segment to be processed based on the preset speech speed, and then combining the cut non-description segment to obtain the target video. Wherein the speech rate in the target video can be kept consistent. The duration of the target video is less than or equal to a first preset duration.

In the embodiment of the application, the non-important video frames are cut and removed from the non-description fragments, meanwhile, the video frames played every second are increased in a mode of accelerating the speed of speech, and under the condition that the video quality and the content frame number are guaranteed as far as possible, the first preset duration is met, and then the waste of service resources for popularizing the information is reduced.

FIG. 7 is a block diagram illustrating a resource processing apparatus according to an example embodiment. Referring to fig. 7, the apparatus includes a feature information acquisition module 701, a segment classification module 702, a cropping module 703, and an integration module 704.

A feature information obtaining module 701 configured to perform segmentation on an initial video to obtain a plurality of initial video segments and fusion feature information of each initial video segment; the duration of the initial video is greater than a first preset duration;

a segment classification module 702 configured to perform classification of the plurality of initial video segments based on the fusion feature information of each initial video segment, so as to obtain a description segment of the object and a non-description segment of the object;

a clipping module 703 configured to perform clipping on the non-description segment to obtain a clipped non-description segment;

an integrating module 704, configured to perform integration processing on the description segment and the clipped non-description segment of the object, to obtain a target video; the duration of the target video is less than or equal to a first preset duration.

In some possible embodiments, the integration module is configured to perform:

In some possible embodiments, the apparatus further comprises:

determining the duration of the transition video as the splicing duration;

determining a plurality of video frames from each initial video segment;

In some possible embodiments, the clipping module is configured to perform:

In some possible embodiments, the integration module is configured to perform:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 8 is a block diagram illustrating an apparatus 800 for resource processing in accordance with an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The apparatus 800 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 820 of the apparatus 800 to perform the above-described method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Claims

1. A method for processing resources, comprising:

classifying the plurality of initial video clips based on the fusion characteristic information of each initial video clip to obtain a description clip of an object and a non-description clip of the object;

cutting the non-description fragment to obtain a cut non-description fragment;

integrating the description fragment of the object and the cut non-description fragment to obtain a target video; the duration of the target video is less than or equal to the first preset duration.

2. The method of claim 1, wherein the integrating the description segment of the object and the clipped non-description segment to obtain the target video comprises:

if the integration duration is longer than the first preset duration, accelerating the description fragment and/or the clipped non-description fragment to obtain the target video, wherein the speech speed of each video fragment in the target video is kept consistent.

3. The resource handling method of claim 2, wherein the method further comprises:

determining a video clip to be processed from the plurality of initial video clips based on the fusion characteristic information of each initial video clip;

and determining a video sub-segment to be processed from the video segment to be processed based on a second preset time length, the integrity of the voice information and the definition of the starting frame.

4. The resource processing method according to claim 3, wherein each of the initial video segments carries a segment sequence number, and if the integration duration is longer than the first preset duration, accelerating the description segment and/or the clipped non-description segment to obtain a target video comprises:

determining the duration of the transition video as the splicing duration;

and if the splicing time length is longer than the first preset time length, accelerating the video sub-segment to be processed, the description segment and/or the cut non-description segment to obtain the target video.

5. The resource processing method according to claim 4, wherein if the to-be-processed video sub-segment, the description segment, and the clipped non-description segment all contain speech text information, the accelerating the to-be-processed video sub-segment, the description segment, and/or the clipped non-description segment to obtain the target video comprises:

determining a first speech rate corresponding to the voice text information in the video sub-segment to be processed;

if the first speech rate, the second speech rate and the third speech rate are the same, uniformly accelerating the to-be-processed video sub-segment, the description segment and/or the clipped non-description segment based on a preset speech rate to obtain the target video; otherwise; and accelerating the to-be-processed video sub-segment, the description segment and/or the cut non-description segment based on a preset speech speed to obtain the target video.

6. The resource processing method according to any one of claims 1 to 5, wherein the segmenting the initial video to obtain a plurality of initial video segments and a fusion feature credit for each of the initial video segments comprises:

acquiring fusion characteristic information of each initial video clip in the plurality of initial video clips.

7. A resource processing apparatus, comprising:

the system comprises a characteristic information acquisition module, a video segmentation module and a video fusion module, wherein the characteristic information acquisition module is configured to segment an initial video to obtain a plurality of initial video segments and fusion characteristic information of each initial video segment; the duration of the initial video is greater than a first preset duration;

a segment classification module configured to perform classification on the plurality of initial video segments based on the fusion feature information of each initial video segment, so as to obtain a description segment of an object and a non-description segment of the object;

the cutting module is configured to perform cutting on the non-description fragment to obtain a cut non-description fragment;

the integration module is configured to perform integration processing on the description segment of the object and the cut non-description segment to obtain a target video; the duration of the target video is less than or equal to the first preset duration.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the resource handling method of any of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the resource processing method of any of claims 1 to 6.

10. A computer program product, characterized in that the computer program product comprises a computer program, which is stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the computer device to perform the resource processing method according to any of claims 1 to 6.