CN114999530A

CN114999530A - Audio and video editing method and device

Info

Publication number: CN114999530A
Application number: CN202210542292.6A
Authority: CN
Inventors: 高强; 李旭; 刘杨; 李强
Original assignee: Beijing Feixiang Xingxing Technology Co ltd
Current assignee: Beijing Feixiang Xingxing Technology Co ltd
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-09-02
Anticipated expiration: 2042-05-18
Also published as: CN114999530B

Abstract

The present specification provides an audio/video clipping method and apparatus, wherein the audio/video clipping method includes: acquiring an audio and video to be clipped, and determining an audio file associated with the audio and video to be clipped; converting the audio file into at least one audio text based on the target semantic meaning associated with the audio and video to be edited, and determining a text time interval corresponding to each audio text; determining a target audio text in the at least one audio text according to the target semantic, and determining a target text time interval corresponding to the target audio text; and editing the audio/video to be edited according to the target text time interval to obtain a target audio/video.

Description

Audio and video editing method and device

Technical Field

The specification relates to the technical field of computers, in particular to an audio and video editing method. The present specification also relates to an audio-video editing apparatus, a computing device, and a computer-readable storage medium.

Background

Currently, in order to improve the efficiency of watching videos for users, it is necessary to clip videos, delete duplicate content or content that is not of interest to the users.

However, the current clipping mode for audio and video is as follows: video clips which do not meet the requirements are manually deleted, so that the clipping efficiency of the audio and video is influenced.

Therefore, an audio and video clipping method is urgently needed to improve the clipping efficiency of the audio and video.

Disclosure of Invention

In view of this, the embodiments of the present specification provide an audio and video editing method. The present specification also relates to an audio/video editing apparatus, a computing device, and a computer-readable storage medium to solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present specification, there is provided an audio-video clipping method, including:

acquiring an audio and video to be clipped, and determining an audio file associated with the audio and video to be clipped;

converting the audio file into at least one audio text based on the target semantic meaning associated with the audio and video to be edited, and determining a text time interval corresponding to each audio text;

determining a target audio text in the at least one audio text according to the target semantics, and determining a target text time interval corresponding to the target audio text;

and editing the audio/video to be edited according to the target text time interval to obtain a target audio/video.

According to a second aspect of embodiments of the present specification, there is provided an audio-visual clip device including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire an audio and video to be clipped and determine an audio file associated with the audio and video to be clipped;

the conversion module is configured to convert the audio file into at least one audio text based on the target semantics related to the audio and video to be edited and determine a text time interval corresponding to each audio text;

the determining module is configured to determine a target audio text in the at least one audio text according to the target semantic and determine a target text time interval corresponding to the target audio text;

and the clipping module is configured to clip the audio and video to be clipped according to the target text time interval to obtain a target audio and video.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is to store computer-executable instructions, and the processor is to execute the computer-executable instructions to:

determining a target audio text in the at least one audio text according to the target semantic, and determining a target text time interval corresponding to the target audio text;

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the audio-video clipping method.

The audio and video editing method provided by the specification acquires an audio and video to be edited and determines an audio file associated with the audio and video to be edited; converting the audio file into at least one audio text based on the target semantic meaning associated with the audio and video to be edited, and determining a text time interval corresponding to each audio text; determining a target audio text in the at least one audio text according to the target semantic, and determining a target text time interval corresponding to the target audio text; and editing the audio/video to be edited according to the target text time interval to obtain a target audio/video.

The embodiment of the specification realizes the audio file conversion based on the target semantic associated with the audio and video to be edited, and improves the accuracy of the obtained audio text in the subsequent semantic analysis; and the audio and video to be clipped is clipped according to the target text time interval corresponding to the target audio text, so that the clipping efficiency of the audio and video to be clipped is improved.

Drawings

Fig. 1 is a flowchart of an audio/video editing method provided in an embodiment of the present specification;

FIG. 2 is a schematic diagram of an element restoration model provided in an embodiment of the present specification;

fig. 3 is a processing flow chart of an audio/video clipping method applied to an interview audio/video to be clipped according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an audio/video clip device provided in an embodiment of the present specification;

fig. 5 is a block diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present specification. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if," as used herein, may be interpreted as "at … …" or "when … …" or "in response to a determination," depending on the context.

First, the noun terms referred to in one or more embodiments of the present application are explained.

RNN-Transducer: a speech recognition model based on a recurrent neural network.

CTC: connectionist Temporal Classification, a method of automatically aligning two unequal length sequences.

BERT: bidirectional Encoder responses from transforms, a bi-directional encoding technique used to learn text Representations.

In the present specification, an audio/video editing method is provided, and the present specification relates to an audio/video editing apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Fig. 1 shows a flowchart of an audio/video clipping method provided in an embodiment of the present specification, which specifically includes the following steps:

step 102: acquiring an audio and video to be clipped, and determining an audio file associated with the audio and video to be clipped.

The audio and video to be clipped is a video file which has clipping requirements and contains an audio stream and a video stream, for example, a video H contains multiple sections of repeated content, or a batch of videos contain similar content, such as a video set composed of videos all containing self-introduction content; each video in the video set can be used as an audio/video to be clipped; the audio file refers to an audio file obtained from an audio/video to be edited; in practical application, after the audio and video to be clipped is determined, the audio stream of the audio and video to be clipped is collected, and the collected audio stream is stored to obtain an audio file associated with the audio and video to be clipped.

For example, a teaching video G is obtained, where the teaching video G is a video including a teaching video stream and a corresponding teaching audio stream; and after the teaching video G with the editing requirement is determined, acquiring a teaching audio stream in the teaching video G, and storing the teaching audio stream to obtain a teaching audio file.

By acquiring the audio and video to be clipped and determining the audio file corresponding to the audio and video to be clipped, the video segments needing to be clipped in the audio and video to be clipped can be determined conveniently on the basis of the audio file.

Step 104: and converting the audio file into at least one audio text based on the target semantic meaning associated with the audio and video to be clipped, and determining a text time interval corresponding to each audio text.

The target semantics refer to semantic data determined according to clipping requirements, and for example, the target semantics may be self introduction, course introduction, and the like; the audio text refers to text data obtained by converting an audio file; the text time interval refers to a time interval corresponding to the audio text in the audio file, for example, the audio in the audio file is "great family, i.e., it is xiaoming", and based on the time information showing the audio "great family" in the audio file, the time interval corresponding to the audio text "great family" is determined to be 0 th to 3 rd seconds, i.e., the audio file is 0 th to 3 rd seconds, and the correspondingly played audio text is "great family".

In practical application, an audio text output by the voice conversion model can be obtained by inputting an audio file into the voice conversion model; the voice conversion model refers to a model capable of converting audio into corresponding text; in practical application, the conversion from audio to text can be realized through a CTC technology, an RNN-Transducer technology, and the like, and the application is not particularly limited.

Further, in order to identify the target semantics in the audio file more accurately and further perform audio/video clipping more accurately, the audio file may be converted into an audio text based on the target semantics associated with the audio/video to be clipped, and specifically, the method for converting the audio file into at least one audio text based on the target semantics associated with the audio/video to be clipped may include:

inputting the audio file to a speech conversion model;

processing the audio file through a feature extraction unit in the voice conversion model to obtain audio features;

processing the audio features through an audio feature processing unit in the voice conversion model to obtain audio features to be decoded;

determining a target semantic word associated with the audio file in a preset semantic word list through a decoding unit in the voice conversion model;

and decoding the audio features to be decoded by the decoding unit according to the target semantics corresponding to the target semantic words to obtain at least one audio text and output the voice conversion model.

The voice conversion model is a model which can convert an audio file to obtain a corresponding audio text; the voice conversion model can comprise a feature extraction unit, an audio feature processing unit, a decoding unit and the like; the characteristic extraction unit is used for extracting the audio characteristics of the audio file input into the voice conversion model; the audio characteristic processing unit is used for classifying the audio characteristics to obtain audio characteristics to be decoded; the decoding unit is used for decoding the audio features to be decoded according to the semantic words in the preset semantic word list to obtain an audio text.

In practical application, the preset semantic word list may be obtained according to target semantics, specifically: identifying semantic words in the target semantics, and generating a semantic word list based on the semantic words; for example, the target semantic is self-introduction of "great family," i.e., "and the semantic words identified in the target semantic according to the preset identification rules are" great family, "i.e.," and so on.

The decoding unit in the semantic conversion model determines a target semantic word in a preset semantic word list according to the audio features to be decoded, for example, if the audio features to be decoded include audio features corresponding to the text words of "big family good", and if the semantic word list includes the semantic words of "big family good", the audio features to be decoded can be decoded according to the semantics corresponding to the semantic words of "big family good".

For example, an audio file J obtained according to the audio and video to be clipped is input into the voice conversion model; extracting audio features in the audio file J by a feature extraction unit in the voice conversion model; inputting the audio features into an audio feature processing unit, and classifying the audio features to obtain audio features to be decoded; and determining a semantic word list preset in the decoding unit, decoding the audio features to be decoded based on the semantic word list to obtain an audio text, and outputting the audio text by the voice conversion model.

In practical application, each audio file has a corresponding text weight, and the text weight can be adjusted based on the target semantics corresponding to the target semantic words to obtain an audio text.

Specifically, the method for decoding, by the decoding unit according to the target semantic corresponding to the target semantic word, the audio feature to be decoded to obtain at least one audio text and output the speech conversion model may include:

processing the audio features to be decoded by the decoding unit according to the target semantics corresponding to the target semantic words to obtain at least one initial audio text, wherein each initial audio text carries a text weight;

and adjusting the text weight of the initial audio text associated with the target semantic word in the at least one initial audio text to obtain at least one audio text carrying the target text weight.

Wherein the text weight refers to the weight corresponding to the text content of the audio text; the target text weight refers to a text weight obtained by adjusting the text weight based on the target semantics of the target semantic word; the initial audio text refers to the audio text which is not adjusted by the target semantics.

Specifically, the decoding unit compares the target semantics corresponding to the target semantic words with the audio text, and generates initial audio texts according to comparison results, wherein each initial audio contains text weights; determining one or more initial audio texts associated with the target semantic word in the initial audio texts, and adjusting the text weight of the initial audio file based on a preset weight adjustment value to obtain an audio text containing the target text weight, wherein the preset weight adjustment value refers to a preset numerical value for adjusting the text weight, for example, if the preset weight adjustment value is 0.2, it may be determined that the text weight of the initial audio text is increased or decreased by 0.2.

For example, a decoding unit in the voice conversion model compares the semantic corresponding to the semantic word a in the semantic word list with the features to be decoded, determines whether each feature to be decoded corresponds to the semantic word a, and obtains a comparison result; obtaining at least one initial audio text carrying text weight based on the comparison result, wherein the initial audio text comprises an audio text corresponding to the semantic word A and an audio text not corresponding to the semantic word A; and determining an initial audio text associated with the semantic word A in the initial audio text, and performing increasing adjustment on the text weight average of the determined initial audio text associated with the semantic word A according to a preset weight adjustment value of 0.1.

In practical application, in order to ensure semantic accuracy of a text obtained by audio conversion, a segmentation element corresponding to the text may be determined, and the audio text may be processed in a manner of segmenting the text based on the segmentation element.

Specifically, the method for obtaining at least one initial audio text by processing the audio feature to be decoded by the decoding unit according to the target semantic corresponding to the target semantic word may include:

processing the audio features to be decoded by the decoding unit according to the target semantics corresponding to the target semantic words to obtain at least one text segment;

splicing each text segment into a text segment sequence, and identifying a segmentation element corresponding to the text segment sequence;

and segmenting the text segment sequence according to the segmentation elements to obtain at least one initial audio text.

The text segment is a text which is obtained by identifying the audio content in the audio text and corresponds to the audio content, and is a part of text content corresponding to the audio file; in the actual voice conversion process, part of the audio can be converted based on the semantics to obtain corresponding audio texts, so that a plurality of audio texts corresponding to the audio files are obtained; the text segment sequence is a sequence obtained by splicing the audio texts according to the sequence of the obtained audio texts; a segmentation element refers to an element used to segment a sequence of text segments based on semantics.

Specifically, the decoding unit decodes the audio features to be decoded according to the target semantics corresponding to the target semantic words to obtain text segments, for example, decodes the audio features to be decoded corresponding to the audio "big family good, i am small bright" to obtain the text segments "big family good", "i am" small bright "; splicing the text segments according to the sequence of the obtained text segments to obtain a text segment sequence, for example, the above example is continued to obtain a text segment sequence 'big family is good and my is Xiaoming'; identifying the segmentation elements corresponding to the text segment sequence and the position information of the segmentation elements in the text segment sequence, for example, following the above example, determining the position information of the segmentation elements corresponding to the text segment sequence "big family and my Xiaoming", "and the segmentation elements"; segmenting the text segment sequence based on the segmentation elements to obtain at least one initial audio text, for example, following the above example, segmenting the 'big family my xiaoming' according to the segmentation elements and the position information corresponding to the segmentation elements to obtain 'big family my xiaoming', i.e. determining that the initial audio text is 'big family' and 'i is xiaoming', respectively.

In practical application, an initial text corresponding to the text segment sequence can be determined based on the element recovery model; specifically, the method for identifying the segment element corresponding to the audio text segment sequence may include:

inputting the sequence of audio text segments to an element restoration model;

and acquiring the segmentation elements output by the element recovery model and the position information of each segmentation element in the audio text segment sequence.

The element recovery model is a model which is trained in advance and can obtain segmentation elements and segmentation element position information in a text according to an input text.

Specifically, the audio segmentation sequence is input into an element recovery model, and segmentation elements and segmentation element position information output by the element recovery model are received.

For example, a text segment sequence "big good me is Xiaoming who teaches me today" is input into the RNN punctuation recovery model trained in advance, and as shown in fig. 2, punctuation symbols corresponding to the text segment sequence and position information of the punctuation symbols are output by the RNN punctuation recovery model.

Further, a text time interval corresponding to the audio text is determined, so that the video to be clipped is conveniently processed based on the text time interval.

Specifically, the method for decoding, by the decoding unit, the audio feature to be decoded according to the target semantic corresponding to the target semantic word to obtain at least one audio text and output the speech conversion model may include:

decoding the audio features to be decoded by the decoding unit according to the target semantics corresponding to the target semantic words to obtain a plurality of decoding vectors;

converting each decoding vector through an output unit of the voice conversion model to obtain a word unit corresponding to each decoding vector;

aligning the word unit corresponding to each decoding vector with the audio file through the output unit to obtain the at least one audio text carrying time information and output the voice conversion model;

correspondingly, a specific method for determining a text time interval corresponding to each audio text may include:

and determining a text time interval corresponding to each audio text according to the time information carried in the at least one audio text.

Specifically, after the decoding unit decodes the audio features to be decoded, a corresponding decoding vector can be obtained; determining word units corresponding to each decoded vector, for example, determining decoded vectors corresponding to a text "you" and a text "good" as a vector a and a vector b, and determining word units corresponding to the vector a and the vector b as g, and then subsequently determining to combine the vector a and the vector b to obtain a corresponding audio text.

The voice conversion model comprises an output unit, and the output unit aligns the word unit corresponding to each decoding vector with the audio file to obtain the audio file containing time information; for example, after receiving a decoding vector k corresponding to a text "good", determining whether the decoding vector k is an end vector in a word unit g, if so, determining time information corresponding to the decoding vector k in an audio file, and determining a time interval corresponding to the word unit g based on the time information corresponding to the decoding vector k; and if not, further receiving the decoding vector until determining that the end vector of the word unit is received, determining time information corresponding to the end vector, and generating a text time interval of the audio file based on the time information corresponding to the start vector of the word unit and the time information corresponding to the end vector.

Step 106: and determining a target audio text in the at least one audio text according to the target semantic, and determining a target text time interval corresponding to the target audio text.

After the at least one audio text is determined, a target audio text containing target semantics is determined in the at least one audio text, and a target file time interval corresponding to the target audio text is determined.

For example, it is determined that an audio text corresponding to an audio file forms an audio text set Q, and if it is determined that the audio text 2, the audio text 3, and the audio text 8 include a target semantic P in the audio text set Q { audio text 1, audio text 2.. audio text n } according to the target semantic P, text time intervals corresponding to the audio text 2, the audio text 3, and the audio text 8 are respectively determined.

Further, it may be determined whether the audio text includes a target semantic through a semantic analysis model, and in particular, the method for determining the target audio text in the at least one audio text according to the target semantic may include:

inputting each audio text into a semantic analysis model for processing to obtain a probability value of each audio text containing target semantics;

and comparing the probability value corresponding to each audio text with a preset probability threshold value, and screening a target audio text from at least one audio text according to the comparison result.

The semantic analysis model refers to a model which can output the probability that the audio text contains the target semantics based on the input audio text; the semantic analysis model may be derived by training a generic model, such as a BERT model.

In practical application, the trained semantic analysis model can be obtained in the following way; specifically, before inputting each audio text into the semantic analysis model for processing, the method may further include:

acquiring a sample data set containing target semantics;

training the semantic analysis model based on the sample data in the sample data set.

The sample data set refers to a sample data set generated based on target semantics, for example, a sample audio file is converted to obtain a sample audio text; labeling a sample audio text according to the target semantics, labeling a label of 'including the target semantics' when the sample audio text includes the target semantics, and labeling a label of 'not including the target semantics' when the sample audio text does not include the target semantics; generating a sample data set based on the sample audio text containing the label; and fine-tuning the semantic analysis model based on the sample data set to obtain the trained semantic analysis model.

For example, the BERT model is pre-trained using a preset sample set; acquiring a sample audio file S, and converting the audio file S into a sample audio text; determining that the target semantics are course introduction, and adding a tag of 'including the target semantics' and a tag of 'not including the target semantics' to the sample audio text based on the target semantics; generating a sample data set based on the sample audio text containing the label; and fine-tuning the pre-trained BERT model based on the sample data set to obtain a trained semantic analysis model.

After the semantic analysis model is obtained, inputting each audio text into the semantic analysis model for processing, and receiving a probability value that each audio text output by the semantic analysis model contains target semantics; the probability value refers to a probability value that the audio text contains the target semantics, for example, the probability value that the audio text H contains the target semantics is 85%; and comparing the probability value corresponding to each audio text with a preset probability threshold, wherein the preset probability threshold is an upper limit value of the probability value corresponding to the audio text, taking the audio text exceeding the preset probability threshold as a target audio text, and determining a text time interval corresponding to the target audio text.

For example, determining audio text 1, audio text 2 and audio text 3 corresponding to the audio file a, and the probability value of 30%, 95%, 90% corresponding to each audio text; if the preset probability threshold is determined to be 85%, it may be determined that the audio text 3 and the audio text 2 are target audio texts, and text time intervals corresponding to the audio text 3 and the audio text 2 are respectively determined.

And determining a target audio text containing target semantics and determining a text time interval corresponding to the target audio text so as to clip the audio and video to be clipped based on the text time interval.

Step 108: and editing the audio/video to be edited according to the target text time interval to obtain a target audio/video.

And after the target text time interval is determined, determining a video segment based on the target text time interval, and editing the audio and video to be edited according to the video segment.

In practical application, the method for clipping the audio/video to be clipped according to the target text time interval to obtain the target audio/video may include:

determining audio and video clips to be processed in the video to be clipped based on the target text time interval;

and deleting the audio and video clips to be processed in the audio and video to be clipped to obtain a target audio and video.

The audio/video clips to be processed refer to audio/video clips determined based on the target text time interval.

Specifically, the time interval of the audio/video containing the target semantics can be determined according to the target text time interval, for example, the audio/video to be clipped contains the target semantics according to the 3 rd to 5 th seconds of the text time interval; and determining audio and video segments to be processed in the audio and video to be clipped based on the time interval, and deleting each determined audio and video segment to be processed in the audio and video to be clipped to obtain the clipped target audio and video.

The audio and video editing method comprises the steps of obtaining an audio and video to be edited and determining an audio file related to the audio and video to be edited so as to obtain a corresponding audio text based on the audio file; converting the audio file into at least one audio text based on the target semantics related to the audio/video to be edited, and determining a text time interval corresponding to each audio text, so that the accuracy of the audio text is improved, and the target semantics can be conveniently identified in the audio text subsequently; determining a target audio text in the at least one audio text according to the target semantic, and determining a target text time interval corresponding to the target audio text so as to efficiently determine the audio text containing the target semantic; and editing the audio/video to be edited according to the target text time interval to obtain a target audio/video, so that the video to be edited is edited according to the text time interval corresponding to the audio text containing the target semantics, and the editing efficiency is improved.

The following describes the interview audio/video method to be clipped by taking the application of the audio/video clipping method provided by the present specification in interview audio/video to be clipped as an example with reference to fig. 3. Fig. 3 shows a processing flow chart of an audio/video editing method applied to an interview video to be edited, which is provided by an embodiment of the present specification, and specifically includes the following steps:

step 302: and acquiring the interview audio/video to be edited, and determining the audio file associated with the interview audio/video to be edited.

Specifically, the audio stream in the interview audio/video to be edited is collected to obtain an audio file.

Step 304: and inputting the audio file into the voice conversion model, and processing the audio file based on a feature extraction unit in the voice conversion model to obtain audio features.

Step 306: and processing the audio features based on an audio feature processing unit in the voice conversion model to obtain the audio features to be decoded.

Step 308: and determining a target semantic word of the associated audio file in a preset semantic word list based on a decoding unit in the voice conversion model.

Step 310: and decoding the audio features to be decoded according to the target semantics corresponding to the target semantic words to obtain the audio text and the corresponding text time interval.

Specifically, the decoding unit processes the audio features to be decoded according to the target semantics corresponding to the target semantic word, and obtains at least one text segment. And splicing each text segment into a text segment sequence, and identifying a segment element corresponding to the text segment sequence. And segmenting the text segment sequence according to the segmentation elements to obtain at least one initial audio text. Determining a text weight of an initial audio text associated with a target semantic word in at least one initial audio text, and adjusting to obtain at least one audio text carrying the target text weight; and determining a text time interval corresponding to each audio text.

Step 312: and inputting each audio text into a semantic analysis model for processing to obtain a probability value of each audio text containing target semantics.

Step 314: and comparing the probability value corresponding to each audio text with a preset probability threshold, screening a target audio text from at least one audio text according to the comparison result, and determining a text time interval corresponding to the target audio text.

Step 316: and determining audio and video fragments to be processed in the interview audio and video to be edited based on the target text time interval, and deleting the audio and video fragments to be processed to obtain the target interview audio and video.

The audio and video editing method comprises the steps of obtaining an audio and video to be edited and determining an audio file related to the audio and video to be edited so as to obtain a corresponding audio text based on the audio file; converting the audio file into at least one audio text based on the target semantics related to the audio and video to be edited, and determining a text time interval corresponding to each audio text, so that the accuracy of the audio text is improved, and the target semantics can be conveniently identified in the audio text subsequently; determining a target audio text in the at least one audio text according to the target semantic, and determining a target text time interval corresponding to the target audio text so as to efficiently determine the audio text containing the target semantic; and editing the audio/video to be edited according to the target text time interval to obtain a target audio/video, so that the video to be edited is edited according to the text time interval corresponding to the audio text containing the target semantics, and the editing efficiency is improved.

Corresponding to the above method embodiment, the present specification further provides an audio/video clip device embodiment, and fig. 4 shows a schematic structural diagram of an audio/video clip device provided in an embodiment of the present specification. As shown in fig. 4, the apparatus includes:

an obtaining module 402, configured to obtain an audio/video to be clipped, and determine an audio file associated with the audio/video to be clipped;

a conversion module 404 configured to convert the audio file into at least one audio text based on the target semantic associated with the audio/video to be edited, and determine a text time interval corresponding to each audio text;

a determining module 406, configured to determine a target audio text in the at least one audio text according to the target semantic, and determine a target text time interval corresponding to the target audio text;

and the clipping module 408 is configured to clip the audio/video to be clipped according to the target text time interval to obtain a target audio/video.

Optionally, the conversion module 404 is further configured to:

inputting the audio file to a speech conversion model;

Optionally, the conversion module 404 is further configured to:

correspondingly, determining a text time interval corresponding to each audio text includes:

Optionally, the conversion module 404 is further configured to:

splicing each text fragment into a text fragment sequence, and identifying a segmentation element corresponding to the text fragment sequence;

Optionally, the determining module 406 is further configured to:

Optionally, the apparatus further comprises a training module configured to:

acquiring a sample data set containing target semantics;

Optionally, the clipping module 408 is further configured to:

and deleting the audio/video segment to be processed in the audio/video to be clipped to obtain a target audio/video.

Optionally, the conversion module 404 is further configured to:

inputting the sequence of audio text segments to an element restoration model;

The audio and video editing device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is configured to acquire an audio and video to be edited and determine an audio file associated with the audio and video to be edited; the conversion module is configured to convert the audio file into at least one audio text based on the target semantics related to the audio and video to be edited and determine a text time interval corresponding to each audio text; the determining module is configured to determine a target audio text in the at least one audio text according to the target semantics and determine a target text time interval corresponding to the target audio text; and the clipping module is configured to clip the audio and video to be clipped according to the target text time interval to obtain a target audio and video.

The embodiment of the specification realizes the conversion of the audio file based on the target semantic associated with the audio and video to be edited, and improves the accuracy of the obtained audio text in the subsequent semantic analysis; and the audio and video to be clipped is clipped according to the target text time interval corresponding to the target audio text, so that the clipping efficiency of the audio and video to be clipped is improved.

The above is a schematic scheme of an audio-video clip device of the present embodiment. It should be noted that the technical solution of the audio/video editing device and the technical solution of the audio/video editing method belong to the same concept, and details of the technical solution of the audio/video editing device, which are not described in detail, can be referred to the description of the technical solution of the audio/video editing method.

Fig. 5 illustrates a block diagram of a computing device 500 provided according to an embodiment of the present description. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.

Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a global microwave interconnect access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.

Wherein, the processor 520 is configured to execute the following computer-executable instructions:

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned audio/video clipping method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned audio/video clipping method.

An embodiment of the present specification also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are operable to:

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the above-mentioned audio/video editing method belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned audio/video editing method.

The foregoing description of specific embodiments has been presented for purposes of illustration and description. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for this description.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the specification and its practical application, to thereby enable others skilled in the art to best understand the specification and its practical application. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. An audio-video clipping method, comprising:

acquiring an audio/video to be clipped, and determining an audio file associated with the audio/video to be clipped;

converting the audio file into at least one audio text based on the target semantics related to the audio and video to be clipped, and determining a text time interval corresponding to each audio text;

2. The method of claim 1, wherein converting the audio file into at least one audio text based on the target semantics of the to-be-clipped audio-video association comprises:

inputting the audio file to a speech conversion model;

3. The method of claim 2, wherein decoding, by the decoding unit, the audio feature to be decoded according to the target semantic corresponding to the target semantic word to obtain at least one audio text and output the speech conversion model, comprises:

4. The method as claimed in claim 2, wherein the decoding, by the decoding unit, the audio feature to be decoded according to the target semantic corresponding to the target semantic word to obtain at least one audio text and output the speech conversion model, includes:

5. The method as claimed in claim 3, wherein processing the audio feature to be decoded by the decoding unit according to the target semantic corresponding to the target semantic word to obtain at least one initial audio text comprises:

6. The method of claim 1, wherein determining a target audio text among the at least one audio text according to the target semantics comprises:

and comparing the probability value corresponding to each audio text with a preset probability threshold, and screening a target audio text from at least one audio text according to the comparison result.

7. The method of claim 6, wherein prior to inputting each audio text into the semantic analysis model for processing, further comprising:

acquiring a sample data set containing target semantics;

8. The method of claim 1, wherein the clipping the audio/video to be clipped according to the target text time interval to obtain a target audio/video comprises:

9. The method of claim 5, wherein identifying the segment element to which the sequence of audio text segments corresponds comprises:

inputting the sequence of audio text segments to an element restoration model;

10. An audio-video clip device characterized by comprising:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is configured to acquire an audio and video to be clipped and determine an audio file related to the audio and video to be clipped;

11. A computing device comprising a memory and a processor; the memory is for storing computer executable instructions and the processor is for executing the computer executable instructions to implement the steps of the audio-visual clip method of any one of claims 1 to 9.

12. A computer readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the audio-visual clip method of any of claims 1 to 9.