CN109756751B

CN109756751B - Multimedia data processing method and device, electronic equipment and storage medium

Info

Publication number: CN109756751B
Application number: CN201711084918.9A
Authority: CN
Inventors: 熊章俊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2023-02-03
Anticipated expiration: 2037-11-07
Also published as: CN109756751A

Abstract

The present disclosure discloses a multimedia data processing method and apparatus, an electronic device, and a computer-readable storage medium, wherein the method comprises: processing the text material to obtain text segments and text metadata corresponding to the text segments; identifying a multimedia data segment with the marked content matched with the text metadata from the multimedia data segments with the marked content, wherein the multimedia data segment is used as a target multimedia segment converted from the text segment; and generating a multimedia file converted from the text material through the target multimedia segment. According to the technical scheme, the conversion from the characters to the audio and video is realized, and a large amount of manpower and material resources are saved because the audio and video fragments are not required to be manually selected.

Description

Multimedia data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of multimedia application technologies, and in particular, to a multimedia data processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

The text material and the video material are related and different. In the correlation aspect, a director of movie and television series often shoots and makes some literary works into movies and television series; in turn, for some film and television works, some derivative literary works are also promoted. The difference between the text material and the video material is reflected in that the text material and the video material are not completely matched, and the text material and the video material are transformed with each other, so that a large amount of various resources such as manpower and material resources are needed for transformation and re-creation.

In the direction of converting from characters to videos, professional deduction is realized by a director and actors; in the network world, some people also have to intercept video segments from massive video materials and combine them to express their understanding of the text contents according to the text contents of the video materials.

Obviously, the workload is large when the text contents of a large amount of video materials are analyzed manually, and long time is consumed for combining the intercepted video segments according to the understanding of the text contents in the video materials.

Disclosure of Invention

In order to solve the problem that long time is consumed for manually analyzing the text contents of massive videos and then combining different video segments according to the understanding of the text contents in the videos, the multimedia data processing method is provided.

In one aspect, the present disclosure provides a multimedia data processing method, including:

processing a text material to obtain a text fragment and text metadata corresponding to the text fragment;

identifying a multimedia data segment with marked content matched with the text metadata from the multimedia data segments with marked content, wherein the multimedia data segment is used as a target multimedia segment converted from the text segment;

and generating a multimedia file converted from the text material through the target multimedia segment.

In another aspect, the present disclosure also provides a multimedia data processing apparatus, including:

the text processing module is used for processing the text material to obtain text segments and text metadata corresponding to the text segments;

the data matching module is used for identifying a multimedia data segment with the marked content matched with the text metadata from the multimedia data segments with the marked content, wherein the multimedia data segment is used as a target multimedia segment converted from the text segment;

and the file generation module is used for generating a multimedia file converted from the text material through the target multimedia fragment.

In addition, the present disclosure also provides an electronic device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the above multimedia data processing method.

Further, the present disclosure provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executable by a processor to perform the above multimedia data processing method.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the technical scheme, the multimedia data segment matched with the text metadata of the text segment is identified from the multimedia data segment with the marked content, so that the multimedia data segment can be used as a target multimedia segment converted from the text segment, and a multimedia file converted from the text material is generated through the target multimedia segment.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic illustration of an implementation environment according to the present disclosure;

FIG. 2 is a block diagram illustrating a server in accordance with an exemplary embodiment;

FIG. 3 is a flow chart illustrating a method of multimedia data processing according to an exemplary embodiment;

FIG. 4 is a flowchart of details of step 310 of the corresponding embodiment of FIG. 3;

FIG. 5 is a schematic diagram of details of step 330 of the corresponding embodiment of FIG. 3;

fig. 6 is a flow chart of a multimedia data processing method according to the corresponding embodiment of fig. 5;

FIG. 7 is a process diagram of the historical text segment being processed by the text information processing module;

FIG. 8 is a schematic diagram illustrating model training and matching by the text matching module in accordance with an exemplary embodiment;

fig. 9 is a flowchart of a multimedia data processing method according to the corresponding embodiment of fig. 3;

FIG. 10 is a diagram illustrating a process of processing a multimedia data segment by a multimedia information processing module to obtain a marked content of the multimedia data segment;

FIG. 11 is a functional block diagram of a server configuration according to an exemplary embodiment of the present disclosure;

FIG. 12 is a functional diagram of an edit authoring module shown in accordance with an exemplary embodiment;

FIG. 13 is a block diagram illustrating a multimedia data processing apparatus in accordance with an exemplary embodiment;

FIG. 14 is a block diagram of the details of a text processing module in the corresponding embodiment of FIG. 13;

fig. 15 is a detailed block diagram of the data matching module in the corresponding embodiment of fig. 13.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

FIG. 1 is a schematic illustration of an implementation environment according to the present disclosure. The implementation environment includes: a plurality of mobile terminals 110 and at least one server 120.

The association between the mobile terminal 110 and the server 120 includes a network association and/or a protocol of hardware and a data association therebetween. The mobile terminal 110 provides the existing text material to the server 120 and requests the server 12 to convert it into a multimedia file for the existing text material. The server 120 processes the text material to obtain the text segment and the text metadata corresponding to the text segment. The server 120 may store multimedia data segments in its own database, each multimedia data segment having a tag for tagging the content of the multimedia data segment. The server 120 searches the matched multimedia data segment for the text segment according to the mark content of the multimedia data segment and the text metadata of the text segment, and then generates a multimedia file through the found multimedia data segment, thereby realizing the conversion from the characters to the multimedia audio and video.

According to the requirement, the multimedia data processing method provided by the disclosure can also be applied to intelligent display equipment, and the intelligent display equipment can be an intelligent television, an intelligent set top box and the like. The intelligent display equipment can process text materials input by a user in an off-line state, obtain text fragments and text metadata corresponding to the text fragments, find out multimedia data fragments matched with the text metadata from the multimedia data fragments stored in a local database, generate a multimedia file converted from the text materials through the found multimedia data fragments, and realize the conversion of characters to multimedia audio and video.

Referring to fig. 2, fig. 2 is a schematic diagram of a server structure according to an embodiment of the present invention. The server 200, which may vary significantly depending on configuration or performance, may include one or more Central Processing Units (CPUs) 222 (e.g., one or more processors) and memory 232, one or more storage media 230 (e.g., one or more mass storage devices) that store applications 242 or data 244. Memory 232 and storage medium 230 may be, among other things, transient or persistent storage. The program stored in the storage medium 230 may include one or more modules (not shown), each of which may include a series of instruction operations for the server 200. Still further, the central processor 222 may be configured to communicate with the storage medium 230 to execute a series of instruction operations in the storage medium 230 on the server 200. Server 200 may also include one or more power supplies 226, one or more wired or wireless network interfaces 250, one or more input-output interfaces 258, and/or one or more operating systems 241, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and so forth. The steps performed by the server described in the embodiments of fig. 3-6, 8 below may be based on the server architecture shown in fig. 2.

It will be understood by those skilled in the art that all or part of the steps for implementing the following embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Fig. 3 is a flow chart illustrating a method of multimedia data processing according to an exemplary embodiment. The multimedia data processing method is applicable to and executed by, for example, the server 120 or the smart display device of the implementation environment shown in fig. 1. As shown in fig. 3, the multimedia data method may include the following steps.

In step 310, the text material is processed to obtain a text segment and text metadata corresponding to the text segment.

Wherein the text material is a work containing textual content, such as a novel. Depending on the length of the text material, the text material may include one or more text segments. A text passage may be a natural passage or a chapter.

Therefore, the processing of the text material includes extracting text segments from the text material and extracting text metadata corresponding to each text segment. The text metadata is obtained by extracting key information from the text content in the process of the text material. The main content of the text metadata, i.e. the text segment, may include, for example, time, place, person, action, etc. In a specific implementation of an exemplary embodiment, the textual metadata is obtained through a constructed contextual long term memory model (CLSTM). Each text segment has uniquely corresponding text metadata. For a text material, the text metadata corresponding to all the text segments form key information of the text material, which is used for describing the main content of the text material.

It should be noted that a document exhibits a sequence structure (e.g., sentence, paragraph, and chapter) at various abstract levels. These levels of abstraction constitute a natural, content-characterized hierarchy that can be used to make meaningful inferences about words or larger segments of text. CLSTM is a model for reasoning the meaning of natural language documents. The structured text metadata can be output by inputting the text fragments into the CLSTM model.

Optionally, as shown in fig. 4, the step 310 specifically includes:

in step 311, the text material is segmented into a plurality of text segments.

When the text material is long, the text information processing module configured in the server 120 can perform paragraph division on the text material through a natural language processing technology to obtain a plurality of text segments. Each text segment may include one or more natural segments.

Generally, paragraphs of text material may be divided by indentation of the top line by 2 characters or periods. Each occurrence of a period represents the beginning of a sentence and the end of another sentence, thereby resulting in several sentences, each of which may be considered a text fragment. Alternatively, 2 characters are indented with the top line, and each natural segment is treated as a text fragment.

Each natural segment has a particular document meaning, and the document meaning bias of some of the possible natural segments in the plurality of natural segments is the same. Whether the meaning bias of the documents of the natural paragraphs is the same or not can be specified by presetting feature keywords or other rules for representing the meaning bias of the documents. By utilizing the document meaning bias of each natural segment, the natural segments with the document meaning bias close to each other can be divided together to be used as a text segment.

In step 312, extracting key information of the text segment through the context long-short term memory model, and obtaining text metadata corresponding to the text segment according to the extracted key information.

The main content (i.e., key information) of the document of each text segment can be specifically extracted through the CLSTM model, for example, the main content includes a theme, a character relationship, a character action, and the like of each text segment, and the text metadata corresponding to each text segment may be the extracted key information of each text segment. And outputting the main content of the text segment by inputting the text segment into the CLSTM model, wherein the output main content is the text metadata corresponding to the text segment.

In step 330, a multimedia data segment with tagged content matching the text metadata is identified from the multimedia data segments with tagged content, and the multimedia data segment serves as a target multimedia segment converted from the text segment.

The multimedia data segment may refer to an audio segment, a video segment or an audio-video segment with a certain playing time. The multimedia data segment of marked content indicates that the multimedia data segment has been marked according to the main content of the multimedia data segment, so that the multimedia data segment has marked content. The marked content is the main content of the multimedia data segment. In the following, the multimedia data segment is a video segment, and the case of an audio segment or an audio-video segment can be implemented by referring to the video segment.

It should be noted that, several video clips may be stored in the storage medium of the server 120, and the main content of each video clip has been marked by a tag. For any text segment, the server 120 screens out the tagged content with the highest matching degree with the text metadata by calculating the matching degree between the text metadata of the text segment and the tagged content of each video segment according to the text metadata of the text segment and the tagged content of each video segment, and takes the video segment corresponding to the tagged content as the video segment matched with the text segment. The video clip matching the text clip will be the target video clip converted from the text clip.

When a plurality of text segments exist, the process can be referred to, and according to the text metadata of each text segment, matching video segments are identified for each text segment one by one from the video segments of the marked content, so as to obtain a target video segment converted from each text segment.

In step 350, a multimedia file converted from the text material is generated through the target multimedia segment.

The multimedia file may be a video file, an audio file, or an audio-visual file. Wherein the video file can be generated according to the video clips found out above and matched with the text clips. Likewise, the audio file may be generated from audio clips matching the text clips, and the audio-video file may be generated from audio-video clips matching the text clips.

The text material belongs to characters, and in the direction of converting the characters into the video, professional deduction is mainly performed by a director and actors at present.

In general, a text material may include a plurality of text segments, and for a text material that only includes one text segment, a video segment matched with the text segment will be used as a target video segment, and the obtained target video segment may be directly used as a video file converted from the text material as needed. And editing and modifying the target video clip, and storing the modified target video clip as a video file. Furthermore, for a plurality of video segments corresponding to a plurality of text segments one to one, the plurality of video segments can be spliced and edited according to the time development sequence to obtain a video file.

For a text material containing a plurality of text segments, the step 350 specifically includes:

and splicing the target multimedia fragments corresponding to each text fragment according to the sequence of the text fragments in the text material to obtain the multimedia file converted from the text material.

When more than one text segment exists, the target multimedia segments corresponding to the text segments can be spliced according to the sequence of the text segments in the text material. Specifically, when the text material is segmented into a plurality of text segments, each text segment is numbered in sequence, and the appearance sequence of each text segment in the text material is obtained according to the number of each text segment.

For example, the sequence of text segments in the text material is as follows: text segment 1, text segment 2, and text segment 3. The video segment matched with the text segment 1 is a video segment X, the video segment matched with the text segment 2 is a video segment Y, and the video segment matched with the text segment 3 is a video segment Z, so that the video segments X, Y and Z can be spliced in sequence according to the sequence of the text segments 1, 2 and 3 to obtain a video file converted from text materials.

According to the technical scheme provided by the above exemplary embodiment of the disclosure, the multimedia data segment matched with the text metadata of the text segment is identified from the multimedia data segment with the marked content, so that the multimedia data segment can be used as a target multimedia segment converted from the text segment, and further the multimedia video file converted from the text material can be obtained through the target multimedia segment.

Further, as shown in fig. 5, the step 330 specifically includes:

in step 331, according to the multimedia data segments with marked content, a matching degree between itself and the marked content of each multimedia data segment is obtained for the text metadata through a text matching model.

And for any text segment, combining the text metadata of the text segment with the marked content of each video segment respectively according to the marked content of each video segment, inputting the combined text metadata into a text matching model, and outputting a matching value between the text metadata of the text segment and the marked content of each video segment. For example, suppose that the text metadata of the text segments are Ax and Ax are respectively combined with b1, b2 \8230; (b represents the mark content of the video segments and n video segments exist) one by one, then each combination is input into a text matching model, and the probability value of each combination, namely the matching degree, is output.

In step 332, according to the matching degree between the text metadata and the tagged content of each multimedia data segment, a multimedia data segment whose tagged content matches the text metadata is obtained.

Specifically, according to the matching degree between the text metadata and the mark content of each video clip, the text metadata and the mark content combination (Ax, bx) of the video clip with the highest matching degree are selected, and the video clip with the mark content matched with the text metadata of the text clip is obtained.

Before the step 331, as shown in fig. 6, the multimedia data processing method provided by the present disclosure may further include the following steps:

in step 601, a multimedia material is processed to obtain a multimedia data segment output by the multimedia material and a markup content corresponding to the multimedia data segment.

The multimedia material may be video material, audio material or audio-video material. For example, the video material may be divided into video segments at preset time intervals according to the preset time intervals. For example, 5 hours of video material is divided into 5 video clips, and the playing time of each video clip is 1 hour. The marked content of the video clip, that is, the main content of the video clip, can be obtained by extracting the main content of the subtitle file corresponding to each video clip by using a CLSTM model according to the subtitle file corresponding to the video clip. The multimedia information processing module configured in the server 120 itself may be configured to process the video material, and obtain a video segment output by the video material and a mark content corresponding to the video segment.

In step 602, the marked contents of the historical text metadata and the multimedia data segment known to match each other are obtained.

Where the historical text metadata is a relative concept, the text metadata that already existed before the text metadata of the text snippet was obtained in step 310. The historical text metadata refers to the main content of the historical text segment. Likewise, a historical segment of text refers to a segment of text that already existed prior to the segment of text being obtained at step 310.

Optionally, the text information processing module configured in the server 120 may process the historical text material in advance to obtain the historical text metadata of the historical text segment. Then, the marked contents of the history text metadata and the multimedia data fragments which are matched with each other can be screened out in a manual matching mode.

Further, before step 602, the following steps may also be included:

extracting key information of the obtained historical text segments through a context long-term and short-term memory model to obtain historical text key information corresponding to the historical text segments;

and correcting the key information of the historical text corresponding to the historical text fragment to obtain the metadata of the historical text corresponding to the historical text fragment.

Fig. 7 is a schematic diagram of a process of processing the historical text snippets by the text information processing module to obtain the metadata of the historical text, as shown in fig. 7, the historical text snippets may be recorded in an original library configured in the server 120, and key information extraction is performed on the historical text snippets by using a natural language processing technology such as a CLSTM model to obtain the key information of the historical text corresponding to the historical text snippets. The historical text snippets and corresponding historical text key information are stored in an intelligent processing library configured by the server 120. In addition, the key information of the historical text in the intelligent processing library can be modified by adopting a manual editing mode, and the modification record is stored in an editing intervention library configured by the server 120, wherein the editing intervention library stores the front-back mapping relationship between the modified historical text segment and the key information of the historical text.

And then performing overall fitting on the historical text segments stored in the intelligent processing library and the corresponding historical text key information, the historical text segments stored in the editing intervention library and the corresponding corrected historical text key information to obtain the historical text segments and the corresponding historical text metadata, and storing the historical text segments and the corresponding historical text metadata in a finished product library configured by the server 120. The historical text metadata of the finished product library may be corrected again, the correction records may be stored in the editing intervention library, the data in the intelligent processing library and the data in the editing intervention library may be collectively fitted again to form continuously updated and completed historical text segments and corresponding historical text metadata, and the finally obtained historical text metadata of the historical text segments may be stored in the finished product library configured by the server 120. The finished product library is the same as a data structure in the intelligent processing library and consists of a plurality of historical text segments with time sequence, and historical text metadata corresponding to each historical text segment mainly comprises an event scene, an event type, an emotional atmosphere and the like.

In step 603, the known history text metadata and the labeled content of the multimedia data segment which are matched with each other are used as a sample training set, the sample training set is input into a document theme generating model, and the optimal parameters of the document theme generating model are obtained through learning, so that the text matching model is obtained.

It should be noted that, before the multimedia data segment matched with the text segment is obtained by using the text matching model, a process of establishing the text matching model may also be included. Taking the example that the multimedia data segment is a video segment, as shown in fig. 8, importing history text metadata and video segment labeled content which are matched with each other as a training sample set into an LDA model (document theme generation model) for parameter training, and after the basis of mass data learning, the LDA model has matching capability to obtain a text matching model. Therefore, according to the given text metadata, LDA matching can be carried out, and the video clip with the marked content matched with the text metadata is found from the video clip with the marked content.

As shown in fig. 9, before the step 330, the multimedia data processing method provided by the present disclosure may further include the following steps:

in step 901, segmenting the acquired multimedia material to obtain a plurality of multimedia data segments;

for example, the video material may be divided into video segments at preset time intervals according to the preset time intervals. The method for segmenting the audio material or the audio and video material can be realized by referring to the method for segmenting the video material.

In step 902, performing subtitle recognition processing on each multimedia data segment to obtain subtitle data corresponding to each multimedia data segment;

the caption identification processing is to identify caption content attached to each frame of image of the multimedia data segment, or obtain caption information of the multimedia data segment through audio identification according to audio information contained in the multimedia data segment, or obtain a caption file generated in cooperation with the multimedia data segment. The subtitle content in the image, the subtitle information of the audio, and the subtitle file can be regarded as subtitle data corresponding to the multimedia data segment.

Specifically, step 902 may include the following processes:

extracting image caption information from each multimedia data segment by adopting a picture character recognition technology and extracting audio caption information from each multimedia data segment by adopting an audio recognition technology;

and the image subtitle information and the audio subtitle information corresponding to each multimedia data segment and the subtitle file generated by matching with the multimedia data segment form the subtitle data of the multimedia data segment.

Among them, the picture character recognition technology may be an OCR (optical character recognition) technology. OCR technology refers to a process in which an electronic device (e.g., a scanner or a digital camera) examines a character printed on paper, determines its shape by detecting dark and light patterns, and then translates the shape into a computer text by a character recognition method. The subtitle recognition process using the OCR technology is directed to a case where subtitles in a multimedia material are not independent subtitle files but are pasted together with a video image. For this case, OCR technology is used to process each frame of image of the multimedia data segment, and image subtitle information is extracted.

As shown in fig. 10, the subtitle data obtained by performing the subtitle recognition process on the multimedia data segment may be obtained by three ways, an OCR (optical character recognition) recognition process, an audio recognition process and a matching subtitle acquisition. The audio identification processing is to identify audio information in a multimedia data segment by an audio identification technology and convert the audio information into audio subtitle information in case that some multimedia data segments have no subtitles. The original subtitle refers to a subtitle file generated by matching with the multimedia data segment, and the subtitle file is usually edited and manufactured by professional staff according to the video content of the multimedia data segment and is high in quality. The image subtitle information (such as OCR subtitles), the audio subtitle information, and the subtitle file (i.e. original subtitles) constitute subtitle data of the multimedia data segment.

In step 903, extracting the content information of the caption data corresponding to each multimedia data segment through the context long-short term memory model;

it should be noted that, a processing object of the multimedia information processing module configured by the server 120 is a multimedia data segment, and since the processing of the multimedia data segment is relatively complex, the processing of the relatively complex multimedia data segment can be converted into the processing of text information by analyzing subtitles in the multimedia data segment. Therefore, the subtitle data of the multimedia data fragments can be analyzed by adopting the natural language processing technology such as the CLSTM model and the like, and the main content of each multimedia data fragment is extracted.

The content information of the subtitle data refers to the main content of the subtitle data. The subtitle data of the OCR subtitle, the audio subtitle and the original subtitle of each multimedia data segment obtained by the three methods can be input into a CLSTM model by using a natural language processing technology to extract key information and output the content information of the multimedia data segment. The content information of the multimedia data segment is used as a smart tag.

In step 904, the corresponding content information and the input tag information are fitted to each multimedia data segment, and the corresponding tag content of each multimedia data segment is obtained.

As shown in fig. 10, the tag information input for each multimedia data segment may be an operation tag and an authoring tag. The operation labels are label information of multimedia data fragments with high quality operated manually, the labels can be opened to anyone including professional audio and video producers, system maintainers and amateur audio and video enthusiasts, and the person can summarize topics of some classical fragments in multimedia materials and extract corresponding content information as the operation labels. The creation of the label is that the creator considers some modification information (i.e. creation label) of unreasonable label content during the audio/video clip creation process, and the quality of the creation label is also high.

For the intelligent tag, the operation tag and the creation tag of each multimedia data segment, a CLSTM model can be adopted to carry out overall fitting processing to obtain the marking content of each multimedia data segment, namely an integrated tag. By changing the operation label or the creation label, the comprehensive label can be continuously updated and perfected.

The historical text metadata of the historical text segment obtained by the text information processing module and the marking content of the multimedia data segment obtained by the multimedia information processing module can be manually matched and grouped, so that the known historical text metadata and the marking content of the multimedia data segment which are matched with each other can be obtained.

Fig. 11 illustrates functional modules configured by the server 120 according to an exemplary embodiment of the present disclosure. As shown in fig. 11, the text information processing module is configured to process the text material to obtain text metadata of the text segment. The multimedia information processing module is used for processing the multimedia material to obtain the marking content of the multimedia data segment. The text matching module is used for establishing a text matching model according to history text metadata and the marking content of the multimedia data fragments which are matched with each other, and identifying the multimedia data fragments of which the marking content is matched with the text metadata by using the text matching model, so that the multimedia data fragments matched with each text fragment are spliced to obtain a multimedia file converted from text materials.

As shown in fig. 11, the server 120 may further include an edit authoring module. The editing creation module can perform some amendments and add some personalized contents on the basis of the multimedia file processed by the text matching module to obtain a finished product.

Taking the multimedia data segment as a video segment as an example, as shown in fig. 12, the basic operation of the editing and authoring module is to modify the text metadata of the text segment or modify the labeled content of the video segment or modify the association relationship between the text segment and the video segment when the obtained text metadata of the text segment and the labeled content of the video segment are not ideal, and these modified contents can be used as learning materials in turn to optimize the text matching model,

furthermore, for the condition that the matching degree of the text segment and the video segment is not high or the expression effect is not good, the more suitable video segment can be searched in the video segment library through the search function of the editing and creating module and given information.

In addition, peripheral decorations such as personalized subtitles, head decorations of characters in the video, masks of eyebrows and the like can be added to the video through a personalized tool of the editing and creating module.

Through the processing of the editing and creating module, video files with higher quality and more related front and back of each video segment can be obtained, and more ideal finished video clips can be output by processing and creating through a personalized tool.

The following is an embodiment of the apparatus of the present disclosure, which can be used to execute an embodiment of the multimedia data processing method executed by the server 120 of the present disclosure. For details not disclosed in the embodiments of the disclosed apparatus, please refer to the embodiments of the disclosed multimedia data processing method.

Fig. 13 is a block diagram illustrating a multimedia data processing apparatus according to an exemplary embodiment, which may be used in the server 120 of the implementation environment shown in fig. 1 to perform all or part of the steps of the multimedia data processing method shown in any one of fig. 3-6 and 9. As shown in fig. 13, the multimedia data processing apparatus includes, but is not limited to: a text processing module 1310, a data matching module 1330, and a file generation module 1350.

The text processing module 1310 is configured to process a text material to obtain a text segment and text metadata corresponding to the text segment;

a data matching module 1330, configured to identify, from the multimedia data segments with marked content, a multimedia data segment with marked content matching the text metadata, where the multimedia data segment serves as a target multimedia segment converted from the text segment;

a file generating module 1350, configured to generate a multimedia file converted from the text material through the target multimedia segment.

The implementation process of the functions and actions of each module in the device is specifically detailed in the implementation process of the corresponding step in the multimedia data processing method, and is not described herein again.

The text processing module 1310 may be, for example, one of the physical structure central processors 222 of fig. 2.

The data matching module 1330 and the file generating module 1350 may also be functional modules for performing the corresponding steps of the multimedia data processing method described above. It is understood that these modules may be implemented in hardware, software, or a combination of both. When implemented in hardware, these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits. When implemented in software, the modules may be implemented as one or more computer programs executing on one or more processors, such as programs stored in memory 232 for execution by central processor 222 of FIG. 2.

Optionally, as shown in fig. 14, the text processing module 1310 includes but is not limited to:

a segment segmentation unit 1311, configured to perform segment division on the text material to obtain a plurality of text segments;

a data extracting unit 1312, configured to extract key information of the text segment through a context long-short term memory model, and obtain text metadata corresponding to the text segment according to the extracted key information.

Optionally, the file generating module 1350 includes but is not limited to:

and the segment splicing unit is used for splicing the target multimedia segments corresponding to each text segment according to the sequence of the text segments in the text material to obtain the multimedia file converted from the text material.

Optionally, as shown in fig. 15, the data matching module 1330 includes but is not limited to:

a data matching unit 1331, configured to obtain, according to the multimedia data segment with the marked content, a matching degree between itself and the marked content of each multimedia data segment for the text metadata through a text matching model;

a segment obtaining unit 1332, configured to obtain, according to the matching degree between the text metadata and the tagged content of each multimedia data segment, a multimedia data segment whose tagged content matches the text metadata.

Optionally, the data matching module 1330 may further include, but is not limited to:

the multimedia processing unit is used for processing a multimedia material to obtain a multimedia data segment output by the multimedia material and mark contents corresponding to the multimedia data segment;

the sample acquisition unit is used for acquiring known history text metadata and mark contents of the multimedia data fragments which are matched with each other;

and the sample training unit is used for taking the known history text metadata and the labeled content of the multimedia data fragment which are matched with each other as a sample training set, inputting the sample training set into a document theme generating model, and obtaining the optimal parameters of the document theme generating model through learning so as to obtain the text matching model.

Optionally, the present disclosure further provides an electronic device, which may be used in the server 120 in the implementation environment shown in fig. 1 to execute all or part of the steps of the multimedia data processing method shown in any one of fig. 3 to 6 and 9. The electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the multimedia data processing method according to the above exemplary embodiment.

The specific manner in which the processor of the electronic device in this embodiment performs operations has been described in detail in the embodiment related to the multimedia data processing method, and will not be elaborated herein.

In an exemplary embodiment, a storage medium is also provided that is a computer-readable storage medium, such as may be transitory and non-transitory computer-readable storage media, including instructions. The storage medium stores a computer program executable by the central processor 222 of the server 200 to perform the above-described multimedia data processing method.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for processing multimedia data, comprising:

segmenting the obtained multimedia material to obtain a plurality of multimedia data segments; performing caption identification processing on each multimedia data segment to obtain caption data corresponding to each multimedia data segment;

extracting content information of caption data corresponding to each multimedia data segment; fitting the content information corresponding to each multimedia data segment and the input label information to obtain a comprehensive label; the input tag information comprises an operation tag and an authoring tag, the operation tag is multimedia data segment tag information which is manually operated and is opened to anyone, the authoring tag is modification information of unreasonable marked contents in the process of making audio and video clips by an author, the comprehensive tag is marked contents corresponding to each multimedia data segment, and the comprehensive tag is updated along with the change of the operation tag or the authoring tag;

identifying a multimedia data segment with the marked content matched with the text metadata from the multimedia data segments with the marked content, wherein the multimedia data segment serves as a target multimedia segment converted from the text segment; generating a multimedia file converted from the text material through the target multimedia segment;

performing caption identification processing on each multimedia data segment to obtain caption data corresponding to each multimedia data segment, including:

extracting image subtitle information from each multimedia data segment by adopting a picture character recognition technology, extracting audio subtitle information from each multimedia data segment by adopting an audio recognition technology, and acquiring a subtitle file generated by matching with the multimedia data segment; the image caption information and the audio caption information corresponding to each multimedia data segment and the caption file generated by matching with the multimedia data segment form the caption data of the multimedia data segment;

the identifying, from the multimedia data segments of the marked content, a multimedia data segment of which the marked content matches the text metadata, the multimedia data segment serving as a target multimedia segment converted from the text segment, includes:

according to the multimedia data fragments marked with the contents, obtaining the matching degree between the text metadata and the marked contents of each multimedia data fragment through a text matching model; and obtaining the multimedia data fragments with the mark contents matched with the text metadata according to the matching degree between the text metadata and the mark contents of each multimedia data fragment.

2. The method of claim 1, wherein the processing of the text material to obtain text segments and text metadata corresponding to the text segments comprises:

paragraph division is carried out on the text material to obtain a plurality of text segments;

extracting key information of the text segments through a context long-term and short-term memory model, and obtaining text metadata corresponding to the text segments according to the extracted key information.

3. The method of claim 1, wherein generating a multimedia file converted from the text material by the target multimedia segment comprises:

4. The method of claim 1, wherein before obtaining the matching degree between the text metadata and the labeled content of each multimedia data segment according to the labeled content multimedia data segment through a text matching model, the method further comprises:

processing a multimedia material to obtain a multimedia data segment output by the multimedia material and mark contents corresponding to the multimedia data segment;

acquiring known history text metadata matched with the known history text metadata and the mark content of the multimedia data fragment;

and taking the known history text metadata and the labeled content of the multimedia data fragments which are matched with each other as a sample training set, inputting the sample training set into a document theme generating model, and obtaining the optimal parameters of the document theme generating model through learning so as to obtain the text matching model.

5. The method of claim 4, wherein prior to obtaining the tagged content of the known matching historical textual metadata and multimedia data segments, the method further comprises:

extracting key information of the obtained historical text fragments through a context long-term and short-term memory model to obtain the key information of the historical text corresponding to the historical text fragments;

and correcting the historical text key information corresponding to the historical text fragment to obtain historical text metadata corresponding to the historical text fragment.

6. A multimedia data processing apparatus, characterized in that the apparatus comprises:

the text processing module is also used for segmenting the acquired multimedia material to obtain a plurality of multimedia data segments;

the text processing module is also used for carrying out caption identification processing on each multimedia data segment to obtain caption data corresponding to each multimedia data segment;

the text processing module is also used for extracting the content information of the caption data corresponding to each multimedia data segment;

the text processing module is also used for fitting the content information corresponding to each multimedia data segment with the input label information to obtain a comprehensive label; the input label information comprises an operation label and an creation label, the operation label is multimedia data segment label information which is manually operated and is opened to anyone, the creation label is modification information of an author on unreasonable marked content in the process of making an audio/video clip, the comprehensive label is marked content corresponding to each multimedia data segment, and the comprehensive label is updated along with the change of the operation label or the creation label;

the data matching module is used for identifying a multimedia data segment with the marked content matched with the text metadata from the multimedia data segment with the marked content, wherein the multimedia data segment is used as a target multimedia segment converted from the text segment, and the marked content is generated by fitting according to the content information of the caption data of the multimedia data segment and the input label information;

the file generation module is used for generating a multimedia file converted from the text material through the target multimedia fragment;

the text processing module is further configured to:

the data matching module comprises:

the data matching unit is used for obtaining the matching degree between the data matching unit and the marked content of each multimedia data fragment for the text metadata through a text matching model according to the multimedia data fragments with the marked content;

and the segment obtaining unit is used for obtaining the multimedia data segments with the labeled contents matched with the text metadata according to the matching degree between the text metadata and the labeled contents of each multimedia data segment.

7. The apparatus of claim 6, wherein the text processing module comprises:

the segment segmentation unit is used for carrying out paragraph segmentation on the text material to obtain a plurality of text segments;

and the data extraction unit is used for extracting the key information of the text segments through a context long-term and short-term memory model and obtaining the text metadata corresponding to the text segments according to the extracted key information.

8. The apparatus of claim 6, wherein the file generation module comprises:

9. The apparatus of claim 6, wherein the data matching module further comprises:

the multimedia processing unit is used for processing a multimedia material and obtaining a multimedia data segment output by the multimedia material and a mark content corresponding to the multimedia data segment;

10. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the multimedia data processing method of any of claims 1-5.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program executable by a processor to perform the method of processing multimedia data according to any one of claims 1 to 5.