CN115080770A

CN115080770A - Multimedia data processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN115080770A
Application number: CN202210554099.4A
Authority: CN
Inventors: 唐鑫; 王冠皓
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-09-20

Abstract

The present disclosure provides a multimedia data processing method, an apparatus, an electronic device and a readable storage medium, which relate to the technical field of data processing and the technical field of image processing, and in particular relate to the technical fields of artificial intelligence such as deep learning and voice technology. The specific implementation scheme is as follows: acquiring at least two modal data of multimedia data to be processed; the at least two modality data comprises at least two of text modality data, audio modality data, and image modality data; performing segmentation processing on the at least two modal data to obtain data fragments of the at least two modal data; and performing fusion processing on the data segments of the at least two modal data to obtain at least two multimedia segments of the multimedia data.

Description

Multimedia data processing method and device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of data processing and image processing, and more particularly to the field of artificial intelligence techniques such as deep learning and speech techniques.

Background

With the rapid development of science and technology, in application scenarios such as classification, storage, search, recommendation and the like of multimedia data, content segmentation of the multimedia data is very important.

Generally, an operator needs to completely watch multimedia data by himself and then manually segment the multimedia data according to the content of the multimedia data.

Disclosure of Invention

The disclosure provides a multimedia data processing method, a multimedia data processing device, an electronic device and a readable storage medium.

According to an aspect of the present disclosure, there is provided a multimedia data processing method including:

acquiring at least two modal data of multimedia data to be processed; the at least two modality data comprises at least two of text modality data, audio modality data, and image modality data;

performing segmentation processing on the at least two modal data to obtain data fragments of the at least two modal data;

and performing fusion processing on the data segments of the at least two modal data to obtain at least two multimedia segments of the multimedia data.

According to another aspect of the present disclosure, there is provided a multimedia data processing apparatus including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring at least two modal data of multimedia data to be processed; the at least two modality data comprises at least two of text modality data, audio modality data, and image modality data;

the segmentation unit is used for carrying out segmentation processing on the at least two modal data to obtain data fragments of the at least two modal data;

and the fusion unit is used for carrying out fusion processing on the data segments of the at least two modal data to obtain at least two multimedia segments of the multimedia data.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of the aspects and any possible implementation described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the above-described aspect and any possible implementation.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the aspect and any possible implementation as described above.

According to the technical scheme, the embodiment of the disclosure can analyze the content of the multimedia data from multiple angles by fusing the plurality of modal data of the multimedia data, so as to realize correct segmentation of the multimedia data, thereby improving the efficiency and reliability of the segmentation of the multimedia data.

In addition, by adopting the technical scheme provided by the disclosure, the user experience can be effectively improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become readily apparent from the following description.

Drawings

To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without paying creative efforts. The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

fig. 3 is a block diagram of an electronic device for implementing a multimedia data processing method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It is to be understood that the described embodiments are only a few, and not all, of the disclosed embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terminal device involved in the embodiments of the present disclosure may include, but is not limited to, a mobile phone, a Personal Digital Assistant (PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), and other smart devices; the display device may include, but is not limited to, a personal computer, a television, or the like having a display function.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document generally indicates that the preceding and following associated objects are in an "or" relationship.

With the rapid development of science and technology, the content segmentation of multimedia data is very important in application scenarios such as classification, storage, search, recommendation and the like of multimedia data.

Taking video news as an example, generally, first-stage video news is composed of a plurality of news segments, different news segments are relatively independent and have different theme contents, news editing of each broadcast media or television station and the like have requirements on extraction of the news segments, and the news segments are convenient to archive, retrieve or clip and recreate subsequently.

The traditional process is that a demand side needs to completely watch videos, and then manual segmentation is carried out according to the content of video news, so that time and labor are consumed, and mistakes are easily made. Therefore, there is a great need for an automated splitting process of video news.

Similarly, the same requirements exist for other multimedia data.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure, as shown in fig. 1.

101. At least two modality data of the multimedia data to be processed are acquired.

The at least two modality data may include, but are not limited to, at least two of text modality data, audio modality data, and image modality data, which is not particularly limited in this embodiment.

102. And performing segmentation processing on the at least two modal data to obtain data fragments of the at least two modal data.

103. And fusing the data segments of the at least two modal data to obtain at least two multimedia segments of the multimedia data.

Therefore, a plurality of multimedia segments of the multimedia data are obtained, the automatic segmentation of the multimedia data is realized, the requirement of multimedia segment extraction can be effectively met, and the segments can be conveniently subjected to filing, retrieval or editing and recreating in the follow-up process.

It should be noted that part or all of the execution subjects 101 to 103 may be an application located at the local terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) set in the application located at the local terminal, or may also be a processing engine located in a server on the network side, or may also be a distributed system located on the network side, for example, a processing engine or a distributed system in a multimedia data processing platform on the network side, which is not particularly limited in this embodiment.

It is to be understood that the application may be a native application (native app) installed on the local terminal, or may also be a web page program (webApp) of a browser on the local terminal, which is not limited in this embodiment.

Therefore, at least two modal data of the multimedia data to be processed are obtained, and then the at least two modal data are segmented to obtain the data segments of the at least two modal data, so that the data segments of the at least two modal data are fused to obtain the at least two multimedia segments of the multimedia data.

In the present disclosure, the data is widely available and various in forms, and each source or form of the data may be regarded as a modality of the data, for example, for video data, the form of the video data may include images, audio and text, and then each form may be referred to as a modality of the video data, namely, an image modality, an audio modality and a text modality. Each video data may be decomposed into a plurality of modality data, such as an audio modality, an image modality, a text modality, and the like.

Because the video data contains various modal data, the different modal data have different characteristics and application ranges, for example, based on the image modal data, a visual scene continuous segment can be obtained, based on the audio modal data, a sound coherent segment on sound perception can be obtained, based on the text modal data, a text meaning similar segment can be obtained, and the like. Therefore, the content of the video data can be analyzed from multiple angles by fusing the plurality of modal data of the video data, so that the video data can be correctly segmented, and the efficiency and the reliability of the segmentation of the video data are improved.

Optionally, in a possible implementation manner of this embodiment, for the video data, in 101, the video data to be processed may be specifically subjected to parsing processing to obtain an image frame (i.e., image modality data) and an audio frame (i.e., audio modality data). Further, the obtained image frame may be further subjected to a Character Recognition process, such as an Optical Character Recognition (OCR) process, to obtain text modality data.

Optionally, in a possible implementation manner of this embodiment, in 102, specifically, the positions of the specific feature objects appearing in the at least two modality data may be respectively determined according to the at least two modality data. Furthermore, according to the position, the at least two pieces of modal data may be segmented to obtain at least two pieces of data of each piece of modal data in the at least two pieces of modal data.

In this implementation manner, based on the specific feature objects at different angles determined by the different modality data, the modality data corresponding to each specific feature object is further segmented, so as to obtain data segments of the modality data at different angles, so that the content of the multimedia data can be analyzed from multiple angles.

In a specific implementation process, for the text modal data, text information of a title type may be specifically identified as the specific feature object. Then, a location of the particular feature object appearing in the acquired text modality data may be determined.

In the implementation process, by identifying the text information of the title type, the interference of the text information of other types on the segmentation of the text modal data can be effectively filtered, and a more accurate segmentation result of the text modal data is obtained.

Specifically, the obtained text modal data may be specifically input into a text detection model, for example, a glimpse (YOLO) model, so as to obtain a plurality of text types in the text modal data, for example, text types such as a title type, a subtitle type, and a column type, and further, text information of the identified title type is used as the specific feature object. Then, according to the obtained text mode data, a time period in which the text information of the title type appears is determined, and the time period is used as the segmentation information according to the segmentation processing to segment the text mode data.

In another specific implementation, since text information of different title types may represent related content, especially text information of adjacent title types may have different meanings but may have relevance. In order to deal with such a problem, particularly, based on the obtained semantic features of the at least two data fragments of the text modality data, merging processing may be performed on data fragments with similar semantics among the at least two data fragments of the text modality data.

In the implementation process, the semantic features of the data fragments of the text modal data are utilized, so that the similarity among the data fragments can be considered from the semantic perspective, and the data fragments with similar semantics are combined.

For the text modal data, the data segments with similar semantics may mean that the semantic similarity between two data segments satisfies a certain condition, and the two data segments may be regarded as the data segments with similar semantics.

For example, specifically, based on the obtained semantic features of at least two data segments of the text modal data, the semantic similarity between two adjacent data segments is calculated, and then the two adjacent data segments with the semantic similarity within a preset range are merged.

Or, for another example, the data segments of the obtained text modal data with similar semantics may be combined by using a pre-constructed text semantic similarity model based on the semantic features of the obtained at least two data segments of the text modal data.

Taking video news as an example, after segmenting the obtained text modal data of the video news based on news headlines to obtain a plurality of text data segments of the text modal data, since the news headlines of different text contents may represent related contents, especially the text headlines of adjacent news headlines are different in character content but may have associated meanings, for example: the news title 1 of the data segment 1 is "rainfall condition in focus a", and the news title 2 of the data segment 2 is "rainfall to storm a yesterday in a". For this case, by using the text semantic similarity model, adjacent text data segments corresponding to the news headlines 1 and 2 with similar semantics can be merged.

In addition to the text information identifying the title type, the text information of a specific text type, such as the text information of a channel identification type (e.g., XX television station, etc.), the text information of a presenter identity type (guest: XXX), etc., may be further identified as the specific feature object. Furthermore, a position of occurrence of the particular feature object in the acquired text modality data may be determined.

In another specific implementation, for the audio modality data, a voiceprint feature may be specifically identified as the specific feature object. Then, a position of occurrence of the particular feature object in the acquired audio modality data may be determined.

In the implementation process, by identifying the voiceprint characteristics, continuous speech of a person within a certain time can be effectively identified, so that the part of audio modal data is determined to be continuous in content, and a more accurate segmentation result of the audio modal data is obtained.

Specifically, the obtained audio modality data may be input into a deep neural network, so as to obtain a plurality of voiceprint features in the audio modality data. Furthermore, according to the obtained audio mode data, a time period in which the audio information with the same voiceprint characteristics appears is determined, and the time period is used as segmentation information according to which the segmentation processing is carried out on the audio mode data.

Taking video news as an example, a voiceprint model of a deep convolutional neural network, a voiceprint model of a Long Short-Term Memory (LSTM) network, and the like may be adopted to implement the deep neural network, and identify a time period of each human voice in audio modal data of the video news. The same person speaks continuously for a certain period of time, and it can be considered that the part of the audio modality data and the corresponding video news are also continuously related to a certain extent.

In another specific implementation, at least one of shot cut information, scene information, and face information may be specifically identified as the specific feature object for the image modality data. Then, a location at which the particular feature object appears in the acquired image modality data may be determined.

In the implementation process, by identifying at least one of the segmentation information related to the image, namely the shot switching information, the scene information and the face information, a continuous scene can be effectively identified, so that the image modal data is determined to be continuous in content, and a more accurate segmentation result of the image modal data is obtained.

For example, a deep neural network may be specifically used to learn shot switching points of different data segments in image mode data, and then the deep neural network may be used to directly predict shot switching points in the obtained image mode data, so as to serve as segmentation information according to the segmentation processing, and segment the image mode data to obtain visually coherent scene segments.

Because the deep neural network is obtained based on large-scale data training, the lens switching points can be better identified, the relation between all frames near the lens switching points can be comprehensively learned not only by relying on simple front and back frames or simple bottom image characteristics, and therefore the correct lens switching points can be obtained.

Or, for another example, the obtained image modality data may be specifically input to a classification neural network, so as to obtain a plurality of pieces of scene information in the image modality data, for example, whether the image modality data is a studio, a transition shot, a leader/trailer, and the like, as segmentation information according to which the segmentation processing is performed, and the image modality data is subjected to segmentation processing, so as to obtain a video segment in a specific scene. The segmentation information can be used for extracting image modality data in effective scenes such as specific scenes and the like, and can also be used for filtering out image modality data in ineffective scenes.

Image mode data in invalid scenes are processed by adding a classification neural network, the image mode data in the invalid scenes such as transition lens, leader/trailer and the like are filtered, and only the image mode data in the valid scenes are reserved.

Or, for another example, the obtained image modality data may be specifically input into a face recognition model, so as to obtain a plurality of face information in the image modality data, and the face information is used as segmentation information according to segmentation processing, and the image modality data is subjected to segmentation processing, so as to obtain video segments corresponding to different anchor broadcasters.

In another specific implementation, since the image information of different shots may express related contents, especially the image information of adjacent shots may have different meanings but may have correlation. In order to deal with such a problem, a merging process may be specifically performed on data segments with similar semantics in the at least two data segments of the image modality data based on the obtained semantic features of the at least two data segments of the image modality data.

In the implementation process, by utilizing the semantic features of the data segments of the image modality data, the similarity between the data segments can be considered from the semantic perspective, and then the data segments with similar semantics are combined.

For the image modality data, the data segments with similar semantics may mean that the semantic similarity between two data segments satisfies a certain condition, and the two data segments may be regarded as data segments with similar semantics.

For example, specifically, the similarity between two adjacent data segments may be calculated based on the obtained semantic features of at least two data segments of the image modality data, and then, the two adjacent data segments with the similarity within a preset range may be merged.

Or, for another example, specifically, the obtained data segments of the image modality data with similar semantics may be merged by using an image semantic similarity model based on the semantic features of the obtained at least two data segments of the image modality data.

Taking video news as an example, after the obtained image modality data of the video news is sliced based on the shot switching point to obtain a plurality of image data segments of the image modality data, since the image information of different shots may express related contents, especially, the image information of adjacent shots has different picture contents but may have associated meanings, for example, in the image modality data representing a telephone call and B telephone call, the shots are respectively given to a and B, i.e., the shot 1 representing a telephone call and the shot 2 representing B telephone call, but actually a and B. For this case, by using the image semantic similarity model, the adjacent image data segments corresponding to the shots 1 and 2 with similar semantics can be merged.

Therefore, the segmentation processing of the image mode data may obtain excessively fragmented data fragments, and the semantic features of the data fragments can be further utilized to improve the segmentation processing, so that the accuracy of the obtained data fragments is remarkably optimized.

In the present disclosure, the data segments of different angles obtained based on the different modality data may not only complement each other, but also correct each other.

Optionally, in a possible implementation manner of this embodiment, after 102, at least two data segments of different modality data may be further utilized to perform a calibration process on the at least two data segments of each modality data, so as to adjust the at least two data segments of each modality data.

In the implementation mode, the content of the multimedia data can be better analyzed by fusing the plurality of modal data of the multimedia data, and the respective segmentation results are further fused, so that the correct segmentation processing result is obtained, and the efficiency and the reliability of the segmentation of the multimedia data are improved.

In a specific implementation process, the data segment of the image modality data may be specifically used as a reference, and the data segment of the text modality data and the data segment of the audio modality data are used for correcting the data segment of the image modality data and the data segment of the audio modality data, so as to implement the calibration process.

Taking video news as an example, the segmentation processing result based on the news headline and the segmentation processing result based on the voiceprint feature can be used for performing operations such as starting and ending point correction, fine segmentation, aggregation and the like on the segmentation processing result based on the segmentation information related to the image to obtain a final news segment, and then the corresponding identified news headline is filled into the news segment to obtain the final news segment.

According to the technical scheme provided by the disclosure, the content of the multimedia data can be better analyzed by fusing the information of the plurality of modal data of the multimedia data, so that the multimedia data can be correctly segmented, and the data segments of different modal data can be mutually supplemented and corrected.

In this embodiment, at least two pieces of modal data of the multimedia data to be processed are obtained, and then the at least two pieces of modal data are segmented to obtain data fragments of the at least two pieces of modal data, so that the data fragments of the at least two pieces of modal data are fused to obtain the at least two pieces of multimedia fragments of the multimedia data.

In addition, by adopting the technical scheme provided by the disclosure, manual operation is not needed, the operation is simple, errors are not easy to occur, and the efficiency and the reliability of the multimedia data segmentation can be further improved.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

Fig. 2 is a schematic diagram according to a third embodiment of the present disclosure, as shown in fig. 2. The multimedia data processing apparatus 200 of the present embodiment may include an acquisition unit 201, a segmentation unit 202, and a fusion unit 203. The acquiring unit 201 is configured to acquire at least two modality data of multimedia data to be processed; the at least two modality data comprises at least two of text modality data, audio modality data, and image modality data; a segmenting unit 202, configured to perform segmentation processing on the at least two pieces of modal data to obtain data segments of the at least two pieces of modal data; a fusion unit 203, configured to perform fusion processing on the data segments of the at least two modality data to obtain at least two multimedia segments of the multimedia data.

It should be noted that, part or all of the multimedia data processing apparatus of this embodiment may be an application located at the local terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) set in the application located at the local terminal, or may also be a processing engine located in a server on the network side, or may also be a distributed system located on the network side, for example, a processing engine or a distributed system in a processing platform of multimedia data on the network side, and this embodiment is not particularly limited in this respect.

Optionally, in a possible implementation manner of this embodiment, the segmenting unit 202 may be specifically configured to determine, according to the at least two modality data, positions of specific feature objects appearing in the at least two modality data respectively; and according to the position, performing segmentation processing on the at least two modal data to obtain at least two data fragments of each modal data in the at least two modal data.

In a specific implementation process, the segmentation unit 202 may be specifically configured to identify, for the text modal data, text information of a specific text type as the specific feature object; and determining a location of occurrence of the particular feature object in the acquired text modality data

In another specific implementation process, the segmenting unit 202 may be further configured to perform merging processing on data segments with similar semantics in at least two data segments of the text modality data based on semantic features of the at least two data segments of the text modality data.

In another specific implementation process, the segmentation unit 202 may be specifically configured to identify, for audio modality data, a voiceprint feature as the specific feature object; and determining a location where the particular feature object appears in the acquired audio modality data.

In another specific implementation process, the segmentation unit 202 may be specifically configured to identify, for image modality data, at least one of shot-cut information, scene information, and face information as the specific feature object; and determining a location where the particular feature object appears in the acquired image modality data.

In another specific implementation process, the segmenting unit 202 may be further configured to perform merging processing on data segments with similar semantics in at least two data segments of the image modality data based on semantic features of the at least two data segments of the image modality data.

Optionally, in a possible implementation manner of this embodiment, the segmenting unit 202 may be further configured to perform calibration processing on the at least two data segments of each modal data by using the at least two data segments of different modal data, so as to adjust the at least two data segments of each modal data.

It should be noted that the method in the embodiment corresponding to fig. 1 may be implemented by the multimedia data processing apparatus provided in this embodiment. For a detailed description, reference may be made to relevant contents in the embodiment corresponding to fig. 1, and details are not described here.

In this embodiment, at least two pieces of modal data of the multimedia data to be processed are obtained by the obtaining unit, and then the segmenting unit performs segmentation processing on the at least two pieces of modal data to obtain data fragments of the at least two pieces of modal data, so that the fusing unit can perform fusion processing on the data fragments of the at least two pieces of modal data to obtain the at least two pieces of multimedia fragments of the multimedia data.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 3 illustrates a schematic block diagram of an example electronic device 300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 3, the electronic device 300 includes a computing unit 301 that can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM)302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 can also be stored. The computing unit 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

A number of components in the electronic device 300 are connected to the I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, and the like. The communication unit 309 allows the electronic device 300 to exchange information/data with other devices through a computer network such as an internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, computing units running various machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 301 performs the respective methods and processes described above, such as a multimedia data processing method. For example, in some embodiments, the multimedia data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 300 via the ROM 302 and/or the communication unit 309. When the computer program is loaded into the RAM 303 and executed by the computing unit 301, one or more steps of the multimedia data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the multimedia data processing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable multimedia data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A multimedia data processing method, comprising:

2. The method according to claim 1, wherein the slicing the at least two modality data to obtain the data segments of the at least two modality data comprises:

according to the at least two modal data, respectively determining the positions of the specific characteristic objects appearing in the at least two modal data;

and according to the position, performing segmentation processing on the at least two modal data to obtain at least two data fragments of each modal data in the at least two modal data.

3. The method according to claim 2, wherein said determining, from the at least two modality data, a location of occurrence of a particular feature object in the at least two modality data, respectively, comprises:

for text modal data, identifying text information of a specific text type as the specific feature object;

determining a location where the particular feature object appears in the acquired text modality data.

4. The method according to claim 3, wherein the slicing the at least two modality data to obtain data segments of the at least two modality data further comprises:

and merging the data fragments with similar semantics in the at least two data fragments of the text modal data based on the semantic features of the at least two data fragments of the text modal data.

5. The method according to claim 2, wherein said determining, from the at least two modality data, a location of occurrence of a particular feature object in the at least two modality data, respectively, comprises:

for audio modal data, identifying voiceprint features as the specific feature objects;

determining a location where the particular feature object appears in the acquired audio modality data.

6. The method according to claim 2, wherein said determining, from the at least two modality data, a location of occurrence of a particular feature object in the at least two modality data, respectively, comprises:

identifying at least one of shot cut information, scene information, and face information as the specific feature object for the image modality data;

determining a location where the particular feature object appears in the acquired image modality data.

7. The method according to claim 6, wherein the slicing the at least two modality data to obtain data segments of the at least two modality data further comprises:

and merging the data fragments with similar semantics in the at least two data fragments of the image modality data based on the semantic features of the at least two data fragments of the image modality data.

8. The method according to any one of claims 2 to 7, wherein the slicing the at least two modality data according to the position to obtain at least two data segments of each modality data of the at least two modality data further comprises:

and utilizing at least two data fragments of different modal data to perform calibration processing on the at least two data fragments of each modal data so as to adjust the at least two data fragments of each modal data.

9. A multimedia data processing apparatus comprising:

10. The apparatus of claim 9, wherein the segmentation unit is specifically configured to

According to the at least two modal data, respectively determining the positions of the specific characteristic objects appearing in the at least two modal data; and

11. The apparatus of claim 10, wherein the segmentation unit is specifically configured to

For text modal data, identifying text information of a specific text type as the specific feature object; and

12. The apparatus of claim 11, wherein the segmentation unit is further configured to

13. The apparatus of claim 10, wherein the segmentation unit is specifically configured to

For audio modal data, identifying voiceprint features as the specific feature objects; and

14. The apparatus of claim 10, wherein the segmentation unit is specifically configured to

Identifying at least one of shot cut information, scene information, and face information as the specific feature object with respect to image modality data; and

a location of occurrence of the particular feature object in the acquired image modality data is determined.

15. The apparatus of claim 14, wherein the segmentation unit is further configured to

16. The apparatus according to any of claims 10-15, wherein the segmentation unit is further configured to

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.