WO2023287360A2 - 多媒体处理方法、装置、电子设备及存储介质 - Google Patents
多媒体处理方法、装置、电子设备及存储介质 Download PDFInfo
- Publication number
- WO2023287360A2 WO2023287360A2 PCT/SG2022/050494 SG2022050494W WO2023287360A2 WO 2023287360 A2 WO2023287360 A2 WO 2023287360A2 SG 2022050494 W SG2022050494 W SG 2022050494W WO 2023287360 A2 WO2023287360 A2 WO 2023287360A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text content
- multimedia resource
- invalid
- content
- voice data
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 36
- 238000003860 storage Methods 0.000 title claims abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 53
- 230000000694 effects Effects 0.000 claims abstract description 26
- 238000004458 analytical method Methods 0.000 claims description 29
- 238000012790 confirmation Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 25
- 238000012545 processing Methods 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 22
- 239000012634 fragment Substances 0.000 claims description 19
- 238000001514 detection method Methods 0.000 claims description 13
- 238000005516 engineering process Methods 0.000 claims description 11
- 230000007704 transition Effects 0.000 claims description 7
- 230000004044 response Effects 0.000 claims description 5
- 238000005520 cutting process Methods 0.000 abstract description 11
- 238000010586 diagram Methods 0.000 description 20
- 230000008569 process Effects 0.000 description 19
- 230000003287 optical effect Effects 0.000 description 6
- 230000002452 interceptive effect Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 239000002245 particle Substances 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 101000822695 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C1 Proteins 0.000 description 1
- 101000655262 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C2 Proteins 0.000 description 1
- 101000655256 Paraclostridium bifermentans Small, acid-soluble spore protein alpha Proteins 0.000 description 1
- 101000655264 Paraclostridium bifermentans Small, acid-soluble spore protein beta Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 235000021152 breakfast Nutrition 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005562 fading Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/432—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
- G06F16/432—Query formulation
- G06F16/433—Query formulation using audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/483—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/02—Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
- G11B27/031—Electronic editing of digitised analogue information signals, e.g. audio or video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/472—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
- H04N21/47217—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for controlling playback functions for recorded or on-demand content, e.g. using progress bars, mode or play-point indicators or bookmarks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Definitions
- Multimedia processing method, device, electronic device, and storage medium Cross-references to this application
- This disclosure requires the application number 202110802038.0 and the name "Multimedia processing method, device, electronic device, and storage medium” filed on July 15, 2021
- the priority of the Chinese patent application the entire content of which is incorporated herein by reference.
- Technical Field Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a multimedia processing method, device, electronic device, and storage medium. BACKGROUND OF THE INVENTION
- the technical difficulty and threshold for the public to produce multimedia resources has been greatly reduced, making multimedia-based content creation and sharing enter the stage of popularization.
- an embodiment of the present disclosure provides a multimedia processing method, including: acquiring a first multimedia resource; performing speech recognition on audio data of the first multimedia resource, and determining the first multimedia resource Corresponding initial text content, wherein the audio data of the first multimedia resource includes speech data of the initial text content; determining invalid text content in the initial text content, wherein the invalid text content is semantic Text content that has no information expression function; determine the first playback position of the voice data of the invalid text content in the first multimedia resource; based on the first playback position, for the first multimedia The resource is clipped to obtain a second multimedia resource, wherein the audio data of the second multimedia resource includes the voice data of the target text content and does not include the voice data of the invalid text content, and the target text content other text content in the initial text content
- an embodiment of the present disclosure provides a multimedia processing device, including: a voice recognition module, configured to acquire a first multimedia resource, perform voice recognition on audio data of the first multimedia resource, and determine the The initial text content corresponding to the first multimedia resource, wherein the audio data of the first multimedia resource includes voice data of the initial text content; a first confirmation module, configured to determine invalid text content in the initial text content, wherein the invalid text content is text content that has no information expression function in semantics; a second confirmation module, to determine the voice of the invalid text content A first playback position of the data in the first multimedia resource; a generating module, configured to clip the first multimedia resource based on the first playback position to obtain a second multimedia resource, wherein, the audio data of the second multimedia resource includes the voice data of the target text content and does not include the voice data of the invalid text content, and the target text content is the initial text content except the invalid text Other text content besides the content.
- an embodiment of the present disclosure provides an electronic device, including: at least one processor and a memory; the memory stores computer-executable instructions; the at least one processor executes the computer-executable instructions stored in the memory, so that The at least one processor executes the multimedia processing method described in the first aspect and various possible designs of the first aspect.
- an embodiment of the present disclosure provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the processor executes the computer-executable instructions, the above first aspect and the first Aspects of various possible designs of the multimedia processing method.
- an embodiment of the present disclosure provides a computer program product, including a computer program.
- an embodiment of the present disclosure provides a computer program, and when the computer program is executed by a processor, implements the multimedia processing method described in the above first aspect and various possible designs of the first aspect.
- the multimedia processing method, device, electronic equipment, and storage medium determine the first multimedia resource by acquiring the first multimedia resource; performing speech recognition on the audio data of the first multimedia resource; initial text content corresponding to the resource, wherein the audio data of the first multimedia resource includes voice data of the initial text content; determining invalid text content in the initial text content, wherein the invalid text content is Text content that has no information expression function in semantics; determine the first playback position of the voice data of the invalid text content in the first multimedia resource; based on the first playback position, for the first multimedia
- the body resource is clipped to obtain a second multimedia resource, wherein the audio data of the second multimedia resource includes the voice data of the target text content and does not include the voice data of the invalid text content, and the target text The content is other text content in the initial text content except the invalid text content.
- FIG. 1 is an application scenario diagram of a multimedia processing method provided by an embodiment of the present disclosure
- FIG. 2 is another application scenario diagram of the multimedia processing method provided by the embodiment of the present disclosure
- FIG. 3 is a schematic flowchart of the multimedia processing method provided by the embodiment of the present disclosure
- FIG. 5 is a second schematic flowchart of a multimedia processing method provided by an embodiment of the present disclosure
- FIG. 6 is a flowchart of an implementation of step S203 in the embodiment shown in FIG. 5
- FIG. 7 is a flow chart of the implementation of step S204 in the embodiment shown in FIG. 5
- FIG. 8 is a schematic diagram of an interactive interface provided by an embodiment of the present disclosure
- Figure 10 is a structural block diagram of a multimedia processing device provided by an embodiment of the present disclosure
- Figure 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure
- Figure 12 is a schematic diagram of this disclosure
- FIG. 1 is an application scenario diagram of the multimedia processing method provided by the embodiments of the present disclosure.
- the multimedia processing method provided by this embodiment can be applied to the post-processing of multimedia video resource recording.
- the method provided in this embodiment can be applied to a terminal device, as shown in FIG.
- the terminal device 11 processes the initial multimedia video by executing the multimedia processing method provided in this embodiment, and automatically removes video clips and audio clips corresponding to meaningless content such as slips of the tongue, pauses, and spoken words that appear in the initial multimedia video,
- the multimedia clipped video is generated. Since the meaningless content in the multimedia clipped video has been removed, compared with the original multimedia video, the multimedia clipped video is more coherent and smooth in content, and the content quality of the multimedia video resource is higher.
- the terminal device 11 sends the multimedia clipping video to the video platform server 12, and the video platform server 12 publishes the multimedia clipping video as a multimedia video resource on the video platform, and other terminal devices 13 can communicate with the video platform server 12.
- FIG. 2 is another application scenario diagram of the multimedia processing method provided by the embodiment of the present disclosure.
- the method provided by this embodiment can also be applied to the video platform server, that is, the user records a video through the terminal device.
- the multimedia initial video is sent to the video platform server, and the video platform server processes the multimedia initial video by executing the multimedia processing method provided in this embodiment to generate the multimedia trimmed video as shown in FIG. 1, and
- the multimedia trimmed video is published on the video platform as a multimedia video resource, and other terminal devices can watch the multimedia video resource by communicating with the video platform server.
- FIG. 3 is a first schematic flowchart of a multimedia processing method provided by an embodiment of the present disclosure.
- the method of this embodiment can be applied to a server or a terminal device.
- the terminal device is exemplarily described as the execution subject of the method of this embodiment.
- the multimedia processing method includes:
- multimedia generally refers to a combination of multiple media, generally including multiple media forms such as text, sound, and image.
- multimedia refers to a human-computer interactive information exchange and dissemination media that combines two or more media.
- the first multimedia resource may be an audio video with an audio track, and more specifically, the first multimedia resource may be a multimedia resource or file that protects video data and audio data.
- the first multimedia resource may be a video with sound recorded by the user through a terminal device, such as a recording function of a smart phone.
- the video with sound is a video containing human voices, such as an explanation video, a teaching video, Product introduction video, etc.
- the first multimedia resource can also be obtained by the terminal device by receiving data transmitted by other electronic devices, and no more examples will be given here.
- S102 Perform voice recognition on the audio data of the first multimedia resource, and determine the initial text content corresponding to the first multimedia resource, where the audio data of the first multimedia resource includes voice data of the initial text content.
- the first multimedia resource includes at least audio data
- voice recognition is performed on the audio data of the first multimedia resource according to a preset algorithm, and the first multimedia resource is determined.
- the audio data may only include voice data, or include voice data and non-voice data at the same time, the voice data is the audio data corresponding to the voice of the person recorded in the video; correspondingly, the non-voice data is the audio data in the video
- the first multimedia resource is a product introduction video, wherein the audio data corresponding to the voice of the person introducing the product in the video is voice data.
- the obtained text is the initial text content, that is, the human voice-to-speech text that introduces the product in the video.
- the invalid text content is text content that has no information expression function in semantics. Further, for example, after the initial text content is obtained, the initial text content is identified based on characters, phrases, sentences, and paragraphs in the initial text content, and invalid text content and target text content can be determined therein. Among them, the invalid text content is the text content that has no information expression function in semantics, more specifically, such as abnormal pauses, repetitions, redundant modal particles in the speaking process, etc., which have no information expression function, but will affect the fluency of language expression Words, phrases, phrases, etc.
- determining the invalid text content in the initial text content for example, based on a preset invalid text content library, determining the invalid text content in the initial text content, specifically, the invalid text content library Including preset elements such as characters, words, and phrases corresponding to the invalid text content, based on the words, words, and phrases in the invalid text content library, it is detected whether the initial text content contains the words, words, phrases, etc. in the invalid text content library. Short sentences, and then determine the invalid text content in the initial text content.
- semantic information corresponding to the initial text content may be obtained by performing semantic analysis on the initial text content; and then invalid text content in the initial text content may be determined according to the semantic information.
- the semantic meaning of each character and word element in the initial text content can be determined, and the invalid text content is also a classification of semantic meaning. Therefore, by performing semantic analysis on the initial text content, invalid text content in the initial text content can be determined.
- the semantic analysis of the initial text content can be realized through a pre-trained language processing model, and the use and training of the language processing model for semantic analysis is a prior art known to those skilled in the art, and will not be traced here.
- S104 Determine a first playback position of voice data of invalid text content in the first multimedia resource.
- each word or phrase in the initial text content corresponds to an audio clip and a piece of audio data
- the audio data contains a playback time stamp, after determining the invalid text content in the initial text content, according to the invalid text
- the characters and phrases contained in the content determine the audio data corresponding to each character and word, and obtain the playback time stamp of the audio data, and then, according to the playback time stamp of the audio data, it can be determined that the voice data of the invalid text content is in the first most The first playback position in the media asset.
- the invalid text content in the first multimedia resource corresponds to a piece of continuous audio data.
- the first playback position may only include a set of starting points and the playback timestamp of the termination point.
- the invalid text content in the first multimedia resource corresponds to multiple pieces of discontinuous audio data.
- the first playback position may include multiple sets including the start point and the end point The playback timestamp of .
- S105 Based on the first playback position, clip the first multimedia resource to obtain a second multimedia resource, wherein the audio data of the second multimedia resource contains voice data of the target text content and does not contain invalid text Voice data of the content, the target text content is other text content in the initial text content except the invalid text content.
- the first playback position it means that the voice data corresponding to the invalid text content has been identified and located.
- the invalid text content is corresponding
- the voice data of the first multimedia resource is deleted from all the audio data of the first multimedia resource, and the voice data corresponding to the target text content is retained, thereby reducing the number of words and words that have no information expression function due to pauses, repetitions, redundant modal particles, etc. expressive influence.
- the first multimedia resource based on the first playback position's description of the playback position of the voice data corresponding to the invalid text content in the first multimedia resource, determine the start point and the end point of the voice data corresponding to the invalid text content, and further, set the start point and the data between the end point and the end point are deleted, and the speech data of the target text content before the start point and after the end point are spliced to generate clipped audio data.
- the first multimedia resource also includes video data. Therefore, similarly, based on the first playback position, corresponding processing is performed on the video data in the first multimedia resource, and the invalid text content is clipped.
- FIG. 4 is a schematic diagram of a process of obtaining a second multimedia resource through a first multimedia resource provided by an embodiment of the present disclosure.
- the first multimedia resource includes audio data and video data, and audio
- the data includes voice data
- the voice data includes first voice data and second voice data, wherein the first voice data is invalid text voice data corresponding to the content; the second voice data is voice data corresponding to the target text content.
- the first multimedia resource by acquiring the first multimedia resource; performing speech recognition on the audio data of the first multimedia resource, and determining the initial text content corresponding to the first multimedia resource, wherein, the first multimedia resource
- the audio data includes voice data of the initial text content; determining invalid text content in the initial text content, wherein the invalid text content is text content that has no information expression function in semantics; determining that the voice data of the invalid text content is in the first multimedia
- the first playback position in the resource based on the first playback position, the first multimedia resource is clipped to obtain a second multimedia resource, wherein the audio data of the second multimedia resource contains the voice of the target text content voice data that does not contain invalid text content, and the target text content is other text content in the initial text content except the invalid text content.
- FIG. 5 is a second schematic flowchart of a multimedia processing method provided by an embodiment of the present disclosure. In this embodiment, on the basis of the embodiment shown in FIG. 3, steps S102-S105 are further refined.
- the multimedia processing method includes:
- a voice endpoint detection algorithm Using a voice endpoint detection algorithm, identify voice data and non-voice data in the audio data of the first multimedia resource.
- the voice endpoint detection (Voice Activity Detection, VAD) algorithm also known as voice activity detection, aims to identify silent periods (ie, non-human voice signals) from the sound signal stream.
- VAD Voice Activity Detection
- the algorithm processes the audio data of the first multimedia resource, and can identify the voice data corresponding to the human voice and the non-voice data corresponding to the non-human voice, so as to realize the subsequent processing process based on the voice data.
- the specific implementation method of the speech endpoint detection algorithm is an existing technology known to those skilled in the art, and will not be repeated here.
- S203 Perform voice recognition on the voice data in the audio data of the first multimedia resource, and determine the initial text content corresponding to the first multimedia resource, where the initial text content includes a plurality of segment content.
- S203 includes two specific implementation steps of S2031 and S2032,
- S2031 According to the automatic speech recognition technology, perform speech recognition on the audio data of the first multimedia resource, and obtain a plurality of speech words, and time stamps corresponding to each speech word, and the time stamp indicates that the audio data corresponding to the speech word The playback position within the media asset.
- S2032 Generate initial text content according to multiple phonetic words.
- ASR Automatic Speech Recognition technology
- ASR Automatic Speech Recognition
- a technology that converts human speech into text and is based on multiple technical disciplines such as acoustics, phonetics, linguistics, and computer science.
- Preprocessing, feature extraction, postprocessing, and feature recognition are performed on the audio signal to realize the conversion from speech to text.
- feature extraction, postprocessing, and feature recognition are performed on the audio signal to realize the conversion from speech to text.
- the specific algorithms involved in each processing link can be configured according to specific needs, and no examples are given here.
- a word-level recognition result that is, a phonetic word
- each phonetic word corresponds to a segment of audio data, that is, an audio segment .
- the phonetic words corresponding to each audio segment are arranged sequentially, and the generated text containing a plurality of phonetic words is the initial text content.
- the time stamp corresponding to each spoken word is also determined, which is used to represent the audio data corresponding to the spoken word (that is, the above-mentioned speech segment).
- each one or more phonetic words may constitute a piece of content, which is used to represent specific semantics. For example, a word composed of two phonetic characters, or an idiom composed of four phonetic characters, etc.
- a word composed of two phonetic characters or an idiom composed of four phonetic characters, etc.
- the invalid text content in the initial text content it may appear in the form of one phonetic word, such as "um”, or it may appear in the form of two phonetic words, such as "that".
- the above examples are only for The semantically non-informative content that often appears in Chinese speech is similar to other languages, so we will not give examples here.
- S204 Determine at least one invalid segment content from the multiple segment contents of the initial text content. Exemplarily, as shown in FIG. 7, S204 includes five specific implementation steps of S2041, S2042, S2043, S2044, and S2045,
- S2041 Based on the preset invalid text content library, determine invalid segment content in the initial text content.
- step S2042 If there is no invalid fragment content in the invalid text content database, perform semantic analysis on the initial text content to obtain semantic information corresponding to each fragment content of the initial text content.
- the invalid text content library presets a plurality of phonetic words and/or combinations of phonetic words that have no information expression function in semantics, and according to the phonetic words and/or phonetic words that have no information expression function preset in the invalid text content library Combination of phonetic words, the consistency detection of the initial text content can determine the same phonetic word and/or combination of phonetic words that have no information expression function in semantics in the initial text content, that is, invalid fragment content.
- step S204 can be implemented.
- determining the invalid fragment content through the invalid text content library since the step of semantic analysis is omitted, the efficiency is higher, and the computing resources are less occupied, which can improve the efficiency of locating and clipping the invalid text content. And if there is no invalid segment content in the invalid text content library, that is, according to the combination of a plurality of phonetic words and/or phonetic words that have no information expression function in semantics preset in the invalid text content library, the consistency of the original text content is performed.
- a plurality of phonetic words and/or combinations of phonetic words preset in the invalid text content library in the invalid text content library are not detected in the initial text content, or invalid text content is detected in the initial text content If the number of phonetic words and/or phonetic word combinations preset in the library is less than the preset value, the initial text content is subjected to semantic analysis, and each segment of the initial text content is determined through semantic analysis Semantic information corresponding to the content, and then in a subsequent step, determine invalid segment content through the semantic information.
- S2043 According to the semantic information corresponding to each fragment content of the initial text content, determine a credibility coefficient of at least one fragment content in the initial text content, and the credibility coefficient is used to represent the credibility of the fragment content as invalid text content.
- S2044 Determine at least one invalid segment content from at least one segment content according to the credibility coefficient of the segment content and a preset reliability threshold.
- the output voice information includes the confidence of the semantic type corresponding to the segment, and the confidence represents the semantic
- the analysis model evaluates the credibility of the semantic classification results of the fragment content, that is, the confidence is the credibility coefficient. The higher the reliability coefficient, the more credible the semantic type corresponding to the segment content.
- invalid content also corresponds to a
- the fragment content corresponding to the credibility coefficient greater than the credibility threshold , which is determined as the semantic classification of the "invalid content", that is, invalid fragment content.
- the invalid segment content is determined by the reliability coefficient of each segment content, which can improve the identification accuracy of the invalid segment content and reduce misjudgment.
- S2045 Add the invalid fragment content determined based on the semantic information to the invalid text content library. For example, after the invalid segment content is determined through voice information, since the invalid text library does not include the word or word combination corresponding to the invalid segment content, the invalid segment content is added to the invalid text content library to expand the invalid text content
- the content of the library can improve the accuracy and validity of invalid text content judgment using the invalid text content library. This improves the efficiency of locating and clipping invalid text content.
- the invalid text content may include one or more invalid segment contents, and after each invalid segment content in the initial text content is determined, the corresponding invalid text content can be determined.
- the invalid segment content includes at least one phonetic word
- determining the invalid text content in the initial text content includes: acquiring the playback duration of each phonetic word according to the time stamp corresponding to the phonetic word; For the standard duration and the playback duration of the phonetic word, the phonetic word whose playback duration is longer than the first threshold of the standard duration, or the phonetic word whose playback duration is shorter than the second threshold of the standard duration is determined as invalid text content in the initial text content.
- the phonetic word in the content of the invalid segment is generated by converting the voice data, and the voice data corresponds to the human voice, but in the actual application process, for the same phonetic word, the pronunciation duration of the human voice is There may be differences, that is, there are differences in the pronunciation duration of human voices. Different pronunciation durations can express different semantics, which in turn determines whether phonetic words can express semantic information.
- each phonetic word has a preset standard duration, for example, 0.2 seconds.
- the playback time of the determined phonetic word is much longer than the standard time length or much shorter than the standard time length, it means that the phonetic word is likely to be a mood pause word that does not express specific semantics, so it can be determined is invalid text content.
- the standard duration of the phonetic word with the playing time of the phonetic word, it is detected whether the phonetic word in the content of the invalid segment is an invalid word generated due to reasons such as tone pause, thereby reducing the need for Misjudgment of phonetic words with different meanings, and improve the recognition accuracy of invalid text content.
- S206 Determine the start point and end point of the voice data of each invalid segment content in the invalid text content in the audio data of the first multimedia resource.
- S207 Determine the first playback position of the voice data of the invalid text content in the first multimedia resource according to the start point and the end point corresponding to each invalid segment content in the invalid text content.
- the invalid segment content includes at least one phonetic word, and according to the time stamp corresponding to each phonetic word in the invalid text content, determine the first playback position of the voice data of the invalid text content in the first multimedia resource.
- S208 Display invalid text content in the initial text content.
- S209 Play an audio clip corresponding to the invalid text content in response to the operation instruction for the invalid text content.
- the terminal device to which the method provided in this embodiment is applied has a touchable display screen, and an application (Application, APP) for editing the first multimedia resource runs in the terminal device.
- the touch screen displays the interactive interface of the APP.
- FIG. 8 is a schematic diagram of an interactive interface provided by an embodiment of the present disclosure. With reference to FIG. 8, initial text content is displayed on the interactive interface of the terminal device (FIG.
- S210 Determine a second playback position of the non-voice data in the first multimedia resource according to the start point and the end point of the non-voice data.
- the non-speech data is the audio data corresponding to the non-speech part in the first multimedia resource, such as the blank part before the introduction and the blank part after the introduction in the product introduction video.
- the non-speech data is obtained through the speech endpoint detection algorithm in step S202, and will not be repeated here.
- the corresponding playing position of the non-speech data in the first multimedia resource ie, the second playing position, can be obtained.
- the positioning of the non-voice data can be realized, based on the first playback position and the second playback position, the first multimedia resource is clipped, and the invalid segment in the first multimedia resource is The voice data corresponding to the content and the non-voice data are removed, leaving the audio data corresponding to other text content, that is, the audio data corresponding to the content of the target segment.
- S212 Add a fade-in effect to the start point of the voice data corresponding to at least one target segment content, and or, add a fade-out effect to the end point of the voice data corresponding to at least one target segment content, to generate transitional voice data corresponding to the target segment content.
- S213 Splicing the transition voice data according to the first playback position and the second playback position to generate a second multimedia resource. Further, leaving other text content including at least one target segment content, in order to improve the playback fluency of the trimmed audio, a fade-in and fade-out effect is added to the voice data corresponding to the target segment content. Specifically, for example, a fade-in effect is added to the starting point of the voice data corresponding to at least one target segment content, and a fade-out effect is added to the voice data corresponding to the at least one target segment content.
- the fade-in effect and the fade-out effect refer to performing windowing in the time domain at the start point and end point of the voice data, so that when the voice data corresponding to the content of a target segment starts to play, the volume gradually increases from small to large (fade in) or by Larger gradually becomes smaller (fade out), reducing the abruptness of audio data clipping.
- the specific method of adding fade-in and fade-out to an audio clip is a prior art in the art, and will not be repeated here.
- transitional voice data is generated, and then, according to the first playback position and the second playback position, each transitional voice data is spliced to generate the target audio data, Similarly, according to the first playback position and the second playback position, the target video data corresponding to the target audio data is acquired, and then the second multimedia resource is generated.
- the second multimedia resource only includes the multimedia video corresponding to the target text content composed of the target segment content, but does not include the multimedia video corresponding to the invalid text content, And, the multimedia video corresponding to the non-voice data.
- FIG. 9 is a schematic diagram of another process for obtaining a second multimedia resource through a first multimedia resource according to an embodiment of the present disclosure. As shown in FIG.
- the speech data and non-speech data in it are determined, and after processing steps such as speech recognition and semantic analysis are performed on the speech data, a plurality of invalid segment contents are determined, wherein the invalid segment contents correspond to the first speech data, except for the invalid segment
- the content of the target segment other than the content corresponds to the second voice data, and according to the first playback position corresponding to the first voice data corresponding to the invalid segment content and the second playback position corresponding to the non-voice data, the audio data is trimmed to remove the first voice data and non-speech data to generate target audio data corresponding to the target text content.
- video data is clipped according to the first playback position and the second playback position to generate target video data.
- FIG. 10 is a structural block diagram of a multimedia processing device provided in an embodiment of the present disclosure. For ease of description, only parts related to the embodiments of the present disclosure are shown. Referring to FIG.
- the multimedia processing device 3 includes: a voice recognition module 31, configured to acquire a first multimedia resource, perform voice recognition on the audio data of the first multimedia resource, and determine the initial Text content, wherein the audio data of the first multimedia resource includes voice data of the initial text content; the first confirmation module 32 is configured to determine invalid text content in the initial text content, where the invalid text content is semantically uninformative The text content of the expression function; the second confirmation module 33, determining the first playback position of the voice data of the invalid text content in the first multimedia resource; the generation module 34, used for based on the first playback position, for the first multimedia The body resource is clipped to obtain the second multimedia resource, wherein the audio data of the second multimedia resource contains the speech data of the target text content and does not contain the speech data of the invalid text content, and the target text content is the original text content Text content other than invalid text content.
- a voice recognition module 31 configured to acquire a first multimedia resource, perform voice recognition on the audio data of the first multimedia resource, and determine the initial Text content, wherein the audio data of the
- the first confirmation module 32 is specifically configured to: perform semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content; determine invalid text content in the initial text content according to the semantic information.
- the initial text content includes a plurality of fragment content, and when the first confirmation module 32 determines invalid text content in the initial text content according to the semantic information, it is specifically configured to: according to the semantics corresponding to the initial text content information, to determine the credibility coefficient of at least one fragment content in the initial text content, the credibility coefficient is used to represent the credibility of the fragment content as invalid text content; according to the credibility coefficient of the fragment content and the preset credibility A threshold value, determining at least one invalid segment content from at least one segment content; and determining invalid text content in the initial text content according to the at least one invalid segment content.
- the second confirmation module 33 is specifically configured to: determine the start point and end point of the voice data of each invalid segment content in the audio data of the first multimedia resource; Corresponding to the start point and the end point, determine the first playback position of the voice data of the invalid text content in the first multimedia resource.
- the generation module 34 is specifically configured to: obtain other text content in the initial text content except the invalid segment content based on the first playback position, wherein the other text content includes at least one target segment content; Adding a fade-in effect at the starting point of the voice data corresponding to at least one target segment content, and/or, adding a fade-out effect at the end point of the voice data corresponding to at least one target segment content, generating transitional voice data corresponding to the target segment content; The transition voice data is spliced at the first playback position to generate a second multimedia resource.
- the first confirmation module 32 before performing semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content, is further configured to: determine the initial Invalid text content in the text content; when the first confirmation module 32 performs semantic analysis on the initial text content to obtain the semantic information corresponding to the initial text content, it is specifically used to: If there is no invalid text content in the invalid text content database, then Perform semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content; after the first confirmation module 32 determines the invalid text content in the initial text content according to the semantic information, it is also used to: determine the invalid text content based on the semantic information , added to the invalid text content library.
- the generation module 34 before clipping the first multimedia resource based on the first playback position to obtain the second multimedia resource, the generation module 34 is further configured to: display invalid text content in the initial text content text content; in response to an operation instruction for the invalid text content, play an audio segment corresponding to the invalid text content.
- the speech recognition module 31 performs speech recognition on the audio data of the first multimedia resource, and when determining the initial text content corresponding to the first multimedia resource, is specifically used to: detect VAD through a speech endpoint Algorithm, identifying speech data and non-speech data in the audio data of the first multimedia resource; performing speech recognition on the speech data in the audio data of the first multimedia resource, and determining the initial text corresponding to the first multimedia resource content.
- the second confirmation module 33 is further configured to: determine the second playback position of the non-voice data in the first multimedia resource according to the start point and the end point of the non-voice data; the generation module 34. Specifically used for: clipping the first multimedia resource based on the first playback position and the second playback position to obtain a second multimedia resource, where the second multimedia resource does not include non-voice data.
- the speech recognition module 31 when the speech recognition module 31 performs speech recognition on the audio data of the first multimedia resource and determines the initial text content corresponding to the first multimedia resource, it is specifically configured to:
- the ASR technology performs speech recognition on the audio data of the first multimedia resource, and obtains a plurality of speech words, and a time stamp corresponding to each speech word, and the time stamp represents the time stamp of the audio data corresponding to the speech word in the first multimedia resource Play position; Generate initial text content according to multiple phonetic words;
- the second confirmation module 33 is specifically used to: determine the voice data of invalid text content in the first multimedia The first playback position in the resource.
- the first confirmation module 32 is specifically configured to: acquire the playback duration of each phonetic word according to the time stamp corresponding to the phonetic word; A phonetic word whose playback time is longer than the first threshold of the standard duration, or a phonetic word whose playback duration is shorter than the second threshold of the standard duration is determined to be invalid text content in the initial text content.
- the first multimedia resource further includes video data
- the generating module 34 is specifically configured to: based on the first playback position, clip the audio data and video data of the first multimedia resource, Obtain the second multimedia resource.
- FIG. 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
- the electronic device 4 includes at least one processor 41 and a memory 42; the memory 42 stores instructions executed by a computer; at least one processor 41 executes The computer execution instructions stored in the memory 42 enable at least one processor 41 to execute the multimedia processing method in the embodiment shown in FIGS. 2-7 .
- the processor 41 and the memory 42 are connected through a bus 43 .
- Relevant descriptions can be understood by referring to the relevant descriptions and effects corresponding to the steps in the embodiments corresponding to FIG. 2 to FIG.
- FIG. 12 shows a schematic structural diagram of an electronic device 900 suitable for implementing the embodiments of the present disclosure.
- the electronic device 900 may be a terminal device or a server.
- the terminal equipment may include but not limited to mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA for short), tablet computers (Portable Android Device, PAD for short), portable multimedia players (Portable Media Player, PMP for short), mobile terminals such as vehicle-mounted terminals (eg, vehicle-mounted navigation terminals), and fixed terminals such as digital television (Television, TV), desktop computers, and the like.
- PDA Personal Digital Assistant
- PMP portable multimedia players
- mobile terminals such as vehicle-mounted terminals (eg, vehicle-mounted navigation terminals)
- fixed terminals such as digital television (Television, TV), desktop computers, and the like.
- an electronic device 900 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 901, which may be stored in a read-only memory (Read Only Memory, ROM for short) 902 or from a storage device. 908 to execute various appropriate actions and processes by loading a program into a random access memory (Random Access Memory, RAM for short) 903 . In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are also stored.
- the processing device 901 , ROM 902 and RAM 903 are connected to each other through a bus 904 .
- An input/output (Input/Output, I/O for short) interface 905 is also connected to the bus 904 .
- input devices 906 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (Liquid Crystal Display, LCD for short) ), a speaker, a vibrator, etc. output device 907; including a storage device 908 such as a magnetic tape, a hard disk, etc.; and a communication device 909.
- the communication means 909 may allow the electronic device 900 to perform wireless or wired communication with other devices to exchange data. While FIG.
- the processes described above with reference to the flowcharts can be implemented as computer software programs.
- the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program code for executing the method shown in the flowchart.
- the computer program may be downloaded and installed from a network via communication means 909 , or from storage means 908 , or from ROM 902 .
- the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
- a computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof.
- Computer readable storage media may include, but are not limited to: electrical connections with one or more conductors, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read-only memory (Erasable Programmable ROM, referred to as EPROM or flash memory), optical fiber, convenient A portable compact disk read-only memory (Compact Disc ROM, CD-ROM for short), an optical storage device, a magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium containing or storing a program, and the program may be used by or in combination with an instruction execution system, device, or device.
- a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which computer-readable program codes are carried.
- the propagated data signal may take various forms, including but not limited to electromagnetic signal, optical signal, or any suitable combination of the above.
- the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium, and the computer-readable signal medium may send, propagate or transmit a program for use by or in combination with an instruction execution system, apparatus or device .
- the program code contained on the computer readable medium may be transmitted by any appropriate medium, including but not limited to: electric wire, optical cable, radio frequency (Radio Frequency, RF for short), etc., or any suitable combination of the above.
- the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or it may exist independently without being assembled into the electronic device.
- the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device is made to execute the methods shown in the above-mentioned embodiments.
- Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language-such as "C" or similar programming language.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer can be connected to the user computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external Computer (e.g. via Internet connection using an Internet Service Provider).
- LAN Local Area Network
- WAN Wide Area Network
- Internet Service Provider e.g. via Internet connection using an Internet Service Provider.
- each block in the flowchart or block diagram may represent a module, program segment, or part of code that contains one or more logic functions for implementing the specified executable instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block in the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented by a dedicated hardware-based system that performs specified functions or operations. , or may be implemented by a combination of special purpose hardware and computer instructions.
- the units involved in the embodiments described in the present disclosure may be implemented by means of software or by means of hardware.
- the name of the unit does not constitute a limitation on the unit itself under certain circumstances, for example, the first obtaining unit may also be described as "a unit that obtains at least two Internet Protocol addresses".
- the functions described herein above may be performed at least in part by one or more hardware logic components.
- exemplary types of hardware logic components include: field programmable gate array (Held Programmable Gate Array, FPGA for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), application-specific standard products ( Application Specific Standard Parts, referred to as ASSP), System on a Chip (SOC for short), Complex Programmable Logic Device (CPLD for short), etc.
- a machine-readable medium may be a tangible medium, which may contain or store a program for use by or in combination with an instruction execution system, device, or device.
- a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM or flash memory), optical fiber, compact disk read-only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
- RAM Random Access Memory
- ROM Read Only Memory
- EPROM Erasable Programmable Read Only Memory
- CD-ROM compact disk read-only memory
- a multimedia processing method including: acquiring a first multimedia resource; performing speech recognition on audio data of the first multimedia resource, and determining initial text content corresponding to the first multimedia resource, wherein the audio data of the first multimedia resource includes voice data of the initial text content; determining invalid text content in the initial text content, wherein , the invalid text content is semantically meaningless text content; determining a first playback position of voice data of the invalid text content in the first multimedia resource; based on the first playback position, clipping the first multimedia resource to obtain a second multimedia resource, wherein the audio data of the second multimedia resource contains the voice data of the target text content and does not contain the invalid text content Voice data, the target text content is other text content in the initial text content except the invalid text content.
- determining invalid text content in the initial text content includes: performing semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content; according to the semantic information to determine invalid text content in the initial text content.
- the initial text content includes a plurality of fragment content, and according to the semantic information, determining invalid text content in the initial text content includes: according to the corresponding Semantic information, determining a credibility coefficient of at least one segment content in the initial text content, where the credibility coefficient is used to characterize the credibility of the segment content as the invalid text content; according to the segment content determining at least one invalid segment content from the at least one segment content; determining invalid text content in the initial text content according to the at least one invalid segment content.
- determining the first playback position of the voice data of the invalid text content in the first multimedia resource includes: determining where the voice data of each invalid segment content is located the start point and the end point in the audio data of the first multimedia resource; according to the start point and the end point corresponding to each of the invalid segment contents, determine that the voice data of the invalid text content is in the first multimedia resource The first playback position in a multimedia resource.
- clipping the first multimedia resource to obtain a second multimedia resource includes: based on the first playback position, obtaining Other text content in the initial text content except the invalid segment content, wherein the other text content includes at least one target segment content; adding a fade-in effect at the starting point of the voice data corresponding to at least one target segment content , and/or, adding a fade-out effect at the end point of the voice data corresponding to at least one of the target segment contents, generating a Transition voice data corresponding to the content of the target segment; splicing the transition voice data according to the first playback position to generate the second multimedia resource.
- the method before performing semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content, the method further includes: based on a preset invalid text content library, determining Invalid text content in the initial text content; performing semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content, including: if the invalid text content does not exist in the invalid text content library, Then perform semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content; after determining invalid text content in the initial text content according to the semantic information, the method further includes: The invalid text content determined by the semantic information is added to the invalid text content database.
- the method before clipping the first multimedia resource to obtain the second multimedia resource based on the first playback position, the method further includes: displaying the initial text content Invalid text content; in response to an operation instruction for the invalid text content, play an audio segment corresponding to the invalid text content.
- performing speech recognition on the audio data of the first multimedia resource, and determining the initial text content corresponding to the first multimedia resource includes: using a speech endpoint detection algorithm, Recognizing voice data and non-voice data in the audio data of the first multimedia resource; performing voice recognition on the voice data in the audio data of the first multimedia resource, and determining the first multimedia resource The corresponding initial text content.
- the method further includes: determining a second playback of the non-voice data in the first multimedia resource according to the start point and the end point of the non-voice data location; based on the first playback location, clipping the first multimedia resource to obtain a second multimedia resource includes: based on the first playback location and the second playback location, clipping the The first multimedia resource is clipped to obtain the second multimedia resource, where the second multimedia resource does not include the non-voice data.
- performing speech recognition on the audio data of the first multimedia resource, and determining the initial text content corresponding to the first multimedia resource includes: according to the automatic speech recognition technology, perform speech recognition on the audio data of the first multimedia resource, and obtain a plurality of speech words, and a time stamp corresponding to each of the speech words, and the time stamp indicates that the audio data corresponding to the speech word is in the first a playback position in a multimedia resource; generating the initial text content according to the plurality of phonetic words; determining a first playback position of the voice data of the invalid text content in the first multimedia resource, The method includes: determining a first playback position of the voice data of the invalid text content in the first multimedia resource according to the time stamp corresponding to each voice word in the invalid text content.
- determining the invalid text content in the initial text content includes: acquiring the playback duration of each of the phonetic words according to the time stamp corresponding to the phonetic word; according to a preset standard duration, and the playback duration of the phonetic word, the phonetic word whose playback duration is longer than the first duration threshold of the standard duration, or the phonetic word whose playback duration is shorter than the second duration threshold of the standard duration, is determined as Invalid text content in the initial text content.
- the first multimedia resource further includes video data, and based on the first playback position, the first multimedia resource is clipped to obtain a second multimedia resource
- the resources include: based on the first playback position, clipping audio data and video data of the first multimedia resource to obtain the second multimedia resource.
- a multimedia processing device including: a speech recognition module, configured to acquire a first multimedia resource, and performing voice recognition on the audio data, and determining the initial text content corresponding to the first multimedia resource, wherein the audio data of the first multimedia resource includes the voice data of the initial text content; the first confirmation module uses to determine the invalid text content in the initial text content, wherein the invalid text content is text content that has no information expression function in semantics; the second confirmation module determines that the voice data of the invalid text content is in the first A first playback position in the multimedia resource; a generating module, configured to clip the first multimedia resource based on the first playback position to obtain a second multimedia resource, wherein the second The audio data of the multimedia resource contains the voice data of the target text content and does not include the voice data of the invalid text content, and the target text content is other text content in the initial text content except the invalid text content .
- the first confirmation module is specifically configured to: perform semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content; determine the Invalid text content in initial text content.
- the initial text content includes multiple fragments of content, and when the first confirmation module determines invalid text content in the initial text content according to the semantic information, it is specifically configured to: determining a credibility coefficient of at least one piece of content in the initial text content according to the semantic information corresponding to the initial text content, and the credibility coefficient is used to characterize the credibility of the piece of content as the invalid text content degree; according to the credibility coefficient of the segment content and a preset credibility threshold, determine at least one invalid segment content from the at least one segment content; determine the initial text according to the at least one invalid segment content Invalid text content in content.
- the second confirmation module is specifically configured to: determine the starting point and termination point of the voice data of each invalid segment content in the audio data of the first multimedia resource point; according to the start point and the end point corresponding to each invalid segment content, determine a first playback position of the voice data of the invalid text content in the first multimedia resource.
- the generating module is specifically configured to: obtain other text content in the initial text content except the invalid segment content based on the first playback position, wherein the The other text content includes at least one target segment content; adding a fade-in effect at the starting point of the voice data corresponding to the at least one target segment content, and/or adding a fade-in effect at the end point of the voice data corresponding to the at least one target segment content Fading out the effect, generating transition voice data corresponding to the content of the target segment; splicing the transition voice data according to the first playback position to generate the second multimedia resource.
- the first confirmation module before performing semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content, is further configured to: Based on a preset invalid text content library , determining invalid text content in the initial text content; when the first confirmation module performs semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content, it is specifically configured to: if the invalid text content If the invalid text content does not exist in the library, perform semantic analysis on the initial text content to obtain semantic information corresponding to the initial text content; the first confirmation module determines the initial text content according to the semantic information of After the invalid text content, it is further used to: add the invalid text content determined based on the semantic information to the invalid text content library.
- the generating module before clipping the first multimedia resource based on the first playback position to obtain a second multimedia resource, is further configured to: displaying invalid text content in the initial text content; and playing an audio segment corresponding to the invalid text content in response to an operation instruction for the invalid text content.
- the voice recognition module performs voice recognition on the audio data of the first multimedia resource, and when determining the initial text content corresponding to the first multimedia resource, specifically use In: identifying voice data and non-voice data in the audio data of the first multimedia resource through a voice endpoint detection algorithm; performing voice recognition on the voice data in the audio data of the first multimedia resource, and determining The initial text content corresponding to the first multimedia resource.
- the second confirmation module is further configured to: determine that the non-voice data is in the first multimedia resource according to the start point and the end point of the non-voice data The second playback position in the; the generating module is specifically configured to: based on the first playback position and the second playback position, clip the first multimedia resource to obtain the second multimedia resource body resources, wherein the second multimedia resource does not include the non-voice data.
- the speech recognition module when it performs speech recognition on the audio data of the first multimedia resource and determines the initial text content corresponding to the first multimedia resource, it is specifically used to : According to automatic speech recognition technology, perform speech recognition on the audio data of the first multimedia resource, obtain a plurality of speech words, and a time stamp corresponding to each of the speech words, and the time stamp indicates that the speech word corresponds to the playback position of the audio data in the first multimedia resource; generate the initial text content according to the plurality of phonetic words; the second confirmation module is specifically configured to: according to the invalid text content The time stamp corresponding to each phonetic word determines the first playback position of the voice data of the invalid text content in the first multimedia resource.
- the first confirmation module is specifically configured to: acquire the playback duration of each of the speech words according to the time stamp corresponding to the speech word; according to a preset standard duration, and For the playback duration of the phonetic word, the phonetic word whose playback duration is longer than the first duration threshold of the standard duration, or the phonetic word whose playback duration is shorter than the second duration threshold of the standard duration is determined as the initial Invalid text content in text content.
- the first multimedia resource further includes video data
- the generating module is specifically configured to: based on the first playback position, generate the first multimedia resource clipping the audio data and video data to obtain the second multimedia resource.
- an electronic device including: at least one processor and a memory; the memory stores computer-executable instructions; the at least one processor executes the memory-stored The computer executes instructions, so that the at least one processor executes the multimedia processing method described in the above first aspect and various possible designs of the first aspect.
- a computer-readable storage medium is provided, the computer-readable storage medium stores computer-executable instructions, and when a processor executes the computer-executable instructions, Realize the multimedia processing method described in the above first aspect and various possible designs of the first aspect.
- an embodiment of the present disclosure provides a computer program product, including a computer program.
- the multimedia processing method described in the above first aspect and various possible designs of the first aspect is implemented.
- an embodiment of the present disclosure provides a computer program. When the computer program is executed by a processor, the multimedia processing method described in the above first aspect and various possible designs of the first aspect is implemented.
- the above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. Those skilled in the art should understand that the scope of disclosure involved in the present disclosure is not limited to the technical solution formed by a specific combination of the above technical features, but also covers the technical solutions formed by the above technical features or Other technical solutions formed by any combination of equivalent features.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Security & Cryptography (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management Or Editing Of Information On Record Carriers (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
BR112023026041A BR112023026041A2 (pt) | 2021-07-15 | 2022-07-14 | Método de processamento de multimídia, dispositivos, equipamentos eletrônicos e mídias de armazenamento |
JP2023576228A JP2024527483A (ja) | 2021-07-15 | 2022-07-14 | マルチメディア処理方法、装置、電子機器および記憶媒体 |
EP22842574.0A EP4336854A4 (en) | 2021-07-15 | 2022-07-14 | MULTIMEDIA PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM |
US18/535,891 US20240105234A1 (en) | 2021-07-15 | 2023-12-11 | Multimedia processing method and apparatus, electronic device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110802038.0 | 2021-07-15 | ||
CN202110802038.0A CN115623279A (zh) | 2021-07-15 | 2021-07-15 | 多媒体处理方法、装置、电子设备及存储介质 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/535,891 Continuation US20240105234A1 (en) | 2021-07-15 | 2023-12-11 | Multimedia processing method and apparatus, electronic device, and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2023287360A2 true WO2023287360A2 (zh) | 2023-01-19 |
WO2023287360A3 WO2023287360A3 (zh) | 2023-04-13 |
Family
ID=84855225
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SG2022/050494 WO2023287360A2 (zh) | 2021-07-15 | 2022-07-14 | 多媒体处理方法、装置、电子设备及存储介质 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20240105234A1 (zh) |
EP (1) | EP4336854A4 (zh) |
JP (1) | JP2024527483A (zh) |
CN (1) | CN115623279A (zh) |
BR (1) | BR112023026041A2 (zh) |
WO (1) | WO2023287360A2 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP4429256A1 (en) | 2023-01-19 | 2024-09-11 | Beijing Zitiao Network Technology Co., Ltd. | Video editing method and apparatus, electronic device, and storage medium |
CN118368478A (zh) * | 2023-01-19 | 2024-07-19 | 北京字跳网络技术有限公司 | 视频的编辑方法、装置、电子设备和存储介质 |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070219778A1 (en) * | 2006-03-17 | 2007-09-20 | University Of Sheffield | Speech processing system |
CN202502737U (zh) * | 2012-03-12 | 2012-10-24 | 中国人民解放军济南军区司令部第二部 | 一种视音频信息的智能编辑系统 |
US9020817B2 (en) * | 2013-01-18 | 2015-04-28 | Ramp Holdings, Inc. | Using speech to text for detecting commercials and aligning edited episodes with transcripts |
US11861906B2 (en) * | 2014-02-28 | 2024-01-02 | Genius Sports Ss, Llc | Data processing systems and methods for enhanced augmentation of interactive video content |
US20190215464A1 (en) * | 2018-01-11 | 2019-07-11 | Blue Jeans Network, Inc. | Systems and methods for decomposing a video stream into face streams |
CN108259965B (zh) * | 2018-03-31 | 2020-05-12 | 湖南广播电视台广播传媒中心 | 一种视频剪辑方法和剪辑系统 |
CN109241332B (zh) * | 2018-10-19 | 2021-09-24 | 广东小天才科技有限公司 | 一种通过语音确定语义的方法及系统 |
CN110297907B (zh) * | 2019-06-28 | 2022-03-08 | 谭浩 | 生成访谈报告的方法、计算机可读存储介质和终端设备 |
CN110853621B (zh) * | 2019-10-09 | 2024-02-13 | 科大讯飞股份有限公司 | 语音顺滑方法、装置、电子设备及计算机存储介质 |
US11875781B2 (en) * | 2020-08-31 | 2024-01-16 | Adobe Inc. | Audio-based media edit point selection |
CN112733654B (zh) * | 2020-12-31 | 2022-05-24 | 蚂蚁胜信(上海)信息技术有限公司 | 一种视频拆条的方法和装置 |
-
2021
- 2021-07-15 CN CN202110802038.0A patent/CN115623279A/zh active Pending
-
2022
- 2022-07-14 BR BR112023026041A patent/BR112023026041A2/pt unknown
- 2022-07-14 JP JP2023576228A patent/JP2024527483A/ja active Pending
- 2022-07-14 WO PCT/SG2022/050494 patent/WO2023287360A2/zh active Application Filing
- 2022-07-14 EP EP22842574.0A patent/EP4336854A4/en active Pending
-
2023
- 2023-12-11 US US18/535,891 patent/US20240105234A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
CN115623279A (zh) | 2023-01-17 |
EP4336854A2 (en) | 2024-03-13 |
EP4336854A4 (en) | 2024-09-25 |
WO2023287360A3 (zh) | 2023-04-13 |
BR112023026041A2 (pt) | 2024-03-05 |
US20240105234A1 (en) | 2024-03-28 |
JP2024527483A (ja) | 2024-07-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10614803B2 (en) | Wake-on-voice method, terminal and storage medium | |
WO2020098115A1 (zh) | 字幕添加方法、装置、电子设备及计算机可读存储介质 | |
CN108831437B (zh) | 一种歌声生成方法、装置、终端和存储介质 | |
CN112786006B (zh) | 语音合成方法、合成模型训练方法、装置、介质及设备 | |
US20240021202A1 (en) | Method and apparatus for recognizing voice, electronic device and medium | |
WO2023287360A2 (zh) | 多媒体处理方法、装置、电子设备及存储介质 | |
CN112927674B (zh) | 语音风格的迁移方法、装置、可读介质和电子设备 | |
CN112908292B (zh) | 文本的语音合成方法、装置、电子设备及存储介质 | |
CN111369971A (zh) | 语音合成方法、装置、存储介质和电子设备 | |
WO2019007308A1 (zh) | 语音播报方法及装置 | |
CN111489735B (zh) | 语音识别模型训练方法及装置 | |
CN111782576B (zh) | 背景音乐的生成方法、装置、可读介质、电子设备 | |
CN113257218B (zh) | 语音合成方法、装置、电子设备和存储介质 | |
JP6974421B2 (ja) | 音声認識方法及び装置 | |
CN111986655B (zh) | 音频内容识别方法、装置、设备和计算机可读介质 | |
WO2022037388A1 (zh) | 语音生成方法、装置、设备和计算机可读介质 | |
WO2023051246A1 (zh) | 视频录制方法、装置、设备及存储介质 | |
CN111477210A (zh) | 语音合成方法和装置 | |
CN112071287A (zh) | 用于生成歌谱的方法、装置、电子设备和计算机可读介质 | |
WO2022169417A1 (zh) | 语音相似度确定方法及设备、程序产品 | |
CN116189652A (zh) | 语音合成方法、装置、可读介质及电子设备 | |
CN113761865A (zh) | 声文重对齐及信息呈现方法、装置、电子设备和存储介质 | |
WO2021170094A1 (zh) | 用于信息交互的方法和装置 | |
CN111259181B (zh) | 用于展示信息、提供信息的方法和设备 | |
CN111445925A (zh) | 用于生成差异信息的方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 2022842574 Country of ref document: EP Ref document number: 22842574.0 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11202309454P Country of ref document: SG |
|
ENP | Entry into the national phase |
Ref document number: 2023576228 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2022842574 Country of ref document: EP Effective date: 20231207 |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112023026041 Country of ref document: BR |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22842574 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 112023026041 Country of ref document: BR Kind code of ref document: A2 Effective date: 20231211 |