CN112487247A - Video processing method and video processing device - Google Patents

Video processing method and video processing device Download PDF

Info

Publication number
CN112487247A
CN112487247A CN202011381440.8A CN202011381440A CN112487247A CN 112487247 A CN112487247 A CN 112487247A CN 202011381440 A CN202011381440 A CN 202011381440A CN 112487247 A CN112487247 A CN 112487247A
Authority
CN
China
Prior art keywords
target
video
voice
time interval
target video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011381440.8A
Other languages
Chinese (zh)
Inventor
曹岱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Mobile Communication Shenzhen Co Ltd
Original Assignee
Vivo Mobile Communication Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Mobile Communication Shenzhen Co Ltd filed Critical Vivo Mobile Communication Shenzhen Co Ltd
Priority to CN202011381440.8A priority Critical patent/CN112487247A/en
Publication of CN112487247A publication Critical patent/CN112487247A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features

Abstract

The application discloses a video processing method and a video processing device, and belongs to the technical field of video processing. The video processing method comprises the following steps: acquiring a target video and a first target voice; wherein the target video comprises: a video frame and audio data corresponding to the video frame; the first target voice includes: a target character sound; determining a second target voice in the audio data through voiceprint recognition; wherein the second target speech includes a sound matching the target character sound; and processing the target video according to the second target voice to generate a target file. According to the method and the device, the voice of the target person can be rapidly and accurately determined in the target video without manual analysis and judgment, and then the target file associated with the voice of the target person is obtained, so that the time consumption of the whole process is reduced, and careless omission caused by manual negligence can be avoided.

Description

Video processing method and video processing device
Technical Field
The present application belongs to the field of video processing technologies, and in particular, relates to a video processing method and a video processing apparatus.
Background
The video editing technology is a technology for processing a video file, and generally refers to a technology for performing editing operations such as cutting, splicing, adding characters, adding pictures, adding sound effects, and the like on a video file. Currently, video editing techniques are widely used in various aspects of life. For example, editing operation is carried out on the existing video through a video editing technology, so that some funny and amusing clip videos are produced and published on various social websites for people to enjoy everyday.
In the process of processing the video file by the video editing technology, the video content of the original video is analyzed and judged by operators, and then the video is correspondingly processed according to the analysis and judgment results, so that the video file meeting the requirements of the operators is obtained.
However, by manually analyzing and judging the video content of the original video, not only careless omission may be caused, but also the whole process takes too long time.
Disclosure of Invention
An object of the embodiments of the present application is to provide a video processing method and a video processing apparatus, which can solve the problems of omission and time-consuming process that may be caused by manually analyzing and determining video content in the video processing process in the prior art.
In order to solve the technical problem, the present application is implemented as follows:
in a first aspect, an embodiment of the present application provides a video processing method, where the video processing method includes:
acquiring a target video and a first target voice; wherein the target video comprises: a video frame and audio data corresponding to the video frame; the first target voice includes: a target character sound;
determining a second target voice in the audio data through voiceprint recognition; wherein the second target speech includes a sound matching the target character sound;
and processing the target video according to the second target voice to generate a target file.
In a second aspect, an embodiment of the present application provides a video processing apparatus, including:
the acquisition module is used for acquiring a target video and a first target voice; wherein the target video comprises: a video frame and audio data corresponding to the video frame; the first target voice includes: a target character sound;
the voiceprint recognition module is used for determining a second target voice in the audio data through voiceprint recognition; wherein the second target speech includes a sound matching the target character sound;
and the file generation module is used for processing the target video according to the second target voice to generate a target file.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the video processing method according to the first aspect.
In a fourth aspect, the present application provides a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the video processing method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the video processing method according to the first aspect.
In the embodiment of the application, the second target voice matched with the first target voice in the target video can be determined through voiceprint recognition. Here, the sound included in the second target voice and the sound included in the first target voice both belong to the sound of the target person. And further processing the target video according to the second target voice to obtain a target file related to the sound of the target person in the target video. The whole process does not need manual analysis and judgment, the sound of the target person can be rapidly and accurately determined in the target video, and then the target file associated with the sound of the target person is obtained, so that the time consumption of the whole process is reduced, and careless omission caused by manual negligence can be avoided.
Drawings
Fig. 1 is a flowchart illustrating steps of a video processing method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram illustrating a process of acquiring a target video according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a process of obtaining a first target speech according to an embodiment of the present application;
FIG. 4 is a display diagram of an interface for acquiring a first target voice from a video according to an embodiment of the present application;
fig. 5 is a block diagram of a video processing apparatus according to an embodiment of the present application;
fig. 6 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;
fig. 7 is a second schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.
The video processing method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.
As shown in fig. 1, a video processing method provided in an embodiment of the present application includes:
step 101, a target video and a first target voice are obtained.
In this step, the target video includes: video frames and audio data corresponding to the video frames. That is, when the target video is played, the corresponding audio data is played while the video frame is displayed. I.e. video with both sound and pictures. When the target video is acquired, various modes can be provided for the user to select. As shown in fig. 2, a first interface 21 displayed by a display unit of the electronic device includes a selected video control, and after a user inputs the selected video control, the display unit displays a second interface 22 including a local import control, a third-party import control and a video link control. The user inputs different controls in the second interface 22 and will obtain the target video in different ways. For example, a user entering a locally imported control will obtain the target video locally at the electronic device. And the user inputs the import control from the third party and acquires the target video from the third party software. When the user inputs the video link control, the website of the video website can be input, and the target video is obtained from the video website indicated by the website. The manner of acquiring the target video is not limited to the three manners provided in fig. 2, and only any one or two of the manners may be provided.
The first target voice includes: the target character sound. The voice of the target person can be obtained by recording when the target person speaks in a recording mode. Of course, it is also possible to select the generated voice file and video file, and intercept the first target voice in the voice file and video file. Likewise, when the first target voice is obtained, various ways may be provided for the user to select. As shown in fig. 3, a first interface 31 displayed by a display unit of the electronic device includes a sound source selection control, and after a user inputs the sound source selection control, the display unit displays a second interface 32 including an entry sound source control and a selection control from a video. The user inputs different controls in the second interface 32 and will use different ways to obtain the first target voice. For example, a user inputs the recorded sound source control, a recording module of the electronic device is started to record, and the sound recorded by the recording module is used as the first target voice. Preferably, the recording duration is greater than 30 seconds. The user inputs the control selected from the videos, a plurality of videos are provided for the user to select, and the starting time and the ending time input by the user are received after the user selects the videos; regarding the voice corresponding to the start time and the end time in the video selected by the user as the first target voice, as shown in fig. 4, after the user inputs the control selected from the video, the first target voice is obtained by inputting the start time and the end time in the selected video. Wherein, the number of the first target voices may be one or at least two. The manner of acquiring the first target voice is not limited to the two manners provided in fig. 3, and only either one of the manners may be provided.
Step 102, determining a second target voice in the audio data through voiceprint recognition.
In this step, the second target voice includes a sound matching the target character sound. Voiceprint recognition has two discrimination modes, one is speaker recognition, and the other is speaker verification. The former is used to judge which one of several persons said a certain voice, and the latter is used to confirm whether a certain voice is said by a certain person. The former discrimination mode is adopted for voiceprint recognition in the application. The second target speech determined by the voiceprint recognition here includes a sound belonging to the same person as the target person sound.
And 103, processing the target video according to the second target voice to generate a target file.
In this step, the target file may be a voice file or a video file; the video file may include only video frames, or may include both video frames and voice.
In the embodiment of the application, the second target voice matched with the first target voice in the target video can be determined through voiceprint recognition. Here, the sound included in the second target voice and the sound included in the first target voice both belong to the sound of the target person. And further processing the target video according to the second target voice to obtain a target file related to the sound of the target person in the target video. The whole process does not need manual analysis and judgment, the sound of the target person can be rapidly and accurately determined in the target video, and then the target file associated with the sound of the target person is obtained, so that the time consumption of the whole process is reduced, and careless omission caused by manual negligence can be avoided.
Optionally, determining the second target speech in the audio data by voiceprint recognition comprises:
and determining a first voiceprint characteristic of the target character sound and a second voiceprint characteristic corresponding to different sounds in the audio data.
In this step, the voiceprint recognition technology can be used to obtain the voiceprint characteristics of the character voice, and the first voiceprint characteristics corresponding to the target character voice and the second voiceprint characteristics corresponding to different voices in the audio data are obtained.
Determining the sound corresponding to the target voiceprint feature in the audio data as a second target voice; wherein the target voiceprint features include a second voiceprint feature that matches the first voiceprint feature.
In this step, when the voiceprint features corresponding to the two voices are matched, it is indicated that the two voices are the voice of the same person. Thus, here the second target speech includes a sound belonging to the same person as the target person sound.
In the embodiment of the application, the voice of the target person can be quickly and accurately determined in the audio data included in the target video by using the voiceprint characteristics of the person voice. Here, the target person is a person indicated by the sound of the target person.
Optionally, processing the target video according to the second target voice to generate a target file, including:
and determining the time interval of the second target voice in the target video.
In this step, the second target voice may be a segment of voice, or may be at least two discontinuous segments of voice. The time interval here is the interval between the start time and the end time of each piece of speech. For example, if the start time of the second target speech in the target video is 3 minutes 15 seconds and the end time is 5 minutes 12 seconds, the time interval is between 3 minutes 15 seconds and 5 minutes 12 seconds in the target video.
And intercepting the video frame of the target video according to the time interval to obtain a target video segment.
In this step, each time interval corresponds to one target video clip, and each target video clip is a video clip composed of video frames in the time interval corresponding to the target video clip in the target video.
And removing the target element in the target video segment.
In this step, the target elements include: at least one of a person other than the first target person and a target video background, the first target person including: the persons contained in the target video segments. Generally, the number of target video segments is large, and when all target video segments contain a certain person, the person is considered to be the person indicated by the sound included in the second target voice. The target video background is a background picture in the target video. After removing the characters other than the first target character and the background of the target video in the target video segment, the target video segment will only retain the image of the first target character.
And determining the target video clip with the target elements removed as a target file.
In this step, all the obtained target video segments can be displayed to the user for the user to select. After receiving an input of a certain target video segment from a user, the target video segment corresponding to the input is saved. Of course, all the target video segments may be spliced into one continuous video file, and the video file obtained after splicing is stored.
In the embodiment of the present application, the persons indicated by the sounds included in the second target voice are the same as the persons indicated by the sounds included in the first target voice, and both the persons are target persons. And through the time interval of the second target voice in the target video, a video clip containing a target character can be intercepted from the target video, so that the intercepted video clip is used as a video material for a user, and at least one of characters except the first target character in the target video clip and the background of the target video is further removed, so that objects reserved in the final target video clip can be freely selected.
Optionally, processing the target video according to the second target voice to generate a target file, including:
and determining the time interval of the second target voice in the target video.
In this step, the second target voice may be a segment of voice, or may be at least two discontinuous segments of voice. The time interval here is the interval between the start time and the end time of each piece of speech. For example, if the start time of the second target speech in the target video is 3 minutes 15 seconds and the end time is 5 minutes 12 seconds, the time interval is between 3 minutes 15 seconds and 5 minutes 12 seconds in the target video.
And intercepting the video frame of the target video according to the time interval to obtain a first target video segment.
In this step, each time interval corresponds to a first target video segment, and each first target video segment is a video segment composed of video frames in the time interval corresponding to the first target video segment in the target video.
A first target person included in each of the first target video segments is determined.
In this step, the number of the first target video segments is usually large, and when all the first target video segments include a certain person, the person is considered to be the person indicated by the sound included in the second target voice, that is, the first target person.
And intercepting the video frames containing the first target person in the residual video to obtain a second target video segment.
In this step, the remaining videos include videos of the target video except the first target video segment. Here, there may be a scene in the target video in which no utterance is made after the first target person appears. The second target video segments are the video segments corresponding to the scenes.
And determining the first target video clip and the second target video clip as target files.
In this step, all the obtained first target video segments and second target video segments can be displayed to the user for the user to select. After receiving an input of a certain first target video segment or second target video segment from a user, saving the first target video segment or second target video segment corresponding to the input. Of course, all the first target video segments and the second target video segments may be spliced into a continuous video file, and the video file obtained after splicing is stored.
In the embodiment of the present application, the person indicated by the sound included in the second target voice is the same as the person indicated by the sound of the target person included in the first target voice, and both the persons are target persons. Through the time interval of the second target voice in the target video, a first target video segment containing a target person can be intercepted from the target video; and intercepting a second target video segment which also contains the target person from the rest video through the persons contained in the first target video segment, and taking the first target video segment and the second target video segment as video materials for users to use.
Optionally, processing the target video according to the second target voice to generate a target file, including:
a first time interval of a second target voice in the target video is determined.
In this step, the second target voice may be a segment of voice, or may be at least two discontinuous segments of voice. The first time interval here is an interval between the start time and the end time of each piece of speech. For example, if the determined start time of the second target speech in the target video is 3 minutes 15 seconds and the end time is 5 minutes 12 seconds, the first time interval is an interval between 3 minutes 15 seconds and 5 minutes 12 seconds in the target video.
A second time interval of the third target voice in the target video is determined.
In this step, the time interval between the third target voice and the second target voice in the target video is smaller than the preset time threshold. When this condition is satisfied, it can be considered that an interaction is being performed between the third target voice and the second target voice. Here, the third target voice may be one voice or a plurality of discontinuous voices. Preferably, each segment of speech contains the sound of only one character.
And intercepting the video frame of the target video according to the first time interval and the second time interval to obtain a target video clip.
In this step, each first time interval and each second time interval respectively correspond to one target video clip. That is, each target video segment is a video segment composed of video frames in a time interval corresponding to the target video segment in the target video. The time interval here may be the first time interval or the second time interval.
And determining the target video clip as a target file.
In the embodiment of the present application, the person indicated by the sound included in the second target voice is the same as the person indicated by the sound of the target person included in the first target voice, and both the persons are target persons. Through the time interval of the second target voice in the target video, a video clip containing the target person can be intercepted in the target video. The time interval between the third target voice and the second target voice in the target video is smaller than the preset time threshold. A video clip of the person interacting with the target person can be intercepted in the target video through the second time interval of the third target voice in the target video. And the intercepted video clips are used as video materials for users to use.
Optionally, after the step of capturing the video frame of the target video according to the first time interval and the second time interval to obtain the target video clip, the video processing method further includes:
and removing the target element in the target video segment.
In this step, the target elements include: persons other than the first target person and the second target person, the first target person including: the characters contained in the video frames intercepted according to the first time interval, the second target character comprises: and the characters are contained in the video frames intercepted according to the second time interval. Of course the target element may also comprise a target video background.
Determining a target video clip as a target file, comprising:
and determining the target video clip with the target elements removed as a target file.
In the embodiment of the application, people except the first target person and the second target person in the target video segment can be removed, so that objects reserved in the final target video segment can be freely selected.
Optionally, generating the target file according to the audio data associated with the second target voice includes:
and intercepting a second target voice in the target video to generate a target file.
Here, the target file retains only the sound of the target person in the target video. Here, the person indicated by the sound included in the second target speech is the same as the person indicated by the sound of the target person included in the first target speech, and both are target persons. Preferably, before the target file is generated, a third target voice may be intercepted from the target video, wherein a time interval between the third target voice and the second target voice in the target video is smaller than a preset time threshold. When this condition is satisfied, it can be considered that an interaction is being performed between the third target voice and the second target voice. Here, the third target voice may be one voice or a plurality of discontinuous voices. Preferably, each segment of speech contains the sound of only one character. And generating a target file according to the second target voice and the third target voice.
In the embodiment of the present application, the person indicated by the sound included in the second target voice is the same as the person indicated by the sound of the target person included in the first target voice, and both the persons are target persons. By intercepting the second target voice, a target file containing only the sound of the target character can be generated and used as a material for the user.
It should be noted that, in the video processing method provided in the embodiment of the present application, the execution subject may be a video processing apparatus, or a control module in the video processing apparatus for executing the video processing method. In the embodiment of the present application, a video processing apparatus executing a video processing method is taken as an example, and the video processing apparatus provided in the embodiment of the present application is described.
As shown in fig. 5, an embodiment of the present application further provides a video processing apparatus, including:
an obtaining module 51, configured to obtain a target video and a first target voice; wherein the target video includes: a video frame and audio data corresponding to the video frame; the first target voice includes: a target character sound;
a voiceprint recognition module 52, configured to determine a second target voice in the audio data through voiceprint recognition; wherein the second target voice comprises a sound matching the target character sound;
and the file generating module 53 is configured to process the target video according to the second target voice to generate a target file.
Optionally, the voiceprint recognition module 52 includes:
the voice print unit is used for determining a first voice print characteristic of the target character sound and a second voice print characteristic corresponding to different sounds in the audio data;
the recognition unit is used for determining the sound corresponding to the target voiceprint feature in the audio data as a second target voice; wherein the target voiceprint features include a second voiceprint feature that matches the first voiceprint feature.
Optionally, the file generating module 53 includes:
the first determining unit is used for determining a time interval of the second target voice in the target video;
the intercepting unit is used for intercepting video frames of the target video according to time intervals to obtain target video clips;
the removing module is used for removing the target elements in the target video clip; the target elements include: at least one of a person other than the first target person and a target video background, the first target person including: characters contained in the target video clips;
and the second determining unit is specifically configured to determine the target video segment with the target element removed as the target file.
Optionally, the file generating module 53 includes:
the first determining unit is used for determining a time interval of the second target voice in the target video;
the first intercepting unit is used for intercepting video frames of a target video according to a time interval to obtain a first target video segment;
the second determining unit is used for determining first target persons contained in the first target video clips;
the second intercepting unit is used for intercepting video frames containing the first target person in the residual video to obtain a second target video segment; wherein the remaining videos include videos in the target video except the first target video segment;
and the third determining unit is used for determining the first target video clip and the second target video clip as target files.
Optionally, the file generating module 53 includes:
the first determining unit is used for determining a first time interval of the second target voice in the target video;
a second determining unit, configured to determine a second time interval of a third target voice in the target video; the time interval between the third target voice and the second target voice in the target video is smaller than a preset time threshold;
the intercepting unit is used for intercepting video frames of the target video according to a first time interval and a second time interval to obtain a target video clip;
and a third determining unit, configured to determine the target video segment as the target file.
Optionally, the video processing apparatus further comprises:
the removing module is used for removing the target elements in the target video clip; the target elements include: persons other than the first target person and the second target person, the first target person including: the characters contained in the video frames intercepted according to the first time interval, the second target character comprises: characters contained in the video frames intercepted according to the second time interval;
and the third determining unit is specifically configured to determine the target video segment with the target element removed as the target file.
Optionally, the file generating module 53 is specifically configured to intercept the second target voice in the target video, and generate the target file.
In the embodiment of the application, the second target voice matched with the first target voice in the target video can be determined through voiceprint recognition. Here, the sound included in the second target voice and the sound included in the first target voice both belong to the sound of the target person. And further processing the target video according to the second target voice to obtain a target file related to the sound of the target person in the target video. The whole process does not need manual analysis and judgment, the sound of the target person can be rapidly and accurately determined in the target video, and then the target file associated with the sound of the target person is obtained, so that the time consumption of the whole process is reduced, and careless omission caused by manual negligence can be avoided. .
The video processing apparatus in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.
The video processing apparatus in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android operating system (Android), an iOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.
The video processing apparatus provided in the embodiment of the present application can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.
Optionally, as shown in fig. 6, an electronic device 600 is further provided in this embodiment of the present application, and includes a processor 601, a memory 602, and a program or an instruction stored in the memory 602 and executable on the processor 601, where the program or the instruction is executed by the processor 601 to implement each process of the above-mentioned video processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.
Fig. 7 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
The electronic device 700 includes, but is not limited to: a radio frequency unit 701, a network module 702, an audio output unit 703, an input unit 704, a sensor 705, a display unit 706, a user input unit 707, an interface unit 708, a memory 709, and a processor 710.
Those skilled in the art will appreciate that the electronic device 700 may also include a power supply (e.g., a battery) for powering the various components, and the power supply may be logically coupled to the processor 710 via a power management system, such that the functions of managing charging, discharging, and power consumption may be performed via the power management system. The electronic device structure shown in fig. 7 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.
A processor 710 for obtaining a target video and a first target voice; wherein the target video includes: a video frame and audio data corresponding to the video frame; the first target voice includes: a target character sound;
a processor 710 further configured to determine a second target voice in the audio data through voiceprint recognition; wherein the second target voice comprises a sound matching the target character sound;
and the processor 710 is further configured to process the target video according to the second target voice to generate a target file.
In the embodiment of the application, the second target voice matched with the first target voice in the target video can be determined through voiceprint recognition. Here, the sound included in the second target voice and the sound included in the first target voice both belong to the sound of the target person. And further processing the target video according to the second target voice to obtain a target file related to the sound of the target person in the target video. The whole process does not need manual analysis and judgment, the sound of the target person can be rapidly and accurately determined in the target video, and then the target file associated with the sound of the target person is obtained, so that the time consumption of the whole process is reduced, and careless omission caused by manual negligence can be avoided.
It should be understood that in the embodiment of the present application, the input Unit 704 may include a Graphics Processing Unit (GPU) 7041 and a microphone 7042, and the Graphics Processing Unit 7041 processes image data of still pictures or videos obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 706 may include a display panel 7061, and the display panel 7061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 707 includes a touch panel 7071 and other input devices 7072. The touch panel 7071 is also referred to as a touch screen. The touch panel 7071 may include two parts of a touch detection device and a touch controller. Other input devices 7072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. Memory 709 may be used to store software programs as well as various data, including but not limited to applications and operating systems. Processor 710 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 710.
The embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the video processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.
The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above video processing method embodiment, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A video processing method, characterized in that the video processing method comprises:
acquiring a target video and a first target voice; wherein the target video comprises: a video frame and audio data corresponding to the video frame; the first target voice includes: a target character sound;
determining a second target voice in the audio data through voiceprint recognition; wherein the second target speech includes a sound matching the target character sound;
and processing the target video according to the second target voice to generate a target file.
2. The video processing method according to claim 1, wherein processing the target video according to the second target voice to generate a target file comprises:
determining a time interval of the second target voice in the target video;
intercepting the video frame of the target video according to the time interval to obtain a target video segment;
removing target elements in the target video segment; the target elements include: at least one of a person other than a first target person and a target video background, the first target person comprising: characters contained in the target video clips;
determining the target video clip with the target element removed as the target file.
3. The video processing method according to claim 1, wherein processing the target video according to the second target voice to generate a target file comprises:
determining a time interval of the second target voice in the target video;
intercepting the video frame of the target video according to the time interval to obtain a first target video segment;
determining first target persons contained in the first target video clips;
intercepting video frames containing the first target character in the residual video to obtain a second target video segment; wherein the remaining video comprises video of the target video other than the first target video segment;
determining the first target video clip and the second target video clip as target files.
4. The video processing method according to claim 1, wherein processing the target video according to the second target voice to generate a target file comprises:
determining a first time interval of the second target voice in the target video;
determining a second time interval of a third target voice in the target video; wherein the time interval between the third target voice and the second target voice in the target video is smaller than a preset time threshold;
intercepting the video frame of the target video according to the first time interval and the second time interval to obtain a target video clip;
and determining the target video clip as the target file.
5. The video processing method according to claim 4, wherein after the step of capturing the video frames of the target video in the first time interval and the second time interval to obtain the target video segment, the video processing method further comprises:
removing target elements in the target video segment; the target elements include: persons other than the first target person and the second target person, the first target person including: the characters included in the video frames intercepted according to the first time interval, the second target character includes: characters contained in the video frames intercepted according to the second time interval;
the determining the target video segment as the target file comprises:
determining the target video clip with the target element removed as the target file.
6. A video processing apparatus, characterized in that the video processing apparatus comprises:
the acquisition module is used for acquiring a target video and a first target voice; wherein the target video comprises: a video frame and audio data corresponding to the video frame; the first target voice includes: a target character sound;
the voiceprint recognition module is used for determining a second target voice in the audio data through voiceprint recognition; wherein the second target speech includes a sound matching the target character sound;
and the file generation module is used for processing the target video according to the second target voice to generate a target file.
7. The video processing apparatus according to claim 6, wherein the file generation module comprises:
a first determining unit, configured to determine a time interval of the second target voice in the target video;
the intercepting unit is used for intercepting the video frame of the target video according to the time interval to obtain a target video segment;
a removing module, configured to remove a target element in the target video segment; the target elements include: at least one of a person other than a first target person and a target video background, the first target person comprising: characters contained in the target video clips;
a second determining unit, configured to determine the target video segment with the target element removed as the target file.
8. The video processing apparatus according to claim 6, wherein the file generation module comprises:
a first determining unit, configured to determine a time interval of the second target voice in the target video;
the first intercepting unit is used for intercepting the video frame of the target video according to the time interval to obtain a first target video segment;
a second determining unit, configured to determine first target persons included in the first target video segments;
the second intercepting unit is used for intercepting video frames containing the first target character in the residual video to obtain a second target video segment; wherein the remaining video comprises video of the target video other than the first target video segment;
a third determining unit, configured to determine the first target video segment and the second target video segment as target files.
9. The video processing apparatus according to claim 6, wherein the file generation module comprises:
a first determining unit, configured to determine a first time interval of the second target voice in the target video;
a second determining unit, configured to determine a second time interval of a third target voice in the target video; wherein the time interval between the third target voice and the second target voice in the target video is smaller than a preset time threshold;
the intercepting unit is used for intercepting the video frames of the target video according to the first time interval and the second time interval to obtain a target video clip;
a third determining unit, configured to determine the target video segment as the target file.
10. The video processing device according to claim 9, wherein the video processing device further comprises:
a removing module, configured to remove a target element in the target video segment; the target elements include: persons other than the first target person and the second target person, the first target person including: the characters included in the video frames intercepted according to the first time interval, the second target character includes: characters contained in the video frames intercepted according to the second time interval;
the third determining unit is specifically configured to determine the target video segment with the target element removed as the target file.
CN202011381440.8A 2020-11-30 2020-11-30 Video processing method and video processing device Pending CN112487247A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011381440.8A CN112487247A (en) 2020-11-30 2020-11-30 Video processing method and video processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011381440.8A CN112487247A (en) 2020-11-30 2020-11-30 Video processing method and video processing device

Publications (1)

Publication Number Publication Date
CN112487247A true CN112487247A (en) 2021-03-12

Family

ID=74938406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011381440.8A Pending CN112487247A (en) 2020-11-30 2020-11-30 Video processing method and video processing device

Country Status (1)

Country Link
CN (1) CN112487247A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113596572A (en) * 2021-07-28 2021-11-02 Oppo广东移动通信有限公司 Voice recognition method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016107154A1 (en) * 2014-12-29 2016-07-07 中兴通讯股份有限公司 Method and apparatus for generating health report
CN109905764A (en) * 2019-03-21 2019-06-18 广州国音智能科技有限公司 Target person voice intercept method and device in a kind of video
CN111785279A (en) * 2020-05-18 2020-10-16 北京奇艺世纪科技有限公司 Video speaker identification method and device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016107154A1 (en) * 2014-12-29 2016-07-07 中兴通讯股份有限公司 Method and apparatus for generating health report
CN109905764A (en) * 2019-03-21 2019-06-18 广州国音智能科技有限公司 Target person voice intercept method and device in a kind of video
CN111785279A (en) * 2020-05-18 2020-10-16 北京奇艺世纪科技有限公司 Video speaker identification method and device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113596572A (en) * 2021-07-28 2021-11-02 Oppo广东移动通信有限公司 Voice recognition method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109819313B (en) Video processing method, device and storage medium
CN107801096B (en) Video playing control method and device, terminal equipment and storage medium
CN111884908B (en) Contact person identification display method and device and electronic equipment
CN110113677A (en) The generation method and device of video subject
CN112532885B (en) Anti-shake method and device and electronic equipment
CN113992972A (en) Subtitle display method and device, electronic equipment and readable storage medium
CN113806570A (en) Image generation method and generation device, electronic device and storage medium
CN112487247A (en) Video processing method and video processing device
CN111526380B (en) Video processing method, video processing device, server, electronic equipment and storage medium
CN112309449A (en) Audio recording method and device
CN111343508A (en) Information display control method and device, electronic equipment and storage medium
WO2023030116A1 (en) Display method and apparatus
US9407864B2 (en) Data processing method and electronic device
CN115941869A (en) Audio processing method and device and electronic equipment
CN112261321B (en) Subtitle processing method and device and electronic equipment
CN112653919B (en) Subtitle adding method and device
CN113782027A (en) Audio processing method and audio processing device
CN112565913B (en) Video call method and device and electronic equipment
CN112532904B (en) Video processing method and device and electronic equipment
CN113873165A (en) Photographing method and device and electronic equipment
CN113593614A (en) Image processing method and device
CN110213062B (en) Method and device for processing message
CN113852774A (en) Screen recording method and device
CN112672088A (en) Video call method and device
CN112261470A (en) Audio processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination