CN114299944B - Video processing method, system, device and storage medium - Google Patents

Video processing method, system, device and storage medium Download PDF

Info

Publication number
CN114299944B
CN114299944B CN202111494011.6A CN202111494011A CN114299944B CN 114299944 B CN114299944 B CN 114299944B CN 202111494011 A CN202111494011 A CN 202111494011A CN 114299944 B CN114299944 B CN 114299944B
Authority
CN
China
Prior art keywords
video
determining
processed
frame
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111494011.6A
Other languages
Chinese (zh)
Other versions
CN114299944A (en
Inventor
郝德禄
肖冠正
甘心
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iMusic Culture and Technology Co Ltd
Original Assignee
iMusic Culture and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iMusic Culture and Technology Co Ltd filed Critical iMusic Culture and Technology Co Ltd
Priority to CN202111494011.6A priority Critical patent/CN114299944B/en
Publication of CN114299944A publication Critical patent/CN114299944A/en
Application granted granted Critical
Publication of CN114299944B publication Critical patent/CN114299944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The embodiment of the application realizes the portrait segmentation of a target person with the largest speaking frequency in a video to be processed, determines the target person through modes of voice recognition, face target detection, lip shape recognition and the like, re-identifies or tracks the target person according to the similarity between two adjacent frames, improves the accuracy of target person recognition and finally obtains the target video only containing the target person. The embodiment of the application can be widely applied to scenes such as portrait cutout beautification, photo/video background replacement, certificate photo production, privacy protection and the like.

Description

Video processing method, system, device and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a video processing method, system, apparatus, and storage medium.
Background
With the continuous development of multimedia information technology, more and more information is presented by using video as a carrier, and in order to obtain information (such as portrait information) specified in the video, a dynamic video needs to be subjected to portrait segmentation processing, for example, in a street view interview video containing multiple people, a host needs to be segmented, and other passers need to be ignored. Due to the reasons of shot switching, scene jumping and the like, the related technology is difficult to accurately capture the designated person in the current frame, so that accurate portrait segmentation is difficult to complete according to the video.
Disclosure of Invention
The present application is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, the application provides a video processing method, a system, a device and a storage medium.
In a first aspect, an embodiment of the present application provides a video processing method, including: determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of a target character, and the target character is the character with the largest speaking frequency in the video to be processed; determining a plurality of first face areas according to the video to be processed; determining a plurality of second texts according to the plurality of first face regions; the second text is determined after lip language recognition is carried out on the first face area; determining a plurality of second face regions according to the first text, the second text and the first face region, wherein the second face regions are face regions corresponding to the target person; extracting features of each frame in the video to be processed to obtain a feature matrix corresponding to each frame of image; determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame; determining a target character frame in the video to be processed according to the similarity of the second face area and each frame; the target character frame is a position frame corresponding to the target character; and carrying out portrait segmentation processing on the target character frame to determine a target video.
Optionally, the determining a first text according to the acquired video to be processed includes: determining the audio to be processed corresponding to the video to be processed; performing voice recognition on the audio to be processed to determine a recognition text; extracting the voice spectrum characteristics of the audio to be processed, and determining voice spectrum information; classifying the audio to be processed according to the speech spectrum information, and determining a target audio; wherein the target audio is the audio corresponding to the target character; and determining the first text according to the target audio and the recognition text.
Optionally, the determining, according to the first text, the second text, and the first face region, a plurality of second face regions includes: calculating an edit distance between the first text and the second text; and when the editing distance is lower than or equal to a first threshold value, determining that the first face area corresponding to the current first text is the second face area.
Optionally, the performing feature extraction on each frame in the video to be processed to obtain a feature matrix corresponding to each frame of image includes: extracting a characteristic diagram corresponding to each frame of image in the video to be processed through a skeleton network; determining a coding matrix according to the characteristic diagram; and carrying out global cross-channel fusion on the coding matrix to determine the feature matrix.
Optionally, the determining, according to the feature matrix of each frame, a similarity between any two adjacent frames in the video to be processed includes: and calculating the similarity of the two characteristic matrixes corresponding to any two adjacent frames through a twin network.
Optionally, the determining a target person frame in the video to be processed according to the similarity between the second face region and each frame includes: when the similarity is higher than a second threshold value, performing target tracking on the target person according to the second face area, and determining the target person frame; and when the similarity is lower than or equal to the second threshold, re-identifying the target person according to the first face area and the second face area, and determining the target person frame.
In a second aspect, an embodiment of the present application provides a video processing system, which includes a first module, a second module, a third module, a fourth module, a fifth module, a sixth module, a seventh module, and an eighth module; the first module is used for determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of a target person, and the target person is the person with the largest speaking frequency in the video to be processed; the second module is used for determining a plurality of first face areas according to the video to be processed; the third module is used for determining a plurality of second texts according to the plurality of first face areas; the second text is determined after lip language recognition is carried out on the first face area; the fourth module is configured to determine a plurality of second face regions according to the first text, the second text, and the first face region, where the second face regions are face regions corresponding to the target person; the fifth module is used for extracting features of each frame in the video to be processed and acquiring a feature matrix corresponding to each frame of image; the sixth module is configured to determine a similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame; the seventh module is used for determining a target character frame in the video to be processed according to the second face area and the similarity of each frame; the target character frame is a position frame corresponding to the target character; the eighth module is configured to perform portrait segmentation processing on the target person frame to determine a target video.
In a third aspect, an embodiment of the present application provides an apparatus, including: at least one processor; at least one memory for storing at least one program; when executed by the at least one processor, cause the at least one processor to implement the video processing method of the first aspect.
In a fourth aspect, the present application provides a computer storage medium, in which a program executable by a processor is stored, and when the program executable by the processor is executed by the processor, the program is used for implementing the video processing method according to the first aspect.
The beneficial effects of the embodiment of the application are as follows: determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of the target person, and the target person is the person with the largest speaking frequency in the video to be processed; determining a plurality of first face areas according to a video to be processed; determining a plurality of second texts according to the plurality of first face regions; the second text is determined after lip language recognition is carried out on the first face area; determining a plurality of second face areas according to the first text, the second text and the first face area, wherein the second face areas are face areas corresponding to the target person; extracting features of each frame in a video to be processed to obtain a feature matrix corresponding to each frame of image; determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame; determining a target character frame in the video to be processed according to the similarity of the second face area and each frame; the target character frame is a position frame corresponding to the target character; and performing portrait segmentation processing on the target character frame to determine a target video. According to the embodiment of the application, the target person with the largest speaking frequency is segmented in the video to be processed, the target person is determined through modes such as voice recognition, face target detection and lip recognition, re-recognition or target tracking is carried out on two adjacent frames according to the similarity, the accuracy of target person recognition is improved, and finally the target video only containing the target person is obtained. The embodiment of the application can be widely applied to scenes such as portrait cutout beautification, photo/video background replacement, certificate photo production, privacy protection and the like.
Drawings
The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.
Fig. 1 is a flowchart illustrating steps of a video processing method according to an embodiment of the present application;
fig. 2 is a flowchart of steps provided by an embodiment of the present application for obtaining a first text from a video to be processed;
fig. 3 is a flowchart of a step of obtaining a feature matrix corresponding to each frame of image according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a video processing system provided by an embodiment of the present application;
fig. 5 is a schematic diagram of an apparatus according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It should be noted that although functional block divisions are provided in the system drawings and logical orders are shown in the flowcharts, in some cases, the steps shown and described may be performed in different orders than the block divisions in the systems or in the flowcharts. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The embodiments of the present application will be further explained with reference to the drawings.
Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a video processing method according to an embodiment of the present application, including, but not limited to, steps S100 to S170:
s100, determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of the target person, and the target person is the person with the largest speaking frequency in the video to be processed;
specifically, taking a video to be processed as a video for street view interview as an example, in this type of video, the main role is a host, and since the host generally refers to people with a relatively large number of times of speaking, if the portrait of the host is to be segmented in the video, the host to be segmented is called a target person, and the target person is a person with the largest number of times of speaking in the video to be processed. The text corresponding to all the speech contents of the target person in the video to be processed is referred to as the first text in the embodiment of the present application. The step of obtaining the first text from the video to be processed will be explained in the following.
Referring to fig. 2, fig. 2 is a flowchart of steps provided by an embodiment of the present application to obtain a first text from a video to be processed, where the steps include, but are not limited to, steps S101 to S105:
s101, determining to-be-processed audio corresponding to-be-processed video;
specifically, audio separation is performed on the video to be processed to obtain the audio to be processed corresponding to the video to be processed. In the above, the video to be processed may be a street view interview video, and then the audio to be processed may be subjected to preprocessing such as denoising and background sound elimination according to actual needs, so as to obtain a clearer audio to be processed.
S102, performing voice recognition on the audio to be processed, and determining a recognition text;
specifically, the speech recognition is performed on the audio to be processed, and specifically, the speech recognition may be performed by using an HMM method, a neural network method, and the like in the related art, so as to obtain a recognition text corresponding to the audio to be processed.
S103, extracting voice spectrum characteristics of the audio to be processed, and determining voice spectrum information;
specifically, the sound has three characteristics of loudness, tone and tone, and these characteristics can be extracted by the spectral characteristics of the vibration amplitude, frequency and waveform of the sound wave. Therefore, in this step, it is necessary to perform speech spectrum feature extraction on the audio to be processed, for example: dividing the audio to be processed into a plurality of sections according to the pause between different sentences of the audio to be processed, and extracting the speech spectrum characteristics of the audio of each section to obtain a plurality of sections of speech spectrum information.
S104, classifying the audio to be processed according to the speech spectrum information, and determining a target audio; the target audio is the audio of the corresponding target character;
specifically, the speech spectrum information with similar characteristics is classified through clustering methods such as k-means clustering and the like, and the speech spectrum information of the same class can be regarded as representing the same speaker. As already described in step S100, in the embodiment of the present application, if the target person is the person who speaks the most frequently, the target audio corresponding to the target person may be determined according to the number of pieces of speech spectrum information.
S105, determining a first text according to the target audio and the identification text;
specifically, if the target audio corresponding to the target person is determined in step S105, the first text may be determined in the recognition text according to the target audio. For example, the recognition text of the corresponding position may be determined as the first text at the corresponding start time point and end time point of the target audio in the audio to be processed. According to step S100, the first text is a text corresponding to the content of the target person.
Through steps S101 to S105, the embodiment of the present application provides a step of acquiring a first text corresponding to a target video through a video to be processed.
Step S100 has been described through steps S101-S105, and step S110 is described.
S110, determining a plurality of first face areas according to a video to be processed;
specifically, the target detection is performed on the face in the video to be processed, specifically, the target detection may be performed on each frame of image in the video to be processed through a face detection algorithm such as S3FD, and the area where the face is located in each frame of image is determined, and these areas are referred to as a first face area.
It is understood that since there may be a plurality of faces in a picture, there may be a plurality of first face regions in each frame image.
S120, determining a plurality of second texts according to the first face areas; the second text is determined after lip language recognition is carried out on the first face area;
specifically, lip language recognition is performed on a first face area of each frame image, and text content corresponding to the first face area is determined according to the lip shape. For example, a lip language identification network with LSTM as a core is constructed, images of a first face area are input into a convolution layer, lip feature information is obtained, the feature information is input into an LSTM layer, time domain information of a sequence is obtained, the time domain information is input into a multi-layer perceptron, finally, text contents corresponding to a plurality of first face areas are obtained through processing of a softmax classifier, and the text contents are determined to be second texts.
It should be noted that, in this step, lip language recognition is performed on all the first face areas of each frame, so that the obtained second text also corresponds to all the first face areas, and the second text includes text contents corresponding to the speech contents of the target person and text contents corresponding to the speech contents of other persons in the video to be processed except the target person.
S130, determining a plurality of second face areas according to the first text, the second text and the first face area, wherein the second face areas are face areas corresponding to the target person;
specifically, the second face region may be determined from the first face region by the first text and the second text. For example, an edit distance between the first text and the second text is first calculated, and the edit distance is an index for measuring the degree of similarity between the two sequences. That is, by calculating an edit distance between a first text obtained by speech recognition and a second text obtained by lip recognition, the content closest to the second text in the first text may be determined, for example, the first text and the second text are divided into a plurality of paragraphs, the edit distance of the corresponding paragraph in the first text and the second text is calculated, when the edit distance is lower than or equal to a preset first threshold, it is indicated that the current paragraph of the first text is very similar to the current paragraph of the second text, the content most similar to the first text in the second text may be determined, and since the second text has a correspondence relationship with the first face region, that is, a face region matching the first text may be determined in the first face region, and these face regions may be determined as the second face region. Since the first text is the text content of the target person speaking, the first text may correspond to the second face region, in other words, the second face region is the face region representing the target person.
S140, extracting features of each frame in the video to be processed to obtain a feature matrix corresponding to each frame of image;
specifically, it is mentioned in the above that one of the reasons why it is difficult to perform the portrait segmentation of the designated target person in the entire video to be processed in the related art is that there may be influence factors such as scene change and shot change in the video, and when the above influence factors occur, the video characteristics may change, and the position, size, and the like of the person may also change. In this step, feature extraction needs to be performed on each frame in the video to be processed to determine nodes where scene change and shot switching occur in the video to be processed.
Referring to fig. 3, fig. 3 is a flowchart of steps of acquiring a feature matrix corresponding to each frame of image according to the embodiment of the present application, where the method includes, but is not limited to, steps S141 to S143:
s141, extracting a characteristic diagram corresponding to each frame of image in the video to be processed through a skeleton network;
specifically, each frame in a video to be processed is input into a skeleton network, the image is subjected to feature extraction, and the extracted features are subjected to high-dimensional coding. Efficient feature encoding can be achieved by using a residual structure, pyramid pooling and attention mechanism. In the embodiment of the present application, the skeleton network may be a skeleton network of a classical neural network such as VGG, resNet, densnet, transformer, and the like. In order to increase the convergence rate of the skeleton network on a small-scale data set, the skeleton network may be initialized on an existing relatively large-scale data set. After training, each frame of image is input into the skeleton network, and the skeleton network outputs a corresponding feature map with the shape of (C, H, W), wherein C is the number of channels, and H and W are the height and width of the feature map respectively.
S142, determining a coding matrix according to the characteristic diagram;
specifically, the feature map generated in step S141 is subjected to multi-scale parallel convolution, that is, different convolution kernels are set, for example, 7 × 7 or 5 × 5, and the feature map is sent to convolution layers of convolution kernels with different sizes through the structure of the parallel convolution layers to generate a feature map with a corresponding size, and zero padding is performed on the feature map with the edge defect. Finally, corresponding to different convolution kernels
Figure GDA0004000060760000061
The result is the summation of channel dimensions, for example: the output of one layer in the multi-scale parallel convolution is (C1, H, W), the output of the other layer is (C2, H, W), and the result of channel summation is (C1 + C2, H, W). And performing continuous standard convolution for multiple times on the addition result to obtain a final coding matrix.
S143, carrying out global cross-channel fusion on the coding matrix, and determining a feature matrix;
specifically, the global cross-channel fusion is performed on the coding matrix generated in step S142, and the process is as follows: the shape transformation is firstly carried out on the input coding matrix, and the subsequent operation is convenient, such as the coding matrix with the shape of (C, H, W) is adjusted to the matrix of (1, C, H x W). And after the shape transformation is carried out, the coding matrix is sent into a plurality of expansion convolution layers with different expansion convolution coefficients, and the information among the channels is extracted and fused. The expansion convolution layer is generally selected to be three layers, the expansion convolution coefficients are respectively set to be [8,12,16], and the number of the expansion convolution kernels is generally set to be 4. And after the data fusion is finished, converting the fused result into an original shape, specifically, summing the convolved results of different expansion convolutional layers in a channel dimension to obtain a matrix with the shape of (n, C, H multiplied by W), wherein n represents the product of the number of expansion convolutional kernels and the expansion convolutional layers. The matrix with the shape of (n, C, hxw) is subjected to standard convolution with a convolution kernel of (1, 1), adjusted to be a matrix with the shape of (1, C, hxw), and then the matrix with the shape of (1, C, hxw) is adjusted to obtain a feature matrix with the shape of (C, H, W).
Through the steps S141 to S143, the embodiment of the present application provides a method for generating a feature matrix corresponding to each frame of image, and through the feature image, it can be determined whether scene change or lens change occurs in two frames before and after the corresponding video to be processed.
The present step S140 has been described through steps S141-S143, and the description of step S150 is started below.
S150, determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame;
specifically, in the embodiment of the present application, a twin network is utilized to process two feature matrices corresponding to any two frames of a video to be processed. The twin network is a two-way neural network, the two networks have the same scale, and the weight is shared. And respectively feeding two different feature matrix inputs into a two-way network, respectively obtaining feature vectors corresponding to the inputs by the two-way network, and then calculating the two feature vectors to obtain the similarity between any two adjacent frames.
S160, determining a target character frame in the video to be processed according to the similarity of the second face area and each frame; the target character frame is a position frame corresponding to the target character;
specifically, according to step S150, all similarities between any two adjacent frames in the video to be processed are determined. And comparing each similarity with a preset second threshold, and when the similarity is higher than the second threshold, indicating that the features of the two adjacent frames are similar, it can be considered that relatively large scene transition or shot switching is not performed, and the movement of the target person in the two current adjacent frames is relatively small, so that according to the second face areas of the two adjacent frames, the target person is tracked, and the target person frame where the target person is located in the two adjacent frames is determined to be the position frame corresponding to the target person, including but not limited to the face, the four limbs and the trunk part.
In some embodiments, the process of implementing target tracking includes: firstly, acquiring a feature matrix corresponding to two adjacent frames, then obtaining features under different dimensionalities in the feature matrix through pyramid pooling, finally generating a candidate frame according to the features, obtaining confidence of the candidate frame through a sotfmax function, finally outputting the candidate frame with high confidence as a result to obtain position information, and then calculating whether the candidate frame and a corresponding second face region are in the same region through a matching algorithm, thereby realizing target tracking.
Through the above, in the case that the similarity between two adjacent frames is high, the feature tracking is performed on the target person to determine the position of the target person in the two adjacent frames.
And when the similarity is lower than or equal to the second threshold, the similarity of the two adjacent frames is relatively low, and in the current two adjacent frames, a large scene switching or shot switching is likely to occur, or the target person moves in a relatively large range, so that in order to ensure the accuracy of identifying the target person in the next frame, the target person is re-identified in the next frame, and the target person frame in which the target person is located in the next frame is re-determined.
In some embodiments, the process of implementing target re-identification includes: firstly, acquiring feature matrixes corresponding to two adjacent frames, then mapping the acquired feature matrixes to a high-dimensional space, and performing metric learning to enable the same targets to be closer in the high-dimensional space and different targets to be farther. And finally, matching a second face area in the previous frame with a first face area in the next frame through feature matching, so as to determine the second face area in the next frame and determine a target person frame corresponding to the second face area.
By comparing the similarity of two adjacent frames, the target tracking or target re-identification of the video frames is determined, so that the areas where all target characters appear in the whole to-be-processed video, namely all target task boxes, are determined.
And S170, performing portrait segmentation processing on the target character frame and determining a target video.
Specifically, the image segmentation is performed on all the target person frames determined in step S160, that is, all the regions in each frame of image except the target person frame are taken as the background, the target task frame and the background are segmented, the image segmentation is completed, and the multi-frame images only retaining the target task frame are synthesized to obtain the target video only with the target person.
Through steps S100 to S170, the embodiment of the present application implements portrait segmentation on a target person with the largest number of speaking times in a video to be processed, determines the target person through voice recognition, face target detection, lip shape recognition, and the like, re-identifies or tracks the target person according to the similarity between two adjacent frames, improves the accuracy of target person recognition, and finally obtains a target video only including the target person. The embodiment of the application can be widely applied to scenes such as portrait cutout beautification, photo/video background replacement, certificate photo production, privacy protection and the like.
Referring to fig. 4, fig. 4 is a schematic diagram of a video processing system provided in an embodiment of the present application, where the system 400 includes a first module 410, a second module 420, a third module 430, a fourth module 440, a fifth module 450, a sixth module 460, a seventh module 470, and an eighth module 480; the first module is used for determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of the target person, and the target person is the person with the largest speaking frequency in the video to be processed; the second module is used for determining a plurality of first face areas according to the video to be processed; the third module is used for determining a plurality of second texts according to the plurality of first face areas; the second text is determined after lip language recognition is carried out on the first face area; the fourth module is used for determining a plurality of second face areas according to the first text, the second text and the first face area, wherein the second face areas are face areas corresponding to the target person; the fifth module is used for extracting features of each frame in the video to be processed and acquiring a feature matrix corresponding to each frame of image; the sixth module is used for determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame; the seventh module is used for determining a target character frame in the video to be processed according to the similarity between the second face area and each frame; the target character frame is a position frame corresponding to the target character; and the eighth module is used for carrying out portrait segmentation processing on the target character frame and determining a target video.
Referring to fig. 5, fig. 5 is a schematic diagram of an apparatus 500 provided in an embodiment of the present application, where the apparatus 500 includes at least one processor 510 and at least one memory 520 for storing at least one program; in fig. 5, a processor and a memory are taken as an example.
The processor and memory may be connected by a bus or other means, such as by a bus in FIG. 5.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Another embodiment of the present application also provides an apparatus that may be used to perform the control method as in any of the embodiments above, for example, performing the method steps of fig. 1 described above.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
The embodiment of the application also discloses a computer storage medium, wherein a program executable by a processor is stored, and the program executable by the processor is used for realizing the video processing method provided by the application when being executed by the processor.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
While the preferred embodiments of the present invention have been described, the present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and such equivalent modifications or substitutions are included in the scope of the present invention defined by the claims.

Claims (8)

1. A video processing method, comprising:
determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of a target character, and the target character is the character with the largest speaking frequency in the video to be processed;
determining a plurality of first face areas according to the video to be processed;
determining a plurality of second texts according to the plurality of first face regions; the second text is determined after lip language recognition is carried out on the first face area;
determining a plurality of second face regions according to the first text, the second text and the first face region, wherein the second face regions are face regions corresponding to the target person;
extracting features of each frame in the video to be processed to obtain a feature matrix corresponding to each frame of image;
determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame;
determining a target character frame in the video to be processed according to the similarity between the second face area and any two adjacent frames; the target character frame is a position frame corresponding to the target character;
determining a target person frame in the video to be processed according to the similarity between the second face area and the any two adjacent frames, including:
when the similarity between any two adjacent frames is higher than a second threshold value, performing target tracking on the target person according to the second face area, and determining the target person frame;
when the similarity between any two adjacent frames is lower than or equal to the second threshold, re-identifying the target person according to the first face area and the second face area, and determining the target person frame; and carrying out portrait segmentation processing on the target character frame to determine a target video.
2. The video processing method according to claim 1, wherein the determining a first text according to the acquired video to be processed includes:
determining the audio to be processed corresponding to the video to be processed;
performing voice recognition on the audio to be processed to determine a recognition text;
extracting the voice spectrum characteristics of the audio to be processed, and determining voice spectrum information;
classifying the audio to be processed according to the speech spectrum information, and determining a target audio; wherein the target audio is the audio corresponding to the target character;
and determining the first text according to the target audio and the recognition text.
3. The method of claim 1, wherein determining second face regions from the first text, the second text, and the first face region comprises:
calculating an edit distance between the first text and the second text;
and when the editing distance is lower than or equal to a first threshold value, determining that the first face area corresponding to the current first text is the second face area.
4. The video processing method according to claim 1, wherein said extracting features of each frame in the video to be processed to obtain a feature matrix corresponding to each frame image comprises:
extracting a characteristic diagram corresponding to each frame of image in the video to be processed through a skeleton network;
determining a coding matrix according to the characteristic diagram;
and carrying out global cross-channel fusion on the coding matrix, and determining the feature matrix.
5. The method according to claim 1, wherein said determining a similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame comprises:
and calculating the similarity of the two characteristic matrixes corresponding to any two adjacent frames through a twin network.
6. A video processing system is characterized by comprising a first module, a second module, a third module, a fourth module, a fifth module, a sixth module, a seventh module and an eighth module;
the first module is used for determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of a target character, and the target character is the character with the largest speaking frequency in the video to be processed;
the second module is used for determining a plurality of first face areas according to the video to be processed;
the third module is used for determining a plurality of second texts according to the plurality of first face areas; the second text is determined after lip language recognition is carried out on the first face area;
the fourth module is used for determining a plurality of second face areas according to the first text, the second text and the first face area, wherein the second face areas are face areas corresponding to the target person;
the fifth module is used for extracting features of each frame in the video to be processed and acquiring a feature matrix corresponding to each frame of image;
the sixth module is configured to determine a similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame;
the seventh module is used for determining a target character frame in the video to be processed according to the similarity between the second face area and any two adjacent frames; the target character frame is a position frame corresponding to the target character;
the seventh module is configured to determine a target person frame in the video to be processed according to the similarity between the second face region and the any two adjacent frames, and includes:
when the similarity between any two adjacent frames is higher than a second threshold value, performing target tracking on the target person according to the second face area, and determining the target person frame;
when the similarity between any two adjacent frames is lower than or equal to the second threshold, re-identifying the target person according to the first face area and the second face area, and determining the target person frame;
the eighth module is configured to perform portrait segmentation processing on the target person frame to determine a target video.
7. A video processing apparatus, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the video processing method of any of claims 1-5.
8. A computer storage medium in which a processor-executable program is stored, wherein the processor-executable program, when executed by the processor, is for implementing a video processing method according to any one of claims 1-5.
CN202111494011.6A 2021-12-08 2021-12-08 Video processing method, system, device and storage medium Active CN114299944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111494011.6A CN114299944B (en) 2021-12-08 2021-12-08 Video processing method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111494011.6A CN114299944B (en) 2021-12-08 2021-12-08 Video processing method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN114299944A CN114299944A (en) 2022-04-08
CN114299944B true CN114299944B (en) 2023-03-24

Family

ID=80965513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111494011.6A Active CN114299944B (en) 2021-12-08 2021-12-08 Video processing method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN114299944B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100409B (en) * 2022-06-30 2024-04-26 温州大学 Video portrait segmentation algorithm based on twin network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590439A (en) * 2017-08-18 2018-01-16 湖南文理学院 Target person identification method for tracing and device based on monitor video
WO2021120190A1 (en) * 2019-12-20 2021-06-24 深圳市欢太科技有限公司 Data processing method and apparatus, electronic device, and storage medium
CN113343831A (en) * 2021-06-01 2021-09-03 北京字跳网络技术有限公司 Method and device for classifying speakers in video, electronic equipment and storage medium

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4975960A (en) * 1985-06-03 1990-12-04 Petajan Eric D Electronic facial tracking and detection system and method and apparatus for automated speech recognition
JP3307354B2 (en) * 1999-01-29 2002-07-24 日本電気株式会社 Personal identification method and apparatus and recording medium recording personal identification program
US20030154084A1 (en) * 2002-02-14 2003-08-14 Koninklijke Philips Electronics N.V. Method and system for person identification using video-speech matching
CN106845385A (en) * 2017-01-17 2017-06-13 腾讯科技(上海)有限公司 The method and apparatus of video frequency object tracking
KR20210048441A (en) * 2018-05-24 2021-05-03 워너 브로스. 엔터테인먼트 인크. Matching mouth shape and movement in digital video to alternative audio
CN110276259B (en) * 2019-05-21 2024-04-02 平安科技(深圳)有限公司 Lip language identification method, device, computer equipment and storage medium
CN110648667B (en) * 2019-09-26 2022-04-08 云南电网有限责任公司电力科学研究院 Multi-person scene human voice matching method
TWI714318B (en) * 2019-10-25 2020-12-21 緯創資通股份有限公司 Face recognition method and face recognition apparatus
CN112565885B (en) * 2020-11-30 2023-01-06 清华珠三角研究院 Video segmentation method, system, device and storage medium
CN112487978B (en) * 2020-11-30 2024-04-16 清华珠三角研究院 Method and device for positioning speaker in video and computer storage medium
CN113486760A (en) * 2021-06-30 2021-10-08 上海商汤临港智能科技有限公司 Object speaking detection method and device, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590439A (en) * 2017-08-18 2018-01-16 湖南文理学院 Target person identification method for tracing and device based on monitor video
WO2021120190A1 (en) * 2019-12-20 2021-06-24 深圳市欢太科技有限公司 Data processing method and apparatus, electronic device, and storage medium
CN113343831A (en) * 2021-06-01 2021-09-03 北京字跳网络技术有限公司 Method and device for classifying speakers in video, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114299944A (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN106599836B (en) Multi-face tracking method and tracking system
US8135221B2 (en) Video concept classification using audio-visual atoms
Ding et al. Audio and face video emotion recognition in the wild using deep neural networks and small datasets
CN112004111A (en) News video information extraction method for global deep learning
CN111461039B (en) Landmark identification method based on multi-scale feature fusion
CN111968150B (en) Weak surveillance video target segmentation method based on full convolution neural network
US9305359B2 (en) Image processing method, image processing apparatus, and computer program product
KR20160032533A (en) Feature extracting method of input image based on example pyramid and apparatus of face recognition
CN111950389B (en) Depth binary feature facial expression recognition method based on lightweight network
Radha Video retrieval using speech and text in video
CN112200041A (en) Video motion recognition method and device, storage medium and electronic equipment
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
WO2020043296A1 (en) Device and method for separating a picture into foreground and background using deep learning
CN112861970A (en) Fine-grained image classification method based on feature fusion
CN114299944B (en) Video processing method, system, device and storage medium
Feng et al. Self-supervised video forensics by audio-visual anomaly detection
CN116129310A (en) Video target segmentation system, method, electronic equipment and medium
CN113450387A (en) Target tracking method and device, electronic equipment and computer readable storage medium
CN115083435A (en) Audio data processing method and device, computer equipment and storage medium
Sarhan et al. HLR-net: a hybrid lip-reading model based on deep convolutional neural networks
Mo et al. A unified audio-visual learning framework for localization, separation, and recognition
Shankar et al. Spoken Keyword Detection Using Joint DTW-CNN.
CN116364064B (en) Audio splicing method, electronic equipment and storage medium
CN114925239B (en) Intelligent education target video big data retrieval method and system based on artificial intelligence
CN114566160A (en) Voice processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant