CN114299944B

CN114299944B - Video processing method, system, device and storage medium

Info

Publication number: CN114299944B
Application number: CN202111494011.6A
Authority: CN
Inventors: 郝德禄; 肖冠正; 甘心
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2023-03-24
Anticipated expiration: 2041-12-08
Also published as: CN114299944A

Abstract

The embodiment of the application realizes the portrait segmentation of a target person with the largest speaking frequency in a video to be processed, determines the target person through modes of voice recognition, face target detection, lip shape recognition and the like, re-identifies or tracks the target person according to the similarity between two adjacent frames, improves the accuracy of target person recognition and finally obtains the target video only containing the target person. The embodiment of the application can be widely applied to scenes such as portrait cutout beautification, photo/video background replacement, certificate photo production, privacy protection and the like.

Description

Video processing method, system, device and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a video processing method, system, apparatus, and storage medium.

Background

With the continuous development of multimedia information technology, more and more information is presented by using video as a carrier, and in order to obtain information (such as portrait information) specified in the video, a dynamic video needs to be subjected to portrait segmentation processing, for example, in a street view interview video containing multiple people, a host needs to be segmented, and other passers need to be ignored. Due to the reasons of shot switching, scene jumping and the like, the related technology is difficult to accurately capture the designated person in the current frame, so that accurate portrait segmentation is difficult to complete according to the video.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, the application provides a video processing method, a system, a device and a storage medium.

In a first aspect, an embodiment of the present application provides a video processing method, including: determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of a target character, and the target character is the character with the largest speaking frequency in the video to be processed; determining a plurality of first face areas according to the video to be processed; determining a plurality of second texts according to the plurality of first face regions; the second text is determined after lip language recognition is carried out on the first face area; determining a plurality of second face regions according to the first text, the second text and the first face region, wherein the second face regions are face regions corresponding to the target person; extracting features of each frame in the video to be processed to obtain a feature matrix corresponding to each frame of image; determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame; determining a target character frame in the video to be processed according to the similarity of the second face area and each frame; the target character frame is a position frame corresponding to the target character; and carrying out portrait segmentation processing on the target character frame to determine a target video.

Optionally, the determining a first text according to the acquired video to be processed includes: determining the audio to be processed corresponding to the video to be processed; performing voice recognition on the audio to be processed to determine a recognition text; extracting the voice spectrum characteristics of the audio to be processed, and determining voice spectrum information; classifying the audio to be processed according to the speech spectrum information, and determining a target audio; wherein the target audio is the audio corresponding to the target character; and determining the first text according to the target audio and the recognition text.

Optionally, the determining, according to the first text, the second text, and the first face region, a plurality of second face regions includes: calculating an edit distance between the first text and the second text; and when the editing distance is lower than or equal to a first threshold value, determining that the first face area corresponding to the current first text is the second face area.

Optionally, the performing feature extraction on each frame in the video to be processed to obtain a feature matrix corresponding to each frame of image includes: extracting a characteristic diagram corresponding to each frame of image in the video to be processed through a skeleton network; determining a coding matrix according to the characteristic diagram; and carrying out global cross-channel fusion on the coding matrix to determine the feature matrix.

Optionally, the determining, according to the feature matrix of each frame, a similarity between any two adjacent frames in the video to be processed includes: and calculating the similarity of the two characteristic matrixes corresponding to any two adjacent frames through a twin network.

Optionally, the determining a target person frame in the video to be processed according to the similarity between the second face region and each frame includes: when the similarity is higher than a second threshold value, performing target tracking on the target person according to the second face area, and determining the target person frame; and when the similarity is lower than or equal to the second threshold, re-identifying the target person according to the first face area and the second face area, and determining the target person frame.

In a second aspect, an embodiment of the present application provides a video processing system, which includes a first module, a second module, a third module, a fourth module, a fifth module, a sixth module, a seventh module, and an eighth module; the first module is used for determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of a target person, and the target person is the person with the largest speaking frequency in the video to be processed; the second module is used for determining a plurality of first face areas according to the video to be processed; the third module is used for determining a plurality of second texts according to the plurality of first face areas; the second text is determined after lip language recognition is carried out on the first face area; the fourth module is configured to determine a plurality of second face regions according to the first text, the second text, and the first face region, where the second face regions are face regions corresponding to the target person; the fifth module is used for extracting features of each frame in the video to be processed and acquiring a feature matrix corresponding to each frame of image; the sixth module is configured to determine a similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame; the seventh module is used for determining a target character frame in the video to be processed according to the second face area and the similarity of each frame; the target character frame is a position frame corresponding to the target character; the eighth module is configured to perform portrait segmentation processing on the target person frame to determine a target video.

In a third aspect, an embodiment of the present application provides an apparatus, including: at least one processor; at least one memory for storing at least one program; when executed by the at least one processor, cause the at least one processor to implement the video processing method of the first aspect.

In a fourth aspect, the present application provides a computer storage medium, in which a program executable by a processor is stored, and when the program executable by the processor is executed by the processor, the program is used for implementing the video processing method according to the first aspect.

The beneficial effects of the embodiment of the application are as follows: determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of the target person, and the target person is the person with the largest speaking frequency in the video to be processed; determining a plurality of first face areas according to a video to be processed; determining a plurality of second texts according to the plurality of first face regions; the second text is determined after lip language recognition is carried out on the first face area; determining a plurality of second face areas according to the first text, the second text and the first face area, wherein the second face areas are face areas corresponding to the target person; extracting features of each frame in a video to be processed to obtain a feature matrix corresponding to each frame of image; determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame; determining a target character frame in the video to be processed according to the similarity of the second face area and each frame; the target character frame is a position frame corresponding to the target character; and performing portrait segmentation processing on the target character frame to determine a target video. According to the embodiment of the application, the target person with the largest speaking frequency is segmented in the video to be processed, the target person is determined through modes such as voice recognition, face target detection and lip recognition, re-recognition or target tracking is carried out on two adjacent frames according to the similarity, the accuracy of target person recognition is improved, and finally the target video only containing the target person is obtained. The embodiment of the application can be widely applied to scenes such as portrait cutout beautification, photo/video background replacement, certificate photo production, privacy protection and the like.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

Fig. 1 is a flowchart illustrating steps of a video processing method according to an embodiment of the present application;

fig. 2 is a flowchart of steps provided by an embodiment of the present application for obtaining a first text from a video to be processed;

fig. 3 is a flowchart of a step of obtaining a feature matrix corresponding to each frame of image according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a video processing system provided by an embodiment of the present application;

fig. 5 is a schematic diagram of an apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional block divisions are provided in the system drawings and logical orders are shown in the flowcharts, in some cases, the steps shown and described may be performed in different orders than the block divisions in the systems or in the flowcharts. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The embodiments of the present application will be further explained with reference to the drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a video processing method according to an embodiment of the present application, including, but not limited to, steps S100 to S170:

s100, determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of the target person, and the target person is the person with the largest speaking frequency in the video to be processed;

specifically, taking a video to be processed as a video for street view interview as an example, in this type of video, the main role is a host, and since the host generally refers to people with a relatively large number of times of speaking, if the portrait of the host is to be segmented in the video, the host to be segmented is called a target person, and the target person is a person with the largest number of times of speaking in the video to be processed. The text corresponding to all the speech contents of the target person in the video to be processed is referred to as the first text in the embodiment of the present application. The step of obtaining the first text from the video to be processed will be explained in the following.

Referring to fig. 2, fig. 2 is a flowchart of steps provided by an embodiment of the present application to obtain a first text from a video to be processed, where the steps include, but are not limited to, steps S101 to S105:

s101, determining to-be-processed audio corresponding to-be-processed video;

specifically, audio separation is performed on the video to be processed to obtain the audio to be processed corresponding to the video to be processed. In the above, the video to be processed may be a street view interview video, and then the audio to be processed may be subjected to preprocessing such as denoising and background sound elimination according to actual needs, so as to obtain a clearer audio to be processed.

S102, performing voice recognition on the audio to be processed, and determining a recognition text;

specifically, the speech recognition is performed on the audio to be processed, and specifically, the speech recognition may be performed by using an HMM method, a neural network method, and the like in the related art, so as to obtain a recognition text corresponding to the audio to be processed.

S103, extracting voice spectrum characteristics of the audio to be processed, and determining voice spectrum information;

specifically, the sound has three characteristics of loudness, tone and tone, and these characteristics can be extracted by the spectral characteristics of the vibration amplitude, frequency and waveform of the sound wave. Therefore, in this step, it is necessary to perform speech spectrum feature extraction on the audio to be processed, for example: dividing the audio to be processed into a plurality of sections according to the pause between different sentences of the audio to be processed, and extracting the speech spectrum characteristics of the audio of each section to obtain a plurality of sections of speech spectrum information.

S104, classifying the audio to be processed according to the speech spectrum information, and determining a target audio; the target audio is the audio of the corresponding target character;

specifically, the speech spectrum information with similar characteristics is classified through clustering methods such as k-means clustering and the like, and the speech spectrum information of the same class can be regarded as representing the same speaker. As already described in step S100, in the embodiment of the present application, if the target person is the person who speaks the most frequently, the target audio corresponding to the target person may be determined according to the number of pieces of speech spectrum information.

S105, determining a first text according to the target audio and the identification text;

specifically, if the target audio corresponding to the target person is determined in step S105, the first text may be determined in the recognition text according to the target audio. For example, the recognition text of the corresponding position may be determined as the first text at the corresponding start time point and end time point of the target audio in the audio to be processed. According to step S100, the first text is a text corresponding to the content of the target person.

Through steps S101 to S105, the embodiment of the present application provides a step of acquiring a first text corresponding to a target video through a video to be processed.

Step S100 has been described through steps S101-S105, and step S110 is described.

S110, determining a plurality of first face areas according to a video to be processed;

specifically, the target detection is performed on the face in the video to be processed, specifically, the target detection may be performed on each frame of image in the video to be processed through a face detection algorithm such as S3FD, and the area where the face is located in each frame of image is determined, and these areas are referred to as a first face area.

It is understood that since there may be a plurality of faces in a picture, there may be a plurality of first face regions in each frame image.

S120, determining a plurality of second texts according to the first face areas; the second text is determined after lip language recognition is carried out on the first face area;

specifically, lip language recognition is performed on a first face area of each frame image, and text content corresponding to the first face area is determined according to the lip shape. For example, a lip language identification network with LSTM as a core is constructed, images of a first face area are input into a convolution layer, lip feature information is obtained, the feature information is input into an LSTM layer, time domain information of a sequence is obtained, the time domain information is input into a multi-layer perceptron, finally, text contents corresponding to a plurality of first face areas are obtained through processing of a softmax classifier, and the text contents are determined to be second texts.

It should be noted that, in this step, lip language recognition is performed on all the first face areas of each frame, so that the obtained second text also corresponds to all the first face areas, and the second text includes text contents corresponding to the speech contents of the target person and text contents corresponding to the speech contents of other persons in the video to be processed except the target person.

S130, determining a plurality of second face areas according to the first text, the second text and the first face area, wherein the second face areas are face areas corresponding to the target person;

specifically, the second face region may be determined from the first face region by the first text and the second text. For example, an edit distance between the first text and the second text is first calculated, and the edit distance is an index for measuring the degree of similarity between the two sequences. That is, by calculating an edit distance between a first text obtained by speech recognition and a second text obtained by lip recognition, the content closest to the second text in the first text may be determined, for example, the first text and the second text are divided into a plurality of paragraphs, the edit distance of the corresponding paragraph in the first text and the second text is calculated, when the edit distance is lower than or equal to a preset first threshold, it is indicated that the current paragraph of the first text is very similar to the current paragraph of the second text, the content most similar to the first text in the second text may be determined, and since the second text has a correspondence relationship with the first face region, that is, a face region matching the first text may be determined in the first face region, and these face regions may be determined as the second face region. Since the first text is the text content of the target person speaking, the first text may correspond to the second face region, in other words, the second face region is the face region representing the target person.

S140, extracting features of each frame in the video to be processed to obtain a feature matrix corresponding to each frame of image;

specifically, it is mentioned in the above that one of the reasons why it is difficult to perform the portrait segmentation of the designated target person in the entire video to be processed in the related art is that there may be influence factors such as scene change and shot change in the video, and when the above influence factors occur, the video characteristics may change, and the position, size, and the like of the person may also change. In this step, feature extraction needs to be performed on each frame in the video to be processed to determine nodes where scene change and shot switching occur in the video to be processed.

Referring to fig. 3, fig. 3 is a flowchart of steps of acquiring a feature matrix corresponding to each frame of image according to the embodiment of the present application, where the method includes, but is not limited to, steps S141 to S143:

s141, extracting a characteristic diagram corresponding to each frame of image in the video to be processed through a skeleton network;

specifically, each frame in a video to be processed is input into a skeleton network, the image is subjected to feature extraction, and the extracted features are subjected to high-dimensional coding. Efficient feature encoding can be achieved by using a residual structure, pyramid pooling and attention mechanism. In the embodiment of the present application, the skeleton network may be a skeleton network of a classical neural network such as VGG, resNet, densnet, transformer, and the like. In order to increase the convergence rate of the skeleton network on a small-scale data set, the skeleton network may be initialized on an existing relatively large-scale data set. After training, each frame of image is input into the skeleton network, and the skeleton network outputs a corresponding feature map with the shape of (C, H, W), wherein C is the number of channels, and H and W are the height and width of the feature map respectively.

S142, determining a coding matrix according to the characteristic diagram;

specifically, the feature map generated in step S141 is subjected to multi-scale parallel convolution, that is, different convolution kernels are set, for example, 7 × 7 or 5 × 5, and the feature map is sent to convolution layers of convolution kernels with different sizes through the structure of the parallel convolution layers to generate a feature map with a corresponding size, and zero padding is performed on the feature map with the edge defect. Finally, corresponding to different convolution kernels

The result is the summation of channel dimensions, for example: the output of one layer in the multi-scale parallel convolution is (C1, H, W), the output of the other layer is (C2, H, W), and the result of channel summation is (C1 + C2, H, W). And performing continuous standard convolution for multiple times on the addition result to obtain a final coding matrix.

S143, carrying out global cross-channel fusion on the coding matrix, and determining a feature matrix;

specifically, the global cross-channel fusion is performed on the coding matrix generated in step S142, and the process is as follows: the shape transformation is firstly carried out on the input coding matrix, and the subsequent operation is convenient, such as the coding matrix with the shape of (C, H, W) is adjusted to the matrix of (1, C, H x W). And after the shape transformation is carried out, the coding matrix is sent into a plurality of expansion convolution layers with different expansion convolution coefficients, and the information among the channels is extracted and fused. The expansion convolution layer is generally selected to be three layers, the expansion convolution coefficients are respectively set to be [8,12,16], and the number of the expansion convolution kernels is generally set to be 4. And after the data fusion is finished, converting the fused result into an original shape, specifically, summing the convolved results of different expansion convolutional layers in a channel dimension to obtain a matrix with the shape of (n, C, H multiplied by W), wherein n represents the product of the number of expansion convolutional kernels and the expansion convolutional layers. The matrix with the shape of (n, C, hxw) is subjected to standard convolution with a convolution kernel of (1, 1), adjusted to be a matrix with the shape of (1, C, hxw), and then the matrix with the shape of (1, C, hxw) is adjusted to obtain a feature matrix with the shape of (C, H, W).

Through the steps S141 to S143, the embodiment of the present application provides a method for generating a feature matrix corresponding to each frame of image, and through the feature image, it can be determined whether scene change or lens change occurs in two frames before and after the corresponding video to be processed.

The present step S140 has been described through steps S141-S143, and the description of step S150 is started below.

S150, determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame;

specifically, in the embodiment of the present application, a twin network is utilized to process two feature matrices corresponding to any two frames of a video to be processed. The twin network is a two-way neural network, the two networks have the same scale, and the weight is shared. And respectively feeding two different feature matrix inputs into a two-way network, respectively obtaining feature vectors corresponding to the inputs by the two-way network, and then calculating the two feature vectors to obtain the similarity between any two adjacent frames.

S160, determining a target character frame in the video to be processed according to the similarity of the second face area and each frame; the target character frame is a position frame corresponding to the target character;

specifically, according to step S150, all similarities between any two adjacent frames in the video to be processed are determined. And comparing each similarity with a preset second threshold, and when the similarity is higher than the second threshold, indicating that the features of the two adjacent frames are similar, it can be considered that relatively large scene transition or shot switching is not performed, and the movement of the target person in the two current adjacent frames is relatively small, so that according to the second face areas of the two adjacent frames, the target person is tracked, and the target person frame where the target person is located in the two adjacent frames is determined to be the position frame corresponding to the target person, including but not limited to the face, the four limbs and the trunk part.

In some embodiments, the process of implementing target tracking includes: firstly, acquiring a feature matrix corresponding to two adjacent frames, then obtaining features under different dimensionalities in the feature matrix through pyramid pooling, finally generating a candidate frame according to the features, obtaining confidence of the candidate frame through a sotfmax function, finally outputting the candidate frame with high confidence as a result to obtain position information, and then calculating whether the candidate frame and a corresponding second face region are in the same region through a matching algorithm, thereby realizing target tracking.

Through the above, in the case that the similarity between two adjacent frames is high, the feature tracking is performed on the target person to determine the position of the target person in the two adjacent frames.

And when the similarity is lower than or equal to the second threshold, the similarity of the two adjacent frames is relatively low, and in the current two adjacent frames, a large scene switching or shot switching is likely to occur, or the target person moves in a relatively large range, so that in order to ensure the accuracy of identifying the target person in the next frame, the target person is re-identified in the next frame, and the target person frame in which the target person is located in the next frame is re-determined.

In some embodiments, the process of implementing target re-identification includes: firstly, acquiring feature matrixes corresponding to two adjacent frames, then mapping the acquired feature matrixes to a high-dimensional space, and performing metric learning to enable the same targets to be closer in the high-dimensional space and different targets to be farther. And finally, matching a second face area in the previous frame with a first face area in the next frame through feature matching, so as to determine the second face area in the next frame and determine a target person frame corresponding to the second face area.

By comparing the similarity of two adjacent frames, the target tracking or target re-identification of the video frames is determined, so that the areas where all target characters appear in the whole to-be-processed video, namely all target task boxes, are determined.

And S170, performing portrait segmentation processing on the target character frame and determining a target video.

Specifically, the image segmentation is performed on all the target person frames determined in step S160, that is, all the regions in each frame of image except the target person frame are taken as the background, the target task frame and the background are segmented, the image segmentation is completed, and the multi-frame images only retaining the target task frame are synthesized to obtain the target video only with the target person.

Through steps S100 to S170, the embodiment of the present application implements portrait segmentation on a target person with the largest number of speaking times in a video to be processed, determines the target person through voice recognition, face target detection, lip shape recognition, and the like, re-identifies or tracks the target person according to the similarity between two adjacent frames, improves the accuracy of target person recognition, and finally obtains a target video only including the target person. The embodiment of the application can be widely applied to scenes such as portrait cutout beautification, photo/video background replacement, certificate photo production, privacy protection and the like.

Referring to fig. 4, fig. 4 is a schematic diagram of a video processing system provided in an embodiment of the present application, where the system 400 includes a first module 410, a second module 420, a third module 430, a fourth module 440, a fifth module 450, a sixth module 460, a seventh module 470, and an eighth module 480; the first module is used for determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of the target person, and the target person is the person with the largest speaking frequency in the video to be processed; the second module is used for determining a plurality of first face areas according to the video to be processed; the third module is used for determining a plurality of second texts according to the plurality of first face areas; the second text is determined after lip language recognition is carried out on the first face area; the fourth module is used for determining a plurality of second face areas according to the first text, the second text and the first face area, wherein the second face areas are face areas corresponding to the target person; the fifth module is used for extracting features of each frame in the video to be processed and acquiring a feature matrix corresponding to each frame of image; the sixth module is used for determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame; the seventh module is used for determining a target character frame in the video to be processed according to the similarity between the second face area and each frame; the target character frame is a position frame corresponding to the target character; and the eighth module is used for carrying out portrait segmentation processing on the target character frame and determining a target video.

Referring to fig. 5, fig. 5 is a schematic diagram of an apparatus 500 provided in an embodiment of the present application, where the apparatus 500 includes at least one processor 510 and at least one memory 520 for storing at least one program; in fig. 5, a processor and a memory are taken as an example.

The processor and memory may be connected by a bus or other means, such as by a bus in FIG. 5.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Another embodiment of the present application also provides an apparatus that may be used to perform the control method as in any of the embodiments above, for example, performing the method steps of fig. 1 described above.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

The embodiment of the application also discloses a computer storage medium, wherein a program executable by a processor is stored, and the program executable by the processor is used for realizing the video processing method provided by the application when being executed by the processor.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described, the present invention is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and such equivalent modifications or substitutions are included in the scope of the present invention defined by the claims.

Claims

1. A video processing method, comprising:

determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of a target character, and the target character is the character with the largest speaking frequency in the video to be processed;

determining a plurality of first face areas according to the video to be processed;

determining a plurality of second texts according to the plurality of first face regions; the second text is determined after lip language recognition is carried out on the first face area;

determining a plurality of second face regions according to the first text, the second text and the first face region, wherein the second face regions are face regions corresponding to the target person;

extracting features of each frame in the video to be processed to obtain a feature matrix corresponding to each frame of image;

determining the similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame;

determining a target character frame in the video to be processed according to the similarity between the second face area and any two adjacent frames; the target character frame is a position frame corresponding to the target character;

determining a target person frame in the video to be processed according to the similarity between the second face area and the any two adjacent frames, including:

when the similarity between any two adjacent frames is higher than a second threshold value, performing target tracking on the target person according to the second face area, and determining the target person frame;

when the similarity between any two adjacent frames is lower than or equal to the second threshold, re-identifying the target person according to the first face area and the second face area, and determining the target person frame; and carrying out portrait segmentation processing on the target character frame to determine a target video.

2. The video processing method according to claim 1, wherein the determining a first text according to the acquired video to be processed includes:

determining the audio to be processed corresponding to the video to be processed;

performing voice recognition on the audio to be processed to determine a recognition text;

extracting the voice spectrum characteristics of the audio to be processed, and determining voice spectrum information;

classifying the audio to be processed according to the speech spectrum information, and determining a target audio; wherein the target audio is the audio corresponding to the target character;

and determining the first text according to the target audio and the recognition text.

3. The method of claim 1, wherein determining second face regions from the first text, the second text, and the first face region comprises:

calculating an edit distance between the first text and the second text;

and when the editing distance is lower than or equal to a first threshold value, determining that the first face area corresponding to the current first text is the second face area.

4. The video processing method according to claim 1, wherein said extracting features of each frame in the video to be processed to obtain a feature matrix corresponding to each frame image comprises:

extracting a characteristic diagram corresponding to each frame of image in the video to be processed through a skeleton network;

determining a coding matrix according to the characteristic diagram;

and carrying out global cross-channel fusion on the coding matrix, and determining the feature matrix.

5. The method according to claim 1, wherein said determining a similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame comprises:

and calculating the similarity of the two characteristic matrixes corresponding to any two adjacent frames through a twin network.

6. A video processing system is characterized by comprising a first module, a second module, a third module, a fourth module, a fifth module, a sixth module, a seventh module and an eighth module;

the first module is used for determining a first text according to the acquired video to be processed; the first text is the text content corresponding to the voice of a target character, and the target character is the character with the largest speaking frequency in the video to be processed;

the second module is used for determining a plurality of first face areas according to the video to be processed;

the third module is used for determining a plurality of second texts according to the plurality of first face areas; the second text is determined after lip language recognition is carried out on the first face area;

the fourth module is used for determining a plurality of second face areas according to the first text, the second text and the first face area, wherein the second face areas are face areas corresponding to the target person;

the fifth module is used for extracting features of each frame in the video to be processed and acquiring a feature matrix corresponding to each frame of image;

the sixth module is configured to determine a similarity between any two adjacent frames in the video to be processed according to the feature matrix of each frame;

the seventh module is used for determining a target character frame in the video to be processed according to the similarity between the second face area and any two adjacent frames; the target character frame is a position frame corresponding to the target character;

the seventh module is configured to determine a target person frame in the video to be processed according to the similarity between the second face region and the any two adjacent frames, and includes:

when the similarity between any two adjacent frames is lower than or equal to the second threshold, re-identifying the target person according to the first face area and the second face area, and determining the target person frame;

the eighth module is configured to perform portrait segmentation processing on the target person frame to determine a target video.

7. A video processing apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the video processing method of any of claims 1-5.

8. A computer storage medium in which a processor-executable program is stored, wherein the processor-executable program, when executed by the processor, is for implementing a video processing method according to any one of claims 1-5.