CN113077470B

CN113077470B - Method, system, device and medium for cutting horizontal and vertical screen conversion picture

Info

Publication number: CN113077470B
Application number: CN202110324841.8A
Authority: CN
Inventors: 曾荣; 徐蕾; 吴三阳; 王伟; 陆赞信
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2022-01-18
Anticipated expiration: 2041-03-26
Also published as: CN113077470A

Abstract

The invention provides a method, a system, a device and a storage medium for cutting a horizontal and vertical screen conversion picture, wherein the method comprises the following steps: acquiring a horizontal screen video file, and separating the horizontal screen video file to obtain a video picture and an audio file; acquiring a face position in a video picture, determining a face image, and segmenting the face image to obtain dynamic information; matching with the audio file according to the dynamic information, and determining key characters according to a matching result; the method separates the video picture and the audio file, divides the dynamic information of the human face in the video picture, and performs characteristic matching on the dynamic information obtained by division and the audio file to locate the speaker in the video, and divides the key character and the non-key character on a section of video containing a plurality of pictures, thereby keeping the picture of the key character as much as possible and cutting the picture of the non-key character to obtain better cutting effect, and can be widely applied to the technical field of video processing.

Description

Method, system, device and medium for cutting horizontal and vertical screen conversion picture

Technical Field

The invention relates to the technical field of video processing, in particular to a method, a system, a device and a storage medium for cutting a horizontal and vertical screen conversion picture.

Background

In video products, two types of video forms exist, namely, a complete video form: horizontal version video and vertical version video. Generally, the content of the horizontal version video is higher in richness and more in content types, and due to the time length, a common user is used to watch the mobile phone transversely to obtain better film watching experience; the vertical video is more interesting, shorter in time and more focused and concise in plot. Users are generally more inclined to view in portrait situations at more fragmented times and in scenes. Under the condition of not carrying out any processing, the video picture is directly zoomed, the vertical version video is displayed under the horizontal screen state or the horizontal version video is displayed under the vertical screen state, and a large amount of black edges appear.

At present, the method for automatically converting horizontal and vertical screen videos through a video processing technology is a high cost performance method, the method has good balance between cost and film viewing experience, massive videos can be automatically converted with extremely low cost, complete information of the videos can be kept as much as possible through the video processing technology, and better film viewing experience is provided for users.

But in some more specific cases, for example, multiple faces appear in a picture of a tv show or movie and a long period of conversation is conducted; if the key area of the video is determined only according to the position of the face during the cropping process, confusion occurs, for example, the person in the cropped picture does not speak, but the speaking person does not appear in the cropped picture; or the area of the key elements in the video is too large, so that the clipping algorithm balances the occupation of a plurality of key elements, and a clipping effect with worse quality is formed.

Disclosure of Invention

In view of the above, to at least partially solve one of the above technical problems, an embodiment of the present invention is directed to a method for clipping a horizontal-vertical screen conversion picture, which can retain the picture of a key character as much as possible and clip the picture of a non-key character to obtain a better clipping effect; meanwhile, the embodiment also provides a system, a device and a computer readable storage medium for correspondingly realizing the method.

In a first aspect, a technical solution of the present application provides a method for clipping a horizontal-vertical screen conversion picture, which includes the steps of:

acquiring a horizontal screen video file, and separating the horizontal screen video file to obtain a video picture and an audio file;

acquiring the face position in the video picture, determining a face image, and segmenting the face image to obtain dynamic information;

matching the dynamic information with the audio file, and determining key figures according to matching results;

and cutting the video picture according to the key character to obtain a vertical screen picture.

In a feasible embodiment of the present application, the obtaining a face position in the video frame, determining a face image, and segmenting the face image to obtain dynamic information includes:

determining the face image according to the face position, generating a gray level image of the face image, and extracting a feature map of the gray level image;

predicting to obtain key points according to the gray level image, generating a similarity transformation matrix of the key points, and determining a key point hot spot diagram according to the similarity transformation matrix;

and iterating and determining the dynamic information of the human face through a feedforward neural network according to the gray level image, the feature map and the key point hot spot map.

In a possible embodiment of the present disclosure, the matching the dynamic information with the audio file and determining a key person according to a matching result include:

coding the dynamic information to obtain a first feature vector;

coding the audio file to obtain a second feature vector;

splicing the first feature vector and the second feature vector, and outputting a matching probability through a convolutional neural network;

and determining the face position with the highest score in the matching probability as the key figure.

In a possible embodiment of the present disclosure, the encoding according to the dynamic information to obtain a first feature vector includes:

determining an average value of pixels in the dynamic information through principal component analysis;

calculating a covariance matrix of the average value of the pixels, and determining a pixel eigenvector according to the covariance matrix;

and projecting the dynamic information according to the pixel characteristic vector to obtain the first characteristic vector.

In a possible embodiment of the present disclosure, the encoding the audio file to obtain a second feature vector includes:

increasing the high-frequency part of the audio file to obtain a pre-emphasis signal, and framing the pre-emphasis signal to obtain a single-frame signal;

windowing the single-frame signal, and obtaining a frequency domain signal through fast Fourier transform;

and determining the energy of the frequency domain signal through a Mel filter bank, determining a Mel frequency cepstrum coefficient, and obtaining a second eigenvector through the Mel frequency cepstrum coefficient.

In a feasible embodiment of the present disclosure, the cropping the video picture according to the key character to obtain a portrait picture includes:

determining that the key character does not exist in the horizontal screen video file, keeping the picture height in the horizontal screen video file unchanged, determining the picture width according to a preset picture proportion, and determining a cutting area according to the picture height and the picture width;

and clipping according to the clipping area to obtain the vertical screen picture.

determining that the key character exists in the horizontal screen video file, and determining the cropping area;

and controlling the cutting area to slide in the picture of the horizontal screen video file, determining that the face of the key figure is positioned in the cutting area, and cutting the picture to obtain the vertical screen picture.

In a second aspect, a technical solution of the present invention further provides a software system for clipping a horizontal screen and a vertical screen transition picture, including: the system comprises a sound and picture separation module, a video and audio file acquisition module and a video and audio file acquisition module, wherein the sound and picture separation module is used for acquiring a horizontal screen video and audio file and separating the horizontal screen video and audio file from the horizontal screen video and audio file;

the characteristic cutting module is used for acquiring the face position in the video picture, determining a face image and segmenting the face image to obtain dynamic information;

the character matching module is used for matching the dynamic information with the audio file and determining key characters according to a matching result;

and the picture cutting module is used for cutting the video picture according to the key character to obtain a vertical screen picture.

In a third aspect, the present invention further provides a clipping device for horizontally and vertically switching pictures, including:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor executes the cropping method of the landscape and portrait screen conversion picture in the first aspect.

In a fourth aspect, the present invention also provides a storage medium, in which a processor-executable program is stored, and the processor-executable program is used for executing the method in the first aspect when being executed by a processor.

Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:

according to the technical scheme, the video picture and the audio file are separated, the dynamic information of the face is segmented in the video picture, the dynamic information obtained by segmentation is subjected to feature matching with the audio file, the speaker in the video is positioned, and the key character and the non-key character are distinguished on a section of video containing a multi-person picture, so that the picture of the key character can be kept as far as possible, the picture of the non-key character is cut, and a better cutting effect is obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart illustrating steps of a method for cropping a horizontal/vertical screen transition picture according to an embodiment of the present disclosure;

FIG. 2 is a schematic view of a landscape screen before cropping in an embodiment of the present application;

FIG. 3 is a schematic diagram of a clipped portrait screen in an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In the embodiments of the present application, the horizontal video file or the video frame mentioned in the embodiments refers to a case where the frame size of the frame is larger than the frame width, for example, the ratio is 21:9, 16:9, etc.; the vertical screen video file or video picture refers to the situation that the picture length of the picture is smaller than the picture width, such as 9:16, 3:4, and the like.

In a first aspect, as shown in fig. 1, the technical solution of the present application provides an embodiment of a method for clipping a horizontal-vertical screen conversion picture, where the method includes steps S100-S400:

s100, acquiring the horizontal screen video file, and separating the horizontal screen video file to obtain a video picture and an audio file.

Specifically, an audio file is derived from the acquired cross screen video and audio through the existing video clipping method or video clipping tool, and the original video file is directly used as a video for face changing to perform subsequent processing. Typically, the exported audio file is an MP3 format file.

S200, acquiring a face position in a video picture, determining a face image, and segmenting the face image to obtain dynamic information;

the dynamic information refers to a part of the human face, such as lips, cheeks, and chin, which is obviously different from the beginning and the end of speaking. Specifically, the face position of each frame in the video is obtained through a face recognition algorithm. In the embodiment, YOLOv3(You Only Look one) is adopted to extract the face position, YOLOv3 is an object detection algorithm based on a deep convolutional neural network, and a high-precision face detection model is obtained by public face data training; in practice WIDER FACE (a face detection reference data set) is used for training. Carrying out frame-by-frame identification on the face in the video picture by training the obtained face detection model; after the face position in the picture is determined, extracting a face image at the position, namely the face image contains face information, registering face key points of the face image through a DAN (deep Alignment network) algorithm, connecting key points of a chin, a lip and two cheek parts through a key point rule to obtain a face key point combination in each frame of picture, and collecting the key point combinations of each frame of video picture to obtain dynamic information. The embodiment determines the character of the person speaking in each frame of the video through the dynamic information of the face. In addition, besides the dynamic information obtained by dividing the face image, the remaining key points are used as static information of the face image, that is, the static information is partial features of the face with little difference between the start and stop of speaking, such as head contour, ears, nose, forehead, eyebrows, etc.; when the character of the person speaking in each frame of the video is determined through the dynamic information of the face, the face information can also be corrected through the static information.

S300, matching the dynamic information with the audio file, and determining key characters according to a matching result;

specifically, the dynamic information of each face in each frame of image and the corresponding audio are input into the discrimination model, and the result of matching the dynamic information with the audio is obtained through the discrimination model, for example, when the matching result is a specific score, the face corresponding to the dynamic information is determined to be a key character, i.e., a speaker, in the retaining wall frame image according to the highest score of matching between the audio and the face.

S400, clipping the video picture according to the key characters to obtain a vertical screen picture.

Specifically, after key characters in each frame of video pictures are determined, clipping is carried out on the original video pictures according to the size of a preset clipping picture, the clipping process is repeated, the video obtained after clipping of each frame is spliced, a converted portrait video is generated, and original audio files are added according to the time axis of the video pictures; it will be appreciated that the cropped picture size should be a picture-to-picture ratio that satisfies the portrait screen.

In some possible embodiments, the step S200 of acquiring a face position in a video frame, determining a face image, and segmenting the face image to obtain dynamic information may further include the step S210-S230 of subdividing:

s210, determining a face image according to the face position, generating a gray level image of the face image, and extracting a feature map of the gray level image;

specifically, gray level processing is carried out on a face image determined according to the face position to obtain a radian image, and a characteristic diagram of the gray level image is extracted according to a Feed Forward NN (Feed Forward NN) in a DAN network architecture; the calculation process of the feature map comprises the following steps: the feature matrix of 1 × 3136 is output, a matrix transformation is performed to obtain a 56 × 56 matrix, and then up-sampling is performed to obtain a matrix of the same size as the input grayscale image, for example, 112 × 112.

S220, obtaining key points according to the gray level image prediction, generating a similarity transformation matrix of the key points, and determining a key point hot spot diagram according to the similarity transformation matrix;

specifically, the positions of a first batch of key points are obtained through prediction by a standard key point template provided by a DAN network architecture, and a similar change matrix from the first batch of key points to the standard key point template is obtained through calculation of Connection Layers (connections Layers) in the DAN network architecture; the input gray level image can be corrected through the similarity change matrix, and the key points are transformed to obtain a key point hot spot diagram, and the same is true. The feature map is obtained by feature extraction through a next layer of Feed Forward NN, wherein the hot spot map of the key points is a calculation mode of central attenuation, the values at the key points are maximum, and the values are smaller when the key points are farther away.

And S230, iterating through a feedforward neural network according to the gray level image, the feature map and the key point hot spot map to determine dynamic information of the face.

Specifically, a gray image, a feature map and a key point hot spot map are input to a Feed Forward NN for iteration to obtain a new key point hot spot map, after the iteration is finished, the key points of the face in the gray image are obtained, and the key points in each frame of video picture are combined to obtain the dynamic information of the face.

In some optional embodiments, step S300, matching with the audio file according to the dynamic information, and determining the key character according to the matching result includes more subdivided steps S310-S330:

s310, coding the dynamic information to obtain a first feature vector;

specifically, in the embodiment, the dynamic information may be encoded into a real number vector with a length of 1024 by an encoder through an encoder of the dynamic information of the face and inputting the dynamic information of the face image, that is, the first feature vector.

S320, coding the audio file to obtain a second feature vector;

specifically, the audio file is divided into frames according to a time axis, each frame of the audio is subjected to feature extraction by using an MFCC (Mel-Frequency cepstrum coefficient) algorithm, and features output by the MFCC are encoded by an encoder to obtain a real number vector with 1024 lengths, namely a second feature vector.

S330, splicing the first feature vector and the second feature vector, and outputting the matching probability through a convolutional neural network; determining the face position with the highest score in the matching probability as a key figure;

specifically, the feature vectors obtained in the steps S310 and S320 are spliced to form a vector of 2048 length, and the probability of yes or no prediction result is output through the trained convolutional neural network.

In some possible embodiments, the step of encoding according to the dynamic information to obtain the first feature vector S310 may include more detailed steps S311 to S313:

s311, determining the average value of the pixels in the dynamic information through principal component analysis;

specifically, in this embodiment, through Principal Component Analysis (PCA), an average value of each pixel of all the dynamic information is first calculated to obtain an average value matrix, where the number of rows of the matrix is 1, and the number of columns of the matrix is M, where M is the image size of the dynamic information.

S312, calculating a covariance matrix of the average value of the pixels, and determining a pixel feature vector according to the covariance matrix;

specifically, the rows of the matrix obtained in step S311 are copied, so that the number of rows of the matrix is consistent with the number N of samples of the dynamic information, and each pixel according to the dynamic information is converted into an original data matrix of N rows and M columns, where M is the image size of the dynamic information. And calculating to obtain a covariance matrix through the original data matrix and the mean matrix. And calculating an eigenvalue matrix and an eigenvector matrix by using the covariance matrix, and sequencing the eigenvalues from large to small. The eigenvalue matrix is N × 1, the eigenvector matrix is N × N, and each row represents an eigenvector corresponding to a certain eigenvalue. And normalized for each row of the feature vector matrix.

S313, projecting the dynamic information according to the pixel characteristic vector to obtain a first characteristic vector;

specifically, according to the normalized eigenvector matrix, dynamic information pixels of each face image of the dynamic information are projected to a PCA space, and a projection vector of a single face image is obtained, namely the first eigenvector.

In some possible embodiments, the step of encoding the audio file into the second feature vector S320 may include the more subdivided steps S321-S323:

s321, improving the high-frequency part of the audio file to obtain a pre-emphasis signal, and framing the pre-emphasis signal to obtain a single-frame signal;

specifically, the voice signal in the audio file is preprocessed, and the voice signal is pre-emphasized through a high-pass filter, wherein the pre-emphasis is to improve the high-frequency part to flatten the spectrum of the signal, so that the spectrum can be obtained by the same signal-to-noise ratio in the whole frequency band from low frequency to high frequency. Meanwhile, the method is also used for eliminating the vocal cords and lip effects in the generation process, compensating the high-frequency part of the voice signal which is restrained by the pronunciation system, and highlighting the formants of the high frequency. Then the signal is framed, and n (n is a positive integer) sampling points are grouped into an observation unit, which is called a frame. In the embodiment, N is 256 or 512, and covers about 20-30 ms. To avoid excessive variation between two adjacent frames, an overlap region is formed between two adjacent frames, where the overlap region includes m sampling points, typically m is about 1/2 or 1/3 of n.

S322, windowing the single-frame signal, and obtaining a frequency domain signal through fast Fourier transform;

specifically, since speech is constantly changing over a long range and cannot be processed without fixed characteristics, a signal tile of each frame is substituted into a window function, and the value outside the window is set to 0, so as to eliminate signal discontinuity which may be caused at both ends of each frame. The commonly used window functions include a square window, a hamming window, a hanning window, and the like, and in the embodiment, the hamming window is adopted according to the frequency domain characteristics of the window function. Each frame is multiplied by a hamming window to increase the continuity of the left and right ends of the frame. Since the signal is usually difficult to see by the transformation in the time domain, it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame. And the power spectrum of the voice signal is obtained by taking the modulus square of the frequency spectrum of the voice signal.

S323, determining the energy of the frequency domain signal through a Mel filter bank, determining a Mel frequency cepstrum coefficient, and obtaining a second eigenvector through the Mel frequency cepstrum coefficient.

Specifically, an energy spectrum is obtained through calculation according to the power spectrum, the energy spectrum passes through a group of Mel-scale triangular filter banks, and a filter bank with M filters is defined, wherein the number of the filters is close to that of critical bands, and in the embodiment, the adopted filter is a triangular filter. The frequency spectrum is smoothed through filtering, the effect of harmonic waves is eliminated, and the formants of the original voice are highlighted. A speech recognition system featuring MFCC is not affected by the difference in pitch of the input speech. And then calculating the logarithmic energy output by each filter bank, obtaining an MFCC coefficient through discrete cosine transform, and coding through a coder to obtain a second eigenvector.

In some possible embodiments, the step S400 of cropping the video frame according to the key character to obtain the portrait frame may include the steps S410 or S420:

s410, determining that key characters do not exist in the horizontal screen video file, keeping the picture height in the horizontal screen video file unchanged, determining the picture width according to a preset picture proportion, determining a cutting area according to the picture height and the picture width, and cutting according to the cutting area to obtain a vertical screen picture;

or the like, or, alternatively,

s420, determining key characters in the horizontal screen video file, determining a cutting area, controlling the cutting area to slide in the picture of the horizontal screen video file, determining the face of the key character to be located in the cutting area, and cutting the picture to obtain a vertical screen picture.

Specifically, the size of the cropping area is determined first, the size of the cropping area is determined by the original video resolution and the cropping target proportion, the cropping principle is to keep the pictures of the original video as much as possible, for example, the original video resolution is 1920 × 1080 (width: height), and the cropping target proportion is wide: when the video is 9:16 high, in order to keep the original video pictures as many as possible, the height is not cut, and the width is cut to 1920 × 1080/16 × 9 (607.5), so that the size of the cut area is 607.5 × 1080 (width: high);

as shown in fig. 2, the horizontal screen image before clipping is a schematic diagram, clipping is performed according to the determined key speaker, in the clipping process, in the embodiment, a clipping frame of 607.5:1080 is slid from left to back on an original image of 1920:1080, when the face of the key speaker is in the middle of the clipping frame, the position of the clipping frame is the required clipping position, as shown in fig. 3, the clipping is a schematic diagram of the vertical screen image obtained after clipping.

In a second aspect, the present application provides a cropping system for a landscape-portrait screen transformation picture, which is used in the method in the first aspect, and includes:

the sound and picture separation module is used for acquiring the horizontal screen video and audio file and separating the horizontal screen video and audio file to obtain a video picture and an audio file;

In a third aspect, the present application further provides a clipping device for horizontal and vertical screen conversion pictures, which includes at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor executes a cropping method of a landscape-portrait screen converted picture as in the first aspect.

An embodiment of the present invention further provides a storage medium storing a program, where the program is executed by a processor to implement the method in the first aspect.

From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:

the invention provides a technical scheme for clipping a horizontal and vertical screen conversion picture based on speaker positioning, and aims to distinguish key characters from non-key characters on a section of video containing a multi-person picture by positioning speakers in the video, so that the picture of the key characters can be kept as far as possible and the picture of the non-key characters can be clipped, and a better clipping effect can be obtained.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for cutting a horizontal and vertical screen conversion picture is characterized by comprising the following steps:

clipping the video picture according to the key character to obtain a vertical screen picture;

the acquiring of the face position in the video picture, determining the face image, and segmenting the face image to obtain dynamic information includes:

2. The method for clipping horizontal and vertical screen transition picture according to claim 1, wherein the matching with the audio file according to the dynamic information and the determination of key characters according to the matching result comprise:

coding the dynamic information to obtain a first feature vector;

coding the audio file to obtain a second feature vector;

splicing the first feature vector and the second feature vector, and outputting a matching probability through a convolutional neural network; and determining the face position with the highest score in the matching probability as the key figure.

3. The method of claim 2, wherein the encoding according to the motion information to obtain a first feature vector comprises:

4. The method of claim 2, wherein the encoding the audio file to obtain the second feature vector comprises:

5. The method for cropping a horizontal/vertical screen converted picture according to any one of claims 1-4, wherein the cropping the video picture according to the key character to obtain a vertical screen picture comprises:

determining that the key character does not exist in the transverse screen video file, and keeping the picture height in the transverse screen video file unchanged;

determining the width of a picture according to a preset picture proportion, and determining a cutting area according to the height of the picture and the width of the picture;

6. The method for cropping a horizontal-vertical screen converted picture according to claim 5, wherein the cropping the video picture according to the key character to obtain a vertical screen picture comprises:

7. A system of tailorring of horizontal vertical screen conversion picture, its characterized in that includes:

the system comprises a sound and picture separation module, a video and audio file acquisition module and a video and audio file acquisition module, wherein the sound and picture separation module is used for acquiring a horizontal screen video and audio file and separating the horizontal screen video and audio file from the horizontal screen video and audio file;

the picture cutting module is used for cutting the video picture according to the key character to obtain a vertical screen picture;

the acquiring the face position in the video picture, determining a face image, and segmenting the face image to obtain dynamic information includes:

8. A cutting device for horizontal and vertical screen conversion pictures is characterized by comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to perform a method of cropping a landscape or portrait screen conversion screen as defined in any one of claims 1 to 6.

9. A storage medium having stored therein a processor-executable program, wherein the processor-executable program is configured to execute a cropping method of a landscape screen and portrait screen conversion screen according to any one of claims 1 to 6 when executed by a processor.