CN111556335A

CN111556335A - Video sticker processing method and device

Info

Publication number: CN111556335A
Application number: CN202010297623.5A
Authority: CN
Inventors: 林倩雅; 夏天; 何雷米一阳; 陈斯; 黄子汕; 刘荣潺
Original assignee: Good Morning Technology Guangzhou Co ltd
Current assignee: Good Morning Technology Guangzhou Co ltd
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2020-08-18
Also published as: US11218648B2; US20210329176A1

Abstract

The invention discloses a video sticker processing method and device. The method comprises the following steps: respectively carrying out face recognition and voice recognition on a video to be processed so as to obtain face position data when the face recognition is successful and obtain a voice recognition text when the voice recognition is successful; matching the voice recognition text with the description text of each sticker in a sticker library to obtain a target sticker, and acquiring a target video frame according to the voice recognition text; adding the target paster at the default position or the target position of the target video frame; wherein the target position is calculated from the face position data. The invention can automatically determine the target paster and the adding position thereof according to the face recognition result and the voice recognition result of the video to be processed, realizes intelligent selection and placement of the target paster and improves the processing efficiency of the video paster.

Description

Video sticker processing method and device

Technical Field

The invention relates to the technical field of video processing, in particular to a method and a device for processing video stickers.

Background

Since video social contact becomes an emerging internet social contact mode, various video editing software is produced. To enhance the entertainment effect of the video, users often add stickers to the video using video editing software. In practical application, a user manually selects a target sticker from a sticker library according to personal preference requirements, manually selects a target video frame from video frames of a video, and manually adjusts the placement position of the target sticker after the target sticker is added to the target video frame, so that the target sticker is rendered and displayed in the target video frame in the video playing process. In the prior art, the video sticker needs to be processed by manual operation of a user, so that the processing time of the video sticker is increased, and the processing efficiency of the video sticker is low.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a video sticker processing method and a video sticker processing device, which can automatically determine a target sticker and an adding position thereof according to a face recognition result and a voice recognition result of a video to be processed, realize intelligent selection and placement of the target sticker, and improve the video sticker processing efficiency.

In order to solve the above technical problem, in a first aspect, an embodiment of the present invention provides a video sticker processing method, including:

respectively carrying out face recognition and voice recognition on a video to be processed so as to obtain face position data when the face recognition is successful and obtain a voice recognition text when the voice recognition is successful;

matching the voice recognition text with the description text of each sticker in a sticker library to obtain a target sticker, and acquiring a target video frame according to the voice recognition text;

adding the target paster at the default position or the target position of the target video frame; wherein the target position is calculated from the face position data.

Further, the face recognition and the voice recognition are respectively performed on the video to be processed to obtain face position data when the face recognition is successful, and obtain a voice recognition text when the voice recognition is successful, specifically:

sequentially carrying out face recognition on the video frames of the video to be processed, and obtaining the face position data of the corresponding video frame when the face recognition of one video frame is successful;

and performing voice recognition on the video to be processed, and converting the recognized voice data into text data when the voice recognition is successful to obtain the voice recognition text.

Further, matching the voice recognition text with the description text of each sticker in a sticker library to obtain a target sticker, and acquiring a target video frame according to the voice recognition text, specifically:

matching text words obtained by word segmentation processing of the voice recognition text with description texts of each sticker in the sticker library to obtain the target sticker;

and acquiring the appearance time of the voice recognition text in the video to be processed, and taking the video frame with the playing time corresponding to the appearance time as the target video frame.

Further, the adding the target sticker at the default position or the target position of the target video frame further includes:

and when the appearance time of the target paster at the default position or the target position reaches a preset threshold value, removing the target paster.

Further, after the performing face recognition and voice recognition on the video to be processed respectively to obtain face position data when the face recognition is successful and obtaining a voice recognition text when the voice recognition is successful, the method further includes:

and adding the voice recognition text at the subtitle position of the target video frame.

In a second aspect, an embodiment of the present invention provides a video sticker processing apparatus, including:

the face and voice recognition module is used for respectively carrying out face recognition and voice recognition on the video to be processed so as to obtain face position data when the face recognition is successful and obtain a voice recognition text when the voice recognition is successful;

the target paster obtaining module is used for matching the voice recognition text with the description text of each paster in the paster library to obtain a target paster and obtaining a target video frame according to the voice recognition text;

the target paster adding module is used for adding the target paster at the default position or the target position of the target video frame; wherein the target position is calculated from the face position data.

Further, the target sticker adding module is further configured to remove the target sticker when the appearance duration of the target sticker at the default position or the target position reaches a preset threshold.

Furthermore, the video sticker processing device further comprises a voice recognition text adding module, wherein the voice recognition text adding module is used for performing face recognition and voice recognition on the video to be processed respectively so as to obtain face position data when the face recognition is successful, and adding the voice recognition text at the subtitle position of the target video frame after the voice recognition text is obtained when the voice recognition is successful.

The embodiment of the invention has the following beneficial effects:

the method comprises the steps of respectively carrying out face recognition and voice recognition on videos to be processed to obtain face position data when the face recognition is successful, obtaining voice recognition texts when the voice recognition is successful, further matching the voice recognition texts with description texts of all stickers in a sticker library to obtain target stickers, obtaining target video frames according to the voice recognition texts, adding the target stickers at default positions of the target video frames or target positions obtained by calculation according to the face position data, and finishing video sticker processing. Compared with the prior art, the embodiment of the invention carries out face recognition and voice recognition on the video to be processed, so that when the voice recognition is successful, the voice recognition text is matched with the description text of each sticker in the sticker library to obtain the target sticker, the target video frame is obtained according to the voice recognition text, when the face recognition is failed, the target sticker is added at the default position of the target video frame according to the default position preset aiming at the target sticker, when the face recognition is successful, the target position is calculated according to the face position data, and the target sticker is added at the target position of the target video frame. The embodiment of the invention can automatically determine the target paster and the adding position thereof according to the face recognition result and the voice recognition result of the video to be processed, realize intelligent selection and placement of the target paster and improve the processing efficiency of the video paster.

Drawings

FIG. 1 is a flowchart illustrating a video sticker processing method according to a first embodiment of the present invention;

FIG. 2 is another flow chart illustrating a video sticker processing method according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a video sticker processing apparatus according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a preferred embodiment in the second embodiment of the present invention.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by the relevant server, and the server is taken as an example for explanation below.

Please refer to fig. 1-2.

As shown in fig. 1-2, the first embodiment provides a video sticker processing method, including steps S1-S3:

and S1, respectively carrying out face recognition and voice recognition on the video to be processed so as to obtain face position data when the face recognition is successful and obtain a voice recognition text when the voice recognition is successful.

And S2, matching the voice recognition text with the description text of each sticker in the sticker library to obtain a target sticker, and acquiring a target video frame according to the voice recognition text.

S3, adding a target paster at the default position or the target position of the target video frame; wherein the target position is obtained by calculation according to the face position data.

As an example, a user uploads a video to be processed through a user terminal, and when the video to be processed is received by a server, face recognition and voice recognition are respectively performed on the video to be processed. If the face recognition is successful, the face position data can be obtained, and if the voice recognition is successful, the voice recognition text can be obtained. The user terminal comprises a mobile phone, a computer, a tablet and other communication equipment which can be connected with the server.

In a preferred embodiment of this embodiment, after obtaining the speech recognition text, the server may issue the speech recognition text to the user terminal, so that the user may confirm the speech recognition text through the user terminal.

And when the voice recognition is successful, matching the voice recognition text with the description text of each paster in the paster library, wherein the paster corresponding to the description text successfully matched with the voice recognition text is the target paster. And simultaneously, acquiring a target video frame according to the voice recognition text.

In a preferred embodiment of this embodiment, after obtaining the target sticker, the server may issue the target sticker to the user terminal, so that the user may confirm the target sticker through the user terminal. After the target video frame is obtained, the server can issue the target video frame to the user terminal, so that the user can confirm the target video frame through the user terminal.

After the target sticker and the target video frame are obtained, determining the adding position of the target sticker by combining the face recognition result, namely when the face recognition fails, adding the target sticker at the default position of the target video frame according to the default position preset for the target sticker, when the face recognition succeeds, calculating to obtain the target position according to the face position data, and adding the target sticker at the target position of the target video frame.

Wherein, the setting process of the default position can refer to: when the face recognition of the video to be processed fails, namely the face cannot be recognized or the width of the face rectangle is smaller than 30% of the width of the mobile phone screen, firstly, a default rectangle of 300 x 380 is added in the center of the mobile phone screen, then, an inscribed ellipse of the default rectangle is drawn, points on the inscribed ellipse are default effective points, and finally, a default effective point is randomly screened from all the default effective points to serve as a default position.

Wherein, the calculation process of the target position can refer to: when the face recognition of the video to be processed is successful, namely the width of the face rectangle is greater than 30% of the width of the mobile phone screen, the width of the face rectangle is widened by 40%, the upper half of the face rectangle is increased by 60%, and the lower half of the face rectangle is increased by 30%, so that the width of the whole face rectangle is not less than 65% of the width of the mobile phone screen. And drawing an inscribed ellipse of the face rectangle, wherein points on the ellipse are standby points (equally divided into 8-10) of the target sticker, the standby points outside the mobile phone screen are unavailable points, and the standby points inside the mobile phone screen are available points. And then adding a default paster (the width of the paster is greater than 45% of the width of the face rectangle) to each available point, wherein if the placement area of the default paster exceeds 20% of the mobile phone screen, the corresponding available point is an invalid point, and if the placement area of the default paster does not exceed 20% of the mobile phone screen, the corresponding available point is a valid point. And finally randomly screening one effective point from all the effective points as a target position. When the number of the effective points is less than 3, one width in the middle is 80% planerwidth, and height is 70% planerheight (rectangle of the safety area), at this time, it needs to judge whether the upper or lower height of the center point of the rectangle is greater than 5% of the height of the mobile phone screen, if yes, the reverse effective point is determined as the target position.

Wherein, the selection process of the target paster rotation angle can refer to: if the adding position of the target sticker is on the left side of the mobile phone screen, the rotating angle is a random angle of 0-45 degrees clockwise, and if the adding position of the target sticker is on the right side of the mobile phone screen, the rotating angle is a random angle of 0-45 degrees anticlockwise.

In a preferred embodiment of this embodiment, after adding the target sticker at the default position or the target position of the target video frame, the server may issue the target video frame with the target sticker added to the user terminal, so that the user may confirm the video sticker processing through the user terminal.

In the embodiment, the face recognition and the voice recognition are respectively carried out on the video to be processed, so that the face position data is obtained when the face recognition is successful, the voice recognition text is obtained when the voice recognition is successful, the voice recognition text is matched with the description text of each sticker in the sticker library to obtain the target sticker, the target video frame is obtained according to the voice recognition text, the target sticker is added at the default position of the target video frame or the target position obtained by calculation according to the face position data, and the video sticker processing is completed.

In the embodiment, the face recognition and the voice recognition are performed on the video to be processed, so that when the voice recognition is successful, the voice recognition text is matched with the description text of each sticker in the sticker library to obtain the target sticker, the target video frame is obtained according to the voice recognition text, when the face recognition is failed, the target sticker is added at the default position of the target video frame according to the default position preset for the target sticker, when the face recognition is successful, the target position is calculated according to the face position data, and the target sticker is added at the target position of the target video frame. The embodiment can automatically determine the target sticker and the adding position thereof according to the face recognition result and the voice recognition result of the video to be processed, realize intelligent selection and placement of the target sticker, and improve the processing efficiency of the video sticker.

In an embodiment, the performing face recognition and voice recognition on the video to be processed respectively to obtain face position data when the face recognition is successful and obtain a voice recognition text when the voice recognition is successful specifically includes: sequentially carrying out face recognition on video frames of a video to be processed, and obtaining face position data corresponding to the video frames when the face recognition of one video frame is successful; and performing voice recognition on the video to be processed, and converting the recognized voice data into text data when the voice recognition is successful to obtain a voice recognition text.

As an example, a user records a video to be processed through a user terminal, uploads a video frame of the video to be processed, and when the server receives the video frame of the video to be processed, the server performs face recognition on the video frames of the video to be processed in sequence according to the receiving sequence of the video frames of the video to be processed, if the face recognition on one video frame is successful, the face recognition of the video to be processed is determined to be successful, so that face position data of the video frame is obtained, and if the face recognition on all the video frames is failed, the face recognition of the video to be processed is determined to be failed. The user finishes recording the video to be processed through the user terminal, uploads the last video frame of the video to be processed, the server performs voice recognition on the video to be processed when receiving the last video frame, if the voice recognition is successful, the recognized voice data is converted into text data to obtain a voice recognition text, and if the voice recognition is failed, the video sticker processing is quitted.

According to the embodiment, the face recognition is performed on the video frames of the video to be processed in sequence, the face position data corresponding to the video frames is obtained when the face recognition of one video frame is successful, the face recognition can be performed on the received video frames when the user records the video to be processed, the face recognition of the rest video frames is not required after the face position data is obtained, the processing time of the face recognition of the video to be processed is greatly shortened, and the processing efficiency of the video paster is improved.

In a preferred embodiment, the matching the speech recognition text with the description text of each sticker in the sticker library to obtain a target sticker, and obtaining a target video frame according to the speech recognition text specifically includes: matching text words obtained by performing word segmentation processing on the voice recognition text with description texts of each sticker in a sticker library to obtain target stickers; acquiring the appearance time of the voice recognition text in the video to be processed, and taking the video frame with the playing time corresponding to the appearance time as a target video frame.

As an example, after obtaining the voice recognition text, the server performs word segmentation processing on the voice recognition text to obtain a text word set, and matches the text words in the text word set with the description text of each sticker in the sticker library one by one, if the description text with a sticker in all matching results matches with the text words, randomly selects a sticker from the matched stickers as a target sticker, and if the description text without a sticker in all matching results matches with the text words, quits the video sticker processing.

For example, the word segmentation processing is performed on the voice recognition text "good happy" from front to back to obtain a text word set { ("good", "open", "heart"), ("good", "open"), ("good" or "open") }, the "good", "open", "heart", "good open", "good open" and "open" are respectively matched with the descriptive text of each sticker in the sticker library, if the descriptive text of a sticker in all matching results matches the text word, a sticker is randomly selected from the matched stickers as a target sticker, and if the descriptive text of no sticker in all matching results matches the text word, the video sticker processing is exited.

In a preferred implementation of this embodiment, a sticker is randomly selected from the matching results of the text word with the longest text length as the target sticker.

For example, randomly sift one sticker from the matching result of "good happy" as the target sticker.

As an example, after obtaining the voice recognition text, the server performs word segmentation processing on the voice recognition text to obtain a text word set, and matches the text words in the text word set with the description text of each sticker in the sticker library one by one according to the sequence of the text length of the text words from long to short, if the description text with a sticker in the current matching result matches with the text words, randomly selects a sticker from the matched stickers as a target sticker, and if the description text without a sticker in all matching results matches with the text words, the video sticker processing is exited.

For example, the word segmentation processing is performed on the voice recognition text "good happy" from front to back to obtain a text word set { ("good happy"), ("good", "open", "heart") }, the "good happy", "good", "open", "heart" are sequentially matched with the descriptive text of each sticker in the sticker library, if the descriptive text of a sticker in the current matching result matches the text word, a sticker is randomly selected from the matched stickers as a target sticker, and if the descriptive text of no sticker in all matching results matches the text word, the video sticker processing is exited.

According to the method and the device, the obtained text words are matched with the description text of each sticker in the sticker library to obtain the target sticker by performing word segmentation processing on the voice recognition text, the success rate of sticker matching can be effectively increased, and therefore the processing efficiency of the video stickers is improved.

In a preferred embodiment of this embodiment, after obtaining the text word set, the server may issue the text word set to the user terminal, so that the user may confirm the text word set through the user terminal.

Wherein, the data structure of the issued text word set can refer to: { (text word 1, startTime, endTime), (text word 2, startTime, endTime), … … }, startTime denotes the start time of the corresponding text word, and endTime denotes the end time of the corresponding text word.

In a preferred embodiment of this embodiment, after obtaining the matching sticker, the server may issue the matching sticker to the user terminal, so that the user may confirm the matching sticker through the user terminal.

Wherein, the data structure of the issued matching sticker can refer to: { (text word 1: matching Sticker 1), (text word 2, matching Sticker 2), … … }.

In a preferred embodiment, the adding the target sticker at the default position or the target position of the target video frame further comprises: and when the appearance time of the target paster at the default position or the target position reaches a preset threshold value, removing the target paster.

For example, after adding the target sticker at the default position or the target position of the target video frame, the occurrence duration of the target sticker at the default position or the target position is detected, and if the occurrence duration of the target sticker at the default position or the target position reaches a preset threshold, the target sticker is removed from the target video frame. Wherein the preset threshold is preset according to actual needs, such as 2 seconds.

According to the embodiment, the target sticker is removed when the appearance duration of the target sticker at the default position or the target position reaches the preset threshold value, so that the situation that the target sticker stays at the default position or the target position for too long to block video content can be avoided.

In a preferred embodiment, after the performing face recognition and voice recognition on the video to be processed respectively to obtain face position data when the face recognition is successful and obtain a voice recognition text when the voice recognition is successful, the method further includes: and adding voice recognition text at the position of the subtitle of the target video frame.

According to the embodiment, the voice recognition text is added to the subtitle position of the target video frame, so that the adding position of the subtitle can be automatically determined according to the voice recognition text, and the video editing processing efficiency is improved.

Please refer to fig. 3-4.

As shown in fig. 3, a second embodiment provides a video-sticker processing apparatus comprising: the face and voice recognition module 21 is used for performing face recognition and voice recognition on the video to be processed respectively so as to obtain face position data when the face recognition is successful and obtain a voice recognition text when the voice recognition is successful; the target sticker acquiring module 22 is configured to match the voice recognition text with the description text of each sticker in the sticker library to obtain a target sticker, and acquire a target video frame according to the voice recognition text; a target sticker adding module 23, configured to add a target sticker at a default position or a target position of a target video frame; wherein the target position is obtained by calculation according to the face position data.

As an example, a user uploads a to-be-processed video through a user terminal, and when the to-be-processed video is received, the face and voice recognition module 21 performs face recognition and voice recognition on the to-be-processed video, respectively. If the face recognition is successful, the face position data can be obtained, and if the voice recognition is successful, the voice recognition text can be obtained. The user terminal comprises a mobile phone, a computer, a tablet and other communication equipment which can be connected with the server.

In a preferred embodiment of this embodiment, after obtaining the speech recognition text, the face and speech recognition module 21 may issue the speech recognition text to the user terminal, so that the user may confirm the speech recognition text through the user terminal.

When the voice recognition is successful, the target sticker acquiring module 22 matches the voice recognition text with the description text of each sticker in the sticker library, and the sticker corresponding to the description text successfully matched with the voice recognition text is the target sticker. Meanwhile, the target video frame is acquired according to the voice recognition text by the target sticker acquisition module 22.

In a preferred embodiment of this embodiment, after the target sticker is obtained, the target sticker obtaining module 22 may issue the target sticker to the user terminal, so that the user may confirm the target sticker through the user terminal. After the target video frame is obtained, the target sticker acquisition module 22 may issue the target video frame to the user terminal, so that the user may confirm the target video frame through the user terminal.

After the target sticker and the target video frame are obtained, the adding position of the target sticker is determined by combining the face recognition result through the target sticker adding module 23, namely when the face recognition fails, the target sticker is added at the default position of the target video frame according to the default position preset for the target sticker, when the face recognition succeeds, the target position is obtained through calculation according to the face position data, and the target sticker is added at the target position of the target video frame.

In a preferred embodiment of this embodiment, after the target sticker is added at the default position or the target position of the target video frame, the target sticker adding module 23 may issue the target video frame with the target sticker added to the user terminal, so that the user may confirm the video sticker processing through the user terminal.

In this embodiment, the face and voice recognition module 21 performs face recognition and voice recognition on the video to be processed, so as to obtain face position data when the face recognition is successful, obtain voice recognition text when the voice recognition is successful, and further the target sticker acquisition module 22 matches the voice recognition text with the description text of each sticker in the sticker library to obtain a target sticker, and acquires a target video frame according to the voice recognition text, so that the target sticker addition module 23 adds the target sticker at the default position of the target video frame or the target position obtained by calculation according to the face position data, thereby completing the video sticker processing.

As an example, a user records a video to be processed through a user terminal, uploads a video frame of the video to be processed, and when the video frame of the video to be processed is received by the face and voice recognition module 21, face recognition is performed on the video frames of the video to be processed in sequence according to a video frame receiving sequence of the video to be processed, if face recognition of one video frame is successful, face recognition of the video to be processed is determined to be successful, face position data of the video frame is obtained, and if face recognition of all video frames is failed, face recognition of the video to be processed is determined to be failed. The user finishes recording the video to be processed through the user terminal, uploads the last video frame of the video to be processed, the face and voice recognition module 21 performs voice recognition on the video to be processed when receiving the last video frame, if the voice recognition is successful, the recognized voice data is converted into text data to obtain a voice recognition text, and if the voice recognition is failed, the video sticker processing is quitted.

In the embodiment, the face and voice recognition module 21 is used for sequentially performing face recognition on video frames of videos to be processed, face position data corresponding to the video frames are obtained when the face recognition of one video frame is successful, the face recognition can be performed on the received video frames when a user records the videos to be processed, face recognition on the rest video frames is not needed after the face position data is obtained, the processing time of the face recognition of the videos to be processed is greatly shortened, and the processing efficiency of video stickers is improved.

As an example, after obtaining the voice recognition text, the target sticker obtaining module 22 performs word segmentation processing on the voice recognition text to obtain a text word set, matches text words in the text word set with description texts of each sticker in the sticker library one by one, randomly selects one sticker from matched stickers as a target sticker if the description texts with the stickers in all matching results are matched with the text words, and exits from the video sticker processing if the description texts without the stickers in all matching results are not matched with the text words.

As an example, after obtaining the voice recognition text, the target sticker obtaining module 22 performs word segmentation processing on the voice recognition text to obtain a text word set, and matches the text words in the text word set with the description text of each sticker in the sticker library one by one according to the sequence of the text length of the text word from long to short, if the description text of a sticker in the current matching result matches the text word, randomly selects a sticker from the matched stickers as a target sticker, and if the description text of no sticker in all matching results matches the text word, exits from the video sticker processing.

In the embodiment, the target sticker acquisition module 22 is used for performing word segmentation processing on the voice recognition text, matching the obtained text words with the description text of each sticker in the sticker library to obtain the target sticker, so that the success rate of sticker matching can be effectively increased, and the processing efficiency of the video sticker is improved.

In a preferred embodiment of this embodiment, after obtaining the text word set, the target sticker acquiring module 22 may issue the text word set to the user terminal, so that the user may confirm the text word set through the user terminal.

In a preferred embodiment of this embodiment, after the matching sticker is obtained, the target sticker acquiring module 22 may issue the matching sticker to the user terminal, so that the user may confirm the matching sticker through the user terminal.

Wherein, the data structure of the issued matching sticker can refer to: { (text word 1: matching Sticker 1), (text word 2, matching Sticker 2), … … }. In a preferred embodiment, the target sticker adding module 23 is further configured to remove the target sticker when the target sticker appears at the default location or the target location for a period of time reaching a preset threshold.

In a preferred embodiment, the target sticker adding module 23 is further configured to remove the target sticker when the target sticker appears at the default location or the target location for a period of time reaching a preset threshold.

In the embodiment, through the target sticker adding module 23, when the occurrence duration of the target sticker at the default position or the target position reaches the preset threshold, the target sticker is removed, so that the situation that the target sticker stays too long at the default position or the target position to block the video content can be avoided.

In a preferred embodiment, as shown in fig. 4, the video sticker processing apparatus further includes a speech recognition text adding module 24, configured to perform face recognition and speech recognition on the video to be processed respectively, so as to obtain face position data when the face recognition is successful, and add a speech recognition text at a subtitle position of the target video frame after obtaining the speech recognition text when the speech recognition is successful.

In the embodiment, the speech recognition text adding module 24 is used for adding the speech recognition text at the subtitle position of the target video frame, and the adding position of the subtitle can be automatically determined according to the speech recognition text, so that the video editing processing efficiency is improved.

In summary, the embodiment of the present invention has the following advantages:

the method comprises the steps of respectively carrying out face recognition and voice recognition on videos to be processed to obtain face position data when the face recognition is successful, obtaining voice recognition texts when the voice recognition is successful, further matching the voice recognition texts with description texts of all stickers in a sticker library to obtain target stickers, obtaining target video frames according to the voice recognition texts, adding the target stickers at default positions of the target video frames or target positions obtained by calculation according to the face position data, and finishing video sticker processing. The embodiment of the invention carries out face recognition and voice recognition on a video to be processed, so that when the voice recognition is successful, a voice recognition text is matched with a description text of each sticker in a sticker library to obtain a target sticker, a target video frame is obtained according to the voice recognition text, when the face recognition is failed, the target sticker is added at the default position of the target video frame according to the default position preset aiming at the target sticker, when the face recognition is successful, the target position is calculated according to face position data, and the target sticker is added at the target position of the target video frame. The embodiment of the invention can automatically determine the target paster and the adding position thereof according to the face recognition result and the voice recognition result of the video to be processed, realize intelligent selection and placement of the target paster and improve the processing efficiency of the video paster.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

It will be understood by those skilled in the art that all or part of the processes of the above embodiments may be implemented by hardware related to instructions of a computer program, and the computer program may be stored in a computer readable storage medium, and when executed, may include the processes of the above embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims

1. A video sticker processing method is characterized by comprising the following steps:

2. The video sticker processing method of claim 1, wherein the face recognition and the voice recognition are performed on the video to be processed, respectively, to obtain face position data when the face recognition is successful and obtain a voice recognition text when the voice recognition is successful, specifically:

3. The video sticker processing method according to claim 1, wherein the matching of the voice recognition text with the description text of each sticker in a sticker library to obtain a target sticker, and obtaining a target video frame according to the voice recognition text specifically comprises:

4. The video sticker processing method of claim 1, wherein said adding the target sticker at a default position or a target position of the target video frame further comprises:

5. The video sticker processing method of claim 1, wherein after the performing face recognition and voice recognition on the video to be processed respectively to obtain face position data when the face recognition is successful and obtain the voice recognition text when the voice recognition is successful, further comprising:

6. A video sticker processing apparatus, comprising:

7. The video sticker processing device according to claim 6, wherein the video to be processed is subjected to face recognition and voice recognition respectively, so as to obtain face position data when the face recognition is successful and obtain voice recognition text when the voice recognition is successful, specifically:

8. The video sticker processing apparatus of claim 6, wherein the matching of the speech recognition text with the description text of each sticker in a sticker library to obtain a target sticker, and obtaining a target video frame according to the speech recognition text specifically comprises:

9. The video sticker processing apparatus of claim 6, wherein the target sticker adding module is further to remove the target sticker when a length of time of occurrence of the target sticker at the default location or the target location reaches a preset threshold.

10. The video sticker processing device of claim 6, further comprising a speech recognition text adding module, configured to perform face recognition and speech recognition on the video to be processed respectively to obtain face position data when the face recognition is successful, and add the speech recognition text at the subtitle position of the target video frame after obtaining the speech recognition text when the speech recognition is successful.