CN111464862A

CN111464862A - Video screenshot method based on voice recognition and image processing

Info

Publication number: CN111464862A
Application number: CN202010330355.2A
Authority: CN
Inventors: 张咏; 冯武
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-07-28

Abstract

The invention relates to a video screenshot method based on voice recognition and image processing, which comprises the steps of receiving a video screenshot voice instruction, carrying out voice recognition on the video screenshot voice instruction to obtain screenshot instruction text data, playing a video file and extracting each video image frame if the screenshot instruction text data is valid text data, further extracting a face image in the video image frames, determining the video image frame containing a target face image according to a preset target face image database, carrying out screenshot on the video file according to the determined video image frame to obtain a target screenshot, establishing a target screenshot database according to the target screenshot, and finally outputting the target screenshot database. According to the video screenshot method, manual screenshot of operators is not needed, the personnel input cost is reduced, and the labor intensity of the operators is reduced. Because the face images are compared automatically, the situation that partial images are omitted due to human factors can be avoided, and the accuracy of screenshot is greatly improved.

Description

Video screenshot method based on voice recognition and image processing

Technical Field

The invention relates to a video screenshot method based on voice recognition and image processing.

Background

At present, the application of video processing technology is more and more extensive. In the field of video processing, a video file needs to be processed in many cases to obtain relevant data information in the video file. In many scenarios, it is necessary to capture an image of a video file that relates to relevant information for subsequent use. The conventional video screenshot method is a manual screenshot mode, an operator watches the video file, and when a certain frame of image contains related information, the screenshot is manually performed, the manual screenshot mode needs the operator to sit beside a computer specially to watch the video file, moreover, the operator needs to be highly centralized, the operator needs to pay great energy, the situation that partial images are missed due to negligence easily occurs, and the screenshot accuracy is low.

Disclosure of Invention

The invention aims to provide a video screenshot method based on voice recognition and image processing, which is used for solving the problems that a manual screenshot mode requires great effort of operators, partial images are easily missed due to negligence, and the screenshot accuracy is low.

In order to solve the problems, the invention adopts the following technical scheme:

a video screenshot method based on voice recognition and image processing comprises the following steps:

receiving a video screenshot voice instruction;

carrying out voice recognition on the video screenshot voice instruction to obtain screenshot instruction text data;

inputting the screenshot instruction text data into a preset video screenshot instruction special dictionary for comparison, and if at least one word in the video screenshot instruction special dictionary exists in the screenshot instruction text data, judging the screenshot instruction text data to be effective text data;

converting the effective text data into a video screenshot control instruction;

playing a preset video file according to the video screenshot control instruction;

extracting each video image frame of the video file in the video file playing process;

extracting a face image contained in each video image frame according to each video image frame;

inputting the face image contained in each video image frame into a preset target face image database for comparison, determining the video image frame containing at least one target face image in the target face image database, and obtaining a target video image frame; wherein the target face image database comprises at least one target face image;

capturing the video file according to the obtained target video image frame to obtain a target capture corresponding to the target video image frame;

establishing a target screenshot database according to the obtained target screenshot;

and outputting the target cut-map database.

Preferably, the inputting the face image included in each video image frame into a preset target face image database for comparison includes:

for any face image in any video image frame, marking feature coordinates of each key feature in the face image and each target face image in the target face image database based on a preset face key feature list;

calculating a feature distance value between the feature coordinate of each key feature of the face image and the feature coordinate of each key feature in the target face image for any target face image in the target face image database; calculating to obtain a target average value of the characteristic distance values; obtaining the matching degree corresponding to the target average value according to the corresponding relation between the preset average value and the matching degree; the preset corresponding relation between the average value and the matching degree comprises at least two average value intervals and the matching degree corresponding to each average value interval, and the average value intervals and the matching degree are in an anti-correlation relation;

obtaining the matching degree of the face image corresponding to each target face image in the target face image database;

and if at least one matching degree which is greater than or equal to a preset matching degree threshold value exists, the arbitrary video image frame is the target video image frame.

Preferably, the step of inputting the screenshot instruction text data into a preset video screenshot instruction special dictionary for comparison includes:

and comparing each word in the video screenshot instruction special dictionary with the screenshot instruction text data to obtain whether the word in the video screenshot instruction special dictionary exists in the screenshot instruction text data or not.

Preferably, the words in the video screenshot instruction specific dictionary comprise a screenshot.

Preferably, the words in the video screenshot instruction specific dictionary further include the words associated with the screenshot.

The invention has the beneficial effects that: when a video file needs to be subjected to screenshot, a video screenshot voice command is spoken, the video screenshot voice command is subjected to voice recognition to obtain screenshot command text data, then the screenshot command text data needs to be judged, comparison is carried out according to a preset video screenshot command special dictionary, if at least one word in the video screenshot command special dictionary exists in the screenshot command text data, the screenshot command text data is judged to be effective text data, the effective text data is converted into a video screenshot control command, a preset video file is played according to the video screenshot control command, and the screenshot is controlled and started through the voice recognition, so that compared with the traditional mode of starting video playing and manual screenshot by clicking the video file, the intelligent degree is greatly improved, and manual operation is not needed, the control convenience is improved; in the video file playing process, extracting each video image frame of a video file, extracting a face image contained in each video image frame, inputting the face image contained in each video image frame into a preset target face image database for comparison, determining which video image frames contain at least one target face image in the target face image database according to the comparison of the face images, wherein the video image frames containing at least one target face image in the target face image database are required target video image frames, capturing the video file according to the obtained target video image frames, obtaining a target screenshot corresponding to the target video image frames, further establishing a target screenshot database, wherein each target screenshot is contained in the target screenshot database, and finally outputting the target screenshot database. Therefore, the video screenshot method provided by the invention is an automatic screenshot method, automatic screenshot is carried out according to the comparison result of the face images to obtain the required screenshot, manual screenshot of operators is not needed, the personnel input cost is reduced, the labor intensity of the operators is reduced, and moreover, the situation that partial images are omitted due to human factors is avoided due to automatic comparison of the face images, so that the screenshot accuracy is greatly improved.

Drawings

In order to more clearly illustrate the technical solution of the embodiment of the present invention, the drawings needed to be used in the embodiment will be briefly described as follows:

fig. 1 is a flow diagram of a video capture method based on speech recognition and image processing.

Detailed Description

The embodiment provides a video screenshot method based on voice recognition and image processing, and an execution main body of the video screenshot method can be a desktop computer, a notebook computer, an intelligent mobile terminal and the like. Because the voice signal needs to be acquired, a voice acquisition device such as a microphone needs to be arranged on the execution main body, for example, a microphone carried by a notebook computer or an intelligent mobile terminal. Because the video file playing needs to be controlled, the execution main body can be provided with video playing applications, such as some mainstream video playing software programs at present, if a plurality of video playing applications are installed, one of the video playing applications is designated as default playing software of the video file, and the video playing application is started in the subsequent control.

As shown in fig. 1, the video capture method includes the following steps:

receiving a video screenshot voice instruction:

the execution main body stores a preset video file, namely a video file needing screenshot. When the video file needs to be subjected to screenshot, an operator speaks a video screenshot voice instruction. And the microphone of the execution main body or the microphone provided by the execution main body acquires the video screenshot voice instruction of the operator.

Carrying out voice recognition on the video screenshot voice instruction to obtain screenshot instruction text data:

the execution main body is internally provided with the existing voice recognition algorithm, and the acquired video screenshot voice instruction is subjected to voice recognition according to the voice recognition algorithm to obtain screenshot instruction text data.

Inputting the screenshot instruction text data into a preset video screenshot instruction special dictionary for comparison, and if at least one word in the video screenshot instruction special dictionary exists in the screenshot instruction text data, judging that the screenshot instruction text data is effective text data:

the execution main body is preset with a special dictionary for the video screenshot instruction, the special dictionary for the video screenshot instruction comprises at least one word, each word in the special dictionary for the video screenshot instruction is a related word of a control instruction of a video screenshot, and as a specific implementation mode, the words in the special dictionary for the video screenshot instruction comprise a screenshot, and further comprise words related to the screenshot, such as the screenshot, the screen capture and the like.

The method includes the steps of inputting screenshot instruction text data into a video screenshot instruction special dictionary for comparison, comparing each word in the video screenshot instruction special dictionary with screenshot instruction text data, namely, inputting any word in the video screenshot instruction special dictionary into the screenshot instruction text data, and judging whether the word exists in the screenshot instruction text data or not. Then, whether the words in the video screenshot instruction special dictionary exist in the screenshot instruction text data is finally obtained.

If at least one word in the video screenshot instruction special dictionary exists in the screenshot instruction text data, namely the word in the video screenshot instruction special dictionary exists in the screenshot instruction text data, the screenshot instruction text data is judged to be effective text data. Such as: and inputting the video screenshot into a dictionary special for the video screenshot instruction to compare, and judging that the screenshot instruction text data is effective text data because the screenshot exists in the dictionary special for the video screenshot instruction in the video screenshot.

Converting the effective text data into a video screenshot control instruction:

the obtained effective text data is converted into a video capture control instruction, and as a specific implementation manner, the video capture control instruction can be a specific data string.

And playing a preset video file according to the video screenshot control instruction:

and controlling to start the installed or default video playing application according to the obtained video screenshot control instruction, and playing a preset video file after the video playing application is started.

In the video file playing process, extracting each video image frame of the video file:

in the process of playing the video file, each video image frame included in the video file is read, and each video playing frame is sequentially output at a preset video playing frame rate based on the frame number of each video image frame, for example, the video playing frame rate may be 60dps, that is, 60 video image frames are output per second.

According to each video image frame, extracting a face image contained in each video image frame:

the execution main body is internally provided with the existing face recognition algorithm, the face recognition algorithm can analyze and process each video image frame, and the face image contained in each video image frame is extracted and obtained. It should be understood that, for a certain video image frame, there may be only one person or a plurality of persons in the video image frame, and therefore, for any one video image frame, there may be only one face image or a plurality of (i.e., at least two) face images.

Inputting the face image contained in each video image frame into a preset target face image database for comparison, determining the video image frame containing at least one target face image in the target face image database, and obtaining a target video image frame; wherein the target face image database comprises at least one target face image:

a target face image database is preset in the execution main body, the target face image database comprises at least one target face image, and the specific setting number is set according to actual needs. Each target face image is a screenshot standard of a video screenshot, and for a certain video image frame, if the face image contained in the video image frame has a target face image in a target face image database, that is, at least one face image in the video image frame is a target face image in a target face image database, the video image frame is a required video image frame, and screenshot needs to be performed according to the video image frame.

The determination process of whether the face image of each video image frame has the target face image in the target face image database is the same, so that the following description will be given by taking any one of the video image frames as an example, and the determination process of other video image frames is the same.

The video image frame may only include one face image or at least two face images, and the following description will be given by taking an example that the video image frame only includes one face image, and if the video image frame includes at least two face images, the processing procedure of each face image is the same as that of each face image for the at least two face images, and then several processing procedures are performed including several face images. Then, for one face image in the video image frame, marking each key feature of the face image and each key feature of each target face image in a target face image database based on a preset face key feature list, and obtaining feature coordinates corresponding to the face image and each target face image in the target face image database according to the coordinates of each key feature in the image; the face key feature list may include: the four human face features of eyes, ears, mouth and nose can also comprise eyebrows, forehead and the like, and the specifically contained human face features can be according to actual needs.

Calculating a characteristic distance value between the characteristic coordinate of each key characteristic of the face image and the characteristic coordinate of each key characteristic of the target face image for any target face image in a target face image database, wherein the characteristic distance value of two characteristic coordinates can be calculated through coordinate distance calculation formulas such as Euclidean distance calculation formula; calculating to obtain an average value of the characteristic distance values according to the obtained characteristic distance values, wherein the average value is a target average value; presetting a corresponding relation between the average value and the matching degree, wherein the corresponding relation comprises at least two average value intervals and the matching degree corresponding to each average value interval, and the average value intervals and the matching degree are in an inverse correlation relation, namely the lower the average value interval, the smaller the distance representing each key feature in the two images is, the more similar the two images are, and the higher the corresponding matching degree is, for example: the corresponding relation comprises two average value intervals, namely [ x1, x2], (x 2, x3], wherein the matching degree corresponding to [ x1, x2] is y1, (x 2, x 3) is y2, x1 < x2 < x3, y1 > y2, and the specific numerical value of the average value interval and the specific numerical value of the matching degree are set according to actual requirements.

The obtaining process obtains the matching degree of the face image and a target face image, and the obtaining process of the matching degree of the face image and other target face images is the same as the obtaining process. Then, the matching degree between the face image and each target face image in the target face image database can be obtained, and if there are N target face images in the target face image database, N matching degrees are obtained.

If at least one matching degree which is greater than or equal to a preset matching degree threshold exists in the N matching degrees, namely if at least one matching degree which is greater than or equal to the matching degree threshold exists, the face image is highly similar to at least one target face image in a target face image database, and the video image frame is judged to be a target video image frame.

The above process is a judgment process of whether one video image frame is a target video image frame, and the judgment processes of other video image frames are the same, so that all video image frames containing at least one target face image in the target face image database are finally determined and obtained, and the obtained video image frames are determined to be the target video image frames.

The above description is given of a specific face image comparison process, and it should be understood that the present application is not limited to the specific face image comparison process, and as other embodiments, other existing face image comparison processes may also be adopted.

Capturing the video file according to the obtained target video image frame to obtain a target capture corresponding to the target video image frame:

and obtaining target video image frames according to the process, and then, capturing the video file according to the obtained target video image frames to obtain target screenshots corresponding to the target video image frames. For a certain target video image frame, the progress of the target video image frame in a video file can be determined according to the target video image frame, and then the video file is subjected to screenshot according to the progress to obtain a target screenshot corresponding to the target video image frame. As the video screenshot process belongs to the conventional technical means, the description is not repeated.

Establishing a target screenshot database according to the obtained target screenshot:

establishing a target screenshot database according to the obtained target screenshot, and giving a specific implementation process as follows: firstly, establishing a blank initial screenshot database, then adding each obtained target screenshot into the initial screenshot database, and finally updating the initial screenshot database to obtain a target screenshot database.

Outputting the target cut-map database:

and outputting the established target cut-off database, such as wired transmission or wireless transmission, to external related equipment, so that the external equipment or related personnel can perform subsequent processing according to the target cut-off database.

The above-mentioned embodiments are merely illustrative of the technical solutions of the present invention in a specific embodiment, and any equivalent substitutions and modifications or partial substitutions of the present invention without departing from the spirit and scope of the present invention should be covered by the claims of the present invention.

Claims

1. A video screenshot method based on voice recognition and image processing is characterized by comprising the following steps:

receiving a video screenshot voice instruction;

converting the effective text data into a video screenshot control instruction;

and outputting the target cut-map database.

2. The method for capturing video image based on voice recognition and image processing as claimed in claim 1, wherein the step of inputting the face image included in each video image frame into a preset target face image database for comparison comprises:

3. The video screenshot method based on voice recognition and image processing as claimed in claim 1, wherein said inputting said screenshot command text data into a preset video screenshot command specific dictionary for comparison comprises:

4. The method of claim 1, wherein the words in the video shot instruction specific dictionary comprise a shot.

5. The method of claim 4, wherein the words in the video shot instruction specific dictionary further comprise words related to the shot.