CN115474076A - Video stream image output method and device and camera equipment - Google Patents

Video stream image output method and device and camera equipment Download PDF

Info

Publication number
CN115474076A
CN115474076A CN202210976194.3A CN202210976194A CN115474076A CN 115474076 A CN115474076 A CN 115474076A CN 202210976194 A CN202210976194 A CN 202210976194A CN 115474076 A CN115474076 A CN 115474076A
Authority
CN
China
Prior art keywords
video stream
close
image
target object
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210976194.3A
Other languages
Chinese (zh)
Inventor
肖兵
陈瑞斌
李春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Shixi Technology Co Ltd
Original Assignee
Zhuhai Shixi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Shixi Technology Co Ltd filed Critical Zhuhai Shixi Technology Co Ltd
Priority to CN202210976194.3A priority Critical patent/CN115474076A/en
Publication of CN115474076A publication Critical patent/CN115474076A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234345Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234381Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by altering the temporal resolution, e.g. decreasing the frame rate by frame skipping
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440254Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering signal-to-noise parameters, e.g. requantization
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440281Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering the temporal resolution, e.g. by frame skipping
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • H04N21/4858End-user interface for client configuration for modifying screen layout parameters, e.g. fonts, size of the windows

Abstract

The invention provides an output method, an output device and camera equipment of a video stream image, wherein the method comprises the steps of acquiring the video stream original image, and carrying out object recognition on the video stream original image so as to determine a recognition object contained in the video stream original image; identifying that the object comprises an object and/or a human object; selecting at least one target object from the recognition objects, and acquiring a video stream close-up image of the target object; and outputting the video stream close-up image to a display terminal to perform close-up display on the target image. According to the method, the video stream original image is obtained, the object or human body identification object contained in the video stream original image is identified, the identification object is automatically tracked and identified, the feature display of the specific identification object can be realized, the free switching between the video stream original image and the video stream close-up image is realized, and the user is assisted in realizing different scene requirements such as live broadcast, conference and the like.

Description

Video stream image output method and device and camera equipment
Technical Field
The invention relates to the technical field of image processing, in particular to a method and a device for outputting video stream images and camera equipment.
Background
The acquisition and output of video stream are effective ways of displaying certain scene conditions, generally, when the video stream is output, most of the video data in the scene is acquired by directly utilizing one or more cameras and is directly output to a screen or a terminal for displaying, if a certain person or a certain object in the scene needs to be amplified for displaying, or when the certain person or object needs to be amplified, the object needing to be amplified for displaying needs to be manually determined, and the switching is inconvenient. The mode not only needs to consume certain human resources, but also is inconvenient to use due to manual operation, and the use experience of a user is influenced.
Disclosure of Invention
In view of the above problems, the present invention proposes an output method, an apparatus, and an image capturing device for video stream images that overcome or at least partially solve the above problems.
According to an aspect of the present invention, there is provided an output method of a video stream image, including:
acquiring a video stream original image, and performing object identification on the video stream original image to determine an identification object contained in the video stream original image; the identification object comprises an object and/or a human body object;
selecting at least one target object from the recognition objects, and acquiring a video stream close-up image of the target object;
and outputting the video stream close-up image to a display terminal to perform close-up display on the target image.
Optionally, the selecting at least one target object from the identification objects comprises:
selecting at least one target object from the recognition objects in response to a voice instruction of a user; and/or the presence of a gas in the gas,
collecting voice information of a user, predicting user intention according to the voice information, and selecting at least one target object from the recognition objects according to the predicted user intention; and/or the presence of a gas in the gas,
analyzing the gesture motion of each recognition object based on the video stream original image, and selecting at least one target object from the recognition objects according to the gesture motion of each recognition object; and/or the presence of a gas in the atmosphere,
and selecting at least one target object according to the motion trail of each identified object in the video stream original image.
Optionally, the analyzing the gesture motion of each of the recognition objects based on the video stream original image, and the selecting at least one target object from the recognition objects according to the gesture motion of each of the recognition objects includes:
detecting key points of a human body to obtain a human body skeleton of each identification object, and establishing an association relation with each identification object;
tracking hand features in human body skeletons corresponding to the recognition objects, and when judging that any recognition object performs a preset gesture action based on the hand features, taking the recognition object performing the preset gesture action as a target object.
Optionally, the selecting at least one target object according to the motion trajectory of each identified object in the original image of the video stream includes:
tracking the motion track of any identification object in the images of the continuous video stream;
and when the motion trail meets a preset motion rule, taking the recognition object as a target object.
Optionally, after acquiring the original image of the video stream, the method further includes: outputting the video stream original image to an operation end;
the selecting at least one target object from the identification objects comprises: and responding to a selection instruction triggered by the user based on the operating end, and selecting at least one target object from the identification objects.
Optionally, the acquiring the video stream close-up image of the target object comprises:
cutting and zooming the video stream original image to generate a video stream close-up image of the target object; or the like, or, alternatively,
and acquiring a video stream close-up image of the target object through a close-up camera.
Optionally, the acquiring, by the close-up camera, the video stream close-up image of the target object comprises:
acquiring the distance between an original camera for acquiring the original image of the video stream and a target object;
if the distance is judged to be within the range of the set threshold value, acquiring a video stream close-up image of the target object by using the original camera;
and if the distance exceeds the set threshold range, acquiring a video stream close-up image of the target object by using a close-up camera.
Optionally, after outputting the video close-up image to a display terminal for close-up presentation of the target image, the method further comprises:
receiving a close-up release instruction for stopping close-up display of the target object;
identifying a close-up release object in the target object that corresponds to the close-up release instruction; determining a matched close-up removing mode according to the number of the objects and/or the types of the objects of the close-up removing object, and stopping close-up display of the close-up removing object according to the close-up removing mode; or the like, or a combination thereof,
and outputting the video stream original image to the display terminal for displaying.
According to another aspect of the present invention, there is provided an output apparatus for video stream images, the apparatus comprising:
the original image acquisition module is used for acquiring an original image of a video stream and performing object recognition on the original image of the video stream to determine a recognition object contained in the original image of the video stream; the identification object comprises an object and/or a human body object;
a feature recognition module, which is used for selecting at least one target object from the recognition objects and acquiring a video feature image of the target object;
and the display module is used for outputting the video stream close-up image to a display terminal so as to display the target image in a close-up manner.
According to still another aspect of the present invention, there is provided an output system of video stream images, the system comprising at least one camera, at least one display terminal, and an output device of video stream images connected to the camera and the display terminal.
According to yet another aspect of the present invention, there is also provided a computer-readable storage medium for storing program code for performing the method of any one of the above.
According to still another aspect of the present invention, there is also provided an image pickup apparatus including a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform any of the methods described above according to instructions in the program code.
The invention provides a video stream image output method, a video stream image output device and a video shooting device.
The above description is only an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description so as to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.
The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flow chart illustrating an output method of video stream images according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for outputting video stream images according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of an output device for video stream images according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating an output system of video stream images according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
An embodiment of the present invention provides an output method of a video stream image, and as shown in fig. 1, the output method of the video stream image provided by the embodiment of the present invention may include at least the following steps S101 to S103.
S101, acquiring a video stream original image, and performing object identification on the video stream original image to determine an identification object contained in the video stream original image; the recognition object includes an object and/or a human object.
In this embodiment, the video stream original image to be output may be captured in real time based on a depth camera or a general camera, or may be a pre-stored section of video stream image that is not cropped. After the video stream original image to be output is acquired, the video stream original image can be detected and identified so as to determine an identification object contained in the video stream original image, wherein the determined identification object can be an object or a human body object. Specifically, each human body object included in the original image of the video stream can be intelligently identified in a face detection mode, and in addition, an object included in the original image of the video stream, such as a complete commodity, an article, and the like, can be identified in a feature extraction, feature selection and matching mode. Taking a meeting in a meeting room as an example, the identification objects in the video stream original image may be a plurality of participants participating in the meeting; for shopping live broadcast, the identification objects in the original image of the video stream may be various anchor casts, recommended commodities, assistant anchor casts and the like, and the number and types of the identification objects in the original image of the video stream shot in different scenes may be different. The video stream original image is a full-frame image acquired by a camera, and the resolution and the picture size of the video stream original image can be freely set according to different display terminals.
The target detection of the embodiment is to obtain the categories and target frames of different objects; when the target tracking is subsequently performed based on each detected recognition object, the detected target can be tracked, and a unique identifier is allocated to each target to obtain information such as a track. The target frame corresponding to each identification object obtained based on target detection contains the size and position information of the corresponding object, so that the specified object can be rapidly and accurately magnified in a close-up manner.
S102, selecting at least one target object from the recognition objects, and acquiring a video stream close-up image of the target object.
There may be a plurality of target objects in the video stream original image, and for different application scenes, close-up may be performed for different numbers and different types of objects, so that at least one target object needs to be selected from the video stream original image, and a video stream close-up image of the target object is acquired, and then one target object or a plurality of (two or more) target objects may be displayed as a region of interest in a main area (e.g., a centered display), and a local part of the target object may be enlarged and close-up, and local details of the target object may be enlarged and displayed. The close-up in this embodiment may be understood as performing enlarged display on the target object.
In this embodiment, when acquiring a video stream feature image of a target object, the following two modes may be included:
in the first mode, the video stream original image is cut and scaled to generate a video stream close-up image of the target object. If the selected target object is single, identifying a corresponding display area of the target object in the original image of the video stream; and performing cutting and scaling processing on the original video stream image based on the display area to generate a video stream close-up image of the target object. When the target object is in close-up, a plurality of objects can be enlarged and close-up, a single object can also be enlarged and close-up, and the setting can be specifically carried out according to different scene requirements.
If a plurality of selected target objects are available, identifying a corresponding display area of each target object in the original video stream image, and determining a boundary area of each target object in the corresponding display area of the original video stream image; and performing cutting and scaling processing on the original video stream image based on the boundary area to generate a video stream close-up image of the target object.
In the second mode, a video stream close-up image of the target object is acquired through the close-up camera.
In this embodiment, the video stream original image and the video stream close-up image are acquired by using cameras which are independent of each other, and after the target object is determined based on the video stream original image, the target position corresponding to the target object may be first located, and the close-up camera is used to acquire the video stream close-up image of the target object according to the target position. The close-up camera does not need the full picture, position the target position first, then press 16: and 9, selecting a certain range of the target area to be cut out, and then scaling the cut-out resolution ratio to 2K for output.
Optionally, a plurality of close-up cameras are arranged in the application scene, when the close-up cameras are used for acquiring the video close-up images of the target object, the distance between the currently used close-up cameras and the target object can be judged first, the current close-up cameras can be used for positioning and close-up of the target object under the assumption that the distance is within the set threshold range, and the close-up cameras are used for acquiring the video close-up images of the target object under the assumption that the distance exceeds the set threshold range. In the embodiment, a sectional type switching lens is taken as an example: target distance < 5.0m: a camera A is adopted to independently position and close up the target; target distance > 5.0m: after the target is positioned by the camera A, the corresponding camera B is switched according to the left and right positions of the target, the target is close-up, and meanwhile, the close-up area can output 1080P or 720P images captured according to the following ratio of 16. The embodiment adopts a sectional shot switching mode, and can perform close-up processing on the target objects at different positions, so that the obtained video stream close-up images of the target objects can be clear and smooth.
And S103, outputting the video stream close-up image to a display terminal to perform close-up display on the target image.
After the video stream close-up image of the target object is acquired, the video stream close-up image can be output to the display terminal, and the display terminal performs close-up display on the target image. In practical applications, before the video close-up image is displayed, the display terminal may be in a state of displaying the video stream original image, and at this time, the video stream original image may be updated to the video stream close-up image for displaying. The display terminal of the embodiment can be a mobile phone, a tablet computer, a display screen and other terminals.
In this embodiment, when the video stream close-up image is output to the display terminal for displaying in step S103, the interface parameters of the display interface corresponding to the display terminal may be obtained first; and adjusting the size of the output image according to the interface parameters, and outputting the adjusted video stream close-up image, so that the output video stream close-up image can adapt to the screen of the display terminal or the interface size for adjustment. For example, if the display terminal is a mobile phone, the video streaming close-up image may be adjusted to the size of the screen when the mobile phone is displayed in a vertical screen, so that the user can watch the video streaming close-up image with the mobile phone.
The embodiment of the invention provides a video stream image output method, which comprises the steps of acquiring an original video stream image, identifying an object or a human body identification object contained in the original video stream image, and after selecting a target object, generating a video stream close-up image of the target object to realize the enlarged close-up of the target object so as to meet the video playing requirements of application scenes such as conferences and live broadcasts, realize the prominent display of the important object or the human body in different application scenes, and meet the personalized requirements of users on video playing.
In the above step S102, at least one target object needs to be selected from the identification objects, in an alternative embodiment of the present invention, the target object may be selected in different manners, which are described in detail below.
1. Speech recognition system
At least one target object is selected from the recognition objects in response to a voice instruction of a user.
The user can specify the target object in a voice mode, when the target object is selected according to a voice instruction of the user, the voice information of the user can be obtained and analyzed, and when the voice information of the user is recognized to contain preset instructions such as 'amplifying display' and 'close-up', the target object contained in the voice information of the user can be further recognized. For example, suppose that the user sends an instruction "zoom in and display article a", at this time, it may be determined that the target object is article a, and then the position of article a in the image of the video stream is determined through processes such as feature comparison and close-up.
In practical application, only part of people may have the authority to issue the voice command, so when the voice command is detected, voiceprint recognition can be performed first, and if the user who issues the voice command is judged to have the authority, the target object can be displayed in a close-up manner, otherwise, the voice command is ignored.
2. Intention prediction mode
And collecting voice information of the user, predicting user intention according to the voice information, and selecting at least one target object from the recognition objects according to the predicted user intention.
For example, in a live scene, assuming that a main broadcast is explaining a commodity at present, by collecting voice information of the main broadcast, when it is predicted that the main broadcast will demonstrate details of the commodity, the commodity can be automatically used as a target object and displayed in a close-up manner. In this embodiment, the predictive analysis of the user intention based on the speech information can be realized by an intelligent model established and trained by a neural network.
3. Gesture recognition mode
And analyzing the gesture motion of each recognition object based on the video stream original image, and selecting at least one target object from the recognition objects according to the gesture motion of each recognition object.
In this embodiment, a preset gesture for triggering close-up by one key may be preset, and after the video stream original image is acquired, by performing image recognition on each object in the video stream original image, when a preset gesture motion of any object is detected, the object may be taken as a target object to perform close-up. Certainly, assuming that a plurality of objects perform the preset gesture motion within a certain time range, the plurality of objects performing the preset gesture motion may be taken as feature objects to perform feature recognition.
In an alternative embodiment of the invention, it is also possible to respond only to gesture movements of fixed objects. When the judgment result is yes, responding to the gesture motion of the recognition object, and when the recognition object is taken as a target object, acquiring a video stream close-up image of the object. If the recognition object is judged not to be a preset fixed object needing gesture recognition, the gesture action of the object is ignored, and the gesture actions of other recognition objects in the video stream original image are continuously monitored. In the embodiment, by recognizing the gesture action of the fixed object, the calculation power of the system can be reduced, and the influence on the use experience of the user due to the misrecognition of the target object is avoided.
Optionally, when the target object is selected by analyzing the gesture motion of each recognition object based on the video stream original image, the following steps A1 to A2 may be specifically included.
A1, obtaining a human body skeleton of each recognition object through human body key point detection, and establishing an association relation with each recognition object. As described above, each recognition object in the original image may have a unique corresponding identifier, and both the recognized face image and the recognized human skeleton may be associated with the corresponding recognition object.
And A2, tracking hand features in human body skeletons corresponding to the recognition objects, and when judging that any recognition object performs a preset gesture action based on the hand features, taking the recognition object performing the preset gesture action as a target object. In this embodiment, the hand tracker may be used to track the hand of each recognition object, so as to obtain the hand characteristics of each recognition object, and meanwhile, the portrait tracker and the hand tracker may be used to track the portrait of each recognition object at the same time, when the recognition object is judged to make the preset gesture motion, besides verifying the type of the preset gesture motion, the duration of the gesture (for example, 1 second is satisfied) and the stable state of the recognition object need to be determined, and when the head of the recognition object is determined to be stable and the gesture is determined to be stable, the recognition object may be determined as the target object, so as to avoid the user from mistakenly recognizing the user intention due to a short-time misoperation. Optionally, when the gesture stabilization is determined, the hand features may be input into a pre-created classification model, the classification model determines that the user performs the preset gesture motion several times in an accumulated manner, it may be determined that the user performs the stable preset gesture motion, and then it is determined that the object is the target object and performs close-up, and if the object is unstable, the counting is performed again. Optionally, when performing gesture recognition, a gesture recognition candidate region of the recognition object may be determined first, so as to perform tracking recognition on the user gesture within the gesture recognition candidate region. The gesture candidate area may be left and right of the user within two shoulders, not over the eyebrows or not over the forehead, and not lower than the upper half of the user. The gesture control reliability can be improved by setting the gesture candidate region, misjudgment is reduced, the computational power consumption is greatly reduced compared with the conventional gesture control scheme, and the implementation on a middle-low end platform is facilitated.
Taking a conference scene as an example, assuming that there are multiple persons participating in a conference room, when one of the users needs to close up when speaking, the user can realize close-up display of the user only by making a preset gesture, so as to highlight an object which is speaking at the moment and attract attention of participants. Both portrait and hand Tracking may be based on SORT (Simple Online and real Tracking) multi-target Tracking algorithm. In practical application, only a specific person may be allowed to issue a gesture command, and therefore, gesture classification and gesture command recognition may be performed only on a part of recognition objects, thereby reducing computational power consumption.
4. Track prediction mode
And selecting at least one target object according to the motion track of each identified object in the original image of the video stream. Optionally, selecting at least one target object according to a motion trajectory of each identified object in an image of the video stream includes: tracking the motion track of any identification object in the images of the continuous video stream; and when the motion track meets the preset motion rule, taking the identified object as a target object. Taking a live broadcast scene as an example, assuming that the anchor holds one article, when it is determined that the anchor holds the article and moves and pushes the article in the direction of the shot, it can be understood that the anchor needs to display the article in a highlight manner, and at this time, the article can be taken as a target object and close-up is performed.
In addition to the above description, the target object may be selected through an instruction of an operator, and specifically, after the video stream original image is obtained in step S101, the video stream original image may be output to an operation end; when at least one target object is selected from the identification objects, the at least one target object can be selected from the identification objects in response to a selection instruction triggered by a user based on the operation terminal.
In practical application, the video stream original image can be output to an operation end, for example, a video monitoring device, an operator at the operation end can not only display an interface through a camera, but also select a frame according to the content to be selected in the display interface, and the identification object selected in the frame can be used as a target object to be displayed in a close-up manner.
In the above description, the process of performing close-up on the target object is described, in this embodiment, the video stream close-up image may be switched to the video stream original image, and optionally, after the step S103 outputs the video stream close-up image to the display terminal to perform close-up display on the target image, a close-up release instruction for stopping performing close-up display on the target object may be received; identifying a close-up release object corresponding to the close-up release instruction in the target object; determining a matched close-up removing mode according to the number and/or type of the objects of the close-up removing object, and stopping close-up display of the close-up removing object according to the close-up removing mode; or directly outputting the original image of the video stream to a display terminal for displaying.
The close-up contact instruction in the embodiment can be a voice instruction or a gesture instruction, and similar to the way when the target object is selected, the one-key gesture can be supported to switch back to the panorama, and the one-key gesture can also be returned to the default mode; when the close-up is removed, the close-up can be removed from a single target object, or the close-up can be removed from a plurality of target objects, and the close-up can be specifically set according to different requirements. When the close-up is released for a plurality of target objects, the position of the target object holding the close-up can be redetermined and the video stream close-up image can be adjusted by adjusting the camera. In addition, the foregoing embodiment also introduces the method for determining the target object through the motion trajectory, in this case, when it is further determined that the anchor handheld article moves toward the original direction of the camera according to the motion trajectory, the article can be touched to close up, and the playing mode of the original image of the video stream can be switched.
The following describes in detail the method for outputting video stream images according to an embodiment of the present invention.
S201, acquiring an original image of a video stream by using a camera A, and outputting the original image of the video stream to a display terminal, wherein the display terminal can be a computer or a mobile phone;
s202, acquiring a video stream original image, and performing object identification on the video stream original image to determine an identification object contained in the video stream original image; in this embodiment, the identification objects in the image of the video stream may include an object a, an object B, and an object C, and an identification identifier is allocated to each object; the object A and the object B can be a main broadcast and an auxiliary broadcast in a live scene, and the object C is an object of the main broadcast and the auxiliary broadcast;
s203, detecting human body skeletons of the object A and the object B through human body key points to establish an association relation with the identification marks of the object A and the object B respectively; at this time, the position changes of the object a, the object B, and the object C, and the hand features of the object a, the object B may be tracked by the tracker;
s204, when the object C is judged to be held by the object A and moves stably in the direction close to the camera, the object C is determined to be a target object;
s205, determining the distance between the camera a and the object C, and if the distance between the camera a and the object C is less than 5.0m, executing step S206: if the distance between the camera a and the object C is greater than or equal to 5.0m, step S207 is performed. Alternatively, the distance comparison is done directly if the distance information is available directly (e.g. the device is equipped with a distance sensor, or a depth camera-TOF/structured light is used). If the distance information cannot be obtained directly, the target box size can be compared: based on the principle of "near-large far-small", the size of the face of the object C is compared with a preset target size (at a preset distance, for example, 5 m), and if the size is smaller than the preset size, it indicates that the preset distance is exceeded.
S206, independently positioning the target by adopting a camera A and acquiring a video stream special writing image;
and S207, after the target is positioned by the camera A, switching the corresponding camera B according to the left and right positions of the target, and acquiring a video stream special writing image by using the camera B.
And S208, outputting the video stream close-up image to a display terminal to show the target image in close-up.
And S210, when the handheld object C of the object A is judged to stably move away from the camera, the object C is released from the close-up, and the video stream original image is output to the display terminal.
The scheme that this embodiment provided through the automatic tracking and the discernment to human action, can realize the free switching of video stream original image and video stream close-up image, the multiple switching demand of supplementary user realization different scenes such as live, meeting need not unnecessary manual operation and can realize the automatic identification of close-up object, promotes the intelligent management of scenes such as live, meeting.
The solution described in the above embodiment is an implementation manner with two cameras, and an embodiment of the present invention further provides an output method of a video stream image, which can be applied to an application scenario with a single camera, where the output method of a video image of this embodiment may include:
s301, acquiring an original image of a video stream by using a camera X, and outputting the original image of the video stream to a display terminal, wherein the display terminal can be a computer, a mobile phone, a tablet, a conference large screen or a multimedia display screen;
s302, acquiring a video stream original image, and performing object recognition on the video stream original image to determine a recognition object contained in the video stream original image; in this embodiment, the identification objects in the image of the video stream may include an object a, an object B, and an object C, and an identification identifier is allocated to each object; the object A and the object B can be a main broadcast and an auxiliary broadcast in a live broadcast scene;
s303, detecting human body key points to obtain human body skeletons of the object A and the object B, and establishing an association relation with the identification marks of the object A and the object B respectively; at this time, object a, object B, and the respective hand features may be tracked by the tracker. In this embodiment, the neural network model with multi-class object detection may be used to simultaneously detect the human image object and the hand object in the video stream original image, and perform association between the human image object and the hand object, so as to determine each object and hand feature.
S304, when detecting that the object A triggers the gesture instruction 1 (the gesture instruction 1 is a preset gesture for starting close-up, such as a hand-lifting gesture), determining that the object A is a target object, and at the moment, positioning the object A by using the camera X and acquiring a video stream close-up image of the object A; specifically, a corresponding display area of the object a in the original image of the video stream can be identified; and performing cutting and scaling processing on the video stream original image based on the display area to generate a video stream special writing image of the object A.
And S305, outputting the video stream close-up image to a display terminal to realize close-up display of the object A. In this embodiment, when the close-up display of the object a is performed, the close-up image of the object a may be displayed in close-up at the display terminal.
And S306, when judging that the gesture instruction 2 of the object A is a preset gesture for exiting the close-up, such as fist making, the gesture instruction 2 removes the close-up of the object A, and outputs the video stream original image to the display terminal.
The method of the embodiment can also realize the close-up of a plurality of people, for example, the video stream original image is cut to mainly show the close-up images of a plurality of target objects, and the user needing to perform the close-up can realize the control of starting the close-up and exiting the close-up only by making simple gestures, so that the method is simpler and more convenient and easier compared with the traditional manual close-up operation.
Based on the same inventive concept, an embodiment of the present invention further provides an output apparatus 300 for video stream images, as shown in fig. 3, the output apparatus 300 for video stream images of the present embodiment may include:
an original image obtaining module 310, configured to obtain an original image of a video stream, and perform object identification on the original image of the video stream to determine an identification object included in the original image of the video stream; identifying that the object comprises an object and/or a human object;
a feature recognition module 320, configured to select at least one target object from the recognition objects, and obtain a video feature image of the target object;
and the display module 330 is configured to output the video stream close-up image to a display terminal, so as to perform close-up display on the target image.
In an alternative embodiment of the present invention, the close-up recognition module 320 may be further operable to:
selecting at least one target object from the recognition objects in response to a voice instruction of a user; and/or the presence of a gas in the gas,
collecting voice information of a user, predicting user intention according to the voice information, and selecting at least one target object from the recognition objects according to the predicted user intention; and/or the presence of a gas in the gas,
analyzing the gesture action of each recognition object based on the video stream original image, and selecting at least one target object from the recognition objects according to the gesture action of each recognition object; and/or the presence of a gas in the gas,
and selecting at least one target object according to the motion track of each identified object in the video stream original image.
In an alternative embodiment of the present invention, the close-up recognition module 320 may be further operable to: detecting key points of a human body to obtain a human body skeleton of each identification object, and establishing an association relation with each identification object;
and tracking hand features in the human body skeleton corresponding to each recognition object, and when judging that any recognition object performs a preset gesture action based on the hand features, taking the recognition object performing the preset gesture action as a target object.
In an alternative embodiment of the present invention, the close-up recognition module 320 may be further operable to: tracking the motion track of any identification object in the images of the continuous video stream;
and when the motion track meets the preset motion rule, taking the identified object as a target object.
In an optional embodiment of the present invention, the presentation module 330 is further configured to output the image of the video stream to the operation end;
feature recognition module 320 may also be configured to: and responding to a selection instruction triggered by a user based on the operation end, and selecting at least one target object from the identification objects.
In an alternative embodiment of the present invention, the close-up recognition module 320 may be further operable to: cutting and zooming the video stream original image to generate a video stream close-up image of a target object; or the like, or, alternatively,
and acquiring a video stream close-up image of the target object through the close-up camera.
In an alternative embodiment of the present invention, the close-up recognition module 320 may be further operable to: acquiring the distance between an original camera for acquiring an original image of a video stream and a target object;
if the distance is within the range of the set threshold value, acquiring a video stream close-up image of the target object by using the original camera;
and if the distance exceeds the set threshold range, acquiring the video stream close-up image of the target object by using the close-up camera.
In an optional embodiment of the present invention, the presentation module 330 is further configured to receive a close-up release instruction to stop close-up presentation of the target object;
identifying a close-up release object corresponding to the close-up release instruction in the target object; determining a matched close-up removing mode according to the number and/or type of the objects of the close-up removing object, and stopping close-up display of the close-up removing object according to the close-up removing mode; or, outputting the video stream original image to a display terminal for displaying.
Embodiments of the present invention further provide a computer-readable storage medium, which is used for storing a program code, and the program code is used for executing the method of the above embodiments.
As shown in fig. 4, the video stream image output system of this embodiment includes at least one camera 100, at least one display terminal 200, and the video stream image output apparatus 300 of the foregoing embodiment, in which the camera and the display terminal are connected to the camera.
An embodiment of the present invention further provides an image pickup apparatus, where the image pickup apparatus includes a processor and a memory: the memory is used for storing program codes and transmitting the program codes to the processor; the processor is configured to execute the method according to the above embodiment according to the instructions in the program code. Of course, besides the processor, the image capturing apparatus further includes an optical component for implementing image acquisition, where the optical component may include an optical component such as a lens and an optical filter, and in addition, the image capturing apparatus may further include other components capable of implementing the image capturing function of the image capturing apparatus and the functions described in this embodiment, such as a memory, a housing, and a fixing component, which are not described herein again.
It is clear to those skilled in the art that the specific working processes of the above-described systems, devices, modules and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.
In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional unit may be implemented in the form of hardware, or may also be implemented in the form of software or firmware.
Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.
Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be replaced with equivalents within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

Claims (10)

1. A method for outputting video stream images, comprising:
acquiring an original video stream image, and performing object recognition on the original video stream image to determine recognition objects contained in the original video stream image; the identification object comprises an object and/or a human body object;
selecting at least one target object from the recognition objects, and acquiring a video stream close-up image of the target object;
and outputting the video stream close-up image to a display terminal to perform close-up display on the target image.
2. The method of claim 1, wherein the selecting at least one target object from the identified objects comprises:
selecting at least one target object from the recognition objects in response to a voice instruction of a user; and/or the presence of a gas in the gas,
collecting voice information of a user, predicting user intention according to the voice information, and selecting at least one target object from the recognition objects according to the predicted user intention; and/or the presence of a gas in the gas,
analyzing the gesture motion of each recognition object based on the video stream original image, and selecting at least one target object from the recognition objects according to the gesture motion of each recognition object; and/or the presence of a gas in the gas,
and selecting at least one target object according to the motion trail of each identified object in the original image of the video stream.
3. The method of claim 2, wherein the parsing the gesture motion of each of the recognition objects based on the video stream original image, and the selecting at least one target object from the recognition objects according to the gesture motion of each of the recognition objects comprises:
detecting key points of a human body to obtain a human body skeleton of each identification object, and establishing an association relation with each identification object;
tracking hand features in human body skeletons corresponding to the recognition objects, and when judging that any recognition object performs a preset gesture action based on the hand features, taking the recognition object performing the preset gesture action as a target object.
4. The method of claim 2, wherein selecting at least one target object according to the motion trajectory of each identified object in the original image of the video stream comprises:
tracking the motion track of any identification object in the images of the continuous video stream;
and when the motion trail is judged to meet a preset motion rule, taking the identification object as a target object.
5. The method of claim 1, wherein after the obtaining of the video stream original image, the method further comprises: outputting the video stream original image to an operation end;
the selecting at least one target object from the recognition objects comprises: and responding to a selection instruction triggered by the user based on the operation end, and selecting at least one target object from the identification objects.
6. The method of claim 1, wherein the obtaining a video stream close-up image of the target object comprises:
cutting and zooming the video stream original image to generate a video stream close-up image of the target object; or the like, or a combination thereof,
and acquiring a video stream close-up image of the target object through a close-up camera.
7. The method of claim 6, wherein the acquiring, by the close-up camera, the video close-up image of the target object comprises:
acquiring the distance between an original camera for acquiring the original image of the video stream and a target object;
if the distance is judged to be within the range of the set threshold value, acquiring a video stream close-up image of the target object by using the original camera;
and if the distance is judged to exceed the set threshold range, acquiring a video stream close-up image of the target object by using a close-up camera.
8. The method according to any one of claims 1-7, wherein after outputting the video stream close-up image to a display terminal for close-up presentation of the target image, the method further comprises:
receiving a close-up release instruction for stopping close-up display of the target object;
identifying a close-up release object corresponding to the close-up release instruction in the target object; determining a matched close-up removing mode according to the number and/or the type of the objects of the close-up removing object, and stopping close-up displaying of the close-up removing object according to the close-up removing mode; or the like, or, alternatively,
and outputting the video stream original image to the display terminal for displaying.
9. An apparatus for outputting video stream images, the apparatus comprising:
the original image acquisition module is used for acquiring an original image of a video stream, and performing object identification on the original image of the video stream to determine an identification object contained in the original image of the video stream; the identification object comprises an object and/or a human body object;
the close-up recognition module is used for selecting at least one target object from the recognition objects and acquiring a video close-up image of the target object;
and the display module is used for outputting the video stream close-up image to a display terminal so as to display the target image in a close-up manner.
10. An image pickup apparatus characterized by comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any of claims 1-8 according to instructions in the program code.
CN202210976194.3A 2022-08-15 2022-08-15 Video stream image output method and device and camera equipment Pending CN115474076A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210976194.3A CN115474076A (en) 2022-08-15 2022-08-15 Video stream image output method and device and camera equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210976194.3A CN115474076A (en) 2022-08-15 2022-08-15 Video stream image output method and device and camera equipment

Publications (1)

Publication Number Publication Date
CN115474076A true CN115474076A (en) 2022-12-13

Family

ID=84366260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210976194.3A Pending CN115474076A (en) 2022-08-15 2022-08-15 Video stream image output method and device and camera equipment

Country Status (1)

Country Link
CN (1) CN115474076A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101068342A (en) * 2007-06-05 2007-11-07 西安理工大学 Video frequency motion target close-up trace monitoring method based on double-camera head linkage structure
CN102968802A (en) * 2012-11-28 2013-03-13 无锡港湾网络科技有限公司 Moving target analyzing and tracking method and system based on video monitoring
CN104125433A (en) * 2014-07-30 2014-10-29 西安冉科信息技术有限公司 Moving object video surveillance method based on multi-PTZ (pan-tilt-zoom)-camera linkage structure
CN110287891A (en) * 2019-06-26 2019-09-27 北京字节跳动网络技术有限公司 Gestural control method, device and electronic equipment based on human body key point
CN111314759A (en) * 2020-03-02 2020-06-19 腾讯科技(深圳)有限公司 Video processing method and device, electronic equipment and storage medium
CN111757137A (en) * 2020-07-02 2020-10-09 广州博冠光电科技股份有限公司 Multi-channel close-up playing method and device based on single-shot live video
CN112308018A (en) * 2020-11-19 2021-02-02 安徽鸿程光电有限公司 Image identification method, system, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101068342A (en) * 2007-06-05 2007-11-07 西安理工大学 Video frequency motion target close-up trace monitoring method based on double-camera head linkage structure
CN102968802A (en) * 2012-11-28 2013-03-13 无锡港湾网络科技有限公司 Moving target analyzing and tracking method and system based on video monitoring
CN104125433A (en) * 2014-07-30 2014-10-29 西安冉科信息技术有限公司 Moving object video surveillance method based on multi-PTZ (pan-tilt-zoom)-camera linkage structure
CN110287891A (en) * 2019-06-26 2019-09-27 北京字节跳动网络技术有限公司 Gestural control method, device and electronic equipment based on human body key point
CN111314759A (en) * 2020-03-02 2020-06-19 腾讯科技(深圳)有限公司 Video processing method and device, electronic equipment and storage medium
CN111757137A (en) * 2020-07-02 2020-10-09 广州博冠光电科技股份有限公司 Multi-channel close-up playing method and device based on single-shot live video
CN112308018A (en) * 2020-11-19 2021-02-02 安徽鸿程光电有限公司 Image identification method, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US8314854B2 (en) Apparatus and method for image recognition of facial areas in photographic images from a digital camera
JP4274233B2 (en) Imaging apparatus, image processing apparatus, image processing method therefor, and program causing computer to execute the method
JP6134825B2 (en) How to automatically determine the probability of image capture by the terminal using context data
US8022982B2 (en) Camera system and method for operating a camera system
US8064656B2 (en) Image processing apparatus, imaging apparatus, image processing method, and computer program
KR101444103B1 (en) Media signal generating method and apparatus using state information
JP5662670B2 (en) Image processing apparatus, image processing method, and program
JP2008113262A (en) Image storage device, imaging apparatus, and image storing method and program
JP2011188297A (en) Electronic zoom apparatus, electronic zoom method, and program
JP2007074143A (en) Imaging device and imaging system
JP2014139681A (en) Method and device for adaptive video presentation
KR20110067716A (en) Apparatus and method for registering a plurlity of face image for face recognition
JP2007088803A (en) Information processor
CN102196176A (en) Information processing apparatus, information processing method, and program
CN109766473B (en) Information interaction method and device, electronic equipment and storage medium
CN112399239B (en) Video playing method and device
CN111596760A (en) Operation control method and device, electronic equipment and readable storage medium
CN114257824A (en) Live broadcast display method and device, storage medium and computer equipment
CN115474076A (en) Video stream image output method and device and camera equipment
JP2011211757A (en) Electronic zoom apparatus, electronic zoom method, and program
CN115334241B (en) Focusing control method, device, storage medium and image pickup apparatus
US20190171898A1 (en) Information processing apparatus and method
JP2008148262A (en) Imaging apparatus, its control method, program, and storage medium
CN115499580B (en) Multi-mode fusion intelligent view finding method and device and image pickup equipment
JP5242491B2 (en) Display control apparatus and operation control method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination