CN113542604A - Video focusing method and device - Google Patents

Video focusing method and device Download PDF

Info

Publication number
CN113542604A
CN113542604A CN202110786821.2A CN202110786821A CN113542604A CN 113542604 A CN113542604 A CN 113542604A CN 202110786821 A CN202110786821 A CN 202110786821A CN 113542604 A CN113542604 A CN 113542604A
Authority
CN
China
Prior art keywords
voiceprint
visual image
target object
target
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110786821.2A
Other languages
Chinese (zh)
Inventor
盛娇麒
刘东婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koubei Shanghai Information Technology Co Ltd
Original Assignee
Koubei Shanghai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koubei Shanghai Information Technology Co Ltd filed Critical Koubei Shanghai Information Technology Co Ltd
Priority to CN202110786821.2A priority Critical patent/CN113542604A/en
Publication of CN113542604A publication Critical patent/CN113542604A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/67Focus control based on electronic image sensor signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/61Control of cameras or camera modules based on recognised objects
    • H04N23/611Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/64Computer-aided capture of images, e.g. transfer from script file into camera, check of taken image quality, advice or proposal for image composition or decision on when to take image

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Studio Devices (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a video focusing method, a video focusing device and video focusing equipment, wherein the method comprises the following steps: acquiring voice data in a video scene, and extracting voiceprint characteristics of the voice data; matching the voiceprint features of the voice data with prestored voiceprint features, and determining matched target voiceprint features in the prestored voiceprint features; and identifying a target object matched with the target voiceprint feature in the video scene, and focusing or chasing the face of the target object. By adopting the method, the problem that the focusing can not be carried out along with the sound during video acquisition is solved.

Description

Video focusing method and device
Technical Field
The application relates to the technical field of video processing, in particular to a video focusing method, a video focusing device and video focusing equipment. The application also relates to a data processing method.
Background
In a video application system, one or more objects are often required to be clearly displayed during video acquisition, and therefore, a target object needs to be focused. Such as a live or video conference or a video call or a recorded video.
In the prior art, if manual focusing is adopted, the focus is fixed, and great inconvenience is brought to video acquisition. If the focusing is automatic, focusing is often performed according to the environmental parameters such as the distance level, the light and the like of each object in the scene to be shot, and the scene needing to be focused along with the sound is difficult to adapt. For example, for live broadcast with goods scenes, the forms of multi-anchor continuous broadcast, multi-person live broadcast with the same lens and the like are gradually developed from a single anchor, but the sound focusing cannot be followed in the live broadcast, the situation that one anchor talks and the lens focuses on the other anchor often occurs, the sense of asynchronous human sound is brought to users, and the experience fault of the users is caused.
Therefore, it is a problem to be solved to provide a reasonable video focusing manner to avoid the situation that the focusing can not be followed by the sound during video acquisition.
Disclosure of Invention
The video focusing method provided by the embodiment of the application solves the problem that the video cannot be focused along with the sound during video acquisition.
The embodiment of the application provides a video focusing method, which comprises the following steps: acquiring voice data in a video scene, and extracting voiceprint characteristics of the voice data; matching the voiceprint features of the voice data with prestored voiceprint features, and determining matched target voiceprint features in the prestored voiceprint features; and identifying a target object matched with the target voiceprint feature in the video scene, and controlling a camera to focus or chase the face of the target object.
Optionally, the acquiring voice data in a video scene includes: and collecting all sounds in the video scene, and filtering the sounds to obtain the voice data.
Optionally, the acquiring all sounds in the video scene, and filtering the sounds to obtain the voice data includes: filtering noise in the sound according to a preset noise frequency range, and extracting the voice data; or filtering the noise in the sound according to a preset noise intensity threshold value, and extracting the voice data.
Optionally, the extracting the voiceprint feature of the voice data includes: and acquiring the frequency value of the current voice frame and/or the frequency value of an adjacent voice frame in the voice data, and determining the voiceprint characteristics of the voice data according to the frequency value of the current voice frame and/or the frequency value of the adjacent voice frame.
Optionally, the method further includes: collecting sound data of the target object, and extracting voiceprint features of the target object according to the sound data; collecting the visual image of the target object, and extracting the visual image characteristics of the visual image; storing the voiceprint features and the visual image features in an associated mode; the voiceprint feature of the target object is a pre-stored voiceprint feature.
Optionally, the acquiring the sound data of the target object and extracting the voiceprint feature of the target object according to the sound data includes: collecting sounds with different sound intensities of the target object in various scenes, and filtering the collected sounds with different sound intensities; and learning the filtered sound based on a neural network to obtain the voiceprint characteristics of the target object.
Optionally, the identifying a target object in the video scene that matches the target voiceprint feature includes: acquiring a target visual image characteristic associated with the target voiceprint characteristic; and identifying an object matched with the target visual image characteristic in the video scene as the target object matched with the target voiceprint characteristic.
Optionally, the target visual image feature is a facial image feature of the target object; the identifying of the target object in the video scene that matches the target visual image feature comprises: extracting a facial image of an object in the video scene; and calculating the similarity between the facial image and the target visual image characteristic, and if the similarity is greater than a similarity threshold value, the facial image is the facial image of the target object.
Optionally, the focusing or focusing the face of the target object includes: determining a sharpness of a facial image of a target object in the video scene; and if the definition is lower than a definition threshold, adjusting the focusing position of the image in the video scene until the definition is not lower than the definition threshold.
Optionally, the focusing or focusing the face of the target object includes: determining a focus position based on a facial image of the target object; marking the focus position, and moving the focus position according to the spatial position change of the face image of the target object to track the face of the target object.
Optionally, the method further includes: capturing position information and/or size information of a facial image of the target object in the video scene; focusing and/or focusing the face of the target object in real time according to the position information and/or the size information.
An embodiment of the present application further provides a data processing method, including: displaying a first page for establishing an incidence relation between the voiceprint and the visual image, wherein the first page is displayed with a voiceprint acquisition inlet; responding to a trigger instruction of the voiceprint acquisition inlet, and displaying a voiceprint acquisition page; the voiceprint acquisition page is used for acquiring the sound information of the object; the sound information is used for extracting the voiceprint characteristics of the object; responding to the successful acquisition instruction of the sound information of the object, and displaying a visual image information acquisition page; the visual image information acquisition page is used for acquiring visual image information of the object; the visual image information is used for extracting visual image characteristics of the object; and establishing an incidence relation between the voiceprint characteristics and the visual image characteristics.
Optionally, the method further includes: receiving input voiceprint identification information on the voiceprint acquisition page; and/or displaying a voiceprint recording control for starting a sound collecting function on the voiceprint collecting page, displaying a secondary voiceprint collecting page after the voiceprint recording control is triggered, and guiding the object to record sound information; the sound information is voice data.
Optionally, the first page further displays a voiceprint management entry; the method further comprises the following steps: responding to the voiceprint management entrance, and displaying a voiceprint management page for managing the voiceprint; (ii) a Receiving user behavior information on the voiceprint management page, and carrying out the following operation processing on the input sound information and/or the visual image information related to the input sound information according to the user behavior information: deleted, updated, or added.
Optionally, the first page further shows an audio tracking focusing control, and the audio tracking focusing control is used for starting an audio and video acquisition function after being triggered, and showing and/or testing an effect of focusing a facial image of a target object in real time according to voice data in a video scene.
Optionally, the first page is displayed with a visual image information acquisition entry; the method further comprises the following steps: responding to a trigger instruction of the visual image information acquisition inlet, and displaying a visual image information acquisition page for acquiring visual image information of the object; the visual image information is used for extracting visual image characteristics of the object; responding to a successful acquisition instruction of the visual image information of the object, and displaying the voiceprint acquisition page; and collecting the sound information of the object, and establishing the association relation between the voiceprint characteristics of the sound information and the visual image characteristics.
An embodiment of the present application further provides a video focusing apparatus, including: the voice print acquiring unit is used for acquiring voice data in a video scene and extracting voice print characteristics of the voice data; the voice print matching unit is used for matching the voice print characteristics of the voice data with pre-stored voice print characteristics and determining matched target voice print characteristics in the pre-stored voice print characteristics; and the focusing unit is used for identifying a target object matched with the target voiceprint feature in the video scene and focusing or tracking the face of the target object.
An embodiment of the present application further provides a data processing apparatus, including: the main interface unit is used for displaying a first page for establishing the association relation between the voiceprint and the visual image, and the first page is displayed with a voiceprint acquisition inlet; the voiceprint acquisition unit is used for responding to a trigger instruction of the voiceprint acquisition inlet and displaying a voiceprint acquisition page; the voiceprint acquisition page is used for acquiring the sound information of the object; the sound information is used for extracting the voiceprint characteristics of the object; the visual image acquisition unit is used for responding to the acquisition success instruction of the sound information of the object and displaying a visual image information acquisition page; the visual image information acquisition page is used for acquiring visual image information of the object; the visual image information is used for extracting visual image characteristics of the object; and the binding unit is used for establishing the association relationship between the voiceprint characteristics and the visual image characteristics.
An embodiment of the present application further provides an electronic device, including: a memory, and a processor; the memory is used for storing a computer program, and the computer program is executed by the processor to execute the method provided by the embodiment of the application.
The embodiment of the present application further provides a storage device, in which a computer program is stored, and the computer program is executed by the processor to perform the method provided in the embodiment of the present application.
Compared with the prior art, the method has the following advantages:
according to the video focusing method, the video focusing device and the video focusing equipment, voice data in a video scene are obtained, and voiceprint features of the voice data are extracted; matching the voiceprint features of the voice data with prestored voiceprint features, and determining matched target voiceprint features in the prestored voiceprint features; and identifying a target object matched with the target voiceprint feature in the video scene, and focusing or chasing the face of the target object. By identifying the voiceprint characteristics of voice data in a video scene and focusing based on the target voiceprint characteristics matched with the voiceprint characteristics, the method realizes sound tracing focusing and solves the problem that the voice cannot be focused when the video is collected.
According to the data processing method, the device and the equipment, a first page used for establishing the association relation between the voiceprint and the visual image is displayed, and a voiceprint acquisition inlet is displayed on the first page; responding to a trigger instruction of the voiceprint acquisition inlet, and displaying a voiceprint acquisition page; the voiceprint acquisition page is used for acquiring the sound information of the object; the sound information is used for extracting the voiceprint characteristics of the object; responding to the successful acquisition instruction of the sound information of the object, and displaying a visual image information acquisition page; the visual image information acquisition page is used for acquiring visual image information of the object; the visual image information is used for extracting visual image characteristics of the object; and establishing an incidence relation between the voiceprint characteristics and the visual image characteristics. The incidence relation between the voiceprint features and the visual image features provides a data basis for realizing the sound tracing focusing, so that the problem that the video cannot be focused along with the sound during video acquisition is solved.
Drawings
FIG. 1 is a schematic diagram of a deployment system environment for a method provided by an embodiment of the present application;
fig. 1A is a schematic view of an application scenario of a method provided in an embodiment of the present application;
FIG. 2 is a flowchart illustrating a video focusing method according to a first embodiment of the present disclosure;
FIG. 3 is a flowchart of establishing a voiceprint facial association relationship according to a first embodiment of the present application;
fig. 4 is a flowchart of a video focusing method in a live scene according to a first embodiment of the present application;
FIG. 5 is a flowchart illustrating a data processing method according to a second embodiment of the present application;
FIG. 6 is a schematic diagram of an application interface for establishing a voiceprint facial association relationship according to a second embodiment of the present application;
FIG. 7 is a schematic view of a video focusing apparatus according to a fourth embodiment of the present application;
FIG. 8 is a diagram of a data processing apparatus according to a fourth embodiment of the present application;
fig. 9 is a schematic diagram of an electronic device provided in the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The embodiment of the application provides a video focusing method and device, electronic equipment and storage equipment. The embodiment of the application also provides a data processing method and device, electronic equipment and storage equipment. The following examples are individually set forth.
For ease of understanding, an application scenario of the method provided in the embodiments of the present application is first given. The method provided by the embodiment of the application can be applied to but not limited to scenes related to videos such as live broadcast or video conference or video call or video recording. The video application system involved in such a scenario may refer to fig. 1, generally comprising: audio and video acquisition end 101, server 102, broadcast end 103. The audio and video acquisition terminal is an electronic device with an audio and video acquisition function, is a source terminal for generating video streams, and can be, but is not limited to, an intelligent terminal such as a mobile phone and a PAD. In a live scene, an audio and video acquisition end is an electronic device used by a main broadcasting end and is a source end for generating video streams. The audio and video acquisition end can acquire audio and video data, and further can perform watermarking, beautifying, special effect filter and other processing on the audio and video, the processed audio and video is encoded and compressed into a transmittable video stream which can be watched and pushed to the server end through a network, and specifically, an audio and video data packet packaged according to a certain format can be sent to the server end by adopting a streaming media protocol. The server side collects the audio and video stream of the audio and video acquisition end, pushes the audio and video stream to all the playing ends, and can also perform authentication, real-time transcoding, recording and storing and the like. The playing end obtains the address of the pull stream, pulls the audio and video stream encapsulated according to the streaming media protocol, and analyzes the audio and video data to decode and play.
Referring again to fig. 1A, a video frame of a live scene is shown, wherein the left, middle and right anchor are A, B, C respectively, anchor a is speaking. The existing multi-anchor video picture 101a is a situation that an existing live broadcast scene is frequently seen in a multi-anchor live broadcast video picture, and focusing is performed according to environment parameters such as far and near levels, light and the like of each object in a scene to be shot, so that a camera lens can focus a C anchor nearby, and can not automatically focus an A anchor which is speaking, so that an A anchor image in the video picture is fuzzy, sound and portrait do not correspond to each other one by one, and the human voice is asynchronous, so that a video viewer experiences faults. In the multi-anchor video frame 102a, focusing is performed according to sound, when an anchor a talks, a voiceprint corresponding to the collected sound is matched with a recorded voiceprint, a face object such as a face image corresponding to the anchor a, which is bound in advance by the voiceprint, is called after matching, the face object is focused, and the face object is further subjected to real-time face focusing tracking, so that the video of the anchor a is clear, and the sound is matched with the video.
It is understood that the live broadcast scene given in fig. 1 is only a schematic illustration of an application scene, and is not a scene limitation of the video focusing method provided in the present embodiment. The video focusing method provided by the embodiment is not limited in application scene, and may also be applied to video capturing scenes such as video conferences, video calls, video recording and sharing, and the like.
The following describes a video focusing method according to a first embodiment of the present application with reference to fig. 2 to 4. The video focusing method shown in fig. 2 includes: step S201 to step S204.
Step S201, voice data in a video scene is obtained, and voiceprint features of the voice data are extracted.
In this step, voice data in a video scene is acquired. The voice data is acquired by audio and video acquisition equipment for acquiring videos; the audio and video acquisition device can be but is not limited to a live scene anchor terminal electronic device or a video conference terminal. Specifically, all sound signals in a video scene can be collected through a sound pickup of the audio/video acquisition device, such as a microphone, and the sound signals are preprocessed to extract the voice data. The preprocessing may be a filtering process on the sound signal. Specifically, the acquiring of the voice data in the video scene includes: and collecting all sounds in the video scene, and filtering the sounds to obtain the voice data. Further, noise in the sound signal may be filtered based on the frequency or intensity of the noise. Specifically, the acquiring all sounds in the video scene, and filtering the sounds to obtain the voice data includes: filtering noise in the sound according to a preset noise frequency range, and extracting the voice data; or filtering the noise in the sound according to a preset noise intensity threshold value, and extracting the voice data.
In this embodiment, before extracting the voiceprint feature of the voice data, framing the voice data is further performed, where the framing includes segmenting the collected voice data according to a specified time interval or a sampling number to obtain a segment of voice data with a specific length, and each segment of voice data is a frame of voice data. And after framing, extracting the voiceprint characteristic parameters of the voice data of each frame to be used as the voiceprint characteristics of the voice data. Because the voice signal is a non-stationary signal, the time sequence characteristic in the signal can be reflected by processing the data frame by frame, and the calculation amount can be reduced compared with the method for processing the data point by point. Specifically, the voiceprint feature of the framed voice data can be extracted by the following processing, including: and acquiring the frequency value of the current voice frame and/or the frequency value of an adjacent voice frame in the voice data, and determining the voiceprint characteristics of the voice data according to the frequency value of the current voice frame and/or the frequency value of the adjacent voice frame. Of course, other characteristic parameters that identify voiceprints, such as MFCC (Mel-scale Frequency Cepstral Coefficients) Cepstral Coefficients, may also be used.
In this embodiment, extracting the voiceprint feature parameters of the voice data may be implemented by: collecting all sounds in the video scene; filtering the collected sound to obtain voice data; performing framing processing on the voice data; and identifying the voiceprint characteristic parameters of each frame of voice data as the voiceprint characteristics of the voice data. In the subsequent step, the matched voiceprint features can be determined according to the voiceprint features of each frame of voice data, for example, the voiceprint features matched with the voiceprint feature parameters in a voiceprint library in which the voiceprint features are prestored are determined.
Step S202, the voiceprint features of the voice data are matched with the prestored voiceprint features, and the matched target voiceprint features in the prestored voiceprint features are determined.
In this step, the target voiceprint feature on the voice data match is determined, so as to further determine the target object for sending out the voice.
In practice, Voiceprint Recognition (VPR) can be used to confirm the Voiceprint characteristics on the voice data match. The method comprises the following steps: . Voiceprint features matched to speech data can be identified in a number of ways, such as: template matching method, statistical probability model method, artificial neural network method, support vector machine method, Sparse Representation method Sparse Representation, SR). The artificial neural network method comprises various methods such as a Time Delay Neural Network (TDNN), a Decision Tree Neural Network (DTNN) and the like; sparse representation is a process of representing signal features as linear combinations of a few elementary atoms using dictionary learning. The specific speech recognition method is not limited in this application.
Step S203, identifying a target object in the video scene matching the target voiceprint feature, and focusing or tracking the face of the target object.
In this embodiment, the object matched with the target voiceprint feature may be understood as an object matched with the voice data, that is, an object which emits a sound including the voice data, and the face of the object is focused or focused, so that the sound tracking and focusing are realized, and a video effect that the sound is consistent with the image is achieved.
In this step, the identifying a target object in the video scene that matches the target voiceprint feature includes: acquiring a target visual image characteristic associated with the target voiceprint characteristic; and identifying an object matched with the target visual image characteristic in the video scene as the target object matched with the target voiceprint characteristic. The visual image feature stored in association with the target voiceprint feature is used as the visual image feature for identifying the target object which sends the voice data, and the target object which sends the voice containing the voice data is determined according to the visual image feature.
The implementation method further comprises the step of establishing the association relationship between the voiceprint and the visual image of the object in advance, wherein the object comprises the target object. The embodiment specifically includes: collecting sound data of the target object, and extracting voiceprint features of the target object according to the sound data; acquiring a visual image of the target object, and extracting visual image characteristics of the visual image; storing the voiceprint features and the visual image features in an associated mode; the voiceprint feature of the target object is a pre-stored voiceprint feature. Wherein the storing the voiceprint features in association with the visual image features comprises: and establishing an incidence relation between the voiceprint characteristics and the visual image characteristics. The acquiring of the sound data of the target object and the extracting of the voiceprint feature of the target object according to the sound data includes: collecting sounds with different sound intensities of the target object in various scenes, and filtering the collected sounds with different sound intensities; and learning the filtered sound based on a neural network to obtain the voiceprint characteristics of the target object. In practical applications, the face of an object can be used as the visual image of the object, such as a human face image. The association relationship between the voiceprint features and the visual image features of the object is established in advance, so that the voiceprint features and the facial features of the object are bound in advance, when the processing is carried out in the step, the voiceprint features of voice data in a video scene are extracted, target voiceprint features matched with the voiceprint features are identified, the facial features of the object are determined according to the identification of the target voiceprint features, the image focus position (or focus position) of the video scene is adjusted to the face matched with the facial features, and the voice tracking focusing is realized.
Referring to fig. 3, a flow chart for establishing a voiceprint face association relationship is shown, which includes: s301, newly creating a voiceprint ID. And S302, recording voice according to the prompt. S303, is the recorded sound in compliance? If so, executing S304; if not, returning to continue recording the sound. And S304, inputting the face according to the prompt. S305, is the entered face in compliance? If so, S3046 is executed, if not, returning to continue to enter the face image. And S306, binding and storing the voiceprint and the facial features. S301 to S306 are repeatedly executed every new voiceprint ID and/or face ID.
Preferably, the visual image is a face of a subject, i.e., a human face. It is to be understood that the visual image feature is a facial image feature, and the target visual image feature is a facial image feature of the target object. Correspondingly, the target visual image feature associated with the target voiceprint feature is a pre-stored facial image feature of the target object. In this step, a target object of the objects in the video scene needs to be identified according to the facial image feature of the target object. In practical applications, a face recognition technology may be used to recognize the face image of the target object. The face recognition technology is a biological recognition technology for identifying the identity based on the face feature information of a person, and can compare the face features to be recognized with the obtained face feature template and judge the identity information of the face according to the similarity degree. Specifically, in this embodiment, the identifying the target object in the video scene, which is matched with the target visual image feature, includes: extracting a facial image of an object in the video scene; and calculating the similarity between the facial image and the target visual image characteristic, wherein if the similarity is greater than a similarity threshold value, the facial image is the facial image of a target object, and the target object is the target object. In the present embodiment, the face image of the target object is focused or the face image of the target object is focused. The focusing or focus following can be optical focusing or focus following for controlling the position and the posture of the lens, or digital focusing or focus following. In one embodiment, the focusing or following focusing of the face of the target object includes: determining a sharpness of a facial image of a target object in the video scene; and if the definition is lower than a definition threshold, adjusting the focusing position of the image in the video scene and/or the posture position of the camera until the definition is not lower than the definition threshold.
In one embodiment, determining a focus position based on a facial image of the target object, and focusing or following in accordance with the focus position includes: determining a focus position based on a facial image of the target object; marking the focus position, and moving the focus position according to the spatial position change of the face image of the target object to track the face of the target object. The method can comprise the following steps: and controlling the camera to align the position of the target object, and adjusting the focusing position to the facial image of the target object to realize focus tracking.
Further, still include: capturing position information and/or size information of a face image of the target object in the video scene, and focusing and/or focusing the face of the target object in real time according to the position information and/or the size information.
Referring to fig. 4, a flow chart of a video focusing method in a live scene is shown, which includes: s401, detecting that the anchor makes sound. And S402, matching with the recorded voiceprint. S403, focusing the face bound by the voiceprint ID on the matching. And S404, carrying out real-time face focusing tracking on the face. S401 to S404 are repeatedly executed every time a sound is detected.
In the embodiment of the application, the server may be a streaming media server, and is configured to receive a video stream pushed by an audio and video acquisition device; wherein the video frames of the video stream comprise video frames focused based on a face image of a target object, and the audio data synchronized with the video frames comprise target voice data, and voiceprint features corresponding to the target voice data are matched with visual image features of the target object; and distributing the video stream to the audio and video playing equipment which requests the video stream.
In the embodiment of the application, the playing end is used for requesting the server end to acquire the video stream; wherein the video frames of the video stream comprise video frames focused based on a face image of a target object, and the audio data synchronized with the video frames comprise target voice data, and voiceprint features corresponding to the target voice data are matched with visual image features of the target object; and playing the video stream, wherein the video frame focused by the face image based on the target object is correspondingly displayed when the target voice data is played in the playing of the video stream.
It should be noted that, in the case of no conflict, the features given in this embodiment and other embodiments of the present application may be combined with each other, and the steps S201 and S202 or similar terms do not limit the steps to be executed sequentially.
The video focusing method provided by the embodiment is explained so far, and the method extracts the voiceprint features of the voice data by acquiring the voice data in a video scene; matching the voiceprint features of the voice data with prestored voiceprint features, and determining matched target voiceprint features in the prestored voiceprint features; and identifying a target object matched with the target voiceprint feature in the video scene, and focusing or chasing the face of the target object. By identifying the voiceprint characteristics of voice data in a video scene and focusing based on the target voiceprint characteristics matched with the voiceprint characteristics, the method realizes sound tracing focusing and solves the problem that the voice cannot be focused when the video is collected.
Based on the above embodiments, a second embodiment of the present application further provides a data processing method, which is described below with reference to fig. 5 and 6. Referring to fig. 5, the data processing method shown in the figure includes: step S501 to step S504.
Step S501, displaying a first page for establishing the association relation between the voiceprint and the visual image, wherein the first page is displayed with a voiceprint acquisition entrance.
The method provided by the embodiment can be used for establishing the incidence relation between the voiceprint and the visual image, acquiring the voice data of the target object and extracting the voiceprint characteristics of the voice data; acquiring a visual image of the object, and extracting visual image characteristics of the visual image; and establishing an incidence relation between the voiceprint features and the visual image features, providing a data base for realizing sound tracing focusing based on focusing of the visual image features associated with the voiceprint features, and solving the problem that the video cannot be focused along with the sound during video acquisition. Wherein, the visual image is preferably a face image of the object, and the visual image characteristic is a face image characteristic of the object. The method specifically provides an interactive interface through electronic equipment capable of providing an audio and video acquisition function so as to guide a user to input sound information and visual image information needing to be bound with the sound information, extract voiceprint characteristics of the sound information and extract visual image characteristics of the visual image information, thereby establishing association storage of the voiceprint characteristics and the visual image characteristics and binding the voiceprint characteristics and the visual image characteristics.
The first page in this step provides an interface that includes a voiceprint information acquisition portal. The voiceprint information acquisition portal may be an interface element in the interface that can receive user input information and/or user triggers.
Step S502, responding to a trigger instruction of the voiceprint acquisition entrance, and displaying a voiceprint acquisition page; the voiceprint acquisition page is used for acquiring the sound information of the object; the sound information is used to extract voiceprint features of the object.
The voiceprint acquisition page displayed in the step provides a voiceprint acquisition interface, and the voiceprint acquisition page receives user behavior information and acquires sound information of the target object. The user behavior information includes but is not limited to behavior information such as inputting a voiceprint name, triggering voiceprint entry, and the like. The method specifically comprises the following steps: receiving input voiceprint identification information on the voiceprint acquisition page; and/or displaying a voiceprint recording control for starting a sound collecting function on the voiceprint collecting page, displaying a secondary voiceprint collecting page after the voiceprint recording control is triggered, and guiding the object to record sound information; the sound information is voice data. The voiceprint identification information may be a voiceprint name input by a user, or may be information such as a voiceprint name or a number generated by a system by default. The voiceprint entry control can be an interface element which is displayed with characters and/or graphic images, for example, a button which is displayed with a function of prompting a user to enter voice file information, or a button which is displayed with a microphone or a sound pickup graph.
In this embodiment, a management interface for managing the recorded voice information and/or voiceprint features is also provided. Specifically, the first page further shows a voiceprint management entry; the method further comprises the following steps: responding to the voiceprint management entrance, and displaying a voiceprint management page for managing the voiceprint; the voiceprint management page can be specifically used for displaying the recorded voiceprint information and/or a visual image associated with the recorded voiceprint information, such as a facial image; receiving user behavior information on the voiceprint management page, and carrying out the following operation processing on the input voiceprint information and/or the facial image associated with the input voiceprint information according to the user behavior information: deleted, updated, or added.
In this embodiment, an interface for the voiceprint focusing display function is further provided, which may display and/or test the focusing and/or focusing effect based on the recorded voiceprint features and the visual image features associated with the voiceprint features, so as to facilitate re-recording and/or managing the recorded voiceprint features and the visual image features associated with the voiceprint features. In practical application, that is, a test function or a test tool for the sound following and focusing function is provided. Specifically, the first page further shows an acoustic focusing control, and the acoustic focusing control is used for starting an audio and video acquisition function after being triggered and showing and/or testing the effect of focusing the facial image of the target object in real time according to the voice data in the video scene.
Step S503, responding to the successful acquisition instruction of the sound information of the object, and displaying a visual image information acquisition page; the visual image information acquisition page is used for acquiring visual image information of the object; the visual image information is used for extracting visual image characteristics of the object.
After collecting the sound information, the step continues to guide the user to collect the visual image information which needs to establish the association relation with the sound information. Preferably, the visual representation is a face of the subject. Specifically, the visual image information acquisition page guides a user to lock the face and acquire visual image information.
Certainly, in the implementation process, in the process of establishing the association relationship between the voiceprint features and the visual image features, the voice information can be input first, and then the visual image information needing to be associated with the voice information is input; or inputting visual image information first and then inputting sound information needing to be associated with the visual image information. Accordingly, in one embodiment, the first page is displayed with a visual image information collection portal; responding to a trigger instruction of the face information acquisition inlet, and displaying a visual image information acquisition page for acquiring visual image information of the object; the visual image information acquisition page is used for acquiring visual image information of an object; the visual image information is used for extracting visual image characteristics of the object; responding to a successful acquisition instruction of the visual image information of the object, and displaying the voiceprint acquisition page; and collecting the sound information of the object, and establishing the association relation between the voiceprint characteristics of the sound information and the visual image characteristics.
Step S504, the incidence relation between the voiceprint characteristic and the visual image characteristic is established.
In the step, the extracted voiceprint features and the visual image features are stored in a correlated mode. For example, extracting voiceprint features from the collected sound information, and identifying the voiceprint features of a piece of sound information by using the voiceprint feature identifier; and binding visual image information such as a face image acquired corresponding to the sound information with the voiceprint feature identifier, thereby establishing the association relationship. Visual image features can also be identified by using the visual image features, the visual image features are the features of the visual image information correspondingly collected by the voiceprint information, and the corresponding association between the voiceprint feature identifiers and the visual image feature identifiers is established, so that the association relationship is established.
Referring to fig. 6, a schematic diagram of an application interface for establishing a voiceprint facial association relationship is shown, which includes: the function panel 601 comprises an entrance 601-1 of a voice following focusing function, and the voice following focusing function is a function of guiding a user to establish an association relation between a voiceprint feature and a visual image feature through an interactive interface. And the incidence relation is used for focusing according to the recognized voiceprint, so that audio and video synchronous with the sound image are acquired. A main audio tracking and focusing interface 602, which is equivalent to the first page; the sound tracing focusing main interface comprises newly added voiceprint id entries 602-1 and/or 602-2, namely the voiceprint information acquisition entry. And displaying a newly added voiceprint id page 603 after the voiceprint information acquisition inlet is triggered, wherein the newly added voiceprint id page is the voiceprint acquisition page. The voiceprint acquisition page can receive the voiceprint identification, such as name information, of the voiceprint acquired this time, which is input by a user. And after the voiceprint acquisition page detects that the user confirms the triggering of starting to enter the voiceprint, if the user triggers a confirmation key or a sound pickup identifier, displaying an entry voiceprint page 604. After the voiceprint page is entered to confirm that the voiceprint entry is successful, the bound face page 606 is triggered and started as shown in a voiceprint entry success interface 605. The binding face page is equivalent to the visual image information acquisition page, and can acquire a face image for being used as visual image information for extracting visual image characteristics. After the face binding confirmation of the user is successfully detected, a face binding success interface 607 is displayed to show that the corresponding relation between a group of voiceprint features and visual image features is successfully established. The focus-following main interface 602 also shows a voiceprint id entry 602-3, i.e. the voiceprint management entry. The main audio tracking and focusing interface 602 in the figure further shows a function entry 602-4 for performing real-time face focusing and tracking on voiceprint id, that is, the function entry for displaying audio tracking and focusing.
It is to be understood that the interfaces in the figures are schematic, and the information on the shape, size, layout, etc. of the various graphical elements is not to be taken as a limitation to the implementation of the described method.
The method provided by the embodiment is explained, and the method comprises the steps of displaying a first page for establishing an association relation between a voiceprint and a visual image, wherein the first page is displayed with a voiceprint acquisition inlet; responding to a trigger instruction of the voiceprint acquisition inlet, and displaying a voiceprint acquisition page; the voiceprint acquisition page is used for acquiring the sound information of the object; the sound information is used for extracting the voiceprint characteristics of the object; responding to the successful acquisition instruction of the sound information of the object, and displaying a visual image information acquisition page; the visual image information acquisition page is used for acquiring visual image information of the object; the visual image information is used for extracting visual image characteristics of the object; and establishing an incidence relation between the voiceprint characteristics and the visual image characteristics. The incidence relation between the voiceprint features and the visual image features provides a data base for realizing the sound tracing focusing, and is used for solving the problem that the video cannot be focused along with the sound during video acquisition.
A third embodiment of the present application provides a video focusing apparatus corresponding to the first embodiment. The device is described below with reference to fig. 7. The video focusing apparatus shown in fig. 7 includes:
a voiceprint acquisition unit 701, configured to acquire voice data in a video scene and extract a voiceprint feature of the voice data;
a voiceprint matching unit 702, configured to match a voiceprint feature of the voice data with a voiceprint feature stored in advance, and determine a target voiceprint feature matched in the voiceprint feature stored in advance;
a focusing unit 703, configured to identify a target object in the video scene that matches the target voiceprint feature, and focus or chase a face of the target object.
Optionally, the voiceprint acquisition unit 701 is specifically configured to: and collecting all sounds in the video scene, and filtering the sounds to obtain the voice data.
Optionally, the voiceprint acquisition unit 701 is specifically configured to: filtering noise in the sound according to a preset noise frequency range, and extracting the voice data; or filtering the noise in the sound according to a preset noise intensity threshold value, and extracting the voice data.
Optionally, the voiceprint acquisition unit 701 is specifically configured to: and acquiring the frequency value of the current voice frame and/or the frequency value of an adjacent voice frame in the voice data, and determining the voiceprint characteristics of the voice data according to the frequency value of the current voice frame and/or the frequency value of the adjacent voice frame.
Optionally, the apparatus further comprises a voiceprint and visual image association unit, wherein the voiceprint and visual image association unit is configured to: collecting sound data of the target object, and extracting voiceprint features of the target object according to the sound data; collecting the visual image of the target object, and extracting the visual image characteristics of the visual image; storing the voiceprint features and the visual image features in an associated mode; the voiceprint feature of the target object is a pre-stored voiceprint feature.
Optionally, the voiceprint and visual image association unit is specifically configured to: collecting sounds with different sound intensities of the target object in various scenes, and filtering the collected sounds with different sound intensities; and learning the filtered sound based on a neural network to obtain the voiceprint characteristics of the target object.
Optionally, the focusing unit is specifically configured to: acquiring a target visual image characteristic associated with the target voiceprint characteristic; and identifying an object matched with the target visual image characteristic in the video scene as the target object matched with the target voiceprint characteristic.
Optionally, the target visual image feature is a facial image feature of the target object; the focusing unit 703 is specifically configured to: extracting a facial image of an object in the video scene; and calculating the similarity between the facial image and the target visual image characteristic, and if the similarity is greater than a similarity threshold value, the facial image is the facial image of the target object.
Optionally, the focusing unit 703 is specifically configured to: determining a sharpness of a facial image of a target object in the video scene; and if the definition is lower than a definition threshold, adjusting the focusing position of the image in the video scene until the definition is not lower than the definition threshold.
Optionally, the focusing unit 703 is specifically configured to: determining a focus position based on a facial image of the target object; marking the focus position, and moving the focus position according to the spatial position change of the face image of the target object to track the face of the target object.
Optionally, the focusing unit 703 is specifically configured to: capturing position information and/or size information of a facial image of the target object in the video scene; focusing and/or focusing the face of the target object in real time according to the position information and/or the size information.
A fourth embodiment of the present application provides a data processing apparatus corresponding to the second embodiment. The device is described below with reference to fig. 8. The data processing apparatus shown in fig. 8 includes:
the main interface unit 801 is used for displaying a first page for establishing an association relation between a voiceprint and a visual image, and the first page is displayed with a voiceprint acquisition inlet;
a voiceprint acquisition unit 802, configured to respond to a trigger instruction of the voiceprint acquisition entry and display a voiceprint acquisition page; the voiceprint acquisition page is used for acquiring the sound information of the object; the sound information of the object is used for extracting the voiceprint characteristics of the object;
a visual image collecting unit 803, configured to display a visual image information collecting page in response to a successful collecting instruction of the sound information of the object; the visual image information acquisition page is used for acquiring visual image information of the object; the visual image information is used for extracting visual image characteristics of the object;
a binding unit 804, configured to establish an association relationship between the voiceprint feature and the visual image feature.
Optionally, the voiceprint acquisition unit 802 is specifically configured to: receiving input voiceprint identification information on the voiceprint acquisition page; and/or displaying a voiceprint recording control for starting a sound collecting function on the voiceprint collecting page, displaying a secondary voiceprint collecting page after the voiceprint recording control is triggered, and guiding the object to record sound information; the sound information is voice data.
Optionally, the first page further displays a voiceprint management entry;
the apparatus further comprises a voiceprint management unit configured to: responding to the voiceprint management entrance, and displaying a voiceprint management page for managing the voiceprint; receiving user behavior information on the voiceprint management page, and carrying out the following operation processing on the input sound information and/or the visual image information related to the input sound information according to the user behavior information: deleted, updated, or added.
Optionally, the main interface unit is specifically configured to: and the first page displays the sound tracing focusing control, the sound tracing focusing control is used for starting an audio and video acquisition function after being triggered, and displays and/or tests the effect of focusing the facial image of the target object in real time according to the voice data in the video scene.
Optionally, the first page is displayed with a visual image information acquisition entry;
the visual image acquisition unit is specifically used for responding to a trigger instruction of the visual image information acquisition inlet and displaying a visual image information acquisition page for acquiring visual image information of an object; the visual image information is used for extracting visual image characteristics of the object; responding to a successful acquisition instruction of the visual image information of the object, and displaying the voiceprint acquisition page; and collecting the sound information of the object, and establishing the association relation between the voiceprint characteristics of the sound information and the visual image characteristics.
Based on the above embodiments, a fifth embodiment of the present application provides an electronic device, and please refer to the corresponding description of the above embodiments for related parts. Referring to fig. 9, the electronic device shown in the figure includes: a memory 901, and a processor 902; the memory is used for storing a computer program, and the computer program is executed by the processor to execute the method provided by the embodiment of the application.
Based on the foregoing embodiments, a sixth embodiment of the present application provides a storage device, and please refer to the corresponding description of the foregoing embodiments for related parts. The schematic diagram of the storage device is similar to fig. 9. The storage device stores a computer program, and the computer program is executed by the processor to execute the method provided by the embodiment of the application.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

Claims (10)

1. A method of video focusing, comprising:
acquiring voice data in a video scene, and extracting voiceprint characteristics of the voice data;
matching the voiceprint features of the voice data with prestored voiceprint features, and determining matched target voiceprint features in the prestored voiceprint features;
and identifying a target object matched with the target voiceprint feature in the video scene, and focusing or chasing the face of the target object.
2. The method of claim 1, wherein the obtaining voice data in a video scene comprises:
and collecting all sounds in the video scene, and filtering the sounds to obtain the voice data.
3. The method of claim 2, wherein the capturing all sounds in the video scene and filtering the sounds to obtain the speech data comprises:
filtering noise in the sound according to a preset noise frequency range, and extracting the voice data; or,
and filtering the noise in the sound according to a preset noise intensity threshold value, and extracting the voice data.
4. The method of claim 1, wherein the extracting the voiceprint features of the speech data comprises:
and acquiring the frequency value of the current voice frame and/or the frequency value of an adjacent voice frame in the voice data, and determining the voiceprint characteristics of the voice data according to the frequency value of the current voice frame and/or the frequency value of the adjacent voice frame.
5. The method of claim 1, further comprising:
collecting sound data of the target object, and extracting voiceprint features of the target object according to the sound data;
collecting the visual image of the target object, and extracting the visual image characteristics of the visual image;
storing the voiceprint features and the visual image features in an associated mode; the voiceprint feature of the target object is a pre-stored voiceprint feature.
6. The method of claim 5, wherein the collecting sound data of the target object and the extracting voiceprint features of the target object according to the sound data comprises:
collecting sounds with different sound intensities of the target object in various scenes, and filtering the collected sounds with different sound intensities;
and learning the filtered sound based on a neural network to obtain the voiceprint characteristics of the target object.
7. The method of claim 1, wherein the identifying a target object in the video scene that matches the target voiceprint feature comprises:
acquiring a target visual image characteristic associated with the target voiceprint characteristic;
and identifying an object matched with the target visual image characteristic in the video scene as the target object matched with the target voiceprint characteristic.
8. The method of claim 7, wherein the target visual image feature is a facial image feature of the target object;
the identifying of the target object in the video scene that matches the target visual image feature comprises:
extracting a facial image of an object in the video scene;
and calculating the similarity between the facial image and the target visual image characteristic, and if the similarity is greater than a similarity threshold value, the facial image is the facial image of the target object.
9. The method of claim 1, wherein focusing or chasing the target object's face comprises:
determining a sharpness of a facial image of a target object in the video scene;
and if the definition is lower than a definition threshold, adjusting the focusing position of the image in the video scene until the definition is not lower than the definition threshold.
10. A data processing method, comprising:
displaying a first page for establishing an incidence relation between the voiceprint and the visual image, wherein the first page is displayed with a voiceprint acquisition inlet;
responding to a trigger instruction of the voiceprint acquisition inlet, and displaying a voiceprint acquisition page; the voiceprint acquisition page is used for acquiring the sound information of the object; the sound information is used for extracting the voiceprint characteristics of the object;
responding to the successful acquisition instruction of the sound information of the object, and displaying a visual image information acquisition page; the visual image information acquisition page is used for acquiring visual image information of the object; the visual image information is used for extracting visual image characteristics of the object;
and establishing an incidence relation between the voiceprint characteristics and the visual image characteristics.
CN202110786821.2A 2021-07-12 2021-07-12 Video focusing method and device Pending CN113542604A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110786821.2A CN113542604A (en) 2021-07-12 2021-07-12 Video focusing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110786821.2A CN113542604A (en) 2021-07-12 2021-07-12 Video focusing method and device

Publications (1)

Publication Number Publication Date
CN113542604A true CN113542604A (en) 2021-10-22

Family

ID=78127627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110786821.2A Pending CN113542604A (en) 2021-07-12 2021-07-12 Video focusing method and device

Country Status (1)

Country Link
CN (1) CN113542604A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024125793A1 (en) * 2022-12-15 2024-06-20 Telefonaktiebolaget Lm Ericsson (Publ) Focusing a camera capturing video data using directional data of audio

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110153362A1 (en) * 2009-12-17 2011-06-23 Valin David A Method and mechanism for identifying protecting, requesting, assisting and managing information
CN105355206A (en) * 2015-09-24 2016-02-24 深圳市车音网科技有限公司 Voiceprint feature extraction method and electronic equipment
CN107333090A (en) * 2016-04-29 2017-11-07 中国电信股份有限公司 Videoconference data processing method and platform
CN108496350A (en) * 2017-09-27 2018-09-04 深圳市大疆创新科技有限公司 A kind of focusing process method and apparatus
CN110505399A (en) * 2019-08-13 2019-11-26 聚好看科技股份有限公司 Control method, device and the acquisition terminal of Image Acquisition
CN112073639A (en) * 2020-09-11 2020-12-11 Oppo(重庆)智能科技有限公司 Shooting control method and device, computer readable medium and electronic equipment
CN112562692A (en) * 2020-10-23 2021-03-26 安徽孺牛科技有限公司 Information conversion method and device capable of realizing voice recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110153362A1 (en) * 2009-12-17 2011-06-23 Valin David A Method and mechanism for identifying protecting, requesting, assisting and managing information
CN105355206A (en) * 2015-09-24 2016-02-24 深圳市车音网科技有限公司 Voiceprint feature extraction method and electronic equipment
CN107333090A (en) * 2016-04-29 2017-11-07 中国电信股份有限公司 Videoconference data processing method and platform
CN108496350A (en) * 2017-09-27 2018-09-04 深圳市大疆创新科技有限公司 A kind of focusing process method and apparatus
CN110505399A (en) * 2019-08-13 2019-11-26 聚好看科技股份有限公司 Control method, device and the acquisition terminal of Image Acquisition
CN112073639A (en) * 2020-09-11 2020-12-11 Oppo(重庆)智能科技有限公司 Shooting control method and device, computer readable medium and electronic equipment
CN112562692A (en) * 2020-10-23 2021-03-26 安徽孺牛科技有限公司 Information conversion method and device capable of realizing voice recognition

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024125793A1 (en) * 2022-12-15 2024-06-20 Telefonaktiebolaget Lm Ericsson (Publ) Focusing a camera capturing video data using directional data of audio

Similar Documents

Publication Publication Date Title
RU2743732C2 (en) Method and device for processing video and audio signals and a program
CN110139062B (en) Video conference record creating method and device and terminal equipment
CN110853646B (en) Conference speaking role distinguishing method, device, equipment and readable storage medium
CN112653902B (en) Speaker recognition method and device and electronic equipment
CN107333090B (en) Video conference data processing method and platform
JP2009510877A (en) Face annotation in streaming video using face detection
CN111479124A (en) Real-time playing method and device
US20220335949A1 (en) Conference Data Processing Method and Related Device
EP2503545A1 (en) Arrangement and method relating to audio recognition
CN113542604A (en) Video focusing method and device
CN114513622A (en) Speaker detection method, speaker detection apparatus, storage medium, and program product
CN116708055B (en) Intelligent multimedia audiovisual image processing method, system and storage medium
CN113627387A (en) Parallel identity authentication method, device, equipment and medium based on face recognition
CN113365109A (en) Method and device for generating video subtitles, electronic equipment and storage medium
WO2023231712A1 (en) Digital human driving method, digital human driving device and storage medium
CN113259734B (en) Intelligent broadcasting guide method, device, terminal and storage medium for interactive scene
CN104780341B (en) A kind of information processing method and information processing unit
CN113611308B (en) Voice recognition method, device, system, server and storage medium
CN112887659B (en) Conference recording method, device, equipment and storage medium
CN114546939A (en) Conference summary generation method and device, electronic equipment and readable storage medium
CN113643708A (en) Conference voice print recognition method and device, electronic equipment and storage medium
CN112584225A (en) Video recording processing method, video playing control method and electronic equipment
CN118075418B (en) Video conference content output optimization method, device, equipment and storage medium thereof
CN117854507A (en) Speech recognition method, device, electronic equipment and storage medium
CN113840152A (en) Live broadcast key point processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211022