CN111629164B - Video recording generation method and electronic equipment - Google Patents

Video recording generation method and electronic equipment Download PDF

Info

Publication number
CN111629164B
CN111629164B CN202010477437.XA CN202010477437A CN111629164B CN 111629164 B CN111629164 B CN 111629164B CN 202010477437 A CN202010477437 A CN 202010477437A CN 111629164 B CN111629164 B CN 111629164B
Authority
CN
China
Prior art keywords
voice information
sound
audio
audio data
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010477437.XA
Other languages
Chinese (zh)
Other versions
CN111629164A (en
Inventor
刘宝利
罗应文
许威
张学荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202010477437.XA priority Critical patent/CN111629164B/en
Publication of CN111629164A publication Critical patent/CN111629164A/en
Application granted granted Critical
Publication of CN111629164B publication Critical patent/CN111629164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Abstract

The embodiment of the application discloses a video recording generation method and electronic equipment, wherein the method comprises the following steps: acquiring first audio-video data, wherein the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object; obtaining a first sound feature of a first object of the at least one object; and acquiring voice information associated with the first sound characteristic from the first audio data, and generating second video and audio data based on the first image data and the voice information associated with the first sound characteristic. According to the video recording generation method, the noise irrelevant to the first sound characteristic can be removed or silenced, so that the recorded video does not contain the noise, and the user experience can be remarkably improved.

Description

Video recording generation method and electronic equipment
Technical Field
The present application relates to the field of electronic devices, and in particular, to a video recording generation method and an electronic device.
Background
With the rapid development of the mobile internet, small videos gradually become a conventional communication mode. When a user records a small video by himself, due to the fact that the object distance is short, the user can only record the head or the upper body of the user generally without the aid of an auxiliary tool, and if the user needs to record a whole-body video or a video in a large scene, the user needs other users to help to record the video generally. The process that other users help to record the video is approximately, the recorded user is ready in a specific scene, the recording user holds the image acquisition device, the image acquisition device is opened to record the video, meanwhile, a starting command is shout to the recorded user, and the recorded user starts speaking. Taking English learning as an example, a scene of recording a learning video is often encountered, a traditional operation method is that a parent or a student holds a mobile phone, opens a camera to start recording and shouts a start command at the same time, and a student starts reading or reciting after hearing the start command.
However, this has a problem that it is difficult for the video recording user to click the camera start button and speak the start command in a complete synchronization manner, and the speaking time of the start command will be collected by the camera after a little delay, so that the recorded video contains the start command. In addition, during the video recording process, if other people speak in the environmental scene, the sound of the other people can be collected to be the noise of the video. The noise generated by either the start command or the other person speaking will interfere with the recorded user's speech.
Content of application
In view of the above problems in the prior art, the embodiments of the present application adopt the following technical solutions:
an aspect of the present application provides a video recording generation method, including:
acquiring first audio-video data, wherein the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object;
obtaining a first sound feature of a first object of the at least one object;
and acquiring voice information associated with the first sound characteristic from the first audio data, and generating second video and audio data based on the first image data and the voice information associated with the first sound characteristic.
In some embodiments, said obtaining a first sound characteristic of a first object of said at least one object comprises:
a first sound characteristic of the first object is determined from a library of preset sound characteristics.
In some embodiments, the determining the first sound characteristic of the first object from a preset sound characteristic library comprises:
acquiring a face image in the first image data;
determining the first object based on the facial image, determining a first sound feature of the first object from the preset sound feature library.
In some embodiments, the obtaining voice information associated with the first sound feature from the first audio data, and generating second audio-visual data based on the first image data and the voice information associated with the first sound feature comprises:
extracting first voice information from the first audio data based on the first sound feature;
generating the second audio-visual data based only on the first image data and the first voice information.
In some embodiments, the obtaining voice information associated with the first sound feature from the first audio data, and generating second audio-visual data based on the first image data and the voice information associated with the first sound feature comprises:
extracting first voice information from the first audio data based on the first sound feature;
extracting at least one second voice message from the first audio data based on at least one second voice feature obtained from the first audio data, wherein the second voice feature is different from the first voice feature;
and determining second voice information having a semantic relation with the first voice information from the at least one second voice information, and generating second audio-video data based on the first image data, the first voice information and the second voice information having a semantic relation with the first voice information.
In some embodiments, the method further comprises:
identifying sound features of respective ones of the objects in the first audio data;
acquiring voice information of each object from the first audio data based on the sound characteristics of each object;
and determining first voice information of the first object from the voice information of each object based on a preset condition.
In some embodiments, the method further comprises:
and determining a first sound characteristic of the first object from sound characteristics of the objects based on the first voice information, and saving the first sound characteristic to a preset sound characteristic library.
Another aspect of the embodiments of the present application provides an electronic device, including:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring first audio-video data, the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object;
a second obtaining module, configured to obtain a first sound feature of a first object of the at least one object;
and the generating module is used for acquiring the voice information associated with the first sound characteristic from the first audio data and generating second audio and video data based on the first image data and the voice information associated with the first sound characteristic.
In some embodiments, the second obtaining module is specifically configured to:
a first sound characteristic of the first object is determined from a library of preset sound characteristics.
In some embodiments, the second obtaining module is further configured to:
acquiring a face image in the first image data;
determining the first object based on the facial image, determining a first sound feature of the first object from the preset sound feature library.
In some embodiments, the generating module is specifically configured to:
extracting first voice information from the first audio data based on the first sound feature;
generating the second audio-visual data based only on the first image data and the first voice information.
In some embodiments, the generating module is specifically configured to:
extracting first voice information from the first audio data based on the first sound feature;
extracting at least one second voice message from the first audio data based on at least one second voice feature obtained from the first audio data, wherein the second voice feature is different from the first voice feature;
and determining second voice information having a semantic relation with the first voice information from the at least one second voice information, and generating second audio-video data based on the first image data, the first voice information and the second voice information having a semantic relation with the first voice information.
In some embodiments, the electronic device further comprises:
the identification module is used for identifying the sound characteristics of each object in the first audio data;
a third obtaining module, configured to obtain, from the first audio data, voice information of each object based on a sound feature of each object;
and the determining module is used for determining the first voice information of the first object from the voice information of each object based on preset conditions.
In some embodiments, the electronic device further comprises:
and the storage module is used for determining the first sound characteristic of the first object from the sound characteristics of the objects based on the first voice information and storing the first sound characteristic to a preset sound characteristic library.
A third aspect of embodiments of the present application provides a storage medium storing a computer program, which when executed implements the following steps:
acquiring first audio-video data, wherein the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object;
obtaining a first sound feature of a first object of the at least one object;
and acquiring voice information associated with the first sound characteristic from the first audio data, and generating second video and audio data based on the first image data and the voice information associated with the first sound characteristic.
A fourth aspect of the embodiments of the present application provides an electronic device, which at least includes a memory and a processor, where the memory stores an executable program, and the processor implements the following steps when executing the executable program on the memory:
acquiring first audio-video data, wherein the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object;
obtaining a first sound feature of a first object of the at least one object;
and acquiring voice information associated with the first sound characteristic from the first audio data, and generating second video and audio data based on the first image data and the voice information associated with the first sound characteristic.
According to the video recording generation method, the first sound feature of the first object in at least one object is obtained, then the voice information associated with the first sound feature is obtained from the first audio data based on the first sound feature, and then the second audio and video data is generated based on the first image data and the voice information associated with the first sound feature, so that noises irrelevant to the first sound feature can be removed or silenced, recorded videos do not contain the noises, and user experience can be remarkably improved.
Drawings
Fig. 1 is a flowchart of a video recording generation method according to an embodiment of the present application;
fig. 2 is a flowchart of step S200 of a video recording generation method according to an embodiment of the present application;
fig. 3 is a flowchart illustrating an embodiment of step S300 of a video recording generation method according to an embodiment of the present application;
fig. 4 is a flowchart illustrating another embodiment of step S300 of a video recording generation method according to an embodiment of the present application;
fig. 5 is a block diagram of an embodiment of an electronic device according to an embodiment of the present application;
fig. 6 is a block diagram of another embodiment of an electronic device according to an embodiment of the present application.
Detailed Description
Various aspects and features of the present application are described herein with reference to the drawings.
It will be understood that various modifications may be made to the embodiments of the present application. Accordingly, the foregoing description should not be construed as limiting, but merely as exemplifications of embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the application.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the application and, together with a general description of the application given above and the detailed description of the embodiments given below, serve to explain the principles of the application.
These and other characteristics of the present application will become apparent from the following description of preferred forms of embodiment, given as non-limiting examples, with reference to the attached drawings.
It should also be understood that, although the present application has been described with reference to some specific examples, a person of skill in the art shall certainly be able to achieve many other equivalent forms of application, having the characteristics as set forth in the claims and hence all coming within the field of protection defined thereby.
The above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.
Specific embodiments of the present application are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely exemplary of the application, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the application of unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present application in virtually any appropriately detailed structure.
The specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or different embodiments in accordance with the application.
The embodiment of the application provides a video recording generation method, which can remove noise irrelevant to a target person by acquiring the sound characteristics of the recorded target person and processing the recorded video based on the sound characteristics, thereby improving the video recording experience.
Fig. 1 is a flowchart of a video recording generation method according to an embodiment of the present application, and referring to fig. 1, the video recording generation method according to the embodiment of the present application specifically includes the following steps:
s100, first video and audio data are obtained, wherein the first video and audio data comprise first image data and first audio data, and the first audio data comprise voice information of at least one object.
According to different application objects of the video recording generation method, the process of acquiring the first video and audio data can be implemented in various ways. For example, when the video recording generation method is applied to an electronic device with a camera device and an audio acquisition device, such as a smart phone, a tablet computer, a notebook computer, and the like, for example, the acquiring of the first audio-video data may be acquiring the first audio-video data through the camera device and the audio acquisition device, that is, acquiring the first image data through the camera device, and acquiring the first audio data through the audio acquisition device. In another case, the acquiring of the first video-audio data may also be that the electronic device acquires the first video-audio data from a wearable device such as VR glasses or AR glasses, and because the processing capability of the wearable device is limited, the first video-audio data may be transmitted to the mobile electronic device after being acquired to perform noise-removing processing. The video recording generation method can also be applied to electronic equipment such as a server, and at the moment, the first video and audio data can be acquired from another electronic equipment such as mobile electronic equipment and wearable equipment. For example, after first video and audio data collected by VR glasses or AR glasses worn by a user are sent to a server through a specific application program, and the server performs noise removing processing.
S200, acquiring a first sound characteristic of a first object in at least one object.
Wherein the at least one object refers to an object from which voice information is acquired when the first audio data is acquired. The first audio data may include voice information of one object or may include voice information of a plurality of objects. If only the voice information of a person is collected when the first audio data is collected, there may be no noise in the first audio data. In this case, as a preferred embodiment, before implementing step S200, the method may further include: sound features of respective objects in the first audio data are recognized, and the number of objects contained in the first audio data is determined based on the recognized sound features, and in the case where the number of objects is plural, step S200 is performed.
The first object is a target person when the first video and audio data is collected, and the number of the target persons can be one or more. If the parent records the homework video for the student, the student is the target person, namely the first object. The first sound characteristic may include sound parameters capable of characterizing the target character, such as volume, timbre, pitch, energy, frequency, and other characteristic parameters, and may also include, for example, a voiceprint characteristic.
In a specific implementation, the obtaining of the first sound characteristic of the first object in the at least one object may include obtaining a preset first sound characteristic of the first object, for example, when a user uses his own mobile electronic device to collect the first audio-visual information, the mobile electronic device may have the first sound characteristic of the user prestored therein. Of course, before the first audio-visual information is collected, the voice information of the first object may also be collected in advance, and the first sound feature of the first object is extracted from the collected voice information of the first object. Or after the first audio-visual information is acquired, the first sound feature of the first object is extracted from the first audio information, so that the first audio-visual information is not required to be stored in advance, the storage space is saved, and real-time identification can be realized.
S300, acquiring voice information associated with the first sound characteristic from the first audio data, and generating second audio and video data based on the first image data and the voice information associated with the first sound characteristic.
The speech information associated with the first sound feature may include only speech information of the first object corresponding to the first sound feature. For example, in recording a video of a student reciting an article in English, the first object may include only the student, the corresponding first sound characteristic may also include the student's sound characteristic, and obtaining speech information associated with the first sound characteristic from the first audio data may include obtaining only the student's speech information.
The speech information associated with the first sound characteristic may also include speech information of a first object corresponding to the first sound characteristic and speech information of at least one other object related to the speech information of the first object. For example, when recording a video that a parent cooperates with a student to complete an english language conversation, removing the student as the first object may require one or more parents to cooperate with the english language conversation, for example, the student says "Nice to meet You", the parent may cooperate with the student to say "Nice to meet You, too" or "You too", and at this time, if only voice information of the student is obtained, the content of the english language conversation may be incomplete and the semantics may not be continuous. Therefore, the voice information of the student and the voice information of the parent can be acquired at the same time.
After acquiring the voice information related to the first sound feature, the second audio-visual data may be generated based on the first image data and the acquired voice information related to the first sound feature. Specifically, the second audio-visual data may be synthesized based on the first image data and the voice information related to the first sound feature, or the other voice information in the first audio data may be removed or silenced based on the acquired voice information related to the first sound feature, so as to generate the second audio-visual data.
According to the video recording generation method, the first sound feature of the first object in at least one object is obtained, then the voice information associated with the first sound feature is obtained from the first audio data based on the first sound feature, and then the second audio and video data is generated based on the first image data and the voice information associated with the first sound feature, so that noises irrelevant to the first sound feature can be removed or silenced, recorded videos do not contain the noises, and user experience can be remarkably improved.
In some embodiments, obtaining a first sound characteristic of a first object of the at least one object comprises: a first sound characteristic of the first object is determined from a library of preset sound characteristics.
The preset sound feature library may include pre-stored object information and sound features, where the object information and the sound features have an association relationship, the object information may include related information such as object name or name, face image of the object, gender, age, and the like, and the sound features may be feature parameters such as volume, tone, energy, frequency, and the like, and may also include voiceprint features and the like.
The preset sound feature library can be arranged at the end of the electronic equipment used by the user, for example, the sound features of the user person and the commonly used user, as well as the personal information of the user and the personal information of the commonly used user can be prestored on the smart phone or the tablet computer of the user. When the user uses the electronic equipment to record the video, prompt information can be popped up to prompt the user to select object information, and after the user selects the object information, the first sound characteristic associated with the object information selected by the user can be determined from the preset sound characteristic library.
The preset sound feature library can also be arranged at a server side, so that when a user uses the electronic equipment to record a video, a prompt window can be popped up to prompt the user to input object information, the electronic equipment can send an acquisition request to the server based on the object information after acquiring the input object information, the acquisition request at least comprises the object information, and then the first sound feature matched based on the object information and fed back can be acquired from the server.
As shown in fig. 2, in some embodiments, determining the first sound characteristic of the first object from the preset sound characteristic library may include:
s210, a face image in the first image data is acquired. During the video recording process, the first object usually appears in the video frame, that is, the first image data usually contains or only contains the facial image of the first object. Therefore, after the first audio-visual data is acquired, a face image can be acquired based on image recognition performed on the first image data. When only one face image is recognized, the face image is taken as the face image to be acquired, when a plurality of face images are recognized, the first object corresponding to the face images can be determined respectively based on the face images, and prompt information can pop up to ask the user to determine the face image to be acquired from the face images.
And S220, determining a first object based on the face image, and determining a first sound characteristic of the first object from a preset sound characteristic library. After the face image is acquired, the face feature information can be recognized based on the face image, the pre-stored object information can include preset face feature information, the first object can be determined by matching the recognized face feature information with the preset face feature information, and the first sound feature of the first object can be determined from a preset sound feature library.
In the case where there is no preset sound feature library, or the first sound feature of the first object is not recognized from the preset sound feature library, the first speech information of the first object may be determined based on the following method:
sound features of respective objects in the first audio data are identified. There are various methods for identifying the sound features of the respective objects, for example, a method based on a mathematical model may be used to describe the sound features of the respective objects in the first audio data, so as to obtain one or more sets of feature description vectors, which are the sound features of the respective objects. The sound characteristics of each object in the first audio data can be recognized based on a self-learning model such as a deep neural network, the model is trained based on a large amount of data after the model is built, the sound characteristics of the object can be accurately described based on the sound characteristics obtained by the self-learning model, and the effect is good.
Speech information of each object is acquired from the first audio data based on sound characteristics of each object. After the sound features of each object are obtained, the voice information with the same sound features in the first audio data can be classified into one class, and then the voice information of each object is obtained. As after the feature description vectors are acquired, the speech information of each object may be acquired from the first audio data based on the similarity calculation of the feature description vectors. Specifically, the voice information of person a may be extracted from the first audio data based on the feature description vector of person a, and the voice information of person B may be extracted from the first audio data based on the feature description vector of person B until the voice information of all objects is acquired. Of course, the above method for acquiring the voice information of the object is only exemplary, and other methods may be adopted.
First voice information of a first object is determined from voice information of respective objects based on a preset condition. The voice information of the target person as the first object is generally high in volume, high in definition, and continuous in content, and in addition, the voice information of the first object is generally throughout the entire first audio data or positioned in an intermediate period of the first audio data. The preset condition may be configured based on the conventional characteristics of the voice information of the target person, for example, the preset condition may be configured as the first voice information with longer duration as the first voice information of the first object, the voice information with time mainly distributed in the middle area of the first audio data and duration occupying the first audio data greater than the first threshold value is used as the first voice information of the first object, or the voice information with continuous semantics is used as the first voice information of the first object. Based on such a method, in a case where the preset sound feature library is not set or the preset sound feature library is not matched to the first sound feature of the first object, the first voice information of the first object may be automatically determined. Thereafter, it may be determined whether there is voice information of other objects related to the first voice information of the first object.
In a preferred embodiment, if the first sound feature of the first object is not matched in the preset sound feature library, after the first voice information of the first object is acquired, the method may further include:
and determining a first sound characteristic of the first object from the sound characteristics of the objects based on the first voice information, and saving the first sound characteristic to a preset sound characteristic library. For example, when the first audio data respectively includes the voice information of the person a, the person B, and the person C, the respective voice information is respectively obtained based on the voice characteristics of the respective persons, and then the voice information of the person a is determined to be the first voice information of the first object based on the foregoing steps, the voice characteristic of the person a may be reversely determined to be the first voice characteristic of the first object, and the first voice characteristic may be stored in the preset voice characteristic library, so that the voice characteristic of the person a may be queried from the preset voice characteristic library in the subsequent use process.
In some embodiments, obtaining the first sound characteristic of the first object of the at least one object may also include:
an object image is acquired based on first image data, and first position information is acquired based on the object image. In particular implementations, an object image of each object in the first image data may be acquired based on image recognition, and first location information of each object may be determined based on the acquired object image. For example, only person a may be included in the first image data, or person a and person B may be included, and the first position information of person a and person B may be acquired based on the image recognition.
Sound features of respective objects in the first audio data are recognized, voice information of the respective objects is acquired from the first audio data based on the sound features of the respective objects, and second position information is acquired based on the voice information of the respective objects. For example, the voice information of person a and person C can be recognized from the first audio data, and the second position information of person a and person C can be recognized based on the voice information of person a and person C.
A first sound feature of the first object is determined based on the first location information and the second location information. By matching the first position information and the second position information, and at this time, only the first position information of the person a and the second position information thereof can be matched, the sound feature of the person a can be determined as the first sound feature of the first object, and the voice information of the person a can be determined as the first voice information of the first object. In this way, the voice information of the object acquired by the object image in the first audio-visual data can be reserved, and the voice information of the object image which is not acquired can be removed or silenced as noise.
As shown in fig. 3, in some embodiments, acquiring the voice information associated with the first sound feature from the first audio data, and generating the second audio-visual data based on the first image data and the voice information associated with the first sound feature includes:
s311, extracting first voice information from first audio data based on the first sound characteristic;
s312, second audio-visual data is generated based on only the first image data and the first voice information.
After the first sound feature of the first object is obtained, the voice information having the first sound feature in the first audio data may be determined as the first voice information based on cluster analysis or similarity analysis. Then, audio and video mixing can be carried out on the first image data and the first voice information based on the first time information of the first image data and the second time information of the first voice information, and therefore second audio and video data can be generated. For example, first time information of each frame image in the first image data may be determined, and second time information of each speech segment of the first speech information may be determined, so that the first speech information can be accurately matched with each frame image in the first image data. Of course, the first voice message may have only the beginning and end times. The second audio-video data acquired by the method only comprises the first voice information of the first object, but not the voice information of other objects related to the first object, the noise is removed more thoroughly, and the method is suitable for being applied to a single-person mode.
As shown in fig. 4, in some embodiments, acquiring the voice information associated with the first sound feature from the first audio data, and generating the second audio-visual data based on the first image data and the voice information associated with the first sound feature includes:
s321, extracting first voice information from the first audio data based on the first sound feature;
s322, extracting at least one second voice message from the first audio data based on at least one second voice feature obtained from the first audio data, wherein the second voice feature is different from the first voice feature;
s323, second voice information with a semantic relation with the first voice information is determined from at least one second voice information, and second video data are generated based on the first image data, the first voice information and the second voice information with the semantic relation with the first voice information.
After the first sound feature of the first object is obtained, the voice information having the first sound feature in the first audio data may be determined as the first voice information based on cluster analysis or similarity analysis. The second sound characteristic is a sound characteristic of an object other than the first object in the first audio data, and the second sound characteristic may be one or more than one. Second voice information of the other objects may be extracted from the first audio data based on sound features of the other objects. For example, when the first audio data includes person a, person B, and person C, and when it is determined that person a is the first object, the voice information of person a is the first voice information, and the voice characteristics of person B and person C are the second voice characteristics, the voice information of person B and person C can be acquired based on the second voice characteristics and used as the second voice information.
Then, semantic analysis can be performed on the first text content and the second text content to determine whether each piece of second voice information has a semantic relationship with the first voice information. For example, when the first text content of the person a is "Nice to meet You", the second text content of the person B is "Nice to meet You, too" or "You too", and the second text content of the person C is "start". Based on semantic analysis, it can be determined that the second text content of person B has a semantic relationship with the first text content of person a, and the second text content of person C does not have a semantic relationship with the first text content of person a, and further, the second voice information of person B can be determined as the second voice information having a semantic relationship with the first voice information, and second audio-visual data can be generated based on the first image data, the second voice, and the second voice information having a semantic relationship with the first voice information at the same time. Therefore, the generated second audio-video data not only comprises the first voice information of the first object, but also comprises the second voice information of other related objects, so that the continuity and the integrity of the semantics can be ensured, and the method is suitable for being applied to a multi-user mode.
In a specific implementation, whether to generate the second audio-visual data based on only the first image data and the first voice information, or to generate the second audio-visual data based on the first image data, the first voice information, and the second voice information having a semantic relationship with the first voice information at the same time may be determined based on a selection of a user. For example, before video recording, a prompt may pop up requesting the user to select either single-person or double-person mode. Of course, when the second voice information having a semantic relationship with the first voice information is acquired, the second audio-visual data may be generated based on the first image data, the first voice information, and the second voice information having a semantic relationship with the first voice information, and when the second voice information having a semantic relationship with the first voice information is not acquired, the second audio-visual data may be generated based on only the first image data and the first voice information.
Based on the same inventive concept, an embodiment of the present application further provides an electronic device, which is shown in fig. 5 and includes:
the system comprises a first acquisition module 10, a second acquisition module, a first display module and a second display module, wherein the first acquisition module is used for acquiring first audio-video data, the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object;
a second obtaining module 20, configured to obtain a first sound characteristic of a first object of the at least one object;
and the generating module 30 is configured to acquire the voice information associated with the first sound feature from the first audio data, and generate the second audio-visual data based on the first image data and the voice information associated with the first sound feature.
In some embodiments, the second obtaining module 20 is specifically configured to:
a first sound characteristic of the first object is determined from a library of preset sound characteristics.
In some embodiments, the second obtaining module 20 is further configured to:
acquiring a face image in first image data;
a first object is determined based on the face image, and a first sound feature of the first object is determined from a preset sound feature library.
In some embodiments, the generating module 30 is specifically configured to:
extracting first voice information from the first audio data based on the first sound feature;
generating the second audio-visual data based only on the first image data and the first voice information.
In some embodiments, the generating module 30 is specifically configured to:
extracting first voice information from the first audio data based on the first sound feature;
extracting at least one second voice message from the first audio data based on at least one second voice feature obtained from the first audio data, wherein the second voice feature is different from the first voice feature;
and determining second voice information having a semantic relation with the first voice information from the at least one second voice information, and generating second audio-video data based on the first image data, the first voice information and the second voice information having a semantic relation with the first voice information.
In some embodiments, the electronic device further comprises:
the identification module is used for identifying the sound characteristics of each object in the first audio data;
a third obtaining module, configured to obtain, from the first audio data, voice information of each object based on a sound feature of each object;
and the determining module is used for determining the first voice information of the first object from the voice information of each object based on preset conditions.
In some embodiments, the electronic device further comprises:
and the storage module is used for determining the first sound characteristic of the first object from the sound characteristics of the objects based on the first voice information and storing the first sound characteristic to a preset sound characteristic library.
Referring to fig. 6, an embodiment of the present application further provides an electronic device, which at least includes a memory 901 and a processor 902, where the memory 901 stores an executable program, and the processor 902, when executing the executable program on the memory 901, implements the following steps:
acquiring first audio-video data, wherein the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object;
obtaining a first sound feature of a first object of the at least one object;
and acquiring voice information associated with the first sound characteristic from the first audio data, and generating second video and audio data based on the first image data and the voice information associated with the first sound characteristic.
When the processor 902 executes the executable program stored in the memory 901 for obtaining the first sound feature of the first object in the at least one object, the following steps are specifically implemented: a first sound characteristic of the first object is determined from a library of preset sound characteristics.
When the processor 902 executes the executable program stored in the memory 901 and determining the first sound feature of the first object from the preset sound feature library, the following steps are specifically implemented:
acquiring a face image in the first image data;
determining the first object based on the facial image, determining a first sound feature of the first object from the preset sound feature library.
When the processor 902 executes the executable program, which is stored in the memory 901, for acquiring the voice information associated with the first sound feature from the first audio data and generating the second audio-video data based on the first image data and the voice information associated with the first sound feature, the following steps are specifically implemented:
extracting first voice information from the first audio data based on the first sound feature;
generating the second audio-visual data based only on the first image data and the first voice information.
When the processor 902 executes the executable program, which is stored in the memory 901, for acquiring the voice information associated with the first sound feature from the first audio data and generating the second audio-video data based on the first image data and the voice information associated with the first sound feature, the following steps are specifically implemented:
extracting first voice information from the first audio data based on the first sound feature;
extracting at least one second voice message from the first audio data based on at least one second voice feature obtained from the first audio data, wherein the second voice feature is different from the first voice feature;
and determining second voice information having a semantic relation with the first voice information from the at least one second voice information, and generating second audio-video data based on the first image data, the first voice information and the second voice information having a semantic relation with the first voice information.
The processor 902, when executing the executable program stored on the memory 901, is further configured to implement the steps of:
identifying sound features of respective ones of the objects in the first audio data;
acquiring voice information of each object from the first audio data based on the sound characteristics of each object;
and determining first voice information of the first object from the voice information of each object based on a preset condition.
The processor 902, when executing the executable program stored on the memory 901, is further configured to implement the steps of:
and determining a first sound characteristic of the first object from sound characteristics of the objects based on the first voice information, and saving the first sound characteristic to a preset sound characteristic library.
The embodiment of the present application further provides a storage medium, which stores a computer program, and when the computer program is executed, the video recording generation method provided by any of the above embodiments of the present application is implemented.
The above embodiments are only exemplary embodiments of the present application, and are not intended to limit the present application, and the protection scope of the present application is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present application and such modifications and equivalents should also be considered to be within the scope of the present application.

Claims (6)

1. A video recording generation method comprises the following steps:
acquiring first audio-video data, wherein the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object;
obtaining a first sound feature of a first object of the at least one object;
acquiring voice information associated with the first sound characteristic from the first audio data, and generating second video and audio data based on the first image data and the voice information associated with the first sound characteristic;
wherein the acquiring the voice information associated with the first sound feature from the first audio data and generating second audio-visual data based on the first image data and the voice information associated with the first sound feature include:
extracting first voice information from the first audio data based on the first sound feature;
extracting at least one second voice message from the first audio data based on at least one second voice feature obtained from the first audio data, wherein the second voice feature is different from the first voice feature;
and determining second voice information having a semantic relation with the first voice information from the at least one second voice information, and generating second audio-video data based on the first image data, the first voice information and the second voice information having a semantic relation with the first voice information.
2. The video recording generation method of claim 1, wherein said obtaining a first sound characteristic of a first object of the at least one object comprises:
a first sound characteristic of the first object is determined from a library of preset sound characteristics.
3. The video recording generation method of claim 2, wherein said determining a first sound characteristic of the first object from a library of preset sound characteristics comprises:
acquiring a face image in the first image data;
determining the first object based on the facial image, determining a first sound feature of the first object from the preset sound feature library.
4. The video recording generation method of claim 1, wherein the method further comprises:
identifying sound features of respective ones of the objects in the first audio data;
acquiring voice information of each object from the first audio data based on the sound characteristics of each object;
and determining first voice information of the first object from the voice information of each object based on a preset condition.
5. The video recording generation method of claim 4, wherein the method further comprises:
and determining a first sound characteristic of the first object from sound characteristics of the objects based on the first voice information, and saving the first sound characteristic to a preset sound characteristic library.
6. An electronic device, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring first audio-video data, the first audio-video data comprises first image data and first audio data, and the first audio data comprises voice information of at least one object;
a second obtaining module, configured to obtain a first sound feature of a first object of the at least one object;
the generating module is used for acquiring the voice information associated with the first sound characteristic from the first audio data and generating second audio and video data based on the first image data and the voice information associated with the first sound characteristic;
wherein the generation module is specifically configured to:
extracting first voice information from the first audio data based on the first sound feature;
extracting at least one second voice message from the first audio data based on at least one second voice feature obtained from the first audio data, wherein the second voice feature is different from the first voice feature;
and determining second voice information having a semantic relation with the first voice information from the at least one second voice information, and generating second audio-video data based on the first image data, the first voice information and the second voice information having a semantic relation with the first voice information.
CN202010477437.XA 2020-05-29 2020-05-29 Video recording generation method and electronic equipment Active CN111629164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010477437.XA CN111629164B (en) 2020-05-29 2020-05-29 Video recording generation method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010477437.XA CN111629164B (en) 2020-05-29 2020-05-29 Video recording generation method and electronic equipment

Publications (2)

Publication Number Publication Date
CN111629164A CN111629164A (en) 2020-09-04
CN111629164B true CN111629164B (en) 2021-09-14

Family

ID=72272321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010477437.XA Active CN111629164B (en) 2020-05-29 2020-05-29 Video recording generation method and electronic equipment

Country Status (1)

Country Link
CN (1) CN111629164B (en)

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103915095B (en) * 2013-01-06 2017-05-31 华为技术有限公司 The method of speech recognition, interactive device, server and system
KR20140114238A (en) * 2013-03-18 2014-09-26 삼성전자주식회사 Method for generating and displaying image coupled audio
US9953637B1 (en) * 2014-03-25 2018-04-24 Amazon Technologies, Inc. Speech processing using skip lists
EP3480811A1 (en) * 2014-05-30 2019-05-08 Apple Inc. Multi-command single utterance input method
CN107301862A (en) * 2016-04-01 2017-10-27 北京搜狗科技发展有限公司 A kind of audio recognition method, identification model method for building up, device and electronic equipment
CN107331404A (en) * 2017-06-22 2017-11-07 深圳传音通讯有限公司 The sound processing method and device of audio frequency and video
CN107333071A (en) * 2017-06-30 2017-11-07 北京金山安全软件有限公司 Video processing method and device, electronic equipment and storage medium
CN107360387A (en) * 2017-07-13 2017-11-17 广东小天才科技有限公司 The method, apparatus and terminal device of a kind of video record
US10680993B2 (en) * 2018-03-30 2020-06-09 Facebook, Inc. Sonic social network
CN108962256A (en) * 2018-07-10 2018-12-07 科大讯飞股份有限公司 A kind of Obj State detection method, device, equipment and storage medium
CN110348011A (en) * 2019-06-25 2019-10-18 武汉冠科智能科技有限公司 A kind of with no paper meeting shows that object determines method, apparatus and storage medium
CN110740259B (en) * 2019-10-21 2021-06-25 维沃移动通信有限公司 Video processing method and electronic equipment
CN110913073A (en) * 2019-11-27 2020-03-24 深圳传音控股股份有限公司 Voice processing method and related equipment

Also Published As

Publication number Publication date
CN111629164A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN107818798B (en) Customer service quality evaluation method, device, equipment and storage medium
CN109215632B (en) Voice evaluation method, device and equipment and readable storage medium
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN110517689B (en) Voice data processing method, device and storage medium
Zhou et al. A compact representation of visual speech data using latent variables
CN108346427A (en) A kind of audio recognition method, device, equipment and storage medium
CN110853646B (en) Conference speaking role distinguishing method, device, equipment and readable storage medium
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
JP2019212288A (en) Method and device for outputting information
CN105575388A (en) Emotional speech processing
CN111401268B (en) Multi-mode emotion recognition method and device for open environment
CN113067953A (en) Customer service method, system, device, server and storage medium
CN113886641A (en) Digital human generation method, apparatus, device and medium
CN111326152A (en) Voice control method and device
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
CN108847246A (en) A kind of animation method, device, terminal and readable medium
CN110347869B (en) Video generation method and device, electronic equipment and storage medium
CN111629164B (en) Video recording generation method and electronic equipment
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN111145748B (en) Audio recognition confidence determining method, device, equipment and storage medium
CN113421573B (en) Identity recognition model training method, identity recognition method and device
Abel et al. Cognitively inspired audiovisual speech filtering: towards an intelligent, fuzzy based, multimodal, two-stage speech enhancement system
CN115988164A (en) Conference room multimedia control method, system and computer equipment
CN115565534A (en) Multi-modal speech recognition method, device, equipment and storage medium
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant