CN114449297A - Multimedia information processing method, computing equipment and storage medium - Google Patents

Multimedia information processing method, computing equipment and storage medium Download PDF

Info

Publication number
CN114449297A
CN114449297A CN202011219009.3A CN202011219009A CN114449297A CN 114449297 A CN114449297 A CN 114449297A CN 202011219009 A CN202011219009 A CN 202011219009A CN 114449297 A CN114449297 A CN 114449297A
Authority
CN
China
Prior art keywords
information
determining
video
emotion
processing mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011219009.3A
Other languages
Chinese (zh)
Inventor
褚晓璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202011219009.3A priority Critical patent/CN114449297A/en
Publication of CN114449297A publication Critical patent/CN114449297A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/475End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data
    • H04N21/4758End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data for providing answers, e.g. voting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Abstract

The embodiment of the application provides a multimedia information processing method, a computing device and a storage medium. In the embodiment of the application, multimedia information is obtained, and emotion information representing the emotion of a user in the multimedia information is determined; and determining a corresponding processing mode of the multimedia information according to the emotion information, so that the multimedia information can be processed, wherein the processing mode is used for reflecting the emotion information. The corresponding processing mode can be determined by determining the emotion information of the user in the multimedia information, so that the multimedia information is automatically processed, manpower is liberated, and processing efficiency is improved.

Description

Multimedia information processing method, computing equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a multimedia information processing method, a video processing method, a live video processing method, a video conference processing method, an audio processing method, a computing device, and a storage medium.
Background
With the rapid development of the internet, the internet can provide various services for users, such as online video watching, online ticket booking, online shopping and live broadcasting, and the like. Especially for online videos, users prefer to have post-processed videos, such as post-processed variety videos. However, since the video to be post-processed requires manual post-processing, a lot of labor and time are consumed. And is not suitable for video processing with high real-time performance.
Disclosure of Invention
Aspects of the present application provide a multimedia information processing method, a video processing method, a live video processing method, a video conference processing method, an audio processing method, a computing device, and a storage medium, so as to automatically process multimedia information and save labor cost.
The embodiment of the application provides a method for processing multimedia information, which comprises the following steps: acquiring multimedia information, and determining emotion information representing user emotion in the multimedia information; determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the multimedia information based on the processing mode.
The embodiment of the present application further provides a method for processing multimedia information, including: acquiring multimedia information, and determining emotion information representing user emotion in the multimedia information; determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the multimedia information based on the processing mode so as to directly display the processed multimedia information.
The embodiment of the application further provides a method for processing live video, which includes: acquiring a video live broadcast image, and determining emotion information representing anchor emotion in the video live broadcast image; determining a corresponding processing mode of a video live broadcast image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the live video images based on the processing mode.
The embodiment of the present application further provides a method for processing a video conference, including: acquiring video images corresponding to a plurality of users in a video conference, and determining the video image to which the user speaking belongs; determining emotion information representing a current emotion of a speaking user in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the video image based on the processing mode.
The embodiment of the present application further provides a video processing method, including: acquiring a preset video, and determining emotion information representing the emotion of a person in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the video image based on the processing mode.
The embodiment of the present application further provides a video processing method, including: acquiring a video image in a video call, and determining emotion information representing user emotion in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the video image based on the processing mode.
The embodiment of the present application further provides an audio processing method, including: acquiring singing audio and determining emotion information representing emotion of a singing user; determining a corresponding processing mode according to the emotion information, wherein the processing mode reflects the emotion information; and processing on the singing audio based on the processing mode.
An embodiment of the present application further provides a computing device, including: a memory, a processor, and a communication component; the memory for storing a computer program; the communication component is used for acquiring multimedia information; the processor to execute the computer program to: determining emotion information representing the emotion of the user in the multimedia information; determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the multimedia information based on the processing mode.
An embodiment of the present application further provides a computing device, including: a memory, a processor, and a communication component; the memory for storing a computer program; the communication component is used for acquiring multimedia information; the processor to execute the computer program to: determining emotion information representing the emotion of the user in the multimedia information; determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the multimedia information based on the processing mode so as to directly display the processed multimedia information.
An embodiment of the present application further provides a computing device, and the computing device includes: a memory, a processor, and a communication component; the memory for storing a computer program; the communication component is used for acquiring a video live broadcast image; the processor to execute the computer program to: determining emotion information representing a anchor emotion in the live video image; determining a corresponding processing mode of a video live broadcast image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the live video images based on the processing mode.
An embodiment of the present application further provides a computing device, including: a memory, a processor, and a communication component; the memory for storing a computer program; the communication assembly is used for acquiring video images corresponding to a plurality of users in a video conference; the processor to execute the computer program to: determining a video image to which a user speaking belongs; determining emotion information representing an emotion of a speaking user in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the video image based on the processing mode.
An embodiment of the present application further provides a computing device, including: a memory, a processor, and a communication component; the memory for storing a computer program; the communication component is used for acquiring a preset video; the processor to execute the computer program to: determining emotion information representing an emotion of a person in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the video image based on the processing mode.
An embodiment of the present application further provides a computing device, including: a memory, a processor, and a communication component; the memory for storing a computer program; the communication assembly is used for acquiring a video image in a video call; the processor to execute the computer program to: determining mood information representative of a user's mood in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the video image based on the processing mode.
An embodiment of the present application further provides a computing device, including: a memory, a processor, and a communication component; the memory for storing a computer program; the communication component is used for acquiring singing audio; the processor to execute the computer program to: determining emotion information representing the current emotion of the singing user; determining a corresponding processing mode according to the emotion information, wherein the processing mode reflects the emotion information; and processing on the singing audio based on the processing mode.
Embodiments of the present application also provide a computer-readable storage medium storing a computer program, which when executed by one or more processors causes the one or more processors to implement the steps of the above-mentioned method.
In the embodiment of the application, multimedia information is obtained, and emotion information representing the emotion of a user in the multimedia information is determined; and determining a corresponding processing mode of the multimedia information according to the emotion information, so that the multimedia information can be processed, and the processing mode can reflect the emotion information. The corresponding processing mode can be determined by determining the emotion information of the user in the multimedia information, so that the multimedia information is automatically processed, manpower is liberated, and processing efficiency is improved.
Correspondingly, because the multimedia information can be automatically processed, the method can be suitable for processing the multimedia information with higher real-time requirement, such as live video, video conference and the like, and simultaneously improves the experience of the user and enhances the interestingness.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1A is a block diagram of a system for processing multimedia information according to an exemplary embodiment of the present application;
FIG. 1B is a block diagram of a system for processing multimedia information according to an exemplary embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for processing multimedia information according to an exemplary embodiment of the present application;
FIG. 3A is a schematic view of a processed video image interface in accordance with an exemplary implementation of the present application;
FIG. 3B is a schematic view of a processed video image interface according to an exemplary embodiment of the present application;
FIG. 4 is a schematic view of a processed video image interface according to an exemplary embodiment of the present application;
FIG. 5 is a schematic interface diagram of a gesture in a video image according to an exemplary embodiment of the present application;
FIG. 6 is a flowchart illustrating a method for processing multimedia information according to another exemplary embodiment of the present application;
fig. 7 is a flowchart illustrating a video live broadcast processing method according to another exemplary embodiment of the present application;
fig. 8 is a flowchart illustrating a method for processing a video conference according to another exemplary embodiment of the present application;
FIG. 9 is a schematic structural diagram of an apparatus for processing multimedia information according to an exemplary embodiment of the present application;
FIG. 10 is a schematic diagram illustrating an exemplary embodiment of a device for processing multimedia information;
fig. 11 is a schematic structural diagram of a processing apparatus for live video according to still another exemplary embodiment of the present application;
fig. 12 is a schematic structural diagram of a processing apparatus for video conference according to another exemplary embodiment of the present application;
FIG. 13 is a block diagram of a computing device provided in an exemplary embodiment of the present application;
FIG. 14 is a schematic block diagram of a computing device provided in an exemplary embodiment of the present application;
FIG. 15 is a block diagram of a computing device provided in an exemplary embodiment of the present application;
fig. 16 is a schematic structural diagram of a computing device according to an exemplary embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Based on the above background, more and more users prefer to have a post-processed video, such as a variety video. However, the post-processing is edited manually. The method can not be applied to videos with high real-time requirements, and manual editing consumes a large amount of labor and time.
Therefore, the method provided by the embodiment of the application can automatically process the video, is suitable for the video with higher real-time performance, and saves manpower.
Fig. 1A is a schematic structural diagram of a system for processing multimedia information according to an exemplary embodiment of the present application. As shown in fig. 1A, the system 100A may include: a first device 101, a second device 102 and a third device 103.
The first device 101 may be a device with certain computing capability, and may implement a function of sending data to the second device 102 and acquiring data from the second device 102. The basic structure of the first device 101 may include: at least one processor. The number of processors may depend on the configuration and type of device with a certain computing power. A device with certain computing capabilities may also include Memory, which may be volatile, such as RAM, non-volatile, such as Read-Only Memory (ROM), flash Memory, etc., or both. The memory typically stores an Operating System (OS), one or more application programs, and may also store program data and the like. In addition to the processing unit and the memory, the device with certain computing capabilities also includes some basic configurations, such as a network card chip, an IO bus, a display component, and some peripheral devices. Alternatively, some peripheral devices may include, for example, a keyboard, a stylus, and the like. Other peripheral devices are well known in the art and will not be described in detail herein. Alternatively, the first device 101 may be a smart terminal, such as a mobile phone, a desktop computer, a notebook, a tablet computer, and the like.
It should be noted that the number of the first devices 101 may be multiple.
The third device 103 refers to a device that can provide a computing processing service in a network virtual environment, and may refer to a device that performs video and/or audio processing (e.g., live video) using a network. In physical implementation, the third device 103 may be any device capable of providing a computing service, responding to a service request, and performing live video, and may be, for example, a cloud server, a cloud host, a virtual center, a conventional server, and the like. The third device 103 mainly includes a processor, a hard disk, a memory, a system bus, and the like, and is similar to a general computer architecture.
The second device 102 may be a device with certain computing capability, and may implement a function of sending data to the third device 103 and acquiring data from the third device 103. The specific implementation form is similar to that of the first device 101, and is not described here again.
In this embodiment of the application, the second device 102 may send a multimedia information request, such as a live video request, to the third device 103, and the third device 103 may respond to the request and start to receive multimedia information sent by a user corresponding to the second device 102, such as a live video sent by a main broadcast. The first device 101 may send a request for viewing multimedia information, such as a live video viewing request, to the third device 103, and the third device 103 sends the multimedia information, such as a live video, to the first device 101 in response to the request, so that a user of the first device 101, that is, a user viewing the multimedia information, such as a live video, views the multimedia information, such as a live video. Thus, the user can watch multimedia information, such as video live broadcast.
Based on this, the third device 103 acquires multimedia information, such as a video image of a live video, from the second device 102, and determines emotion information indicating a user in the multimedia information, such as a director in the video image and an emotion; determining a corresponding processing mode of multimedia information such as video images according to the emotion information, wherein the processing mode is used for reflecting the emotion information; multimedia information, such as video images, is processed based on the processing mode. The third device 103 sends the processed multimedia information, such as video images, to the first device 101.
Specifically, for a video, the third device 103 determines the voice information of the user according to at least one frame of video image; and determining emotion information according to the voice information.
Specifically, for a video, the third device 103 determines a target object to which a processing mode is oriented; and processing the video image of the target object based on the processing mode.
It should be noted that, in the system 100A, the first device 101 may also implement the function of the third device 103, and process multimedia information, such as video images, for direct display or playing. The specific processing procedure is similar to the processing manner of the third device 103, and is not described again. At this time, the third device 103 is responsible for sending multimedia information, such as live video images, directly to the first device 101.
Further, in addition to the system 100A, processing of multimedia information, such as processing of video images, may also be implemented by the following system 100B.
Fig. 1B is a schematic structural diagram of a system for processing multimedia information according to an exemplary embodiment of the present application. As shown in fig. 1B, the system 100B may include: a fourth device 104, a fifth device 105.
The fourth device 104 is similar to the first device 101 in the system 100A, and is not described here again. The number of the fourth devices 104 is plural. The fifth device 105 is similar to the previously described implementation of the third device 103 of the system 100A and will not be described here. Only the following are illustrated:
in the embodiment of the present application, the plurality of fourth devices 104 may send a video conference request in the multimedia information request to the fifth device 105, and the fifth device 105 may respond to the request and start receiving video images in the multimedia information of the respective users sent by the plurality of fourth devices 104. The fifth device 105 transmits the received video images to the fourth devices 104, respectively, so that the users of the fourth devices 104, i.e., the users who refer to the video conference, perform the video conference. Thus, the user can view and communicate with the video image of each user.
Based on this, the fifth device 105 acquires the multimedia information, such as the video image, of the corresponding user from the fourth device 104, and determines emotion information representing the emotion of the user in the multimedia information, such as the video image; determining a corresponding processing mode of multimedia information such as video images according to the emotion information, wherein the processing mode is used for reflecting the emotion information; multimedia information, such as video images, is processed based on the processing mode. The fifth device 105 sends the processed multimedia information, such as video images, to the fourth device 104.
Specifically, for a video in the multimedia information, the fifth device 105 determines the voice information of the user according to at least one frame of video image; and determining emotion information according to the voice information.
Specifically, for a video in the multimedia information, the fifth device 105 determines a target object to which the processing mode is oriented; and processing the video image of the target object based on the processing mode.
Furthermore, for a video in the multimedia information, the fifth device 105 determines a target object to which the emotion information is directed; and taking the video image corresponding to the target object as an image to be processed for processing.
Specifically, for a video in the multimedia information, the fifth device 105 determines, according to at least one frame of video image of the user who is speaking, voice information of the user who is speaking; determining corresponding semantics according to the voice information; and acquiring a target object identifier according to the semantics, and determining the target object based on the target object identifier.
Furthermore, the fourth device 104 may also implement the function of the fifth device 105 for processing multimedia information, such as video images. At this time, the fourth device 104 transmits multimedia information, such as video images, to the fifth device 105, and then receives the multimedia information, such as video images, of the respective users transmitted from the fifth device 105. The fourth device 104 receives the multimedia information, such as the video image, processes the multimedia information, such as the video image, and directly displays the processed multimedia information, such as the video image.
It should be noted that, for the system 100B, a voting method in a video, such as a video conference, can be implemented.
For the system 100B, the fifth device 105 acquires video images of participating users based on the video conference, and identifies gestures of corresponding users in the video images; based on the recognized gestures, opinions that each participating user has a tendency in the video conference are determined, and thus votes are cast. The fifth device 105 may send the voting result to the fourth device 104 in addition to sending the processed video image, such as the user image in the processed video conference, to the fourth device 104.
The voting result may be determined in the following manner: and determining the number corresponding to each opinion.
Specifically, based on the recognized gesture and according to the corresponding relationship between the preset gesture and the opinions, the opinions of the participating users are determined.
In addition, the fifth device 105 may also determine a start time of the gesture, and determine scores of the respective users in the game interaction according to the start time; and storing the score, and sending the score to the corresponding intelligent terminal for display.
Similarly, the fourth device 104 may also implement the fifth device 105 functionality, namely voting in the video conference, and scoring in the game interaction. Because of the similarities with the foregoing, further description is omitted here.
It should be noted that, the multimedia information may include, in addition to video, audio such as live video, video conference, video call, and the like, and the audio may be live audio, audio conference, audio call, voice, and the like, and may be processed according to the processing method of the multimedia information, and thus, the description is omitted. Only the description is as follows: for audio, the audio frequency, audio noise, etc. may be adjusted.
In the video live broadcast application scenario of the embodiment of the application, a second device 102, such as a mobile phone of an anchor, and a third device 103, such as a video live broadcast server (which may be referred to as a server hereinafter), perform network connection, and send a video live broadcast request, where the server responds to the request and receives a live broadcast video sent by the mobile phone of the anchor. The server may also receive a live video viewing request sent by a first device 101 of a user, such as a mobile phone, where the live video viewing request may carry an ID of a main broadcast, the server responds to the viewing request, sends a live video of the main broadcast to the mobile phone of the user, and the mobile phone receives the live video and plays the live video. The live video may be an online shopping live video.
Based on the video live broadcast, a plurality of users can watch the video live broadcast through respective mobile phones. The user can interact with the anchor and shop by sending information. The user can watch live video through the APP (application) installed on the mobile phone and watching live video, purchase the current introduced commodities on the live video interface provided by the APP, input interactive information on the interface, and send the interactive information to the mobile phone of the anchor broadcast through the server, so that the anchor broadcast can watch the interactive information of the user.
In the application scenario, the second device 102 may be a computer or the like.
The mobile phone or the computer of the anchor can receive a large amount of interactive information sent by a plurality of users through the installed live video client. And the current sale condition of the commodity can be received, such as how many remaining commodities the commodity has, how many sold, and the like. When the commodity is sold out, the user can send interactive information, such as 'not bought', and the like. At this time, the anchor says "the sisters who have not bought, there is an opportunity next time". After receiving a video image of a live video, the server performs voice recognition on the received video image, and recognizes voice information in the multi-frame video image, such as voice information of 'buying sisters, having a chance next time' recognized. And then converting the voice information into characters, determining the semantics of the characters, and determining emotion information according to the semantics to encourage emotion information. Meanwhile, the target object can be determined as a user according to the semantics, namely the user watching the live video or the user watching the live video and having a purchasing behavior but not buying a commodity. Based on this, when a video image is transmitted for these users, pictures, expressions, characters, animations, and the like matching the encouragement emotion information may be added to the video image. Such as adding a "touch head" expression. And the server sends the processed video image to a mobile phone of a corresponding user so that the user can watch the video image.
The foregoing adding of pictures, expressions, texts, animations, etc. matching the encouragement emotional information to the video image may also be implemented by the first device 101 of the user, such as a mobile phone. Such as adding "touch head" expressions. The detailed description is omitted.
In addition, for the system 100B, the specific implementation thereof is similar to that described above, and therefore, the description is not repeated, and only: for the system 100B, the application scenario may be a video conference. A fifth device 105, such as a server, receives the video image of the user from the fourth device 104, such as a cell phone. And determining the emotion information and the target object, such as the target object is 'everybody', namely all people participating in the video conference, based on the video image of the user who is currently speaking and the voice information and the semantics according to the at least one frame of video image. The emotional information is "exuberant". Pictures, expressions, texts, animations, etc. corresponding to the "pop up" may be added to the video image of each user. Such as "like" expressions. And the server sends the processed video images, which can be all the processed video images, to the mobile phones of all the users, so that the video images of all the users participating in the video conference are formed for display. And all the video images of the user are processed.
The fifth device 105 may also determine the semantics of the user who is speaking because it can perform speech recognition on the user after acquiring the video image of the user. The statistics of the results of the automatic voting can be triggered when it is determined that the speech of the speaking user has the semantic of "voting". For example, after the server recognizes the "voting" semantic, it starts recognizing the user gesture in the video image of each user. The posture of each user is determined. For example, the posture of the user a is a hand-lifting posture, the posture of the user B is a hands-free posture, the posture of the user C is a hand-lifting posture, and the posture of the user D is a cross posture as an example of both hands. The server determines that user a and user C are positive opinions, user B is negative opinion, and user D is disclaimed. The final voting results are square 2, inverse 1 and abandon 1. The server can send the voting results to the mobile phones of the users for display.
Similarly, the fifth device 105 may also perform game interactions in addition to the determination of the voting results described above. And recognizing the gesture of the user according to the recognition mode. The game score is then determined based on the start time of the gesture, e.g., the earliest start time.
In addition, the fourth device 104, such as a mobile phone, may also implement the above voting result and game interaction.
In the present embodiment described above, the first device 101, the second device 102, and the third device 103 perform network connection, and the fourth device 104 and the fifth device 105 perform network connection, which may be wireless connection. If the first device 101, the second device 102, and the third device 103 are communicatively connected, and the fourth device 104 and the fifth device 105 are communicatively connected, the network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), WiMax, 5G, and the like.
The following describes the processing procedure of multimedia information in detail with reference to the method embodiment.
Fig. 2 is a flowchart illustrating a method for processing multimedia information according to an exemplary embodiment of the present application. The method 200 provided by the embodiment of the present application is executed by a computing device, for example, a server, such as a cloud server. The method 200 comprises the steps of:
201: acquiring multimedia information, and determining emotion information representing user emotion in the multimedia information.
202: and determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
203: and processing the multimedia information based on the processing mode.
The following is detailed for the above steps:
201: acquiring multimedia information, and determining emotion information representing user emotion in the multimedia information.
The multimedia information refers to information combined by video, audio, image and/or text, and may also be single information, such as audio. Where the video may be audio, it is composed of video images.
Video images refer to images that make up a video, such as video frame images. The video images are sourced differently for different scenes. For example, in a live video scene, the video image may be a video image of a video anchor. For a video conference, the video image may be a video image of a user participating in the video conference. For a preset video, such as an online variety program, the video image is a video image of the program, etc.
The audio may be audio conference, audio live, preset audio, voice call, voice conversation, and so on.
The emotion information refers to information representing the emotion of the user in the video image, such as praise, criticism, embarrassment and the like. The user may be different for different scenarios. For example, for a live video, the user may be a main cast. For a video conference, the user may be the user who is currently speaking, or may refer to other users. For a preset segment of video, the user may be a character in the video, such as a guest in a variety program.
In contrast, for different scenes, the manner of acquiring the video image may include the following:
1) and acquiring audio and/or video information of at least one participating user based on the audio and/or video conference.
For example, according to the foregoing, the server may specifically be a video conference server, and after a video conference is created, that is, after a network connection is made with a plurality of intelligent terminals of users participating in the video conference, such as mobile phones, the server may receive video images of the users sent by a video conference APP installed on the mobile phone. So that video images of multiple users can be received. The participating users can log in through accounts of the video conference APP on the mobile phone, so that the participating users can be connected with the server through a network, and send video conference requests to the server, wherein the requests can carry IDs of other participating users. Therefore, the server responds to the request, sends the video conference invitation to the video conference APPs to which the account numbers of other participating users belong, and the other participating users agree with the request, so that the video conference APPs of all the participating users can transmit video images with the server, and the video conference is created. The server may then begin receiving video images.
2) And acquiring the audio and/or video information of the main broadcast based on the audio and/or video live broadcast.
For example, according to the foregoing, the server may specifically be a live video server, and after the live video is created, that is, after the live video is connected to a network of a live intelligent terminal, such as a mobile phone, the server may also be connected to a network of an intelligent terminal of a user watching the live video, such as a mobile phone. The server can receive video images of the anchor from a live video APP installed on the mobile phone of the anchor. The specific creation process of the live video is similar to that described above, and will not be described herein again.
3) And acquiring video information of the preset audio and/or the preset video based on the preset audio and/or the preset video.
For example, according to the foregoing description, the server may specifically be a preset video acquired from another server, or a preset video received from an intelligent terminal, such as a computer. The video may be a recorded variety video.
It should be noted that, the acquiring of the corresponding audio information is similar to the acquiring of the video information, and the details are not repeated here.
After the video image is acquired, the emotion information can be determined for the video image, so that the video image can be automatically processed.
Wherein, determining the emotion information representing the emotion of the user in the multimedia information comprises: determining voice information of a user according to at least one frame of video image; and determining emotion information according to the voice information.
For example, as described above, the server receives video images, such as video images of the respective participating users of the video conference. At the same time, the server can recognize which user is speaking, which video image currently has sound. It should be understood that for a video conference, the server receives video images for each user at the time the video images are received. Therefore, when the video image of the speaking user is identified, which video image corresponds to the speaking user can be determined. Therefore, the server can acquire the multi-frame video image of the speaking user, and audio content, namely voice information, in the video image is extracted. And inputting the voice information into a preset recognition model to obtain emotion information, such as praise emotion. It should be noted that the same is true for other scenarios. For example, for a live video, the server receives the video image of the main broadcast. Only the video image of the anchor is available, so that the voice information of the video image can be extracted. Thereby obtaining voice information and recognizing emotion information by a preset recognition model.
In the process of recognizing the emotion information, strong emotions can be recognized. The image processing may not be performed for a more stationary emotion, or may not be performed based on a stationary emotion. Therefore, the processing effect of the server can be improved, and the resources of the server can be saved. Meanwhile, the timeliness of the real-time video can be met.
And for the distinguishing mode of strong emotion and stable emotion, the score of the emotion information contained in the voice information can be divided, and when the score is within a certain threshold value range, the emotion can be determined to be strong emotion. The determination of the score may be done by presetting the recognition model. Or, when the preset recognition model is trained, the training data can be selected as voice information of strong emotion, so that the model cannot recognize stable emotion after training, and the recognition cannot be performed.
In addition, the preset recognition model is used for recognizing the voice information and determining emotion information corresponding to the voice information. The training mode can be that, through training data, for example, a plurality of segments of voice information with strong emotion, such as audio, and emotion information corresponding to each audio is marked. Or different voice information corresponding to different emotion information may be used, and the emotion information corresponding to each audio is marked. The model may be a Neural Network model, such as CNN (Convolutional Neural Networks), RNN (Recurrent Neural Networks), and the like. And determining weight parameters in the neural network through training to obtain a trained model. Therefore, the voice information can be sent to the trained model to obtain emotion information in the voice information. In addition, the training data can be divided into two types, one type is voice information which can be expressed as positive emotion information, such as praise, happy, like, and the like. Another type is speech information that can be represented as negative emotion information. Negative emotional information such as criticism, difficulty, annoyance, etc.
In order to be able to determine the emotion of the user more accurately, the emotion of the user can be further determined by the expression of the user.
Specifically, the method 200 further includes: determining expression information of a user according to at least one frame of video image; and determining emotion information according to the expression information.
For example, as described above, the server participates in the user in receiving the video images of the video conference. And after the video image of the user who is speaking is determined, extracting a plurality of frames of video images through the corresponding video image to determine the expression change of the user in the video image. The facial expression information of the user can be recognized through a facial expression recognition model. Such as happy, serious, and inattentive. Thereby further assisting in determining the accuracy of the determination of emotional information by speech information as described above.
In addition, the facial expression recognition model may also be a neural network model, and the training manner is similar to that described above, and will not be described herein again. Only the description is as follows: the training data may be facial pictures with expressions, and each picture is identified with a respective corresponding expression. After the model is trained, the video image of the user who is speaking can be input into the model, and the model can firstly recognize the face of the video image and then recognize the expression of the video image to obtain corresponding emotion information.
It should be noted that the emotion information may also be determined directly by the expression information, and the emotion information may not be determined directly based on the emotion information determined by the voice information.
In a videoconference, the speaking user may change at any time. When the speaking user changes, the server can recognize that the user is speaking, and then switch the video image of the user.
The determination of emotional information for video images of other scenes is similar and will not be described here.
Besides the determination of the emotion information directly according to the above model, the emotion information may also be determined according to semantics in the speech information.
Specifically, determining emotion information according to the voice information includes: determining semantics corresponding to the voice information; the mood information is determined according to semantics.
The semantic determination mode may be: and performing word segmentation processing on the characters corresponding to each voice message to obtain at least one word segmentation, and determining a word vector of each word segmentation in the characters according to a preset word vector model. The semantics can be determined.
For example, as described above, after acquiring the voice information, the server may recognize the text corresponding to the voice information through ASR (Automatic Speech Recognition). Such as "table XX". Then the words are segmented, stop words are removed, and some adverbs, adjectives and some conjunctions are included. And then determining a word vector of each participle through a word vector model, and matching each word vector in a preset emotion dictionary according to the word vector. And determining the participles with emotional meanings, such as 'table words'.
It should be noted that the preset emotion dictionary collects word vectors of a large number of emotion words, and each word vector corresponds to emotion information and emotion tendencies, such as positive emotion information or negative emotion information. Whether emotional words, emotional tendency and the like exist in the voice information can be determined through a matching mode.
When the voice information has various emotional words, and the emotional words have positive directions and negative directions, the final emotional tendency of the voice information, such as the positive direction or the negative direction, can be determined through a weighting algorithm. Then, the words with the largest number can be selected from the emotional words with the corresponding tendency as final emotional information, or all the emotional words with the tendency are directly used as final emotional information, or one word is randomly selected from all the emotional words with the tendency as final emotional information.
The final emotional tendency of the voice message can also be determined by the number of emotional words with different emotional tendencies, for example, 3 positive emotional messages and 2 negative emotional messages. The final emotion information of the voice information is positive emotion information, and then the inclined emotion words are determined as final emotion information in the manner described above.
It should be noted that, in the video image, the emotion information may be determined by voice information and expression information. That is, the above two modes can be combined to realize the implementation, and the description is omitted here.
In addition, the method can still be implemented by a model, the training process is the method for determining emotion information in a semantic manner as described above, and the training data may be different voice information or voice words corresponding to different emotion information, and the emotion information corresponding to each audio is marked. After the model is trained, parameters of each weight in the weight algorithm can be determined, so that the final emotional tendency of the characters corresponding to the voice information is determined.
Wherein determining emotional information according to semantics comprises: according to the semantics, determining a label set of the voice information, which can express emotional tendency; and selecting a corresponding emotion label from the label set as emotion information according to the semantics.
Since the foregoing has been set forth, further description is omitted herein. Only the description is as follows: the label of emotional tendency refers to specific content of positive emotional information and specific content of negative emotional information, such as happiness, difficulty, criticism, praise and the like. The set of labels of emotional tendency refers to a set of positive or negative emotional information. For indicating emotional tendency. Therefore, the emotion tendency finally represented by the corresponding character can be determined according to the semantics, and then the corresponding emotion information in the tendency is determined according to the emotion tendency and the semantics.
The determination method of the emotion information is similar for any video image in any scene, and therefore, the description thereof is omitted here.
In addition, instead of determining the processing method based on the emotion information, the processing method may be determined by acquiring a keyword from the speech information. The keywords can be determined by the model in a manner similar to the determination of the emotional information. The creation of the model may include: training data, which may be a large number of sentence words, is obtained. And mark out the key information in each sentence word. And as training data for neural network models, which may be CNN, RNN, etc. After the model is trained, the voice information can be converted into corresponding voice characters and then input into the model to obtain keywords.
In addition to determining the emotion according to the voice information and the like, the emotion of the user can be determined by the body posture of the user in the video image, such as the sitting posture, the standing posture and the like of the user.
Wherein, can be through user's position of sitting, like both arms are alternately taken chest, can confirm the mood of repelling, further can regard as the mood of disagreeing. The open front of the arms may be determined as a welcoming emotion and further may be considered as a consenting emotion. A single-handed chin rest may be determined to be boring mood, etc.
The user can be determined to be angry by standing the user like the waist of the two hands and the feet of the user are forked. The double-hand pockets can be determined as a leisure and happy mood.
The body posture can be recognized by a neural network model. The training of the model is similar to that described above and will not be described further here. After the model is obtained, the model can be determined through the video image and the model, and the specific implementation is similar to the foregoing, and will not be described herein again.
202: and determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
The processing mode refers to a video processing mode for describing or reflecting the emotion information. Video effect modes can be used, for example, adding expression effects, adding text effects, animation effects and the like. The manner may correspond to emotional information such as expressions that can represent pragmatic emotions, expressions that can represent obscene, animations that can represent encouragement, and the like. It should be understood that the processing manner may be understood as a manner such as expressive special effects, or may be understood as a processing manner such as adding expressive special effects, both of which may correspond to, reflect or describe emotional information.
For example, the server may be happy after determining the emotional information, as described above. The happy mood information may be targeted. And determining a preset processing mode corresponding to the happy character, such as a happy expression, a happy animation or a happy character.
In addition, a corresponding processing mode can be determined according to the object to which the emotion information is directed, and if the emotion information is "praise," the object of "praise" can be determined, such as other participating users of the video conference. A "pop-up" characteristic expression may be determined for the other participating user.
The determining of the corresponding processing mode according to the emotion information may specifically be: according to the emotion information, determining a corresponding processing mode of the multimedia information, comprising the following steps: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
Since it has been set forth in the foregoing, it will not be described in detail here.
The corresponding processing mode that needs to be determined according to the object to which the emotion information is directed may specifically be: the method 200 further comprises: determining a target object to which the emotion information faces; and selecting a special effect processing mode corresponding to the emotion information according to the target object.
Since it has been set forth in the foregoing, it will not be described in detail here. Only the description is as follows: the target object refers to the object for which the emotional information is intended, such as the other participating users described above. Or may be a user watching a live video in the live video.
The specific determination method of the target object may be as follows:
specifically, the determining a target object to which the emotion information is directed includes: determining corresponding semantics according to the voice information; and acquiring a target object identifier in the semantic database according to the semantics, and determining the target object based on the target object identifier.
The target object identifier may be a name, a position, a code number, and the like. For different scenarios, it may be different. For example, for a video conference, the identification may be the names, positions, and codes of the participating users. For live video, the account number of the watching user and the name of the user can be used, such as small partners and babies.
The manner of determination of the target object identification may also be identified by the model. The training mode of the model is similar to the training mode of the model for determining keywords in the foregoing, and details are not repeated here. Only the description is as follows: the training data for the model for identifying the target object identification may also be a large number of sentence words. And mark out the target object identification in each sentence word. The model is obtained through training, and the model can determine positions of the target object identification in the sentence words through training data, and determine which of the sentence words belongs to the target object identification according to specific contents of the target object identification. And inputting characters corresponding to the voice information into the trained model to determine the target object identification.
In addition, the identification of the target object can also be determined according to the semantic meaning of the voice information. According to the semantic determining mode, word vectors of word segmentation are obtained and can be matched with preset target object identifications. When the matching is obtained, the corresponding target object identification can be determined.
It should be noted that, the preset target object identifier may have different preset modes according to different scenes. For example, for a video conference, after the video conference is started, the server may obtain personal attribute information such as a name, a position, and the like of each participating user according to account information of the participating user, and use the personal attribute information as a preset target object identifier of the video conference to wait for matching. That is, each video conference can temporarily set a target object identifier according to the participating users. For live video, the target object id may be generic, and may be preset from the beginning, such as "buddies". Of course, for live video, a preset target object identifier may be created temporarily according to each viewing user.
For example, according to the foregoing, after determining the semantic meaning corresponding to the voice information, the server may determine the target object identifier, such as "XX" of the video conference participant according to the semantic meaning. Or may be determined from a model. When the 'XX' of the video conference participant is determined, which participant is determined finally according to the account information of the participant, so that a video image corresponding to the participant can be determined, and corresponding processing can be performed on the video image.
Corresponding video live broadcast is also similar, and after the target object information is determined, which users are determined to be all watching users or a certain watching user can be determined. It should be understood that the target object information is particularly good for all viewing users, such as "buddies" and the like. For a certain user or some users, the specific good identifiers may not be included, the account name may be matched with the account information of the current watching user according to the obtained account information, such as the account name, of the current watching user, so as to determine the corresponding watching user, and when a live video is sent to the watching user, image processing may be performed on a video picture, and then the video picture processed by the user is sent to the user. In addition, the watching users can be grouped, and the image processing of the grouped watching users can be determined according to the grouping result. The packet may be a temporarily created packet or a preset packet. The preset grouping can be according to the account number grade of the watching users, such as common users and VIP (VIP) users. The temporary grouping may be based on users who bought current goods in the live video and users who did not buy the goods. If the target object in the voice information of the anchor is identified to be a certain group, the video image processing can be performed for the users of the group. The identifier of the packet may also be determined by a model or matching method, which is not described herein again.
It should be noted that, for a preset video, such as a variety video, the preset video is also provided. A set of target object identifiers may also be preset, and may be, for example, names of all guests or names and nicknames of all people in the video, and the like. After the target object identifier is determined, a corresponding person, such as a guest, can be determined according to the set. And then, the guest images in the video can be identified according to the image identification mode. And determining the position of the guest in the video, and performing image processing on the guest, such as special effect addition and the like. It should be understood that, when guest image recognition is performed, images of each guest may be collected in advance, and a guest image in a video is recognized based on the image of the guest, or the guest may be recognized in an image matching manner.
203: and processing the multimedia information based on the processing mode.
For example, according to the foregoing description, the server determines a corresponding processing manner, such as adding a "happy" expression or animation. The "happy" expression may be added directly to the video image of the user currently speaking in the video conference. And obtaining the processed video image.
For multimedia information processing with a target object, the corresponding multimedia information needs to be processed according to the target object.
Specifically, the processing of the multimedia information based on the processing mode includes: determining a target object oriented to a processing mode; and processing the video image of the target object based on the processing mode.
For example, as described above, after the server determines the target object, such as in a video conference, the target object is other participating users. And the processing mode is a mode for processing the emotion information: and adding raised expressions, such as adding 'praise' expressions to the video images of other participating users.
Therein, FIG. 3A shows a processed video image interface 300A. In this interface 300A, there are four video images of users participating in the video conference, video image 301, video image 302, video image 303, and video image 304, respectively. Each video image has a corresponding participating user image therein. Such as participating user images 306 in video image 301. And the user currently speaking is the user corresponding to the video image 301, and the user currently speaking is "do nothing wrong" 307. Then, according to the execution mode, it can be determined that the target object is another participating user and the emotion information is "praise". Then a "like" expression may be added to the video images of the other participating users corresponding to the emotional information "praise," such as the "like" expression 305 in the video image 302. After the video images are processed, the processed video images can be sent to the video conference APPs on the mobile phones of the participating users. The respective participating users can then see the image on the handset as shown in interface 300A.
Fig. 3B shows the processed video image interface 300B. In this interface 300B, there are also video images of the four users participating in the video conference in fig. 3A. The user currently speaking is the user corresponding to the video image 301, and the user currently speaking is "bad for do b" 309. Then, according to the execution mode in the foregoing, it can be determined that the target object is the participating user "b". And the emotion information is criticized, a jiong expression opposite to the emotion information criticizing can be added to the video image of the jiong participant user, such as the jiong expression 308 in the video image 302. Similarly, the processed video images are sent to the video conference APPs on the mobile phones of the respective participating users, so that each participating user can see the processed video images, such as the image shown in interface 300B.
In addition, for some processing modes, image processing can be performed on target users in the video images.
Specifically, the method 200 further includes: determining a target object in a video image; and processing the target object based on the processing mode.
For example, according to the foregoing, after the server determines the target object, and the voice information of the user speaking in the video conference is "this problem is good and difficult", and the emotion information is "difficult", at this time, the head of the user in the video image of the user speaking in the video conference may be identified, the position of the head of the user in the image is determined, and the head is enlarged, so that the special effect of enlarging the head is obtained. And then sending the processed video to the video conference APP on the mobile phone of each participating user for watching.
In the case where the target object does not exist in the video image, the image processing may be performed according to the following manner: specifically, the method 200 further includes: determining a position in the video image where the target object can appear; and processing the position based on the processing mode.
For example, as described above, in the case of live video, it is common that a main video image is a video image. The anchor is unable to see the viewing user in the video image while in conversation with the viewing user. The server may randomly determine a position in the video image as the position to add the special effect. Such as adding an expression. Alternatively, the server may assume a position of one viewing user in the video image, such as a lower middle portion of the video image, as a position of each viewing user, and perform image processing at this position.
Therein, FIG. 4 shows a processed video image interface 400. In this interface 400, there is a anchor image 401, and the currently-anchored spoken voice message is "non-robbed buddies, next time there is an opportunity" 402. The target object can be determined to be the small partner without being caught, that is, the watching user who does not catch the commodity according to the execution mode. And the emotional information is "comfort," a "comfort" expression may be added to the video image of the viewing user that is not robbed of the item as opposed to the emotional information "comfort," such as "comfort" expression 403 in interface 400, e.g., a touch expression. Since the target object is a watching user who does not rush to the commodity, the server can determine the watching users who do not rush to the commodity according to the purchase data, and add the "comfort" expression 403 when the watching users who do not rush to the commodity send the live video, so that the watching users who do not rush to the commodity can see the image shown in the interface 400.
After video processing, it can be shown.
Specifically, the method 200 further includes: and providing the processed multimedia information to a corresponding intelligent terminal for displaying.
Since the foregoing has been set forth, further description is omitted herein. The embodiment of the application is based on emotion analysis, a special effect can be added to the video in real time, and thousands of people can be met and live broadcast experience can be achieved.
Besides, a repository can be established for the special effect processing mode, and the repository can be a special effect library and the like. It should be understood that the property management may include emotions.
Specifically, the method 200 further includes: creating a storage library of special effect processing modes; receiving a request from a user, selecting a corresponding special effect processing mode from a storage library based on the type of a video image, and determining the type of the corresponding special effect processing mode; and selecting a special effect processing mode from the corresponding type according to the emotion information and the type of the acquired video image.
The type of the preset video image is a type of the video, for example, for a video conference, the type of the video conference is a conference type, and can be a formal conference, a team activity conference, and the like, and the type of the video conference can be determined according to different conference subjects. For live video, the live video type can be live video, such as live e-commerce type, live education type, live singing type, and the like. For the preset video, such as the hedonic video, the video type can be selected, such as hedonic, movie review, documentary, etc.
For example, as described above, the server creates a corresponding repository for special effects processing methods for storing the special effects processing methods. Before the video conference starts, a user can select a special effect processing mode through an interface provided by a video conference APP installed on a mobile phone of the user. And selecting different special effect processing modes for different conference types according to different video conference types. The APP can send a selection request to the server through the selection operation, and based on the request, the server can determine corresponding types, that is, types of video conferences, for different characteristic processing manners. Thus, for different video conference types, a characteristic processing mode matched with the emotion information is selected from characteristic processing modes in the type.
In addition, the server may also directly obtain the corresponding configured repository from other platforms, and the preset repository may have different types.
Or, the user can directly and autonomously select and add the characteristic processing mode in the video conference to complete the video processing, so that the server determines the type of the characteristic processing mode based on the autonomous selection and addition of the user and the type of the current video conference.
It should be noted that, when the method 200 is executed for a terminal, the terminal may directly obtain the selection operation of the user and directly select the selection operation.
In addition, the method 200 further comprises: and if the preset video is determined not to meet the preset condition according to the emotion information, the voice information and/or the scene, providing prompt information to prompt video clips, voice information and/or people in the preset video to be processed.
It can be determined whether the recognized emotion information, voice information, and/or scene contradict the overall video atmosphere requirements of the preset video, such as what objects, characters, languages, scenes, etc. are not allowed to appear in the video, by recognizing the above-mentioned respective information. When the two videos are contradictory, the computer can remind the user to cut off a certain video, or mosaic to process video images, and can also add noise to the voice and the like. When a person or an object which does not appear in the video image appears, the person is reminded to be cut out or processed in a mosaic mode. The prompting can be performed through prompting information, such as prompting on a clipping interface.
In order to add the characteristics more accurately, the characteristic processing mode is determined, and the emotion information of the user can be determined through the intonation of the user, or the corresponding characteristic processing mode is determined directly through the intonation information or according to the determined emotion information.
Specifically, the method 200 further includes: determining voice information of a user in a video image or voice call; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
The intonation information refers to the tone of the speaking cavity, which can reflect the tone, attitude, emotion, etc. of the user. Such as a rising tone of speech, a serious tone of speech, etc.
The determination of the intonation information can also be determined directly by the model. The model is created in a manner similar to that described above, and will not be described herein again. For example, the training data is different for the training of this model. May be voice information.
Based on this, intonation information can be determined, so that the corresponding characteristic processing manner can be directly selected based on the foregoing, or determined together with the emotion information described above. And will not be described in detail herein.
Under the condition that the multimedia information is based on the images in the video conference, the voting function can be realized, so that voting statistics can be automatically realized when a plurality of people perform online videos.
Specifically, the method 200 further includes: identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; based on the recognized gestures, opinions that the respective participating users have a tendency to have in the video conference are determined.
The gesture may be a body gesture, a hand gesture (gesture), or the like of the user. Such as a user's hand-up posture, hand-not-lift posture, cross-handed posture, etc. These gestures may correspond to different opinions. If the holding posture corresponds to the agreement opinion, it may be called a positive opinion, and the holding posture may correspond to the disagreement opinion, it may be called a negative opinion. The crossed hands gesture may correspond to a disclaimer opinion, etc.
The user gesture may be recognized through a preset gesture recognition model. The training mode of the model is similar to that of the model described above, and thus the description is omitted here. Only the description is as follows: the training data may be a picture with a preset gesture and marked with the gesture name. And learning the neural network model and obtaining a final gesture recognition model. The video image is then input into the model and the user's pose in the model can be determined.
Wherein determining, from the recognized gestures, an opinion that each participating user has a tendency in the video conference comprises: and determining the opinions of the participating users based on the recognized gestures and according to the preset corresponding relation between the gestures and the opinions.
As can be seen from the foregoing, different opinions may be addressed due to different gestures. And determining the opinions represented by the postures put by the user according to the preset postures and the corresponding relations between the opinions.
The posture may be an emotional state, fatigue, or mental state of the user. As mentioned above, for example, the chin rest posture or sitting posture with one hand can be determined as boring emotion, the chest holding posture with two arms crossing each other can be determined as repelling emotion, and the like.
To save computational resources and cost, gesture recognition may be performed when gesture recognition is triggered.
Specifically, the method 200 further includes: the step of recognizing the gesture of the corresponding user in each video image is performed in response to a voting operation or in response to a game interaction operation.
The triggering operation may be a voting operation, and the voting operation may be a voice voting operation, and if the user speaks a vote in the video conference, the server may trigger gesture recognition after recognizing the semantic meaning of the "vote" from the video image by voice. Or the instruction is triggered, for example, a user can initiate a voting operation through an interface provided by the video conference APP on the mobile phone, for example, clicking a voting button. And responding to the operation by the video conference APP, sending a voting request to the server, and triggering the voting operation when the server receives the request.
In addition, the server can also perform gesture recognition on the video image at all times, and after a preset gesture is recognized, a voting operation is triggered, and voting results can be counted.
It should be noted that the above speech recognition can also be based on the above description, after determining the semantic meaning of the speech, such as the word vector, and then matching the word vector with the "voting". Thereby identifying the "vote".
After a gesture is recognized, statistics may be made on the number of opinions represented by the gesture to determine voting results.
Specifically, the method 200 further includes: determining the quantity corresponding to each opinion, and storing the quantity corresponding to the opinions; and sending the quantity corresponding to each opinion to a corresponding intelligent terminal for displaying.
For example, as described above, the server recognizes the user's gesture in each video image, and then performs statistics on the number of different gestures according to the different gestures.
Therein, FIG. 5 illustrates an interface 500 for gestures in a video image. In this interface 500, the user gesture in the video image 301 and the video image 302 is a hand-lifting gesture, and is determined to be a positive opinion 2 ticket. If the user gesture in the video image 303 is a hands-off gesture, the council 1 ticket is determined. The user gesture in video image 304 is a two-handed cross gesture and a waiver 1 ticket is determined. The server stores and sends the voting result, so that the user can watch the voting result through the mobile phone. Meanwhile, the user can download the voting result from the server through the video conference APP on the mobile phone.
In order to better recognize the user's posture and allow the user to correct the voting error in the voting, and the situation of repentance occurs, the user's opinion may be finalized by setting a time range.
Specifically, the method 200 further includes: the time at which the user remains in the recognized gesture is determined, and the opinion of each participating user is determined based on the time.
For example, as described above, after recognizing a user gesture, the server may determine a hold time for the current gesture, e.g., a total of several images for the gesture, to determine the time. From this time being compared to a time threshold, when greater than or equal to the threshold, e.g. 3s, the user's posture can be directly determined. But less than the threshold, the current gesture is determined to be invalid, i.e., the user has not made the gesture. If the user holds the hand-up posture for 1s and the handle is put down, it is determined that the user holds the opposite opinion.
Further, in addition to the voting function, game interaction may be supported, such as determining a score for a user based on the speed of a preset gesture made by multiple people.
Specifically, the method 200 further includes: determining the starting time of the gesture, and determining the score of each user in game interaction according to the starting time; and storing the score, and sending the score to the corresponding intelligent terminal for display.
For example, as described above, the server may support a game interaction session when multiple people are playing a video. Whether the user scores a score is determined by the speed at which the user makes a preset gesture. For example, in a game, a hand-up gesture is used as a scoring gesture. The user can immediately make the correct preset gesture, such as a hand-lifting gesture, after the game is started. The server carries out gesture recognition on each video image, and the user to which the gesture video image belongs is recognized firstly to score. Such as a score of 1. Therefore, the scores of all users in the game are counted, and a final score result, such as a score ranking result, is determined. And sending the ranking result to a video conference APP of the mobile phone of the user so that the user can watch the ranking result.
It should be noted that the triggering condition of the game may be similar to the voting operation condition described above, and will not be described herein again.
Furthermore, for audio, the following processing may be provided:
wherein, determining the emotion information representing the emotion of the user in the multimedia information comprises: mood information representing a user's mood in the audio information is determined.
The emotion of the speaker can be determined by intonation information in the speech information. And will not be described in detail herein. May be a voice call established by the server for at least two people. And acquiring voice information of each user, processing the voice information, and sending the voice information to the APP on the terminal of the corresponding user for the user to acquire.
In addition, the method 200 further comprises: determining semantic information in the audio information; and determining a corresponding processing mode of the audio information according to the semantic information.
Since the foregoing has been set forth, further description is omitted herein. Only the description is as follows: the semantic information is the semantics of the preceding text. Besides determining the emotion according to intonation, the emotion can also be determined according to the content of the voice information, namely determining semantic information.
The processing mode of the voice information may include performing corresponding filtering on the audio, such as noise reduction and noise addition; the sound is intensified or weakened, i.e. the frequency is adjusted. For example, the frequency is reduced for voice information with high frequency or voice information with emotional activity. Conversely, for a speech message with a low mood and a low frequency, the frequency is increased. In contrast, noise reduction, noise addition, and the like may be performed. And then the processed voice information is sent to a receiving party. Meanwhile, the speaking party can be provided with a special effect processing mode for adjusting the emotion, such as a piece of fun sound, a piece of lyric music and the like. And the audio processing mode is also suitable for the audio processing of the video conference.
Specifically, for audio information, determining a corresponding processing mode of multimedia information according to emotion information includes: determining the adjustment frequency of audio in the audio information according to the emotion information; determining the adjustment noise of the audio in the audio information according to the emotion information; and/or providing a preset characteristic processing mode according to the emotion information.
Since the foregoing has been set forth, further description is omitted herein. Only the description is as follows: the preset characteristic processing mode can be a section of beautiful music, a section of laugh voice and the like.
Based on the similar inventive concept, fig. 6 is a flowchart illustrating a multimedia information processing method according to another exemplary embodiment of the present application. The method 600 provided by the embodiment of the application is executed by an intelligent terminal, such as a mobile phone or a computer, and the method 600 includes the following steps:
601: acquiring multimedia information, and determining emotion information representing user emotion in the multimedia information.
602: and determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
603: and processing the multimedia information based on the processing mode so as to directly display the processed multimedia information.
Since the detailed description of the embodiment of step 601-603 has been described above, it is not repeated here. Only to illustrate, since the method 600 is executed by the intelligent terminal, the multimedia information can be directly displayed through the intelligent terminal after being processed.
Specifically, the acquiring of the multimedia information includes: and receiving the multimedia information sent by the server. It should be noted here that, for any scene, the intelligent terminal may obtain multimedia information from the service, such as a video image of a live video, a video image of a video conference, and a video image of a preset good segment of video.
Specifically, determining emotion information representing the emotion of the user in the multimedia information includes: determining voice information of a user according to at least one frame of video image; and determining emotion information according to the voice information.
Specifically, determining a corresponding processing mode of the multimedia information according to the emotion information includes: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
In the case where the multimedia information is based on images in a video conference, the method 600 further comprises: identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; based on the recognized gestures, opinions that the respective participating users have a tendency to have in the video conference are determined.
In addition, the method 600 further comprises: determining the starting time of the gesture, and determining and displaying the scores of the users in the game interaction according to the starting time; the score is stored.
Since the foregoing has described specific embodiments, further description is omitted.
In addition, reference may also be made to the above-mentioned steps of the method 200 for details which are not described in the method 600.
Based on the similar inventive concept, fig. 7 shows a flowchart of a video live broadcast processing method according to another exemplary embodiment of the present application. The method 700 provided by the embodiment of the present application is executed by a server, such as a cloud server, and the method 700 includes the following steps:
701: acquiring a video live broadcast image, and determining emotion information representing anchor emotion in the video live broadcast image.
702: and determining a corresponding processing mode of the live video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
703: and processing the live video images based on the processing mode.
Since the detailed description of the embodiment of steps 701-703 has been set forth above, it is not repeated here. For illustration only, the method 700 may also be performed by executing a subject smart terminal. The specific implementation is similar to that described above, and will not be described herein again.
In addition, the method 700 further comprises: determining a target object to which the emotion information is directed; when the target object is a user watching live video, determining the position of the user in the live video image; wherein, based on the processing mode, the live video image is processed, including: and processing the position according to the processing mode.
In addition, the method 700 further includes: and sending the processed live video images to an intelligent terminal corresponding to the user for display.
Specifically, the determining a target object to which the emotion information is directed includes: determining the voice information of the anchor according to at least one frame of video live broadcast image; determining corresponding semantics according to the voice information; and acquiring a target object identifier according to the semantics, and determining the target object based on the target object identifier.
It should be noted that the live video may be an educational live video, such as an online classroom, a teacher as a main broadcasting, and a classmate as a watching user. Or live e-commerce, including introduction of the owner of the goods and users of shopping goods, etc.
Since the foregoing has described specific embodiments, further description is omitted.
In addition, reference may also be made to various steps in the method 200 described above, where the method 700 is not described in detail.
Based on the similar inventive concept, fig. 8 is a flowchart illustrating a processing method of a video conference according to another exemplary embodiment of the present application. The method 800 provided by the embodiment of the present application is executed by a server, such as a cloud server, and the method 800 includes the following steps:
801: the method comprises the steps of obtaining video images corresponding to a plurality of users in a video conference, and determining the video image to which the user speaking belongs.
802: mood information indicative of the mood of the speaking user in the video image is determined.
803: and determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
804: and processing the video image based on the processing mode.
Since the detailed description of the steps 801-804 has been described above, the detailed description thereof is omitted here. For illustration only, the method 800 may also be performed by executing a subject smart terminal. The specific implementation is similar to that described above, and will not be described herein again.
In addition, the method 800 further includes: and sending the processed video images to the intelligent terminals corresponding to the users for display.
In addition, the method 800 further includes: determining a target object to which the emotion information faces; and taking the video image corresponding to the target object as an image to be processed for processing.
Specifically, the determining a target object to which the emotion information is directed includes: determining voice information of a speaking user according to at least one frame of video image of the speaking user; determining corresponding semantics according to the voice information; and acquiring a target object identifier according to the semantics, and determining the target object based on the target object identifier.
In the case where the video image is based on an image in a video conference, the method 800 further comprises: identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; based on the recognized gestures, opinions that the respective participating users have a tendency to have in the video conference are determined.
In addition, the method 800 further includes: determining the quantity corresponding to each opinion, and storing the quantity corresponding to the opinions; and sending the quantity corresponding to each opinion to a corresponding intelligent terminal for displaying.
In addition, the method 800 further includes: determining the starting time of the gesture, and determining the score of each user in game interaction according to the starting time; and storing the score, and sending the score to the corresponding intelligent terminal for display.
Specifically, determining a corresponding processing mode of the video image according to the emotion information includes: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
In addition, the method 800 further includes: creating a storage library of special effect processing modes; receiving a request from a user, selecting a corresponding special effect processing mode from a storage library based on the type of a preset video conference, and determining the type of the corresponding special effect processing mode; and selecting a special effect processing mode from the corresponding type according to the emotion information and the type of the acquired video conference.
In addition, the method 800 further includes: determining voice information of a speaking user; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
In addition, the method 800 may refer to the steps of the method 200, which are not described in detail.
Fig. 9 is a schematic structural framework diagram of a device for processing multimedia information according to an exemplary embodiment of the present application. The apparatus 900 may be applied to a server, such as a cloud server. The apparatus 900 includes: an acquisition module 901, a determination module 902 and a processing module 903; the following detailed description of the functions of the respective modules:
an obtaining module 901, configured to obtain the multimedia information, and determine emotion information indicating a user emotion in the multimedia information.
A determining module 902, configured to determine, according to the emotion information, a corresponding processing manner of the multimedia information, where the processing manner is used to reflect the emotion information.
And the processing module 903 is configured to process the multimedia information based on the processing mode.
Specifically, the obtaining module 901 is configured to obtain audio and/or video information of at least one participating user based on an audio and/or video conference; acquiring the audio and/or video information of the main broadcast based on the audio and/or video live broadcast; or, based on the preset audio and/or the preset video, the video information of the preset audio and/or the preset video is obtained.
Specifically, the obtaining module 901 is specifically configured to: the first determining unit is used for determining the voice information of the user according to at least one frame of video image and/or expression information; and determining emotion information according to the voice information and/or the expression information.
Specifically, the first determining unit is configured to determine a semantic meaning corresponding to the voice information; the mood information is determined according to semantics.
Specifically, the first determining unit is configured to: determining a label set of the voice information capable of expressing emotional tendency according to semantics; and selecting a corresponding emotion label from the label set as emotion information according to the semantics.
Specifically, the determining module 902 is configured to select, according to the emotion information, a special effect processing mode corresponding to the emotion information.
Further, the determining module 902 is further configured to: determining a target object to which the emotion information faces; the apparatus 900 further comprises: and the selection module is used for selecting a special effect processing mode corresponding to the emotion information according to the target object.
Specifically, the processing module 903 includes: the second determining unit is used for determining a target object oriented to the processing mode; and the processing unit is used for processing the video image of the target object based on the processing mode.
Further, the determining module 902 is further configured to: determining a target object in a video image; a processing module 903, further configured to: and processing the target object based on the processing mode.
In the case that the target object does not exist in the video image, the determining module 902 is further configured to: determining a position in the video image where the target object can appear; the processing module 903 is further configured to: and processing the position based on the processing mode.
Specifically, the determining module 902 is configured to: determining corresponding semantics according to the voice information; and acquiring a target object identifier according to the semantics, and determining the target object based on the target object identifier.
In addition, the apparatus 900 further comprises: and the providing module is used for providing the processed video to the corresponding intelligent terminal for displaying.
The special effect processing mode comprises the following steps: expression special effects, character special effects and animation special effects.
In the case where the multimedia information is based on images in a video conference, the apparatus 900 further comprises: the identification module is used for identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; a determining module 902 configured to: based on the recognized gestures, opinions that the respective participating users have a tendency to have in the video conference are determined.
Wherein the step of identifying a gesture of a corresponding user in each of the video images is performed in response to a voting operation or in response to a game interaction operation.
Specifically, the determining module 902 is configured to: and determining the opinions of the participating users based on the recognized gestures and according to the preset corresponding relation between the gestures and the opinions.
Further, a determining module 902 is configured to: determining the quantity corresponding to each opinion, and storing the quantity corresponding to the opinions; the apparatus 900 further comprises: and the sending module is used for sending the quantity corresponding to each opinion to the corresponding intelligent terminal for displaying.
Further, a determining module 902 is configured to: the time at which the user remains in the recognized gesture is determined, and the opinion of each participating user is determined based on the time.
Further, a determining module 902 is configured to: determining the starting time of the gesture, and determining the score of each user in game interaction according to the starting time; and the sending module is used for storing the scores and sending the scores to the corresponding intelligent terminal for display.
In addition, the apparatus 900 further comprises: the creation module is used for creating a storage library of the special effect processing mode; the selection module is also used for receiving a request from a user, selecting a corresponding special effect processing mode from the storage library based on the type of the video image, and determining the type of the corresponding special effect processing mode; and the selection module is also used for selecting a special effect processing mode from the corresponding type according to the emotion information and the type of the acquired video image.
Further, the determining module 902 is further configured to: determining the voice information of a user in a video image or voice call; the recognition module is also used for recognizing intonation information in the voice information; and the selection module is also used for selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
Further, a module is provided for: and if the preset video is determined not to meet the preset condition according to the emotion information, the voice information and/or the scene, providing prompt information to prompt video clips, voice information and/or people in the preset video to be processed.
Wherein the multimedia information comprises audio and/or video information.
Specifically, the determining module 902 is configured to determine emotion information representing an emotion of the user in the audio information.
Further, the determining module 902 is further configured to: determining semantic information in the audio information; and determining a corresponding processing mode of the audio information according to the semantic information.
Specifically, for the audio information, the determining module 902 includes: the third determining unit is used for determining the adjusting frequency of the audio in the audio information according to the emotion information; the fourth determining unit is used for determining the adjustment noise of the audio in the audio information according to the emotion information; and the providing unit is used for providing a preset characteristic processing mode according to the emotion information.
Fig. 10 is a schematic diagram illustrating a structural framework of an apparatus for processing multimedia information according to another exemplary embodiment of the present application. The device 1000 can be applied to an intelligent terminal, such as a mobile phone or a computer. The apparatus 1000 comprises: the functions of the acquisition module 1001, the determination module 1002, and the processing module 1003 are described in detail below:
an obtaining module 1001 is configured to obtain multimedia information and determine emotion information indicating a user emotion in the multimedia information.
The determining module 1002 is configured to determine a corresponding processing manner of the multimedia information according to the emotion information, where the processing manner is used to reflect the emotion information.
The processing module 1003 is configured to process the multimedia information based on the processing manner, so that the processed multimedia information is directly displayed.
Specifically, the obtaining module 1001 is specifically configured to: and receiving the multimedia information sent by the server.
Specifically, the obtaining module 1001 includes: the determining unit is used for determining the voice information of the user according to at least one frame of video image; and determining emotion information according to the voice information.
Specifically, the determining module 1002 is configured to: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
In the case where the multimedia information is based on images in a video conference, the apparatus 1000 further comprises: the identification module is used for identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; a determining module 1002 configured to: based on the recognized gestures, opinions that the respective participating users have a tendency to have in the video conference are determined.
Further, determining module 1002 is configured to: determining the starting time of the gesture, and determining and displaying the scores of the users in the game interaction according to the starting time; the apparatus 1000 further comprises: and the storage module is used for storing the score.
For parts of the content that cannot be mentioned by the apparatus 1000, reference may be made to the content of the apparatus 900 described above.
Fig. 11 shows a schematic structural framework diagram of a processing apparatus for live video according to yet another exemplary embodiment of the present application. The apparatus 1100 may be applied to a server, such as a cloud server. The apparatus 1100 comprises: the functions of the acquisition module 1101, the determination module 1102 and the processing module 1103 are described in detail below:
the obtaining module 1101 is configured to obtain a live video image, and determine emotion information indicating a main emotion in the live video image.
The determining module 1102 is configured to determine, according to the emotion information, a corresponding processing mode of the live video image, where the processing mode is used to reflect the emotion information.
And the processing module 1103 is configured to process the live video image based on the processing mode.
Furthermore, the determining module 1102 is further configured to: determining a target object to which the emotion information is directed; when the target object is a user watching live video, determining the position of the user in the live video image; wherein, the processing module 1103 is configured to: and processing the position according to the processing mode.
In addition, the apparatus 1100 further comprises: and the sending module is used for sending the processed live video images to the intelligent terminal corresponding to the user for display.
Specifically, the determining module 1102 includes: the determining unit is used for determining the voice information of the anchor according to at least one frame of video live broadcast image; determining corresponding semantics according to the voice information; and the acquisition unit is used for acquiring the target object identification in the semantic meaning and determining the target object based on the target object identification.
For parts of the content that cannot be mentioned by the apparatus 1100, reference may be made to the content of the apparatus 900 described above.
Fig. 12 is a schematic structural framework diagram illustrating a processing apparatus for video conference according to still another exemplary embodiment of the present application. The apparatus 1200 may be applied to a server, such as a cloud server. The apparatus 1200 includes: the acquiring module 1201, the determining module 1202, and the processing module 1203, the functions of each module are described in detail as follows:
the obtaining module 1201 is configured to obtain video images corresponding to multiple users in a video conference, and determine a video image to which a user currently speaking belongs.
A determining module 1202 for determining emotion information representing an emotion of the speaking user in the video image.
A determining module 1202, configured to determine, according to the emotion information, a corresponding processing mode of the video image, where the processing mode is used to reflect the emotion information.
And a processing module 1203, configured to process the video image based on the processing manner.
In addition, the apparatus 1200 further comprises: and the sending module is used for sending the processed video images to the intelligent terminals corresponding to the users for display.
Furthermore, a determining module 1202 for determining a target object to which the emotional information is directed; and taking the video image corresponding to the target object as an image to be processed for processing.
Specifically, the determining module 1202 is configured to: determining voice information of a speaking user according to at least one frame of video image of the speaking user; determining corresponding semantics according to the voice information; and acquiring a target object identifier according to the semantics, and determining the target object based on the target object identifier.
In the case where the video image is based on an image in a video conference, the apparatus 1200 further includes: the identification module is used for identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; a determining module 1202 for: based on the recognized gestures, opinions that the respective participating users have a tendency to have in the video conference are determined.
Further, the determining module 1202 is configured to: determining the quantity corresponding to each opinion, and storing the quantity corresponding to the opinions; and the sending module is used for sending the quantity corresponding to each opinion to the corresponding intelligent terminal for displaying.
Further, the determining module 1202 is configured to: determining the starting time of the gesture, and determining the score of each user in game interaction according to the starting time; the apparatus 1200 further comprises: and the storage module is used for storing the scores and sending the scores to the corresponding intelligent terminal for display.
Specifically, the determining module 1202 is specifically configured to: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
In addition, the apparatus 1200 further comprises: the creation module is used for creating a storage library of the special effect processing mode; the selection module is used for receiving a request from a user, selecting a corresponding special effect processing mode from the storage library based on the type of the preset video conference, and determining the type of the corresponding special effect processing mode; and the selection module is used for selecting a special effect processing mode from the corresponding type according to the emotion information and the acquired type of the video conference.
Furthermore, the determining module 1202 determines the voice information of the speaking user; the recognition module is also used for recognizing intonation information in the voice information; and the selection module is also used for selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
For some contents that cannot be mentioned in the apparatus 1200, reference may be made to the contents of the apparatus 900 described above.
While the internal functions and structures of the apparatus 900 shown in FIG. 9 have been described above, in one possible design, the structures of the apparatus 900 shown in FIG. 9 may be implemented as a computing device, such as a server. As shown in fig. 13, the apparatus 1300 may include: memory 1301, processor 1302, and communications component 1303;
a memory 1301 for storing a computer program.
And a communication component 1303 for obtaining the multimedia information.
A processor 1302 for executing the computer program for: determining emotion information representing emotion of a user in the multimedia information; determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the multimedia information based on the processing mode.
Specifically, the processor 1302 is specifically configured to: acquiring audio and/or video information of at least one participating user based on the audio and/or video conference; acquiring the audio and/or video information of the main broadcast based on the audio and/or video live broadcast; or, based on the preset audio and/or the preset video, the video information of the preset audio and/or the preset video is obtained.
Specifically, the processor 1302 is specifically configured to: determining voice information and/or expression information of a user according to at least one frame of video image; and determining emotion information according to the voice information and/or the expression information.
Specifically, the processor 1302 is specifically configured to: determining semantics corresponding to the voice information; the mood information is determined according to semantics.
Specifically, the processor 1302 is specifically configured to: according to the semantics, determining a label set of the voice information, which can express emotional tendency; and selecting a corresponding emotion label from the label set as emotion information according to the semantics.
Specifically, the processor 1302 is specifically configured to: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
Further, processor 1302 is further configured to: determining a target object to which the emotion information faces; and selecting a special effect processing mode corresponding to the emotion information according to the target object.
Specifically, the processor 1302 is specifically configured to: determining a target object oriented to a processing mode; and processing the video image of the target object based on the processing mode.
Further, processor 1302 is further configured to: determining a target object in a video image; and processing the target object based on the processing mode.
In the case where the target object is not present in the video image, the processor 1302 is further configured to: determining a position in the video image where the target object can appear; and processing the position based on the processing mode.
Specifically, the processor 1302 is specifically configured to: determining corresponding semantics according to the voice information; and acquiring a target object identifier in the semantic database according to the semantics, and determining the target object based on the target object identifier.
Further, processor 1302 is further configured to: and providing the processed video to a corresponding intelligent terminal for displaying.
The special effect processing mode comprises the following steps: expression special effects, character special effects and animation special effects.
Further, processor 1302 is further configured to: and if the preset video is determined not to meet the preset condition according to the emotion information, the voice information and/or the scene, providing prompt information to prompt video clips, voice information and/or people in the preset video to be processed.
Wherein the multimedia information comprises audio and/or video information.
Specifically, the processor 1302 is specifically configured to: mood information representing a user's mood in the audio information is determined.
Further, processor 1302 is configured to: determining semantic information in the audio information; and determining a corresponding processing mode of the audio information according to the semantic information.
Specifically for the audio information, the processor 1302 is specifically configured to: determining the adjustment frequency of audio in the audio information according to the emotion information; determining the adjustment noise of the audio frequency in the audio frequency information according to the emotion information; and/or providing a preset characteristic processing mode according to the emotion information.
In the case that the multimedia information is based on images in a video conference, the processor 1302 is further configured to: identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; based on the recognized gestures, opinions that are biased by the respective participating users in the video conference are determined.
Further, processor 1302 is further configured to: the step of recognizing the gesture of the corresponding user in each video image is performed in response to a voting operation or in response to a game interaction operation.
Specifically, the processor 1302 is specifically configured to: and determining the opinions of the participating users based on the recognized gestures and according to the preset corresponding relation between the gestures and the opinions.
Further, processor 1302 is further configured to: determining the quantity corresponding to each opinion, and storing the quantity corresponding to the opinions; and sending the quantity corresponding to each opinion to a corresponding intelligent terminal for displaying.
Further, processor 1302 is further configured to: the time at which the user remains in the recognized gesture is determined, and the opinion of each participating user is determined based on the time.
Further, processor 1302 is further configured to: determining the starting time of the gesture, and determining the score of each user in game interaction according to the starting time; and storing the score, and sending the score to the corresponding intelligent terminal for display.
Further, processor 1302 is further configured to: creating a storage library of special effect processing modes; receiving a request from a user, selecting a corresponding special effect processing mode from a storage library based on the type of a preset video image, and determining the type of the corresponding special effect processing mode; and selecting a special effect processing mode from the corresponding type according to the emotion information and the type of the acquired video image.
Further, processor 1302 is further configured to: determining the voice information of a user in a video image or voice call; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
In addition, an embodiment of the present invention provides a computer storage medium, and the computer program, when executed by one or more processors, causes the one or more processors to implement the steps of a method for processing multimedia information in the method embodiment of fig. 2.
While the internal functions and structures of the apparatus 1000 shown in fig. 10 have been described above, in one possible design, the structures of the apparatus 1000 shown in fig. 10 may be implemented as a computing device, such as a cell phone or a computer. As shown in fig. 14, the apparatus 1400 may include: memory 1401, processor 1402, and communications component 1403;
a memory 1401 for storing the computer program.
A communication component 1403 for obtaining multimedia information.
A processor 1402 for executing the computer program for: determining emotion information representing emotion of a user in the multimedia information; determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the multimedia information based on the processing mode so as to directly display the processed multimedia information.
Specifically, the processor 1402 is specifically configured to: and receiving the multimedia information sent by the server.
Specifically, the processor 1402 is specifically configured to: determining voice information of a user according to at least one frame of video image; and determining emotion information according to the voice information.
Specifically, the processor 1402 is specifically configured to: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
In the case where the multimedia information is based on images in a video conference, the processor 1402 is further configured to: identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; based on the recognized gestures, opinions that are biased by the respective participating users in the video conference are determined.
Further, the processor 1402 is further configured to: determining the starting time of the gesture, and determining and displaying the scores of the users in the game interaction according to the starting time; the score is stored.
It should be noted that, for some contents that the apparatus 1400 fails to mention, reference may be made to the contents of the apparatus 1300 described above.
In addition, an embodiment of the present invention provides a computer storage medium, and the computer program, when executed by one or more processors, causes the one or more processors to implement the steps of a method for processing multimedia information in the method embodiment of fig. 6.
While the internal functions and structures of the apparatus 1100 shown in FIG. 11 have been described above, in one possible design, the structures of the apparatus 1100 shown in FIG. 11 may be implemented as a computing device, such as a cell phone or a computer. As shown in fig. 15, the apparatus 1500 may include: memory 1501, processor 1502, and communications component 1503;
a memory 1501 is used for storing computer programs.
A communication component 1503 for obtaining the video live image.
A processor 1502 for executing the computer program for: determining emotion information representing a main emotion in a live video image; determining a corresponding processing mode of the live video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the live video images based on the processing mode.
Further, the processor 1502 is further configured to: determining a target object to which the emotion information faces; when the target object is a user watching live video, determining the position of the user in the live video image; the processor 1502 is specifically configured to: and processing the position according to the processing mode.
Further, the processor 1502 is further configured to: and sending the processed live video images to an intelligent terminal corresponding to the user for display.
Specifically, the processor 1502 is specifically configured to: determining voice information of a main broadcast according to at least one frame of video live broadcast image; determining corresponding semantics according to the voice information; and acquiring a target object identifier according to the semantics, and determining the target object based on the target object identifier.
For some contents that cannot be mentioned by the apparatus 1500, the contents of the apparatus 1300 described above may be referred to.
In addition, embodiments of the present invention provide a computer storage medium, and when executed by one or more processors, cause the one or more processors to implement the steps of a processing method for live video in the method embodiment of fig. 7.
While the internal functions and structures of the apparatus 1200 shown in FIG. 12 are described above, in one possible design, the structures of the apparatus 1200 shown in FIG. 12 may be implemented as a computing device, such as a cell phone or a computer. As shown in fig. 16, the apparatus 1600 may include: memory 1601, processor 1602, and communication component 1603;
a memory 1601 for storing a computer program.
A communication component 1603, configured to obtain video images corresponding to multiple users in the video conference.
A processor 1602 for executing a computer program for: determining a video image to which a user speaking belongs; determining emotion information representing an emotion of a speaking user in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the video image based on the processing mode.
Further, the processor 1602 is further configured to: and sending the processed video images to the intelligent terminals corresponding to the users for display.
Further, the processor 1602 is further configured to: determining a target object to which the emotion information faces; and taking the video image corresponding to the target object as an image to be processed for processing.
Specifically, the processor 1602 is specifically configured to: determining voice information of a speaking user according to at least one frame of video image of the speaking user; determining corresponding semantics according to the voice information; and acquiring a target object identifier according to the semantics, and determining the target object based on the target object identifier.
In the case that the video image is based on an image in a video conference, the processor 1602 is further configured to: identifying the gesture of the corresponding user in each video image based on the acquired video images of the participating users; based on the recognized gestures, opinions that are biased by the respective participating users in the video conference are determined.
Further, the processor 1602 is further configured to: determining the quantity corresponding to each opinion, and storing the quantity corresponding to the opinions; and sending the quantity corresponding to each opinion to a corresponding intelligent terminal for displaying.
Further, the processor 1602 is further configured to: determining the starting time of the gesture, and determining the score of each user in game interaction according to the starting time; and storing the score, and sending the score to a corresponding intelligent terminal for displaying.
Specifically, the processor 1602 is specifically configured to: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
Further, the processor 1602 is further configured to: creating a storage library of special effect processing modes; receiving a request from a user, selecting a corresponding special effect processing mode from a storage library based on the type of the video conference, and determining the type of the corresponding special effect processing mode; and selecting a special effect processing mode from the corresponding type according to the emotion information and the type of the acquired video conference.
Further, the processor 1602 is further configured to: determining voice information of a speaking user; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
It should be noted that, for parts of the content that cannot be mentioned by the apparatus 1600, reference may be made to the content of the apparatus 1300 described above.
In addition, embodiments of the present invention provide a computer storage medium, and the computer program, when executed by one or more processors, causes the one or more processors to implement the steps of a processing method for a video conference in the method embodiment of fig. 8.
Another exemplary embodiment of the present application further provides a flowchart of a video processing method. The method 1700 provided in the embodiment of the present application is executed by an intelligent terminal, such as a computer, and the method 1700 includes the following steps:
1701: and acquiring a preset video, and determining emotion information representing the emotion of a person in the video image.
1702: and determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
1703: and processing the video image based on the processing mode.
Since the detailed description of the embodiment of step 1701-1703 has been described in detail above, it is not repeated here. For illustration only, the method 1700 is a video that has been preset, such as a variety video, a movie review video, a documentary, and so forth. The user can automatically clip the video on the clipping interface, or the user provides auxiliary operation for the user in the process of actively clipping the video.
Specifically, determining a corresponding processing mode of the video image according to the emotion information includes: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
In addition, the method 1700 further includes: determining voice information of people in the video image; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
Since the foregoing has described specific embodiments, further description is omitted.
In addition, the method 1700 further includes: determining a scene in the video image under the condition that no person exists in the video image; and determining a special effect processing mode corresponding to the scene according to the scene.
When the video image does not have any person, the video image can be directly identified, for example, which scene the video image belongs to, such as an outdoor scene and an indoor scene, can be identified, and the video image can be further specifically subdivided, such as an indoor library scene and an outdoor garden scene. The identification mode can still select model identification, and the creation of the model is similar to that described above, and is not described here again. After the recognition, a characteristic processing mode corresponding to the scene can be selected, for example, for an outdoor park scene, a piece of characters, a bird flower fragrance and the like can be added. And will not be described in detail herein.
In addition, the method 1700 further includes: creating a storage library of special effect processing modes; receiving a request from a user, selecting a corresponding special effect processing mode from a storage library based on the type of the video, and determining the type of the corresponding special effect processing mode; and selecting a special effect processing mode from the corresponding types according to the emotion information and the types of the acquired preset videos.
In addition, the method 1700 further includes: based on the type of the corresponding special effect processing mode, creating a storage library of the corresponding type of special effect processing mode; and selecting a corresponding storage library according to the type of the preset video so as to select a special effect processing mode.
Since the foregoing has described specific embodiments, further description is omitted.
In addition, the method 1700 further includes: providing a repository; receiving a request from a user, and adding or deleting a corresponding special effect processing mode from a storage library; or receiving a request from a user, and adding or deleting the corresponding storage library.
In addition to the repository described above, the repository may be provided to the user by way of an interface. The user can see the various special effect processing modes below the user through the storage library on the interface. Therefore, the user deletes and adds the special effect processing mode. The operation is realized based on the addition and deletion operations of the user, and the specific mode is not described again. In addition, the user may also add and delete repositories through add and delete operations.
In addition, the method 1700 further includes: and if the preset video is determined not to meet the preset condition according to the emotion information, the voice information and/or the scene, providing prompt information to prompt video clips, voice information and/or people in the preset video to be processed.
In addition, the method 1700 may determine whether the recognized emotion information, voice information, and/or overall video atmosphere requirement of the scene and the preset video contradict each other through recognition of the above information, such as what objects, persons, languages, scenes, etc. are not allowed to appear in the video. When the two videos are contradictory, the computer can remind the user to cut off a certain video, or mosaic to process video images, and can also add noise to the voice and the like. When a person or an object which does not appear in the video image appears, the person is reminded to be cut out or processed in a mosaic mode. The prompting can be performed through prompting information, such as prompting on a clipping interface.
In addition, the method 1700 may refer to the steps of the method 200, which are not described in detail.
An exemplary embodiment of the present application provides a schematic structural framework diagram of a video processing apparatus. The apparatus 1800 may be applied to an intelligent terminal, such as a computer. The apparatus 1800 includes: an acquisition module 1801, a determination module 1802, and a processing module 1803; the following detailed description of the functions of the respective modules:
an obtaining module 1801, configured to obtain a preset video and determine emotion information indicating an emotion of a person in the video image.
A determining module 1802, configured to determine, according to the emotion information, a corresponding processing manner of the video image, where the processing manner is used to reflect the emotion information.
A processing module 1803, configured to process the video image based on the processing manner.
Specifically, the determining module 1802 is configured to select, according to the emotion information, a special effect processing mode corresponding to the emotion information.
In addition, the determining module 1802 is further configured to determine voice information of a person in the video image; further, the apparatus 1800 further comprises: the recognition module is used for recognizing intonation information in the voice information; and the selection module is used for selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
Further, the determining module 1802 is further configured to determine a scene in the video image if no person is present in the video image; and determining a special effect processing mode corresponding to the scene according to the scene.
Further, the apparatus 1800 further comprises: the creation module is used for creating a storage library of the special effect processing mode; the selection module is used for receiving a request from a user, selecting a corresponding special effect processing mode from the storage library based on the type of the video, and determining the type of the corresponding special effect processing mode; and the selection module is used for selecting a special effect processing mode from the corresponding type according to the emotion information and the type of the acquired preset video.
In addition, the creating module is further used for creating a storage library of the special effect processing modes of the corresponding types based on the types of the corresponding special effect processing modes; and the selection module is also used for selecting the corresponding storage library according to the type of the preset video so as to select a special effect processing mode.
Further, the apparatus 1800 further comprises: a providing module for providing a repository; the modification module is used for receiving a request from a user and adding or deleting a corresponding special effect processing mode from the storage library; or receiving a request from a user, and adding or deleting the corresponding storage library.
In addition, the determining module 1802 is further configured to determine that the preset video does not meet the preset condition according to the emotion information, the voice information, and/or the scene, and provide a prompt message to prompt a video clip, the voice information, and/or a person in the preset video to be processed.
For some contents that cannot be mentioned in the apparatus 1800, reference may be made to the contents of the apparatus 900 described above.
While the internal functionality and structure of the apparatus 1800 are described above, in one possible design, the structure of the apparatus 1800 may be implemented as a computing device, such as a computer. The device 1900 may include: a memory 1901, a processor 1902, and a communications component 1903;
a memory 1901 for storing a computer program.
A communication component 1903 for obtaining the preset video.
A processor 1902 for executing the computer program for: determining emotion information representing the emotion of a person in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode reflects emotion information; and processing the video image based on the processing mode.
Specifically, the processor 1902 is specifically configured to: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
Further, the processor 1902 is further configured to: determining voice information of people in the video image; furthermore, tone information in the voice information is recognized; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
Further, the processor 1902 is further configured to: determining a scene in the video image under the condition that no person exists in the video image; and determining a special effect processing mode corresponding to the scene according to the scene.
Further, the processor 1902 is further configured to: creating a storage library of special effect processing modes; the selection module is used for receiving a request from a user, selecting a corresponding special effect processing mode from the storage library based on the type of the video, and determining the type of the corresponding special effect processing mode; and selecting a special effect processing mode from the corresponding type according to the emotion information and the type of the acquired preset video.
Further, the processor 1902 is further configured to: creating a storage library of the special effect processing modes of the corresponding types based on the types of the corresponding special effect processing modes; and selecting a corresponding storage library according to the type of the preset video so as to select a special effect processing mode.
Further, the processor 1902 is further configured to: providing a repository; receiving a request from a user, and adding or deleting a corresponding special effect processing mode from a storage library; or receiving a request from a user, and adding or deleting the corresponding storage library.
Further, the processor 1902 is further configured to: and if the preset video is determined not to meet the preset condition according to the emotion information, the voice information and/or the scene, providing prompt information to prompt video clips, voice information and/or people in the preset video to be processed.
Note that, as for some of the contents that cannot be mentioned in the apparatus 1900, the contents of the apparatus 1300 described above can be referred to.
Additionally, embodiments of the present invention provide a computer storage medium, and when executed by one or more processors, cause the one or more processors to implement the steps of a method for processing video in method 1700 embodiment.
Another exemplary embodiment of the present application further provides a flowchart of an audio processing method. The method 2000 provided by the embodiment of the present application is executed by a server, such as a cloud server, and the method 2000 includes the following steps:
2001: and acquiring voice information in the voice call, and determining emotion information representing user emotion in the voice information.
2002: and determining a corresponding processing mode of the voice information according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
2003: and processing the voice information based on the processing mode.
Since the detailed description of the embodiment of step 1701-1703 has been described in detail above, it is not repeated here. For example, the voice call may be a voice call in an online instant messaging. The call can be a multi-person call or a one-to-one call. The emotion of the speaker can be determined by intonation information in the speech information. And will not be described in detail herein. May be a voice call established by the server for at least two people. And acquiring voice information of each user, processing the voice information, and sending the voice information to the APP on the terminal of the corresponding user for the user to acquire.
Further, the method 2000 further comprises: determining semantic information in the voice information; and determining a corresponding processing mode of the voice information according to the semantic information.
Since the foregoing has been set forth, further description is omitted herein. Only the description is as follows: the semantic information is the semantics of the preceding text. Besides determining the emotion according to intonation, the emotion can also be determined according to the content of the voice information, namely determining semantic information.
The processing mode of the voice information may include performing corresponding filtering on the audio, such as noise reduction and noise addition; the sound is intensified or weakened, i.e. the frequency is adjusted. For example, the frequency is reduced for voice information with high frequency or voice information with emotional activity. Conversely, for a speech message with low mood and low frequency, the frequency is increased. In contrast, noise reduction, noise addition, and the like may be performed. And then the processed voice information is sent to a receiving party. Meanwhile, the speaking party can be provided with a special effect processing mode for adjusting the emotion, such as a piece of fun sound, a piece of lyric music and the like. And the audio processing mode is also suitable for the audio processing of the video conference.
Specifically, determining a corresponding processing mode of the voice information according to the emotion information includes: determining the adjustment frequency of audio in the voice information according to the emotion information; determining the adjustment noise of the audio frequency in the voice information according to the emotion information; and/or providing a preset characteristic processing mode according to the emotion information.
Since the foregoing has been set forth, further description is omitted herein. Only the description is as follows: the preset characteristic processing mode can be a section of beautiful music, a section of laugh voice and the like.
In addition, reference may also be made to the various steps in the method 200 described above for details of the method 2000 that are not described in detail.
An exemplary embodiment of the present application provides a schematic structural framework diagram of an audio processing apparatus. The apparatus 2100 may be applied to a server, such as a cloud server. The apparatus 2100 includes: an acquisition module 2101, a determination module 2102 and a processing module 2103; the following detailed description is directed to the functions of the various modules:
an obtaining module 2101 is configured to obtain the voice information in the voice call, and determine emotion information indicating a user emotion in the voice information.
A determining module 2102 configured to determine a corresponding processing manner of the voice information according to the emotion information, where the processing manner is reflected in the emotion information.
A processing module 2103, configured to process the voice information based on the processing manner.
Furthermore, the determining module 2102 is further configured to determine semantic information in the voice information; and determining a corresponding processing mode of the voice information according to the semantic information.
Specifically, the determining module 2102 comprises: the adjusting unit is used for adjusting the frequency of the audio frequency in the voice information according to the emotion information; the adjusting unit is used for adjusting the noise of the audio frequency in the voice information according to the emotion information; and/or, a providing unit for providing a preset characteristic processing mode according to the emotion information.
For parts of the content that cannot be mentioned in the apparatus 2100, reference may be made to the content of the apparatus 900 described above.
While the internal functions and structures of the apparatus 2100 are described above, in one possible design, the structures of the apparatus 2100 may be implemented as a computing device, such as a server. The apparatus 2200 may comprise: memory 2201, processor 2202, and communications component 2203;
a memory 2201 for storing a computer program.
A communication component 2203 for obtaining voice information in a voice call.
A processor 2202 for executing the computer program to: determining emotion information representing emotion of the user in the voice information; determining a corresponding processing mode of the voice information according to the emotion information, wherein the processing mode reflects emotion information; and processing the voice information based on the processing mode.
Further, the processor 2202 is further configured to: determining semantic information in the voice information; and determining a corresponding processing mode of the voice information according to the semantic information.
Specifically, the processor 2202 is specifically configured to: adjusting the frequency of audio in the voice information according to the emotion information; adjusting the noise of the audio frequency in the voice information according to the emotion information; and/or providing a preset characteristic processing mode according to the emotion information.
It should be noted that, for some contents that cannot be mentioned by the apparatus 2200, the contents of the apparatus 1300 described above may be referred to.
Additionally, embodiments of the present invention provide a computer storage medium, and the computer program, when executed by one or more processors, causes the one or more processors to implement the steps of a method for processing audio in the method 2000.
Another exemplary embodiment of the present application further provides a flowchart of a video processing method. The method 2300 provided by the embodiment of the application is executed by a server, such as a cloud server, and the method 2300 includes the following steps:
2301: the method comprises the steps of obtaining a video image in a video call, and determining emotion information representing user emotion in the video image.
2302: and determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
2303: and processing the video image based on the processing mode.
Since the detailed description of the steps 2301-2303 has been given above, the detailed description is omitted here. For illustration only, the video call may be one of the communication.
The video calls comprise a party video call in which multiple persons participate and a evening video call in which multiple persons participate.
In addition, reference may also be made to the above-mentioned steps of the method 200 for details of the method 2300 that are not described in detail.
An exemplary embodiment of the present application provides a schematic structural framework diagram of a video processing apparatus. The apparatus 2400 may be applied to a server, such as a cloud server. The apparatus 2400 includes: an acquisition module 2401, a determination module 2402 and a processing module 2403; the following detailed description is directed to the functions of the various modules:
an obtaining module 2401, configured to obtain a video image in a video call, and determine emotion information indicating a user emotion in the video image.
A determining module 2402, configured to determine, according to the emotion information, a corresponding processing manner of the video image, where the processing manner is used for reflecting the emotion information.
The processing module 2403 is configured to process the video image based on the processing mode.
The video calls comprise a party video call in which a plurality of persons participate and a evening video call in which a plurality of persons participate.
For some contents that device 2400 cannot refer to, reference may be made to the contents of device 900 described above.
While the internal functions and structures of device 2400 have been described above, in one possible design, the structures of device 2400 may be implemented as a computing device, such as a computer. The apparatus 2500 may include: memory 2501, processor 2502, and communications component 2503;
the memory 2501 stores a computer program.
A communication component 2503, configured to obtain a video image in a video call.
A processor 2502 for executing the computer program for: acquiring a video image in a video call, and determining emotion information representing user emotion in the video image; determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the video image based on the processing mode.
The video calls comprise a party video call in which multiple persons participate and a evening video call in which multiple persons participate.
For some contents that cannot be mentioned by the apparatus 2500, reference may be made to the contents of the apparatus 1300 described above.
In addition, embodiments of the present invention provide a computer storage medium, and when executed by one or more processors, cause the one or more processors to implement the steps of a method for processing video in method 2300 embodiment.
Another exemplary embodiment of the present application further provides a flowchart of an audio processing method. The method 2600 provided by the embodiment of the present application is executed by a server, such as a cloud server, and the method 2600 includes the following steps:
2601: acquiring singing audio and determining emotion information representing emotion of a singing user.
2602: and determining a corresponding processing mode according to the emotion information, wherein the processing mode reflects the emotion information.
2603: processing is performed on the singing audio based on the processing mode.
Since the detailed description of the steps 2601-2603 has been described in detail above, the detailed description thereof is omitted here. For illustration only, it may be that processing is performed on the singing interface for step 2603.
Only the description is as follows: the method 2600 can be applied to on-line KTV. The server may obtain the singing audio of each user participating in the KTV in the room. The user can participate through APP of KTV installed on the mobile phone. And providing a user participation interface and displaying the characteristic processing mode provided by the server on the interface.
In addition, the method 2600 further comprises: scoring the singing audio; and determining a corresponding processing mode based on the scoring result.
Wherein, the scoring can be performed based on a scoring model, and the training mode of the model can be that the scoring is compared with the acoustic audio. The training process has been described above and will not be described here.
In addition, reference may also be made to the steps of the method 200 described above for details of the method 2600.
An exemplary embodiment of the present application provides a schematic structural framework diagram of an audio processing apparatus. The device 2700 can be applied to a server, such as a cloud server. The apparatus 2700 includes: an acquisition module 2701, a determination module 2702, and a processing module 2703; the following detailed description is directed to the functions of the various modules:
an obtaining module 2701 is configured to obtain a singing audio and determine emotion information indicating an emotion of a singing user.
The determining module 2702 is configured to determine a corresponding processing manner according to the emotion information, where the processing manner reflects the emotion information.
A processing module 2703, configured to perform processing on the singing audio based on the processing manner.
Further, the apparatus 2700 further comprises: the scoring module is used for scoring the singing audio; a determining module 2702, configured to determine a corresponding processing manner based on the scoring result.
For parts of the device 2700 that cannot be mentioned, reference may be made to the content of the device 900 described above.
While the internal functions and structures of the apparatus 2700 are described above, in one possible design, the structure of the apparatus 2700 may be implemented as a computing device, such as a server. The device 2800 may include: a memory 2801, a processor 2802, and a communications component 2803;
a memory 2801 for storing a computer program.
A communication component 2803 for obtaining singing audio.
A processor 2802 to execute the computer program to: determining emotion information representing emotion of a singing user; determining a corresponding processing mode according to the emotion information, wherein the processing mode reflects the emotion information; processing is performed on the singing audio based on the processing mode.
Further, processor 2802 is further configured to: scoring the singing audio; and determining a corresponding processing mode based on the scoring result.
For some of the contents that the apparatus 2800 cannot refer to, the contents of the apparatus 1300 described above may be referred to.
Additionally, embodiments of the present invention provide a computer storage medium, where the computer program, when executed by one or more processors, causes the one or more processors to implement the steps of a method for processing audio in method 2600 embodiments.
Another exemplary embodiment of the present application further provides a flowchart of a video processing method. The method 2900 provided by the embodiment of the present application is executed by a server, such as a cloud server, and the method 2900 includes the following steps:
2901: and obtaining an AR video image, and determining emotion information representing the emotion of the user in the AR video image.
2902: and determining a corresponding processing mode of the AR video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
2903: and processing the AR video image based on the processing mode.
Since the detailed description of the embodiments of steps 2901-2903 has been provided above, it is not repeated here. Only the description is as follows: the video is an AR (Augmented Reality) video, such as an AR video conference or the like. The viewing may be by wearing an AR device, such as by presenting video conference images of the respective participating users in images of the real scene of the respective users. The image of the real scene can be acquired by a mobile phone of a user. This makes it possible to perform special effect processing on the AR video.
Further, the method 2900 includes: receiving a request from a user, and acquiring an action corresponding to the request; and executing the action, and displaying the action result in the AR video image.
In the AR video, the user can also perform interaction of the AR video, and the AR interaction is realized based on the interaction request of the user. If, throw the screen in AR videoconference, the user can trigger and throw the screen operation to based on should throwing the screen operation, send through cell-phone videoconference APP and throw screen request to server, this request can carry and throw the screen image, and after the server received and throws the screen request, can will throw the video conference APP that screen image sent to the cell-phone of each receiver, in order to demonstrate, this show mode also belongs to AR show mode, with the preceding similar, just no longer give unnecessary details.
Specifically, determining a corresponding processing mode of the AR video image according to the emotion information includes: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
Further, the method 2900 includes: determining voice information of a user in a video image; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
Since the foregoing has been set forth, further description is omitted herein.
In addition, the method 2900 may refer to the steps of the method 200, which are not described in detail.
An exemplary embodiment of the present application provides a schematic structural framework diagram of a video processing apparatus. The device 3000 can be applied to an intelligent terminal such as a computer. The apparatus 3000 includes: an acquisition module 3001, a determination module 3002 and a processing module 3003; the following detailed description is directed to the functions of the various modules:
an obtaining module 3001, configured to obtain an AR video image, and determine emotion information indicating a user emotion in the AR video image.
A determining module 3002, configured to determine, according to the emotion information, a corresponding processing manner of the AR video image, where the processing manner is used to reflect the emotion information.
The processing module 3003 is configured to process the AR video image based on the processing mode.
In addition, the obtaining module 3001 is further configured to receive a request from a user, and obtain an action corresponding to the request; the apparatus 3000 further comprises: and the execution module is used for executing the action and displaying the action result in the AR video image.
Specifically, the determining module 3002 is configured to select, according to the emotion information, a special effect processing mode corresponding to the emotion information.
Further, the determining module 3002 is further configured to: determining voice information of a user in a video image; recognizing intonation information in the voice information; the apparatus 3000 further comprises: and the selection module is used for selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
For some of the contents that cannot be mentioned in the apparatus 3000, reference may be made to the contents of the apparatus 900 described above.
While the internal functions and structures of the apparatus 3000 have been described above, in one possible design, the structure of the apparatus 3000 may be implemented as a computing device, such as a computer. The device 3100 may include: a memory 3101, a processor 3102, and a communication component 3103;
the memory 3101 stores a computer program.
A communication component 3103 for acquiring AR video images.
A processor 3102 for executing the computer program to: determining emotion information representing a user emotion in the AR video image; determining a corresponding processing mode of the AR video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the AR video image based on the processing mode.
The communication module 3103 is further configured to receive a request from a user, and obtain an action corresponding to the request; processor 3102, which is further configured to: and executing the action, and displaying the action result in the AR video image.
Specifically, the processor 3102 is specifically configured to: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
Further, the processor 3102 is further configured to: determining voice information of a user in a video image; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
For some of the contents that this device 3100 fails to mention, reference may be made to the contents of the device 1300 described above.
Additionally, embodiments of the present invention provide a computer storage medium, and the computer program, when executed by one or more processors, causes the one or more processors to implement the steps of a method for processing video in the method 2900 embodiment.
Another exemplary embodiment of the present application further provides a flowchart of a video processing method. The method 3200 provided by the embodiment of the present application is executed by a server, such as a cloud server, and the method 3200 includes the following steps:
3201: and acquiring a VR video image, and determining emotion information representing the emotion of the user in the VR video image.
3202: and determining a corresponding processing mode of the VR video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information.
3203: and processing the VR video image based on the processing mode.
Since the detailed description of the steps 3201-3203 has been provided above, the detailed description is omitted here. Only the description is as follows: the video is VR (Virtual Reality) video. Taking a video conference as an example, the server may create a virtual video conference, and the server may receive video images of users participating in the video conference, and may set the video images of the users participating in the video conference in a virtual space, such as a 3D conference room, or a virtual place such as a 3D evening scene. And sending the VR video conference images to each receiver for the receivers to watch.
Further, the method 3200 further comprises: receiving a request from a user, and acquiring an action corresponding to the request; and executing the action, and displaying the action result in the VR video image.
Specifically, according to the emotion information, determining a corresponding processing mode of the VR video image includes: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
Further, the method 3200 further comprises: determining voice information of a user in a video image; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
In addition, the method 3200 may refer to the steps of the method 200, which are not described in detail.
An exemplary embodiment of the present application provides a schematic structural framework diagram of a video processing apparatus. The apparatus 3300 may be applied to a server, such as a cloud server. The apparatus 3300 includes: an acquisition module 3301, a determination module 3302, and a processing module 3303; the following detailed description is directed to the functions of the various modules:
an obtaining module 3301, configured to obtain a VR video image, and determine emotion information indicating a user emotion in the VR video image.
The determining module 3302 is configured to determine, according to the emotion information, a corresponding processing mode of the VR video image, where the processing mode is used to reflect the emotion information.
And the processing module 3303 is configured to process the VR video image based on the processing mode.
Furthermore, the obtaining module 3301 is further configured to: receiving a request from a user, and acquiring an action corresponding to the request; and executing the action, and displaying the action result in the VR video image.
Specifically, the determining module 3302 is configured to select, according to the emotion information, a special effect processing mode corresponding to the emotion information.
In addition, the determining module 3302 is further configured to determine voice information of the user in the video image; the apparatus 3300 further includes: the recognition module is used for recognizing intonation information in the voice information; and the selection module is used for selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
For some of the contents that cannot be mentioned in the apparatus 3300, reference may be made to the contents of the apparatus 900 described above.
While the internal functionality and structure of apparatus 3300 are described above, in one possible design, the structure of apparatus 3300 may be implemented as a computing device, such as a computer. The device 3400 may include: a memory 3401, a processor 3402, and a communication component 3403;
a memory 3401 for storing a computer program.
A communication component 3403 to obtain VR video images.
A processor 3402 configured to execute the computer program to: determining emotion information representing a user emotion in the VR video image; determining a corresponding processing mode of the VR video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information; and processing the VR video image based on the processing mode.
Further, the processor 3402 is further configured to: receiving a request from a user, and acquiring an action corresponding to the request; and executing the action, and displaying the action result in the VR video image.
Specifically, the processor 3402 is specifically configured to: and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
Further, the processor 3402 is further configured to: determining voice information of a user in a video image; recognizing intonation information in the voice information; and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
For some contents that cannot be mentioned by the device 3400, reference may be made to the contents of the device 1300 described above.
Additionally, embodiments of the present invention provide a computer storage medium, which when executed by one or more processors, causes the one or more processors to implement the steps of a method for processing video in the method 3200 embodiment.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 201, 202, 203, etc., are merely used for distinguishing different operations, and the sequence numbers themselves do not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable multimedia data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable multimedia data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable multimedia data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable multimedia data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (37)

1. A method for processing multimedia information, comprising:
acquiring multimedia information, and determining emotion information representing user emotion in the multimedia information;
determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information;
and processing the multimedia information based on the processing mode.
2. The method of claim 1, wherein the obtaining multimedia information comprises:
acquiring audio and/or video information of at least one participant based on the audio and/or video conference;
acquiring the audio and/or video information of the main broadcast based on the audio and/or video live broadcast; or the like, or, alternatively,
and acquiring video information of the preset audio and/or the preset video based on the preset audio and/or the preset video.
3. The method of claim 1, wherein determining emotional information indicative of a user's emotion in the multimedia information comprises:
determining voice information and/or expression information of a user according to at least one frame of video image;
and determining the emotion information according to the voice information and/or the expression information.
4. The method of claim 3, wherein the determining the emotion information from the speech information comprises:
determining semantics corresponding to the voice information;
determining the emotional information according to the semantics.
5. The method of claim 4, wherein the determining the emotional information according to the semantics comprises:
according to the semantics, determining a label set of the voice information, which can express emotional tendency;
and selecting a corresponding emotion label from the label set according to the semantics as emotion information.
6. The method according to claim 1, wherein the determining the corresponding processing manner of the multimedia information according to the emotion information comprises:
and selecting a special effect processing mode corresponding to the emotion information according to the emotion information.
7. The method of claim 6, further comprising:
determining a target object to which the emotion information faces;
and selecting a special effect processing mode corresponding to the emotion information according to the target object.
8. The method of claim 1, wherein the processing the multimedia information based on the processing manner comprises:
determining a target object oriented to the processing mode;
and processing the video image of the target object based on the processing mode.
9. The method of claim 8, further comprising:
determining a target object in the video image;
and processing the target object based on the processing mode.
10. The method of claim 9, wherein when the target object is not present in the video image, the method further comprises:
determining a position in the video image where the target object may appear;
and processing the position based on the processing mode.
11. The method of claim 7, wherein the determining the target object to which the emotional information is directed comprises:
determining corresponding semantics according to the voice information;
and acquiring a target object identifier in the semantic database according to the semantic database, and determining the target object based on the target object identifier.
12. The method of claim 1, further comprising:
and providing the processed multimedia information to a corresponding intelligent terminal for displaying.
13. The method of claim 6, wherein the special effects processing mode comprises: expression special effects, character special effects and animation special effects.
14. The method of claim 2, further comprising:
and if the preset video is determined not to meet the preset condition according to the emotion information, the voice information and/or the scene, providing prompt information to prompt video clips, voice information and/or people in the preset video to be processed.
15. The method of claim 1, wherein the multimedia information comprises audio and/or video information.
16. The method of claim 1, wherein determining emotional information indicative of a user's emotion in the multimedia information comprises:
mood information representing a user's mood in the audio information is determined.
17. The method of claim 16, further comprising:
determining semantic information in the audio information;
and determining a corresponding processing mode of the audio information according to the semantic information.
18. The method according to claim 1, wherein the determining a corresponding processing manner of the multimedia information according to the emotion information for the audio information comprises:
determining the adjustment frequency of audio in the audio information according to the emotion information;
determining the adjustment noise of the audio in the audio information according to the emotion information; and/or
And providing a preset characteristic processing mode according to the emotion information.
19. The method of claim 1, wherein in the case that the multimedia information is based on images in a video conference, the method further comprises:
identifying the gesture of a corresponding user in each video image based on the acquired video images of the participating users;
determining, based on the recognized gestures, an opinion that each participating user has a tendency in the video conference.
20. The method of claim 19, further comprising:
the step of identifying a gesture of a corresponding user in each of the video images is performed in response to a voting operation or in response to a game interaction operation.
21. The method of claim 19, wherein determining from the recognized gestures that each participating user has a tendency to opinion in the video conference comprises:
and determining the opinions of the participating users based on the recognized gestures and according to the preset corresponding relation between the gestures and the opinions.
22. The method of claim 19, further comprising:
determining the quantity corresponding to each opinion, and storing the quantity corresponding to the opinion;
and sending the quantity corresponding to each opinion to a corresponding intelligent terminal for displaying.
23. The method of claim 21, further comprising:
a time at which the user remains in the recognized gesture is determined, and opinions of the respective participating users are determined based on the time.
24. The method of claim 19, further comprising:
determining the starting time of the gesture, and determining the score of each user in game interaction according to the starting time;
and storing the score, and sending the score to a corresponding intelligent terminal for displaying.
25. The method of claim 6, further comprising:
creating a storage library of special effect processing modes;
receiving a request from a user, selecting a corresponding special effect processing mode from the repository based on the type of the video image, and determining the type of the corresponding special effect processing mode;
and selecting the special effect processing mode from the corresponding type according to the emotion information and the type of the acquired video image.
26. The method of claim 1, further comprising:
determining the voice information of a user in a video image or voice call;
recognizing intonation information in the voice information;
and selecting a corresponding special effect processing mode based on the emotion information and the intonation information.
27. A method for processing multimedia information, comprising:
acquiring multimedia information, and determining emotion information representing user emotion in the multimedia information;
determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information;
and processing the multimedia information based on the processing mode so as to directly display the processed multimedia information.
28. A processing method of live video is characterized by comprising the following steps:
acquiring a video live broadcast image, and determining emotion information representing anchor emotion in the video live broadcast image;
determining a corresponding processing mode of a video live broadcast image according to the emotion information, wherein the processing mode is used for reflecting the emotion information;
and processing the live video images based on the processing mode.
29. The method of claim 28, further comprising:
determining a target object to which the emotion information faces;
when the target object is a user watching live video, determining the position of the user in the live video image;
wherein, based on the processing mode, processing the live video image comprises:
and processing the position according to the processing mode.
30. A method for processing a video conference, comprising:
acquiring video images corresponding to a plurality of users in a video conference, and determining the video image to which the user speaking belongs;
determining emotion information representing an emotion of a speaking user in the video image;
determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information;
and processing the video image based on the processing mode.
31. A method for processing video, comprising:
acquiring a preset video, and determining emotion information representing the emotion of a person in the video image;
determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information;
and processing the video image based on the processing mode.
32. A method for processing video, comprising:
acquiring a video image in a video call, and determining emotion information representing user emotion in the video image;
determining a corresponding processing mode of the video image according to the emotion information, wherein the processing mode is used for reflecting the emotion information;
and processing the video image based on the processing mode.
33. A method for processing audio, comprising:
acquiring singing audio and determining emotion information representing emotion of a singing user;
determining a corresponding processing mode according to the emotion information, wherein the processing mode reflects the emotion information;
and processing on the singing audio based on the processing mode.
34. The method of claim 33, further comprising:
scoring the singing audio;
and determining a corresponding processing mode based on the scoring result.
35. A computing device, comprising: a memory, a processor, and a communication component;
the memory for storing a computer program;
the communication component is used for acquiring multimedia information;
the processor to execute the computer program to:
determining emotion information representing the emotion of the user in the multimedia information;
determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information;
and processing the multimedia information based on the processing mode.
36. A computing device, comprising: a memory, a processor, and a communication component;
the memory for storing a computer program;
the communication component is used for acquiring multimedia information;
the processor to execute the computer program to: determining emotion information representing the emotion of the user in the multimedia information;
determining a corresponding processing mode of the multimedia information according to the emotion information, wherein the processing mode is used for reflecting the emotion information;
and processing the multimedia information based on the processing mode so as to directly display the processed multimedia information.
37. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform the steps of the method of any one of claims 1-34.
CN202011219009.3A 2020-11-04 2020-11-04 Multimedia information processing method, computing equipment and storage medium Pending CN114449297A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011219009.3A CN114449297A (en) 2020-11-04 2020-11-04 Multimedia information processing method, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011219009.3A CN114449297A (en) 2020-11-04 2020-11-04 Multimedia information processing method, computing equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114449297A true CN114449297A (en) 2022-05-06

Family

ID=81362004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011219009.3A Pending CN114449297A (en) 2020-11-04 2020-11-04 Multimedia information processing method, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114449297A (en)

Similar Documents

Publication Publication Date Title
WO2018224034A1 (en) Intelligent question answering method, server, terminal and storage medium
US20170277993A1 (en) Virtual assistant escalation
CN109176535B (en) Interaction method and system based on intelligent robot
JP7070652B2 (en) Information processing systems, information processing methods, and programs
CN108804698A (en) Man-machine interaction method, system, medium based on personage IP and equipment
CN110602516A (en) Information interaction method and device based on live video and electronic equipment
CN107480766B (en) Method and system for content generation for multi-modal virtual robots
CN105204886B (en) A kind of method, user terminal and server activating application program
CN114125492B (en) Live content generation method and device
CN112738557A (en) Video processing method and device
CN110737845A (en) method, computer storage medium and system for realizing information analysis
CN114064943A (en) Conference management method, conference management device, storage medium and electronic equipment
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
CN111767386A (en) Conversation processing method and device, electronic equipment and computer readable storage medium
CN110309470A (en) A kind of virtual news main broadcaster system and its implementation based on air imaging
CN112492400A (en) Interaction method, device, equipment, communication method and shooting method
CN111369275A (en) Group identification and description method, coordination device and computer readable storage medium
CN113806620B (en) Content recommendation method, device, system and storage medium
CN115171673A (en) Role portrait based communication auxiliary method and device and storage medium
CN114449297A (en) Multimedia information processing method, computing equipment and storage medium
CN113301362B (en) Video element display method and device
CN112533009B (en) User interaction method, system, storage medium and terminal equipment
CN111095397A (en) Natural language data generation system and method
CN113378583A (en) Dialogue reply method and device, dialogue model training method and device, and storage medium
CN113965541B (en) Conversation expression processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination