WO2021012491A1 - 多媒体信息展示方法、装置、计算机设备及存储介质 - Google Patents

多媒体信息展示方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021012491A1
WO2021012491A1 PCT/CN2019/116761 CN2019116761W WO2021012491A1 WO 2021012491 A1 WO2021012491 A1 WO 2021012491A1 CN 2019116761 W CN2019116761 W CN 2019116761W WO 2021012491 A1 WO2021012491 A1 WO 2021012491A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
video file
target object
image
editing
Prior art date
Application number
PCT/CN2019/116761
Other languages
English (en)
French (fr)
Inventor
欧阳碧云
吴欢
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021012491A1 publication Critical patent/WO2021012491A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/44Browsing; Visualisation therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/235Processing of additional data, e.g. scrambling of additional data or processing content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47205End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally

Definitions

  • This application relates to the field of computer application technology. Specifically, this application relates to a multimedia information display method, device, computer equipment, and storage medium.
  • Smart terminals include computers, mobile phones, tablets, etc. People use the application software on smart terminals to perform various operations, such as browsing web pages, voice, text, video communication, and video watching Wait.
  • the purpose of this application is to solve at least one of the above technical defects, and to disclose a multimedia information display method, device, computer equipment and storage medium that can enhance human-computer interaction and entertainment.
  • the present application discloses a multimedia information display method, including: acquiring an editing instruction input by a user for a target image of the current time axis in a played video file, wherein the editing instruction includes the coordinates of the target image to be edited And editing type; lock the target object in the target image according to the coordinate to be edited; edit the target object according to the editing type; display the edited image in the current and subsequent time axis images of the video file Target object.
  • the present application discloses a multimedia information display device, including: an acquisition module configured to execute an editing instruction input by a user for a target image of a current time axis in a played video file, wherein the editing instruction It includes the coordinate to be edited and the editing type of the target image; the locking module: is configured to perform locking of the target object in the target image according to the coordinate to be edited; the editing module: is configured to perform the editing according to the editing type.
  • the target object is edited; the display module is configured to display the edited target object in the image of the subsequent time axis of the video file.
  • the present application discloses a computer device, including: one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to Executed by the one or more processors, the one or more computer programs are configured to execute the foregoing multimedia information display method.
  • the present application discloses a storage medium storing computer-readable instructions, the computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the foregoing multimedia information display method is implemented.
  • Figure 1 is a flowchart of the multimedia information display method of the application
  • FIG. 2 is a flowchart of an identity verification method according to an embodiment of the application.
  • FIG. 3 is a flowchart of a method for locking a target object in a target image in this application
  • Figure 4 is a flowchart of the training method of the convolutional neural network model of the application.
  • FIG. 5 is a schematic diagram of a video image according to an embodiment of the application.
  • Figure 6 is a schematic diagram of the Applicant’s decoration
  • Figure 7 is a schematic diagram of the display of characters after decoration in the application.
  • FIG. 8 is a flowchart of a method for performing tone color conversion on a target object in this application.
  • FIG. 9 is a block diagram of the multimedia information display device of this application.
  • FIG. 10 is a block diagram of the basic structure of the computer equipment of this application.
  • terminal and “terminal equipment” used herein include both wireless signal receiver equipment, equipment that only has wireless signal receivers without transmitting capability, and equipment receiving and transmitting hardware.
  • a device which has a device capable of performing two-way communication receiving and transmitting hardware on a two-way communication link.
  • Such equipment may include: cellular or other communication equipment, which has a single-line display or multi-line display or cellular or other communication equipment without a multi-line display; PCS (Personal Communications Service, personal communication system), which can combine voice and data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notebooks, calendars and/or GPS (Global Positioning System (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device, which has and/or includes a radio frequency receiver, a conventional laptop and/or palmtop computer or other device.
  • PCS Personal Communications Service, personal communication system
  • PDA Personal Digital Assistant
  • GPS Global Positioning System (Global Positioning System) receiver
  • a conventional laptop and/or palmtop computer or other device which has and/or includes a radio frequency receiver, a conventional laptop and/or palmtop computer or other device.
  • terminal and terminal equipment used here may be portable, transportable, installed in vehicles (aviation, sea and/or land), or suitable and/or configured to operate locally, and/or In a distributed form, it runs on the earth and/or any other location in space.
  • the "terminal” and “terminal device” used here can also be communication terminals, Internet terminals, music/video playback terminals, such as PDA, MID (Mobile Internet Device, mobile Internet device) and/or music/video playback Functional mobile phones can also be devices such as smart TVs and set-top boxes.
  • This application discloses a multimedia information display method, including:
  • the video file is a video file obtained by the local server from the application server or stored in the local server.
  • a video file is a dynamic image composed of multiple static picture frames connected in series according to the time axis and combined with corresponding sound effects.
  • the editing instruction refers to the information selected by the user to edit the video file.
  • On the client where the user is watching the video an interface for the user to edit the video is provided.
  • the display of this editing interface can appear in any way.
  • an edit box pops up in a pop-up window for the user to edit at will; in another embodiment, the edit box is overlaid on the current video file in a semi-transparent floating window, and when receiving After the trigger instruction from the user, the editing information is sent to the server for editing processing.
  • the trigger instruction here refers to a specific command entered by the user, or selected for editing through an existing editing option on the editing interface.
  • the existing editing options here are any operations that can edit the video, such as adjusting the color of the image in the video, adding filters, beautifying all the characters in the video or specified characters, and performing the sound in the video Voice change processing, etc., the above editing operations are called editing types.
  • the video file is a series of multiple static image frames together according to the time axis
  • you need to obtain the frame of image that needs to be edited which is called the target image.
  • the target image when editing , You can edit the frame image as a whole, or you can edit a specified object in the target image screen. Therefore, in the process of editing the target image, you also need to obtain the coordinates of the target image to be edited. According to the position to be edited Edit the coordinates corresponding to the editing type.
  • the above editing instructions come from the client where the user is watching the video file. After the user selects the corresponding editing coordinates and editing type on the relevant operation interface of the client, the client generates the editing instructions and sends them to the server, and the server obtains the editing instructions. After that, edit according to the edit coordinates and edit instructions.
  • the coordinates to be edited of the target image are acquired in step S1000, the coordinates to be edited here refer to a certain point in the target image as the origin of the coordinates, and the coordinate position relative to the origin of the coordinates. No matter where the origin of the coordinate is, the coordinate to be edited in this application represents a specific point in the target image, and this point falls in a certain pixel of the target image. Since the target image is formed by splicing a plurality of different pixels, and different pixels are spliced together to form images of different objects, the target object in the target image can be locked through the coordinate to be edited.
  • the target object here may include a certain object, multiple objects, or the entire target image.
  • the specific number and range are determined according to the number of coordinates to be edited selected by the user.
  • the user can select all coordinate points in the entire target image by selecting all, or select one or more objects by selecting one or more points.
  • there are trees, flowers, and people in the target image A certain point in the image of the tree is selected, so it can be considered that the user needs to edit the tree.
  • the user selects the flower and the person in the same way, it means that the user wants to edit and lock the selected one "Flowers" and "People".
  • the editing instruction includes the editing type
  • the target object in the target image is locked, the target object is edited according to the selected editing type.
  • the editing types here include, but are not limited to, color adjustment of images in the video, adding filters, adding text or images, beautifying or decorating all characters in the video or specified characters, changing the size and shape of the target object, The target object is rendered, the sound in the video is changed, and so on.
  • the editing type further includes obtaining the original video file, and performing editing actions such as color correction, beautification, decoration, and voice change in the original video file.
  • the images played on the subsequent time axis are displayed in the style edited in the target image, for example, the entire screen is added to the target image If the filter is added, the filter is added to the subsequent images of the video file.
  • the filter is added to the subsequent images of the video file.
  • the display method of the subsequent time axis image also includes displaying the edited target object in the selected frame, which can display the edited effect screen by specifying certain frames, instead of all according to the edited The effect is displayed.
  • the editing type includes obtaining an original video file, where the original video file is original image information that has not undergone post-processing.
  • the original video file is an image taken through a mobile phone, a computer, or a camera, which has not undergone post-processing.
  • the post-processing here refers to the processing of the pictures or videos taken, such as adding filters and beautifying. If there is no post-processing, it means that the video file has not been added with filters or beautified.
  • the method of obtaining original image information in this application may be that when uploading image information, the original image is uploaded to the server at the same time, so the backend only needs to select the original image information from the server.
  • the user sends the original image and the processed image to the background server at the same time, but can choose which image is displayed on the client or the other party’s display terminal.
  • the processed image is displayed on the display terminal, the unprocessed original image can be retrieved through the access authority.
  • the images taken by mobile phones or cameras and camcorders are all original image information, and an EXIF value will be generated when the file is formed after the shooting.
  • Exif is an image file format, and its data storage is exactly the same as the JPEG format. of.
  • the Exif format is to insert digital image information into the JPEG format head, including the aperture, shutter, white balance, ISO, focal length, date and time and other shooting conditions and camera brand, model, color coding, shooting Time recording sound and GPS global positioning system data, thumbnails, etc.
  • the Exif information may be lost, or the actual aperture, shutter, ISO, and white balance of the image may not match the information in this information. Therefore, by obtaining the parameter information about the image in this information, Perform a parameter comparison interface to determine whether the current image is the original image.
  • NSURL *fileUrl [[NSBundle mainBundle]URLForResource:@"YourPic"withExtension:@""];
  • CGImageSourceRef imageSource CGImageSourceCreateWithURL((CFURLRef)fileUrl,NULL);
  • CFDictionaryRef imageInfo CGImageSourceCopyPropertiesAtIndex(imageSource,0,NULL);
  • NSDictionary*exifDic (__bridgeNSDictionary*)CFDictionaryGetValue(imageInfo,kCGImagePropertyExifDictionary);
  • the original picture is identified in the above manner, the original picture is stored in the database for easy recall and subsequent compilation.
  • the editing instruction further includes user identity information, and before the acquiring the original video file, it also includes:
  • the editing type includes obtaining the original video file, and the original video file is a video file that is uploaded to the server at the same time. As long as there is a permission instruction for viewing, the original video file can be obtained by accessing the server.
  • the permission for viewing is obtained through user identity information. Therefore, when the editing instruction includes obtaining the original video file, the editing instruction should also include the user's identity information.
  • the user's identity information is usually the account information that the user logs in when performing related tasks, and the corresponding authority is matched through the account information.
  • the editing type also includes image editing in the original video file.
  • the type of image editing may include adding filters, changing light, and beautifying or decorating one or more designated objects.
  • the video file or the original video file can be edited according to the user's authority.
  • the specific operation method can be to set the corresponding authority for different editing types.
  • the user requests the above editing type, query the authority corresponding to the user identity information.
  • the editing type is authorized, the selected target image is edited with corresponding permissions.
  • the editing type is not authorized, the editing step sent by the user is not responded, and an error message is returned to prompt the user.
  • the method of locking the target object in the target image according to the coordinate to be edited includes:
  • S2200 Match the coordinates to be edited in the coordinate area to determine the target object to which they belong.
  • the neural network model here refers to an artificial neural network, which has a self-learning function. For example, when realizing image recognition, you only need to input many different image templates and corresponding recognition results into the artificial neural network, and the network will slowly learn to recognize similar images through the self-learning function. In addition, it has a Lenovo storage function. This kind of association can be realized with the feedback network of artificial neural network. Neural networks also have the ability to find optimal solutions at high speed. Finding an optimized solution for a complex problem often requires a lot of calculations. Using a feedback artificial neural network designed for a certain problem and using the computer's high-speed computing capabilities, it may be possible to quickly find an optimized solution. Based on the above points, this application uses a trained neural network model to identify the target object and the coordinate area mapped by the target object.
  • Neural networks include deep neural networks, convolutional neural networks, recurrent neural networks, deep residual networks, etc.
  • This application takes convolutional neural networks as an example for illustration.
  • Convolutional neural networks are a kind of feedforward neural network, and artificial neurons can In response to surrounding units, large image processing can be performed.
  • Convolutional neural network includes convolutional layer and pooling layer.
  • the purpose of convolution in convolutional neural networks (CNN) is to extract certain features from the image.
  • the basic structure of a convolutional neural network includes two layers. One is a feature extraction layer. The input of each neuron is connected to the local receptive field of the previous layer, and the local features are extracted.
  • each computing layer of the network is composed of multiple feature maps, and each feature map is a plane.
  • the weights of all neurons on the plane are equal.
  • the feature mapping structure uses a sigmoid function with a small influencing function core as the activation function of the convolutional network, so that the feature mapping has displacement invariance.
  • neurons on a mapping plane share weights, the number of free parameters of the network is reduced.
  • Each convolutional layer in the convolutional neural network is followed by a calculation layer for local averaging and secondary extraction. This unique two-feature extraction structure reduces the feature resolution.
  • Convolutional neural networks are mainly used to identify displacement, scaling and other forms of distortion invariant two-dimensional graphics. Since the feature detection layer of the convolutional neural network is learned through training data, when using the convolutional neural network, explicit feature extraction is avoided, and the training data is implicitly learned; in addition, due to the same feature mapping surface The weights of the neurons on the above are the same, so the network can learn in parallel, which is also a big advantage of the convolutional network over the network of neurons connected to each other.
  • the storage form of a color image in the computer is a three-dimensional matrix.
  • the three dimensions are the width, height and RGB (red, green and blue color value) values of the image
  • the storage form of a grayscale image in the computer is A two-dimensional matrix, the two dimensions are the width and height of the image.
  • each element in the matrix has a value range of [0,255], but the meaning is different.
  • the three-dimensional matrix of a color image can be split into three R, G, and B Two-dimensional matrix, the elements in the matrix represent the R, G, and B brightness of the corresponding position of the image.
  • the elements represent the gray value of the corresponding position of the image.
  • the binary image can be regarded as a simplification of the gray image. It converts all the originals in the gray image higher than a certain threshold into 1, otherwise it is 0. Therefore, if the element in the binary image matrix is not 0, then 1.
  • the value image is sufficient to describe the contour of the image. An important function of the two-convolution operation is to find the edge contour of the image.
  • the image is converted into a binary image, and then the edge feature of the image object is obtained through the filtering of the convolution kernel, and then the dimensionality of the image is reduced by pooling in order to obtain the obvious image feature.
  • the image features in the image are recognized.
  • the object as a feature in the captured image can be obtained through a neural network model trained by a convolutional neural network.
  • a neural network model trained by a convolutional neural network can also be used, such as DNN (deep neural network), RNN (cyclic Neural network) and other network model training. No matter what kind of neural network is trained, the principle of using this machine learning mode to recognize different objects is basically the same.
  • the training method of the convolutional neural network model is as follows:
  • the training sample data is the constituent unit of the entire training set, and the training set is composed of several training sample training data.
  • the training sample data is composed of data of a variety of different objects and classification judgment information for marking various objects.
  • Classification judgment information refers to people's artificial judgments on training sample data based on the training direction of the input convolutional neural network model, through universal judgment standards and fact states, that is, people's judgment on the output value of the convolutional neural network model Expected goals. For example, in a training sample data, if it is manually recognized that the object in the image information data is the same as the object in the pre-stored image information, the object classification judgment information is calibrated as the same as the pre-stored target object image.
  • the training sample set is sequentially input into the convolutional neural network model, and the model classification reference information output by the penultimate fully connected layer of the convolutional neural network model is obtained.
  • Model classification reference information is the excitation data output by the convolutional neural network model according to the input object image. Before the convolutional neural network model is trained to convergence, the classification reference information is a numerical value with greater discreteness. When the convolutional neural network After the model is not trained to convergence, the classification reference information is relatively stable data.
  • the stop loss function is a detection function used to detect the model classification reference information in the convolutional neural network model and whether it is consistent with the expected classification judgment information.
  • the weights in the convolutional neural network model need to be corrected to make the output result of the convolutional neural network model the same as the expected result of the classification judgment information .
  • the weights in the convolutional neural network model need to be corrected to make the output result of the convolutional neural network model the same as the expected result of the classification judgment information .
  • the first neural network model is trained so that it can recognize the object in the video file, the coverage area of the object, and the corresponding coordinate area.
  • the acquired coordinate to be edited determines the target object to be edited selected by the user.
  • operations such as adding text or image, changing the size and shape of the target object, rendering the target object, adding filters, and beautifying the target object can be performed on the target object.
  • the user edits the video file on the current display terminal.
  • the types of editing include but are not limited to obtaining the original video file, adding text or images, and changing the target.
  • the editing type is to obtain the original video file or edit again on the basis of the original video file, according to the obtained user identity information, identify its permission to obtain the original video file.
  • the user has the permission to obtain the original video file, provide the original video file
  • the user can beautify the designated person in the image according to his own preferences, including whitening skin color, enlarged eyes, Red lips, changing eyebrow shapes, and even adding small accessories, for example, in this embodiment, the editing type is to add small accessories to a certain person in the image.
  • the image includes multiple optional characters.
  • the character can be locked as the target object by the above public method, as shown in Figure 6, according to the selected character, by custom drawing or in the edit box Select the appropriate decoration in the drop-down selection box and add it to the selected character.
  • a decoration is added to the head of the selected character.
  • the editing parameters of the target character are saved. That is, according to the editing parameters, the video file is locked and displayed according to the locked style.
  • the character is automatically tracked, and the local characteristics of the character are automatically read, and the decoration is continued to achieve the purpose of continuous display. For example, when a person is beautified, the subsequent video frame files are automatically searched to match the person. When the person appears, the above-mentioned edited parameters are automatically added to it, without the need for the user to modify the image in each frame. All characters are re-dressed, for example, as shown in Figure 7, when the character is in another scene, its appearance remains unchanged.
  • the target object or person can be selected through the neural network model, and the person selected by the user is the reference person.
  • Each frame of the video file is transmitted to the neural network model to identify the reference person.
  • the above-mentioned saved parameters are automatically added to the reference person, and the image with the added parameters is played on the front end.
  • users can customize the image according to their preferences. For example, when they don’t like a certain character, they can lock the character’s avatar and replace it with "pig head".
  • the character The image of is displayed in the form of a pig's head; in order to increase the interest of users to watch images and videos, it can also stimulate users' creativity.
  • the editing type includes tone color conversion, which is to change the sound in the video file.
  • the timbre conversion here can be the conversion of all the sounds in the video file according to the specified timbre conversion parameters, or the timbre conversion of the sounds produced by one or more objects.
  • the objects mentioned here include sounds made by people, animals or tools, plants under the action of external forces, and can also be background music added in the video.
  • the method of performing tone color conversion on the target object includes:
  • Timbre means that the frequency of different sounds always has unique characteristics in terms of waveform.
  • Different sound-producing bodies have different timbres due to their different materials and structures. For example, piano and violin are different from human voices; each individual's voice is also different. Tone is the characteristic of sound, and it is always different from the appearance of people all over the world. According to different timbres, even in the case of the same pitch and the same sound intensity, we can distinguish that they are made by different instruments or people. Like the ever-changing color palette, the "tone" will also be ever-changing and easy to understand.
  • the timbre will be simulated numerically
  • the target timbre parameter here is the value for simulating the timbre.
  • the target tone color parameters include user-defined parameters or designated parameters selected from a tone color database.
  • the method of adjusting the sound source information of the target object may be manual or automatic adjustment.
  • the automatic adjustment is performed by a neural network model.
  • the sound source information is input into the second neural network model.
  • the second neural network model is the same as the first neural network model disclosed above. It has a self-learning function, except that the training samples are different, so the output result is Also different.
  • the second neural network model after training, it can recognize the sound of the target object, and convert the target object into the corresponding parameter value according to the tone parameter conversion rule.
  • the recognized The sound of the target object is converted. For example, the voice of a locked character is transformed into the voice of an anime character to increase the interest.
  • the specific operation is that the user selects a target timbre that needs to be changed in the sound database by selecting a certain person or animal in the image, and the selected person or animal emits a sound according to the target timbre.
  • a target timbre that needs to be changed in the sound database by selecting a certain person or animal in the image, and the selected person or animal emits a sound according to the target timbre.
  • Character A is a boy.
  • character A is selected and the character A is matched with the speech parameters of the robot cat in the voice database, then In the subsequent video file, what the character A said is uttered according to the specific voice of the Doraemon.
  • the tone color conversion adopts a neural network model.
  • the whole process of human body vocalization has three stages, which can be represented by three basic modules: 1) excitation module, 2) sound channel module; 3) radiation module. Connect these three module systems in series to get a complete speech system.
  • the main parameters in the model include fundamental frequency period, unvoiced/voiced judgment, gain and filter parameters.
  • Voice timbre transformation generally includes two processes, training process and transformation process.
  • the training process generally includes the following steps: 1) Analyze the source and target speakers’ voice signals to extract effective acoustic features; 2) combine them with the source and target speakers’ acoustic features Alignment; 3) Analyze the aligned features to obtain the mapping relationship between the source and target speakers in the acoustic vector space, and the transformation function/rule.
  • the extracted voice feature parameters of the source speaker are obtained through the transformation function/rule obtained through training to obtain the transformed voice feature parameters, and then these transformed feature parameters are used to synthesize and output the voice, so that the output voice sounds like the selected voice What the target speaker said.
  • the general change process includes: 1) extracting feature parameters from the speech input by the source speaker, 2) using transformation functions/rules to calculate new feature parameters; 3) synthesizing and outputting.
  • a synchronization mechanism must be used to ensure Get real-time output.
  • the Pitch Synchronous Overlap Add (PSOLA) method can be used.
  • This application discloses a multimedia information display device, including:
  • Obtaining module 1000 configured to execute an editing instruction for acquiring a target image of the current time axis in the played video file input by the user, wherein the editing instruction includes the coordinates to be edited and the editing type of the target image; locking module 2000: configured to perform locking of the target object in the target image according to the coordinates to be edited; editing module 3000: configured to perform editing of the target object according to the editing type; display module 4000: configured to Perform displaying the edited target object in the image of the subsequent time axis of the video file.
  • the editing type includes obtaining an original video file, where the original video file is original image information that has not undergone post-processing.
  • the editing instruction includes user identity information
  • the editing module further includes:
  • the permission acquisition module configured to execute the acquisition permission of the user's original video file through the user identity information; when the acquisition permission meets a preset rule, the original video file is acquired from the database.
  • the locking module includes:
  • the first recognition module is configured to perform input of the target image into the first neural network model to recognize the object in the target image and the coordinate area mapped by the object;
  • Target matching module configured to perform matching of the coordinate to be edited in the coordinate area to determine the target object to which it belongs.
  • the editing type includes tone color conversion
  • the editing module further includes:
  • Tone acquisition module configured to execute the acquisition of the target tone parameter in the tone conversion instruction
  • Sound source recognition module configured to perform recognition of the sound source information mapped by the target object
  • the sound source processing module is configured to input the sound source information into the second neural network model to output target sound source information that meets the target tone color parameters.
  • the editing type further includes: adding text or images, changing the size and shape of the target object, and rendering the target object.
  • the target tone color parameters include user-defined parameters or designated parameters selected from a tone color database.
  • the multimedia information display device disclosed above is a one-to-one corresponding execution device of the multimedia information display method, and its working principle is the same as the above multimedia information display method, and will not be repeated here.
  • FIG. 10 Please refer to FIG. 10 for the basic structure block diagram of the computer equipment provided by the embodiment of the present application.
  • the computer device includes a processor, a nonvolatile storage medium, a memory, and a network interface connected through a system bus.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store control information sequences.
  • the processor can realize a A multimedia information display method.
  • the processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment.
  • a computer readable instruction may be stored in the memory of the computer device, and when the computer readable instruction is executed by the processor, the processor may cause the processor to execute a multimedia information display method.
  • the network interface of the computer device is used to connect and communicate with the terminal.
  • FIG. 10 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the present application also provides a storage medium storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute the multimedia information described in any of the above embodiments. Show method.
  • the storage medium in this embodiment is a volatile storage medium, but may also be a non-volatile storage medium.
  • the computer program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请公开一种多媒体信息展示方法、装置、计算机设备及存储介质。所述方法包括:获取用户输入的针对所播放的视频文件中当前时间轴的目标图像的编辑指令,其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;根据所述待编辑坐标锁定所述目标图像中的目标物体;根据所述编辑类型对所述目标物体进行编辑;在所述视频文件的当前及后续时间轴的图像中展示编辑后的目标物体。本申请允许用户按照自己的意愿对所观看的图像进行编辑,以提高娱乐性和互动性,另外,还允许用户调用原始图像,并让用户自己在原始图像的基础上进行修改,提高观看者观看图像时的互动性。用户除了可以对指定的人物进行装扮、美颜外,还可以改变人物或动物的说话的音色,进一步增强娱乐性。

Description

多媒体信息展示方法、装置、计算机设备及存储介质
本申请要求于2019年7月19日提交中国专利局、申请号为201910657196.4,发明名称为“多媒体信息展示方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机应用技术领域,具体而言,本申请涉及一种多媒体信息展示方法、装置、计算机设备及存储介质。
背景技术
随着科技的发展,智能终端得到了广泛的应用,智能终端包括电脑、手机、平板等,人们通过智能终端上的应用软件执行各种操作,比如浏览网页、语音、文字、视频交流,视频观看等。
现有技术中,在通过智能终端观看到的无论是图片还是视频,当他人在查看的时候,只能看到已经修改过的,比如经过美颜或者处理之后的,发明人意识到,观看者不能自己进行对图片中的人物或者事物进行修改,只能是被动地看,时间久了,容易产生审美疲劳,且互动性不强。
发明内容
本申请的目的旨在至少能解决上述的技术缺陷之一,公开一种通过能够增强人机互动性以及娱乐性的多媒体信息展示方法、装置、计算机设备及存储介质。
第一方面,本申请公开多媒体信息展示方法,包括:获取用户输入的针对所播放的视频文件中当前时间轴的目标图像的编辑指令,其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;根据所述待编辑坐标锁定所述目标图像中的目标物体;根据所述编辑类型对所述目标物体进行编辑;在所述视频文件的当前及后续时间轴的图像中展示编辑后的目标物体。
第二方面,本申请公开一种多媒体信息展示装置,包括:获取模块:被配置为执行获取用户输入的针对所播放的视频文件中当前时间轴的目标图像的编辑指令, 其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;锁定模块:被配置为执行根据所述待编辑坐标锁定所述目标图像中的目标物体;编辑模块:被配置为执行根据所述编辑类型对所述目标物体进行编辑;展示模块:被配置为执行在所述视频文件的后续时间轴的图像中展示编辑后的目标物体。
第三方面,本申请公开一种计算机设备,包括:一个或多个处理器;存储器;一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行上述一种多媒体信息展示方法。
第四方面,本申请公开一种存储有计算机可读指令的存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述一种多媒体信息展示方法。
本申请附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本申请的实践了解到。
附图说明
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本申请多媒体信息展示方法流程图;
图2为本申请实施例身份验证方法流程图;
图3为本申请锁定目标图像中的目标物体的方法流程图;
图4为本申请卷积神经网络模型的训练方法流程图;
图5为本申请实施例视频图像示意图;
图6为本申请人物装饰示意图;
图7为本申请装饰后的人物展示示意图;
图8为本申请对目标物体进行音色转换的方法流程图;
图9为本申请多媒体信息展示装置框图;
图10为本申请计算机设备基本结构框图。
具体实施方式
本技术领域技术人员可以理解,这里所使用的“终端”、“终端设备”既包括无线信号接收器的设备,其仅具备无发射能力的无线信号接收器的设备,又包括接 收和发射硬件的设备,其具有能够在双向通信链路上,执行双向通信的接收和发射硬件的设备。这种设备可以包括:蜂窝或其他通信设备,其具有单线路显示器或多线路显示器或没有多线路显示器的蜂窝或其他通信设备;PCS(Personal Communications Service,个人通信系统),其可以组合语音、数据处理、传真和/或数据通信能力;PDA(Personal Digital Assistant,个人数字助理),其可以包括射频接收器、寻呼机、互联网/内联网访问、网络浏览器、记事本、日历和/或GPS(Global Positioning System,全球定位系统)接收器;常规膝上型和/或掌上型计算机或其他设备,其具有和/或包括射频接收器的常规膝上型和/或掌上型计算机或其他设备。这里所使用的“终端”、“终端设备”可以是便携式、可运输、安装在交通工具(航空、海运和/或陆地)中的,或者适合于和/或配置为在本地运行,和/或以分布形式,运行在地球和/或空间的任何其他位置运行。这里所使用的“终端”、“终端设备”还可以是通信终端、上网终端、音乐/视频播放终端,例如可以是PDA、MID(Mobile Internet Device,移动互联网设备)和/或具有音乐/视频播放功能的移动电话,也可以是智能电视、机顶盒等设备。
具体的,请参阅图1,本申请公开一种多媒体信息展示方法,包括:
S1000、获取用户输入的针对所播放的视频文件中当前时间轴的目标图像的编辑指令,其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;
视频文件为由本地服务器从应用服务器中获取的或者本地服务器中存储的视频文件。视频文件为多个静态图片帧按照时间轴串联在一起,并配上对应的音效组合而成的动态图像。编辑指令是指用户所选择的对视频文件进行编辑的信息,在用户进行视频观看的客户端上,提供有供用户对视频进行编辑的界面,这种编辑界面的显示可以以任意方式出现,在一实施例中,通过特定触发指令,以弹窗方式弹出编辑框,供用户任意编辑;在另一实施例中,该编辑框以半透明浮窗的方式覆盖在当前的视频文件上,在接收到用户的触发指令后,发送编辑信息至服务器以进行编辑处理。这里的触发指令是指用户输入的特定命令,或者通过编辑界面上已有的编辑选项,选择以进行编辑。这里的已有的编辑选项为任意可以对视频进行编辑的操作,比如对视频中的图像进行颜色调节、添加滤镜,对视频中的所有人物或者指定人物进行美颜、对视频中的声音进行变声处理等等,以上编辑的操作称之为编辑类型。
由于视频文件是多个静态图像帧按照时间轴串联在一起的,因此在进行编辑时,需要先获取得到需要进行编辑的那一帧图像,称之为目标图像,对于目标图像, 在进行编辑时,可整体对该帧图像进行编辑,也可以对目标图像画面中的某一个指定的物体进行编辑,因此,在进行目标图像编辑过程中还需要获取目标图像待编辑位置的坐标,根据待编辑位置的坐标进行对应编辑类型的编辑。
S2000、根据所述待编辑坐标锁定所述目标图像中的目标物体;
上述编辑指令来自于用户观看视频文件的客户端,当用户在客户端的相关操作界面选定对应的编辑坐标和编辑类型后,客户端生成编辑指令发送至服务器端,服务器端在获取了上述编辑指令后,则根据编辑坐标和编辑指令进行编辑。
由于在步骤S1000中获取的是目标图像的待编辑坐标,这里的待编辑坐标是指以目标图像中的某一个点作为坐标原点,而相对与这个坐标原点的坐标位置。无论这个坐标原点在哪个位置,本申请中的待编辑坐标表征的是目标图像中的某一个特定的点,这个点落在目标图像的某一个像素中。由于目标图像是多个不同的像素点拼接而成的,而不同的像素拼接起来组成不同物体的图像,因此通过待编辑坐标这一个点,即可锁定所述目标图像中的目标物体。
这里的目标物体可以包括某一个物体,也可以是多个物体,或者是整个目标图像,具体数量和范围根据用户所选择待编辑坐标的个数来确定。用户可以通过全选的方式,来选择整个目标图像中所有坐标点,也可以通过选中其中一个或多个点来分别选择一个或者多个物体,例如在目标图像中有树、花和人,用户选定了树的图像中的某一个点,因此可以认为用户需要编辑的是这棵树,当用户以同时选定的方式选择了花和人,则表征用户要进行编辑锁定的是所选择的“花”和“人”。
S3000、根据所述编辑类型对所述目标物体进行编辑;
由于在编辑指令中包括编辑类型,因此当锁定了目标图像中的目标物体后,则针对该目标物体按照所选择的编辑类型进行编辑。这里的编辑类型包括但不局限于对视频中的图像进行颜色调节、添加滤镜,添加文字或图像、对视频中的所有人物或者指定人物进行美颜或装饰、改变目标物体的大小和形状、对所述目标物体进行渲染、以及对视频中的声音进行变声处理等等。在一实施例中,编辑类型还包括获取原始视频文件,在原始视频文件中进行调色、美颜、装饰、变声等编辑动作。
S4000、在所述视频文件的当前及后续时间轴的图像中展示编辑后的目标物体。
当根据步骤S2000和步骤S3000对目标物体进行编辑后,从被进行编辑的目标图像开始,后续时间轴播放的图像都按照目标图像中所编辑的样式进行显示,例如在目标图像中对整个画面添加了滤镜,则视频文件后续的画面都添加了该滤镜,当目标图像中的某个人物进行美颜处理后,则后续图像中,该人物一直以美颜后的形 象出现。
进一步的,后续时间轴的图像的展示方法还包括在选定的帧画面中展示编辑后的目标物体,即可通过指定某些帧画面显示编辑后的效果画面,而不是全部都按照编辑后的效果进行显示。
在一实施例中,所述编辑类型包括获取原始视频文件,其中,所述原始视频文件为未经过后期处理的原始图像信息。
原始视频文件为通过手机端、电脑端或者摄像装置等拍摄的图像,其未经过后期处理。这里的后期处理是指对拍摄的图片或者视频进行画面的处理,比如进行了滤镜添加、美颜等操作。未经过后期处理则为未对视频文件进行滤镜添加、美颜等操作。
获取原始图像信息的方法在本申请中可以是,在上传图像信息的时候,同时上传原始状态的图片至服务器中,因此后端只需要在服务器中选取原始图像信息即可。用户在上传图像时将原始图像和经过处理后的图像同时发送至后台服务器,但是可以选择在客户端上或者对方显示终端上显示是哪一种图像。当显示终端上显示为处理后的图像时,可通过访问权限,调取未经处理的原始图像。
一般的手机端或者照相机、摄像机所拍摄的图像都是原始的图像信息,其拍摄完之后形成文件时会生成一个EXIF值,Exif是一种图像文件格式,它的数据存储与JPEG格式是完全相同的。实际上Exif格式就是在JPEG格式头部插入了数码图像的信息,包括拍摄时的光圈、快门、白平衡、ISO、焦距、日期时间等各种和拍摄条件以及相机品牌、型号、色彩编码、拍摄时录制的声音以及GPS全球定位系统数据、缩略图等。当原始图像信息被修改,可能导致Exif信息丢失,或者图像实际的光圈、快门、ISO和白平衡等相关参数与该信息中的不匹配,因此通过获取这一信息中的关于图像的参数信息,进行参数对比接口来判断当前的图像是否为原始图像。
例如:取出图片的exif的方法为
1.获取图像文件
NSURL *fileUrl=[[NSBundle mainBundle]URLForResource:@"YourPic"withExtension:@""];
2.创建CGImageSourceRef
CGImageSourceRef imageSource=CGImageSourceCreateWithURL((CFURLRef)fileUrl,NULL);
3.利用imageSource获取全部ExifData
CFDictionaryRef imageInfo=CGImageSourceCopyPropertiesAtIndex(imageSource,0,NULL);
4.从全部ExifData中取出EXIF文件
NSDictionary *exifDic=(__bridge NSDictionary*)CFDictionaryGetValue(imageInfo,kCGImagePropertyExifDictionary);
5.打印全部Exif信息及EXIF文件信息
NSLog(@"All Exif Info:%@",imageInfo);
NSLog(@"EXIF:%@",exifDic);
通过上述方式识别出原始图片后将原始图片存储在数据库中以便于调用及后续的编译。
在一实施例中,请参阅图2,所述编辑指令还包括用户身份信息,所述获取所述原始视频文件之前还包括:
S1100、通过所述用户身份信息获取所述用户原始视频文件的获取权限;
S1200、当所述获取权限符合预设规则,则从数据库中获取所述原始视频文件。
在本申请中,编辑类型包括获取原始视频文件,而原始视频文件为同时上传至服务器中的视频文件,只要有符合查看的权限指令,则可通过访问服务器来获取得到原始视频文件。
在本实施例中,符合查看的权限通过用户身份信息来获取,因此,当编辑指令包括获取原始视频文件时,在编辑指令中应当还包括用户的身份信息。用户的身份信息通常是用户执行相关任务时所登陆的账号信息,通过账号信息匹配对应的权限。当该用户具有获取原始视频文件的权限时,则当其请求获取原始视频文件时,从数据库中调取对应的原始视频文件,否则禁止获取原始视频文件。
进一步的,编辑类型还包括在原始视频文件中进行图像编辑,进行图像编辑的类型可以是添加滤镜、改变光线的,对指定的一个或多个物体进行美颜或装饰等。进一步的,可根据用户的权限,对视频文件或者原始视频文件进行编辑,具体操作方式可以为对于不同的编辑类型设置对应的权限,在用户请求上述编辑类型时,查询用户身份信息对应的权限,当有权限执行该编辑类型时,则对选取的目标图像进行对应权限的编辑,当没有权限执行该编辑类型时,则不响应用户发送的该编辑步骤,返回错误信息以提示用户。
进一步的,请参阅图3,所述根据所述待编辑坐标锁定所述目标图像中的目标 物体的方法包括:
S2100、将所述目标图像输入至第一神经网络模型中,以识别出所述目标图像中的物体以及所述物体所映射的坐标区域;
S2200、将所述待编辑坐标在所述坐标区域中匹配以确定所属的目标物体。
神经网络模型在这里是指人工神经网络,其具有自学习功能。例如实现图像识别时,只需要先把许多不同的图像样板和对应的应识别的结果输入人工神经网络,网络就会通过自学习功能,慢慢学会识别类似的图像。另外,其具有联想存储功能。用人工神经网络的反馈网络就可以实现这种联想。神经网络还具有高速寻找优化解的能力。寻找一个复杂问题的优化解,往往需要很大的计算量,利用一个针对某问题而设计的反馈型人工神经网络,发挥计算机的高速运算能力,可能很快找到优化解。基于以上有点,本申请采用训练好的神经网络模型来识别目标物体以及目标物体所映射的坐标区域。
神经网络包括深度神经网络、卷积神经网络、循环神经网络、深度残差网络等,本申请以卷积神经网络为例进行说明,卷积神经网络是一种前馈神经网络,人工神经元可以响应周围单元,可以进行大型图像处理。卷积神经网络包括卷积层和池化层。卷积神经网络(CNN)中卷积的目的在于将某些特征从图像中提取出来。卷积神经网络的基本结构包括两层,其一为特征提取层,每个神经元的输入与前一层的局部接受域相连,并提取该局部的特征。一旦该局部特征被提取后,它与其它特征间的位置关系也随之确定下来;其二是特征映射层,网络的每个计算层由多个特征映射组成,每个特征映射是一个平面,平面上所有神经元的权值相等。特征映射结构采用影响函数核小的sigmoid函数作为卷积网络的激活函数,使得特征映射具有位移不变性。此外,由于一个映射面上的神经元共享权值,因而减少了网络自由参数的个数。卷积神经网络中的每一个卷积层都紧跟着一个用来求局部平均与二次提取的计算层,这种特有的两次特征提取结构减小了特征分辨率。
卷积神经网络主要用来识别位移、缩放及其他形式扭曲不变性的二维图形。由于卷积神经网络的特征检测层通过训练数据进行学习,所以在使用卷积神经网络时,避免了显式的特征抽取,而隐式地从训练数据中进行学习;再者由于同一特征映射面上的神经元权值相同,所以网络可以并行学习,这也是卷积网络相对于神经元彼此相连网络的一大优势。
一幅彩色图像在计算机中的存储形式为一个三维的矩阵,三个维度分别是图像的宽、高和RGB(红绿蓝色彩值)值,而一幅灰度图像在计算机中的存储形式为一 个二维矩阵,两个维度分别是图像的宽、高。无论是彩色图片的三维矩阵还是灰度图像的二维矩阵,矩阵中的每个元素取值范围为[0,255],但是含义不同,彩色图像的三维矩阵可以拆分成R、G、B三个二维矩阵,矩阵中的元素分别代表图像相应位置的R、G、B亮度。灰度图像的二维矩阵中,元素则代表图像相应位置的灰度值。而二值图像可视为灰度图像的一个简化,它将灰度图像中所有高于某个阈值的原始转化为1,否则为0,故二值图像矩阵中的元素非0则1,二值图像足以描述图像的轮廓,二卷积操作的一个重要作用就是找到图像的边缘轮廓。
通过将图像转换成二值图像,再通过卷积核的过滤得到图像物体的边缘特征,再经过池化实现图像的降维以便于得到,明显的图像特征。通过模型训练,以识别出所述图像中图像特征。
本申请中,物体作为所拍摄的图像中的一个特征,可通过卷积神经网络训练得到的神经网络模型获得,但是,还可以使用其他的神经网络,比如DNN(深层神经网络)、RNN(循环神经网络)等网络模型训练而成。无论何种神经网络进行训练,采用这种机器学习的模式来识别不同的物体的方法的原理基本一致。
以卷积神经网络模型的训练方法为例,请参阅图4,卷积神经网络模型的训练方法如下:
S2111、获取标记有分类判断信息的训练样本数据;
训练样本数据是整个训练集的构成单位,训练集是由若干个训练样本训练数据组成的。训练样本数据是由多种不同物体的数据以及对各种不同物体进行标记的分类判断信息组成的。分类判断信息是指人们根据输入卷积神经网络模型的训练方向,通过普适性的判断标准和事实状态对训练样本数据做出的人为的判断,也就是人们对卷积神经网络模型输出数值的期望目标。如,在一个训练样本数据中,人工识别出该图像信息数据中的物体与预存储的图像信息中的物体为同一个,则标定该物体分类判断信息为与预存储的目标物体图像相同。
S2112、将所述训练样本数据输入卷积神经网络模型获取所述训练样本数据的模型分类参照信息;
将训练样本集依次输入到卷积神经网络模型中,并获得卷积神经网络模型倒数第一个全连接层输出的模型分类参照信息。
模型分类参照信息是卷积神经网络模型根据输入的物体图像而输出的激励数据,在卷积神经网络模型未被训练至收敛之前,分类参照信息为离散性较大的数值,当卷积神经网络模型未被训练至收敛之后,分类参照信息为相对稳定的数据。
S2113、通过止损函数比对所述训练样本数据内不同样本的模型分类参照信息与所述分类判断信息是否一致;
止损函数是用于检测卷积神经网络模型中模型分类参照信息,与期望的分类判断信息是否具有一致性的检测函数。当卷积神经网络模型的输出结果与分类判断信息的期望结果不一致时,需要对卷积神经网络模型中的权重进行校正,以使卷积神经网络模型的输出结果与分类判断信息的期望结果相同。
S2114、当所述模型分类参照信息与所述分类判断信息不一致时,反复循环迭代的更新所述卷积神经网络模型中的权重,至所述比对结果与所述分类判断信息一致时结束。
当卷积神经网络模型的输出结果与分类判断信息的期望结果不一致时,需要对卷积神经网络模型中的权重进行校正,以使卷积神经网络模型的输出结果与分类判断信息的期望结果相同。
在本申请中,对第一神经网络模型进行训练,使其可以识别出视频文件中的物体、该物体的覆盖面积、对应的坐标区域等。当第一神经网络模型识别出了目标图像中各个物体以及该物体所映射的坐标区域后,通过所获取的待编辑的坐标确定用户选定的需要编辑的目标物体。当确定了目标物体,则可对该目标物体执行添加文字或图像、改变所述目标物体的大小和形状、对所述目标物体进行渲染、添加滤镜、美颜等操作。
在一实施例中,举例说明本申请的上述技术方案,用户在当前的显示终端上针对视频文件进行编辑,编辑的类型包括但不局限于获取原始视频文件、添加文字或图像、改变所述目标物体的大小和形状、对所述目标物体进行渲染,比如美颜、虚拟头像替换、更换背景、或者进行涂鸦,以提高查看图像或视频时的趣味性。
当编辑类型为获取原始视频文件或者是在原始视频文件的基础上进行再次编辑时,根据获取的用户身份信息,识别其获取原始视频文件的权限,当该用户具有获取权限,则提供原始视频文件给用户,由于获取的原始图像信息是不带美颜效果的,用户在接收到原始图像信息后,可根据自己的喜好对图像中的指定人物进行美颜,包括肤色变白、眼睛变大、红唇、变眉形,甚至是添加小饰品等,例如,在本实施例中,编辑类型为针对图像中的某一个人添加小饰品,请参阅图5,图像中包括多个可选的人物,用户点击其中一个人物在图像上映射的任意位置,则可通过上述公开的方式锁定该人物为目标物体,如图6所示为根据选定的人物,通过自定义绘制的方式或者在编辑框的下拉选择框中选择合适的装饰品,并添加至选定的人物 上,本实施例中,在所选定的人物的头部添加了一个装饰物,添加之后保存该目标人物的编辑参数,即根据该编辑参数,在视频文件中进行锁定,并按照锁定的样式进行显示。
当保存了上述编辑后的参数后,在后续的视频中,自动跟踪该人物,并自动读取该人物的局部特征,持续进行装饰以达到持续显示的目的。比如当给某一人物进行了美颜,则在后续的视频帧文件中自动搜索匹配该人物,当出现该人物时,自动对其添加上述编辑好的参数,无需用户对每一帧图像中的人物都进行重新装扮,例如图7,当该人物在另外一个场景下时,其装扮不变。
在一实施例中,目标物体或人物的选择可通过神经网络模型来选择,用户选择的人物则为参考人物,视频文件的每一帧图像都传输至神经网络模型中,以识别此参考人物,当识别出参考人物,则自动对该参考人物添加上述保存的参数,将添加了参数之后的图像在前端进行播放。
采用该方案,可以让用户根据自己的喜好对图像进行自定义修改,比如当不喜欢某个人物时,可将该人物的头像锁定并替换成“猪头”,在后续的视频显示中,该人物的形象以猪头的方式展示;以提高用户观看图像和视频的趣味性,也能激发用户的创造性。
进一步的,所述编辑类型包括音色转换,音色转换为改变视频文件中的声音。需要说明的是,这里的音色转换,可以是将视频文件中的所有的声音都按照指定的音色转换参数进行转换,也可以是指定某一个或多个物体发出的声音的音色转换。这里所说的物体包括人、动物或者工具、植物在外力作用下发出的声音,还可以是视频中添加的背景音乐。
具体的,请参阅图8,对所述目标物体进行音色转换的方法包括:
S3100、获取音色转换指令中的目标音色参数;
音色(Timbre)是指不同的声音的频率表现在波形方面总是有与众不同的特性。不同的发声体由于其材料、结构不同,则发出的声音的音色也不同,例如钢琴和小提琴和人的声音不一样;每一个人一个人的声音也会不一样。音色是声音的特点,和全世界人们的相貌一样总是与众不同。根据不同的音色,即使在同一音高和同一声音强度的情况下,我们也能区分出是不同乐器或人发出的。如同千变万化的调色盘似的颜色一样,“音色”也会千变万化而容易理解。
基于不同物体的发出的不同音色,为了模拟这些物体的音色,会将音色以数值的方式进行模拟,这里的目标音色参数则为对音色进行模拟的数值。进一步的,目 标音色参数包括用户自定义的参数或者从音色数据库中选取的指定参数。
S3200、识别所述目标物体所映射的声源信息;
在上述步骤中获取了目标物体以及音色转换的参数后,还需要对目标物体所映射的声源信息进行获取,将获取的声源信息与音色转换的参数进行对比,以按照音色转换的参数调整目标物体的声源信息。
S3300、将所述声源信息输入第二神经网络模型中以输出符合所述目标音色参数的目标声源信息。
对目标物体的声源信息进行调整的方式可以通过手动方式,也可以通过自动调整方式,在一实施例中,自动调整的方式为通过神经网络模型来进行。
本实施例中,将所述声源信息输入第二神经网络模型中,第二神经网络模型与上述公开的第一神经网络模型一样,具有自学习功能,只是训练的样本不同,从而输出的结果也不同。在第二神经网络模型中,经过训练可以识别出目标物体的声音,并将目标物体按照音色参数转换规则转换成对应的参数值,同时,根据用户选定的音色转换的参数,对所识别的目标物体的声音进行转换。例如,将锁定的某个人物的声音变换成动漫人物的声音展示,以增加趣味性。具体操作为,用户通过选定图像中的某个人物或动物,在声音数据库中选择需要变更的目标音色,则被选定的人物或动物在发出声音的时候按照该目标音色发生。比如在用户观看某一视频文件时,视频中有人物A、人物B和动物C,人物A为男生,当选定人物A,并将该人物A匹配声音数据库中机器猫的说话参数,则在后续的视频文件中,该人物A所说的话按照机器猫的发声特定进行发声。
上述应用时音色转换的一个具体的应用,本申请中,音色转换采用了神经网络模型的方式。
人体发声的整个流程有三个阶段,可用三个基本模块来表示:1)激励模块、2)声道模块;3)辐射模块。将这三个模块系统串联起来即可得到完整语音系统,该模型中主要参数基频周期、清音/浊音的判断、增益及滤波器参数。本申请中,获取所选定的人物的原始发音,对其进行模数转换,通过数字信号,提取对应的特征向量。语音音色变换一般包括两个过程,训练过程和变换过程,训练过程一般包括以下步骤:1)分析源、目标说话人语音信号,提取有效声学特征;2)将其与源目标说话人的声学特征对齐;3)分析对齐后的特征,得到源、目标说话人在声学矢量空间上的映射关系,及变换函数/规则。将提取的源说话人的声音特征参数,通过训练得到的变换函数/规则得到变换后的声音特征参数,然后用这些变换后的特 征参数,合成并输出语音,使输出的语音听起来像所选定的目标说话人说出的话。一般变化过程包括:1)从源说话人输入的语音中提取特征参数,2)利用变换函数/规则计算出新的特征参数;3)合成并输出,在合成过程中,要用一个同步机制确保得到实时输出。本申请中,可采用基音同步重叠相加(PSOLA)的方法。
另一方面,请参阅图9,本申请公开一种多媒体信息展示装置,包括:
获取模块1000:被配置为执行获取用户输入的针对所播放的视频文件中当前时间轴的目标图像的编辑指令,其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;锁定模块2000:被配置为执行根据所述待编辑坐标锁定所述目标图像中的目标物体;编辑模块3000:被配置为执行根据所述编辑类型对所述目标物体进行编辑;展示模块4000:被配置为执行在所述视频文件的后续时间轴的图像中展示编辑后的目标物体。
可选的,所述编辑类型包括获取原始视频文件,其中,所述原始视频文件为未经过后期处理的原始图像信息。
可选的,所述编辑指令包括用户身份信息,所述编辑模块还包括:
权限获取模块:被配置为执行通过所述用户身份信息获取所述用户原始视频文件的获取权限;当所述获取权限符合预设规则,则从数据库中获取所述原始视频文件。
可选的,所述锁定模块包括:
第一识别模块:被配置为执行将所述目标图像输入至第一神经网络模型中,以识别出所述目标图像中的物体以及所述物体所映射的坐标区域;
目标匹配模块:被配置为执行将所述待编辑坐标在所述坐标区域中匹配以确定所属的目标物体。
可选的,所述编辑类型包括音色转换,所述编辑模块还包括:
音色获取模块:被配置为执行获取音色转换指令中的目标音色参数;
声源识别模块:被配置为执行识别所述目标物体所映射的声源信息;
声源处理模块:被配置为执行将所述声源信息输入第二神经网络模型中以输出符合所述目标音色参数的目标声源信息。
可选的,所述编辑类型还包括:添加文字或图像、改变所述目标物体的大小和形状、对所述目标物体进行渲染。
可选的,目标音色参数包括用户自定义的参数或者从音色数据库中选取的指定参数。
上述公开的一种多媒体信息展示装置是多媒体信息展示方法一一对应的执行装置,其工作原理与上述的多媒体信息展示方法一样,此处不再赘述。
本申请实施例提供计算机设备基本结构框图请参阅图10。
该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种多媒体信息展示方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种多媒体信息展示方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
本申请还提供一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一实施例所述的多媒体信息展示方法。本实施方式中的存储介质是易失性存储介质,也可以是非易失性的存储介质。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
以上所述仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (20)

  1. 一种多媒体信息展示方法,包括:
    获取用户输入的针对所播放的视频文件中当前时间轴的目标图像的编辑指令,其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;
    根据所述待编辑坐标锁定所述目标图像中的目标物体;
    根据所述编辑类型对所述目标物体进行编辑;
    在所述视频文件的当前及后续时间轴的图像中展示编辑后的目标物体。
  2. 根据权利要求1所述的多媒体信息展示方法,所述编辑类型包括获取原始视频文件,其中,所述原始视频文件为未经过后期处理的原始图像信息。
  3. 根据权利要求2所述的多媒体信息展示方法,所述编辑指令包括用户身份信息,所述获取所述原始视频文件之前还包括:
    通过所述用户身份信息获取所述用户原始视频文件的获取权限;
    当所述获取权限符合预设规则,则从数据库中获取所述原始视频文件。
  4. 根据权利要求1或2所述的多媒体信息展示方法,所述根据所述待编辑坐标锁定所述目标图像中的目标物体的方法包括:
    将所述目标图像输入至第一神经网络模型中,以识别出所述目标图像中的物体以及所述物体所映射的坐标区域;
    将所述待编辑坐标在所述坐标区域中匹配以确定所属的目标物体。
  5. 根据权利要求1或2所述的多媒体信息展示方法,所述编辑类型包括音色转换,对所述目标物体进行音色转换的方法包括:
    获取音色转换指令中的目标音色参数;
    识别所述目标物体所映射的声源信息;
    将所述声源信息输入第二神经网络模型中以输出符合所述目标音色参数的目标声源信息。
  6. 根据权利要求1或2所述的多媒体信息展示方法,所述编辑类型还包括:添加文字或图像、改变所述目标物体的大小和形状、对所述目标物体进行渲染。
  7. 根据权利要求5所述的多媒体信息展示方法,所述目标音色参数包括用户自定义的参数或者从音色数据库中选取的指定参数。
  8. 一种多媒体信息展示装置,包括:
    获取模块:被配置为执行获取用户输入的针对所播放的视频文件中当前时间轴 的目标图像的编辑指令,其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;
    锁定模块:被配置为执行根据所述待编辑坐标锁定所述目标图像中的目标物体;
    编辑模块:被配置为执行根据所述编辑类型对所述目标物体进行编辑;
    展示模块:被配置为执行在所述视频文件的后续时间轴的图像中展示编辑后的目标物体。
  9. 一种计算机设备,包括:
    一个或多个处理器;
    存储器;
    一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行一种多媒体信息展示方法,所述多媒体信息展示方法包括以下步骤:
    获取用户输入的针对所播放的视频文件中当前时间轴的目标图像的编辑指令,其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;
    根据所述待编辑坐标锁定所述目标图像中的目标物体;
    根据所述编辑类型对所述目标物体进行编辑;
    在所述视频文件的当前及后续时间轴的图像中展示编辑后的目标物体。
  10. 根据权利要求9所述的计算机设备,所述编辑类型包括获取原始视频文件,其中,所述原始视频文件为未经过后期处理的原始图像信息。
  11. 根据权利要求10所述的计算机设备,所述编辑指令包括用户身份信息,所述获取所述原始视频文件之前还包括:
    通过所述用户身份信息获取所述用户原始视频文件的获取权限;
    当所述获取权限符合预设规则,则从数据库中获取所述原始视频文件。
  12. 根据权利要求9或10所述的计算机设备,所述根据所述待编辑坐标锁定所述目标图像中的目标物体的方法包括:
    将所述目标图像输入至第一神经网络模型中,以识别出所述目标图像中的物体以及所述物体所映射的坐标区域;
    将所述待编辑坐标在所述坐标区域中匹配以确定所属的目标物体。
  13. 根据权利要求9或10所述的计算机设备,所述编辑类型包括音色转换,对所述目标物体进行音色转换的方法包括:
    获取音色转换指令中的目标音色参数;
    识别所述目标物体所映射的声源信息;
    将所述声源信息输入第二神经网络模型中以输出符合所述目标音色参数的目标声源信息。
  14. 根据权利要求9或10所述的计算机设备,所述编辑类型还包括:添加文字或图像、改变所述目标物体的大小和形状、对所述目标物体进行渲染。
  15. 根据权利要求13所述的计算机设备,所述目标音色参数包括用户自定义的参数或者从音色数据库中选取的指定参数。
  16. 一种存储有计算机可读指令的存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现一种多媒体信息展示方法,所述多媒体信息展示方法包括以下步骤:
    获取用户输入的针对所播放的视频文件中当前时间轴的目标图像的编辑指令,其中,所述编辑指令包括所述目标图像的待编辑坐标和编辑类型;
    根据所述待编辑坐标锁定所述目标图像中的目标物体;
    根据所述编辑类型对所述目标物体进行编辑;
    在所述视频文件的当前及后续时间轴的图像中展示编辑后的目标物体。
  17. 根据权利要求16所述的存储有计算机可读指令的存储介质,所述编辑类型包括获取原始视频文件,其中,所述原始视频文件为未经过后期处理的原始图像信息。
  18. 根据权利要求17所述的存储有计算机可读指令的存储介质,所述编辑指令包括用户身份信息,所述获取所述原始视频文件之前还包括:
    通过所述用户身份信息获取所述用户原始视频文件的获取权限;
    当所述获取权限符合预设规则,则从数据库中获取所述原始视频文件。
  19. 根据权利要求16或17所述的存储有计算机可读指令的存储介质,所述根据所述待编辑坐标锁定所述目标图像中的目标物体的方法包括:
    将所述目标图像输入至第一神经网络模型中,以识别出所述目标图像中的物体以及所述物体所映射的坐标区域;
    将所述待编辑坐标在所述坐标区域中匹配以确定所属的目标物体。
  20. 根据权利要求16或17所述的存储有计算机可读指令的存储介质,所述编辑类型包括音色转换,对所述目标物体进行音色转换的方法包括:
    获取音色转换指令中的目标音色参数;
    识别所述目标物体所映射的声源信息;
    将所述声源信息输入第二神经网络模型中以输出符合所述目标音色参数的目标声源信息。
PCT/CN2019/116761 2019-07-19 2019-11-08 多媒体信息展示方法、装置、计算机设备及存储介质 WO2021012491A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910657196.4A CN110475157A (zh) 2019-07-19 2019-07-19 多媒体信息展示方法、装置、计算机设备及存储介质
CN201910657196.4 2019-07-19

Publications (1)

Publication Number Publication Date
WO2021012491A1 true WO2021012491A1 (zh) 2021-01-28

Family

ID=68508153

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116761 WO2021012491A1 (zh) 2019-07-19 2019-11-08 多媒体信息展示方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110475157A (zh)
WO (1) WO2021012491A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460183B (zh) * 2020-03-30 2024-02-13 北京金堤科技有限公司 多媒体文件生成方法和装置、存储介质、电子设备
CN111862275B (zh) * 2020-07-24 2023-06-06 厦门真景科技有限公司 基于3d重建技术的视频编辑方法和装置以及设备
CN112312203B (zh) * 2020-08-25 2023-04-07 北京沃东天骏信息技术有限公司 视频播放方法、装置和存储介质
CN112561988A (zh) * 2020-12-22 2021-03-26 咪咕文化科技有限公司 多媒体资源的定位方法、电子设备及可读存储介质
CN113825018B (zh) * 2021-11-22 2022-02-08 环球数科集团有限公司 一种基于图像处理的视频处理管理平台
CN114359099A (zh) * 2021-12-31 2022-04-15 深圳市爱剪辑科技有限公司 一种多功能视效美化处理系统和应用

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007336106A (ja) * 2006-06-13 2007-12-27 Osaka Univ 映像編集支援装置
CN107959883A (zh) * 2017-11-30 2018-04-24 广州市百果园信息技术有限公司 视频编辑推送方法、系统及智能移动终端
CN108062760A (zh) * 2017-12-08 2018-05-22 广州市百果园信息技术有限公司 视频编辑方法、装置及智能移动终端
CN109168024A (zh) * 2018-09-26 2019-01-08 平安科技(深圳)有限公司 一种目标信息的识别方法及设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8819559B2 (en) * 2009-06-18 2014-08-26 Cyberlink Corp. Systems and methods for sharing multimedia editing projects
US9058757B2 (en) * 2012-08-13 2015-06-16 Xerox Corporation Systems and methods for image or video personalization with selectable effects
CN104780339A (zh) * 2015-04-16 2015-07-15 美国掌赢信息科技有限公司 一种即时视频中的表情特效动画加载方法和电子设备
CN108259788A (zh) * 2018-01-29 2018-07-06 努比亚技术有限公司 视频编辑方法、终端和计算机可读存储介质
CN109841225B (zh) * 2019-01-28 2021-04-30 北京易捷胜科技有限公司 声音替换方法、电子设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007336106A (ja) * 2006-06-13 2007-12-27 Osaka Univ 映像編集支援装置
CN107959883A (zh) * 2017-11-30 2018-04-24 广州市百果园信息技术有限公司 视频编辑推送方法、系统及智能移动终端
CN108062760A (zh) * 2017-12-08 2018-05-22 广州市百果园信息技术有限公司 视频编辑方法、装置及智能移动终端
CN109168024A (zh) * 2018-09-26 2019-01-08 平安科技(深圳)有限公司 一种目标信息的识别方法及设备

Also Published As

Publication number Publication date
CN110475157A (zh) 2019-11-19

Similar Documents

Publication Publication Date Title
WO2021012491A1 (zh) 多媒体信息展示方法、装置、计算机设备及存储介质
US10867416B2 (en) Harmonizing composite images using deep learning
Lukac Computational photography: methods and applications
US10049477B1 (en) Computer-assisted text and visual styling for images
KR101887216B1 (ko) 이미지 재구성 서버 및 방법
US10922860B2 (en) Line drawing generation
CN100456804C (zh) 面部图像补偿设备和方法
US11024060B1 (en) Generating neutral-pose transformations of self-portrait images
CN112040273B (zh) 视频合成方法及装置
US11663467B2 (en) Methods and systems for geometry-aware image contrast adjustments via image-based ambient occlusion estimation
WO2023077742A1 (zh) 视频处理方法及装置、神经网络的训练方法及装置
KR20200065433A (ko) 스타일 변환 모델 및 포토 몽타주 기반 합성 이미지의 스타일 변환 장치
CN109685713A (zh) 化妆模拟控制方法、装置、计算机设备及存储介质
CN111860380A (zh) 人脸图像生成方法、装置、服务器及存储介质
KR20180074977A (ko) 영상 간의 특질 변환 시스템 및 그 방법
CN106101576B (zh) 一种增强现实照片的拍摄方法、装置及移动终端
CN112102157A (zh) 视频换脸方法、电子设备和计算机可读存储介质
KR102482262B1 (ko) 객체 분할과 배경 합성을 이용한 데이터 증강 장치 및 방법
US20240054732A1 (en) Intermediary emergent content
Zhou et al. Photomat: A material generator learned from single flash photos
KR102659290B1 (ko) 모자이크 생성 장치 및 방법
CN116824020A (zh) 图像生成方法和装置、设备、介质和程序
US11366981B1 (en) Data augmentation for local feature detector and descriptor learning using appearance transform
CN111696182A (zh) 一种虚拟主播生成系统、方法和存储介质
Bagwari et al. An edge filter based approach of neural style transfer to the image stylization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19938930

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19938930

Country of ref document: EP

Kind code of ref document: A1