WO2020087979A1 - 生成模型的方法和装置 - Google Patents
生成模型的方法和装置 Download PDFInfo
- Publication number
- WO2020087979A1 WO2020087979A1 PCT/CN2019/095735 CN2019095735W WO2020087979A1 WO 2020087979 A1 WO2020087979 A1 WO 2020087979A1 CN 2019095735 W CN2019095735 W CN 2019095735W WO 2020087979 A1 WO2020087979 A1 WO 2020087979A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- soundtrack
- sample
- difference
- model
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/55—Push-based network services
Definitions
- the embodiments of the present application relate to the field of computer technology, such as a method and device for generating a model.
- short video applications have emerged. Users can use short video applications to upload and post videos. When uploading a video using the short video application, the user is usually required to select a music as the video soundtrack.
- the user usually selects the soundtrack from the soundtrack list manually.
- the embodiments of the present application provide a method and a device for generating a model.
- an embodiment of the present application provides a method for generating a model, the method includes: obtaining a sample set, where the samples in the sample set include a first video, a second video, and a third video, the first video and The second video has the same soundtrack and is marked with the same soundtrack style, the first video and the third video have different soundtracks and are marked with different soundtrack styles; samples are taken from the sample set, and the following training process is performed : Input the frames of the video in the extracted sample to the initial model to obtain the feature information of each video in the sample; determine the loss value of the sample based on the feature information and the soundtrack style label in the sample; based on the The loss value determines whether the initial model training is completed; in response to determining that the initial model training is completed, the trained initial model is determined as the video feature extraction model.
- an embodiment of the present application provides an apparatus for generating a model
- the apparatus includes: an acquisition unit configured to acquire a sample set, where the samples in the sample set include a first video, a second video, and a third video, The first video and the second video have the same soundtrack and are marked with the same soundtrack style, the first video and the third video have different soundtracks and are marked with different soundtrack styles;
- the training unit is It is configured to extract samples from the sample set and perform the following training process: input the frames of the videos in the extracted samples to the initial model to obtain the feature information of each video in the sample; based on the feature information and the soundtrack in the sample Style annotation to determine the loss value of the sample; determine whether the initial model is completed based on the loss value; in response to determining that the initial model training is completed, determine the initial model after training as the video feature extraction model.
- an embodiment of the present application provides a method for pushing information, including: in response to receiving a target video, inputting a frame in the target video using the method described in any one of the embodiments of the first aspect described above Video feature extraction model to obtain the target feature information of the target video; calculate the similarity between the target feature information and the feature information of the video in the preset video library, and select the pre-selected from the video library according to the order of similarity from large to small Set a number of videos as candidate videos; obtain the soundtrack information of the candidate videos, and push the soundtrack information.
- an embodiment of the present application provides an apparatus for pushing information, including: a receiving unit configured to, in response to receiving a target video, use a frame input in the target video as in any one of the embodiments of the first aspect above
- the video feature extraction model generated by the described method obtains the target feature information of the target video
- the selection unit is configured to calculate the similarity between the target feature information and the feature information of the video in the preset video library, according to the similarity from In order of large to small, a preset number of videos are selected from the video library as candidate videos
- the pushing unit is configured to obtain the soundtrack information of the candidate videos and push the soundtrack information.
- an embodiment of the present application provides an electronic device, including: at least one processor; a storage device on which at least one program is stored, and the at least one program is executed by the at least one processor, so that the At least one processor implements the method according to any one of the above embodiments of the first aspect and the third aspect.
- an embodiment of the present application provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, the method according to any one of the embodiments of the first aspect and the third aspect described above is implemented.
- FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application can be applied;
- FIG. 2 is a flowchart of an embodiment of a method for generating a model according to the present application
- FIG. 3 is a schematic diagram of an application scenario according to the method of generating a model of the present application
- FIG. 4 is a flowchart of still another embodiment of the method for generating a model according to the present application.
- FIG. 5 is a schematic structural diagram of an embodiment of an apparatus for generating a model according to the present application.
- FIG. 6 is a flowchart of an embodiment of a method for pushing information according to the present application.
- FIG. 7 is a schematic structural diagram of an embodiment of an apparatus for pushing information according to the present application.
- FIG. 8 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
- FIG. 1 shows an exemplary system architecture 100 to which the method or apparatus for generating a model of the present application can be applied.
- the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105.
- the network 104 is a medium used to provide a communication link between the terminal devices 101, 102, 103 and the server 105.
- the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
- the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages, and so on.
- Various communication client applications can be installed on the terminal devices 101, 102, and 103, such as video recording applications, video playback applications, voice interaction applications, search applications, instant communication tools, email clients, social platform software, etc. .
- the terminal devices 101, 102, and 103 may be hardware or software.
- the terminal devices 101, 102, and 103 may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and so on.
- the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (for example to provide distributed services), or as a single software or software module. No limitation here.
- an image acquisition device may also be installed thereon.
- the image acquisition device may be various devices capable of realizing image acquisition functions, such as cameras, sensors, and so on. Users can use the image acquisition devices on the terminal devices 101, 102, and 103 to collect video.
- the server 105 may be a server that provides various services, for example, a video processing server for storing, managing, or analyzing videos uploaded by the terminal devices 101, 102, and 103.
- the video processing server can obtain the sample set. A large number of samples can be included in the sample set.
- the samples in the foregoing sample set may include the first video, the second video, and the third video.
- the first and second videos have the same soundtrack and are marked with the same soundtrack style.
- the first and third videos have different soundtracks and are marked with different soundtrack styles.
- the video processing server can use the samples in the sample set to train the initial model, and can store the training results (such as the generated video feature extraction model). In this way, after the user uploads the video by using the terminal devices 101, 102, and 103, the server 105 can determine the feature information of the video uploaded by the user, and further, can select and push the soundtrack information for the video.
- the server 105 may be hardware or software.
- the server When the server is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server.
- the server When the server is software, it can be implemented as multiple software or software modules (for example, to provide distributed services), or as a single software or software module. No limitation here.
- the method for generating models provided by the embodiments of the present application is generally executed by the server 105, and accordingly, the device for generating models is generally provided in the server 105.
- terminal devices, networks, and servers in FIG. 1 are only schematic. According to the implementation needs, there can be any number of terminal devices, networks and servers.
- the method for generating a model includes steps 201 to 206.
- step 201 a sample set is obtained.
- the execution subject of the method of generating a model can obtain the sample set in various ways.
- the execution subject may obtain the sample set stored in another server (such as a database server) for storing samples through a wired connection or a wireless connection.
- a user may collect samples through a terminal device (such as the terminal devices 101, 102, and 103 shown in FIG. 1). In this way, the above-mentioned execution subject can receive the samples collected by the terminal and store these samples locally, thereby generating a sample set.
- wireless connection methods may include, but are not limited to, 3G / 4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods now known or developed in the future .
- the sample set may include a large number of samples.
- the sample may include a first video, a second video, and a third video.
- the first video and the second video have the same soundtrack and are marked with the same soundtrack style.
- the first and third videos have different soundtracks and are marked with different soundtrack styles.
- the soundtrack style label may be information for indicating and distinguishing the style of the soundtrack.
- the soundtrack style can be pre-divided into multiple types, such as sadness, cheerfulness, and soothing.
- the samples in the above sample set may be generated by the following steps: First, the video may be randomly extracted from the preset video library as the first video, where the videos in the above video library have a soundtrack Callouts and soundtrack style callouts.
- the annotation of the soundtrack can be set to indicate and differentiate the music.
- the soundtrack label may be the name of the soundtrack.
- a video with the same soundtrack label and the same soundtrack style label as the first video can be randomly extracted from the video library as the second video.
- a video with a different soundtrack label and a different soundtrack style label from the first video can be randomly selected from the above video library as the third video.
- the first video, the second video, and the third video can be aggregated into samples.
- samples in the above samples can also be generated by other methods, such as manual selection, which will not be repeated here.
- step 202 samples are extracted from the sample set.
- the execution subject may extract samples from the sample set acquired in step 201, and perform the training process from step 203 to step 206.
- the extraction method and number of samples are not limited in this application.
- it may be a random extraction of at least one sample, or a sample with better definition of the video from which the sample is extracted (ie, the pixels of the frame of the video in the sample are higher).
- step 203 the frames in the video in the extracted sample are input to the initial model to obtain the feature information of each video in the sample.
- the above-mentioned execution subject may input the frames in the video in the sample extracted in step 202 to the initial model. Since the extracted samples include the first video, the second video, and the third video, the feature information of the first video, the feature information of the second video, and the feature information of the third video can be obtained respectively.
- the initial model can output the feature information of the video by analyzing the frames in the video.
- feature information can be expressed in the form of vectors or matrices.
- the frames in the input video may be one or more frames randomly selected; or may be multiple frames extracted from the video at specified time intervals (for example, 1s or 2s, etc.). No limitation here.
- the initial model may be various models with image feature extraction functions created based on machine learning techniques.
- the initial model can perform feature extraction on the frames in the video, and then perform fusion, analysis and other processing on the extracted multi-frame features, and finally output the feature information of the video.
- the initial model may be a convolutional neural network using structures in various related technologies (eg, DenseBox, VGGNet, ResNet, SegNet, etc.).
- Convolutional Neural Network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding cells in the coverage area, and have excellent performance for image processing. Therefore, convolution can be used.
- the neural network extracts the features of the frames in the video.
- the established neural network may include convolutional layers, pooling layers, feature fusion layers, fully connected layers, and so on.
- the convolution layer can be set to extract image features.
- the pooling layer can be set to downsample the input information.
- the feature fusion layer may be configured to fuse the obtained image features corresponding to multiple frames (for example, in the form of a feature matrix or a feature vector). For example, the feature values at the same position in the feature matrix corresponding to different frames may be averaged to perform feature fusion to generate a fused feature matrix.
- the fully connected layer can be set to process the fused features.
- the above initial model may also be another model with image feature extraction function, which is not limited to the above example, and the model structure is not limited here.
- step 204 the loss value of the sample is determined based on the feature information and the annotation of the soundtrack style in the sample.
- the goal of training the initial model is to make the difference in feature information extracted from the frames of the video with the same soundtrack and with the same soundtrack style annotation as small as possible.
- the difference in feature information extracted from the frames of the style-annotated video is as large as possible.
- the difference in feature information extracted from the frame of a video with the same soundtrack and with the same soundtrack style annotation can be referred to as the first difference.
- the difference in feature information extracted from the frames of videos with different soundtracks and with different soundtrack style annotations may be referred to as a second difference.
- the above-mentioned execution subject may first determine the values of the first difference and the second difference according to the feature information.
- the first difference may be the feature information of the first video in the extracted sample and the second Differences in video feature information. Since the first video and the third video in the extracted sample have different soundtracks and are marked with different soundtrack styles, the second video and the third video have different soundtracks and are marked with different soundtrack styles. Therefore, the second difference may be the difference between the feature information of the first video and the feature information of the third video in the extracted sample, or may be the difference between the feature information of the second video and the feature information of the third video.
- the difference in feature information can be determined by means of Euclidean distance, cosine similarity, and the like. The more similar the feature information, the smaller the difference. The more dissimilar the feature information, the greater the difference.
- the above-mentioned executive body may input the first difference and the second difference into a pre-established loss function to determine the loss value of the sample.
- the loss function is a non-negative real value function. In general, the smaller the value of the loss function (loss value), the better the robustness of the model.
- the loss function can be set according to actual needs.
- the above-mentioned loss function may be a function for characterizing the degree of difference between the second difference and the first difference, such as a triplet loss function.
- the above-mentioned execution subject may determine the loss value of the extracted sample through the following steps:
- the first step is to determine the Euclidean distance of the feature information of the video marked with the same soundtrack style, and use this distance as the first Euclidean distance. And determine the Euclidean distance of the feature information of the video marked by different soundtrack styles, and use this distance as the second Euclidean distance.
- the difference between the second Euclidean distance and the first Euclidean distance is determined.
- the above difference is compared with the first preset value to determine the sample loss value, where the above first preset value is a positive number (for example, 0.2).
- the first preset numerical value may be a numerical value specified in advance by a technician based on a large number of data statistics and calculations.
- the execution subject may determine the second predetermined value (eg, 0) as a sample Loss value.
- the second predetermined value eg, 0
- the loss value can be set to a small number, so that the loss value of the sample has little effect on the gradient descent or does not participate in the gradient descent.
- the second preset numerical value may be set to a number smaller than the target value, for example, 0.
- the execution subject may compare the first preset value with the difference The difference is determined as the loss value of the sample. It can be understood that, when the difference between the second Euclidean distance and the first Euclidean distance is less than or equal to the first preset value, the numerical relationship between the second Euclidean distance and the first Euclidean distance can be considered Does not meet expectations. At this time, since the difference between the first preset value and the difference is greater than zero, the difference between the first preset value and the difference can be determined as the loss value of the sample and participate in the gradient descent.
- step 205 it is determined whether the initial model has been trained based on the loss value.
- the above-mentioned execution subject may determine whether the initial model is trained based on the loss value determined in step 204.
- the above-mentioned executive body may determine whether the loss value has converged. In the case where the loss value is determined to converge, it can be determined that the initial model at this time has been trained.
- the above-mentioned executive body may first compare the loss value with the target value. In response to determining that the loss value is less than or equal to the target value, it may be possible to count the proportion of the loss value determined by the preset number of training processes (for example, 100) performed most recently, which is less than or equal to the target value.
- the target value can generally be set to an ideal situation indicating the degree of inconsistency between the predicted value and the true value. That is, in the case where the loss value is less than or equal to the target value, the predicted value can be considered to be close to or approximate to the true value.
- the target value can be set according to actual needs.
- step 206 may be continued.
- the parameters in the initial model can be updated based on the determined loss value, samples can be re-extracted from the above sample set, and the initial model with the updated parameters can be used as the initial model to continue the above training process.
- the gradient of the loss value relative to the model parameters can be obtained using a back propagation algorithm, and then the model parameters can be updated based on the gradient using a gradient descent algorithm.
- the above-mentioned back propagation algorithm, gradient descent algorithm and machine learning method are well-known technologies that have been widely researched and applied at present, and will not be repeated here.
- the extraction method here is also not limited in this application. For example, in the case where there are a large number of samples in the sample set, the execution subject can extract unextracted samples from them.
- step 206 in response to determining that the initial model training is completed, the trained initial model is determined as the video feature extraction model.
- the above-mentioned execution subject may determine the trained initial model as the video feature extraction model.
- the frames in the target video are input to the video feature extraction model to obtain the target feature information of the target video; Then, the similarity calculation of the target feature information and the feature information of the videos in the preset video library may be performed, and a preset number of videos from the video library may be selected as candidate videos in the order of similarity from large to small; Finally, the soundtrack information of the candidate videos can be obtained, and the selected soundtrack information can be pushed.
- FIG. 3 is a schematic diagram of an application scenario of the method for generating a model according to this embodiment.
- a model training application may be installed on the terminal device 301 used by the user. After the user opens the application and uploads the sample set or the storage path of the sample set, the server 302 that provides background support for the application can run a method for generating a model, including:
- the sample set can be obtained.
- the samples in the above sample set include the first video, the second video and the third video, the first video and the second video have the same soundtrack and are marked with the same soundtrack style, and the first video and the third video have different soundtracks and With different soundtrack style annotations.
- samples can be extracted from the above sample set, and the following training process is performed: decimating the video in the extracted samples.
- the extracted frame 303 in the first video, the frame 304 in the second video, and the frame 305 in the third video are input to the initial model 306 to obtain feature information of each video in the sample.
- the loss value 308 of the sample is determined.
- samples can be extracted from it for initial model training.
- the samples in the foregoing sample set may include the first video, the second video, and the third video.
- the first and second videos have the same soundtrack and are marked with the same soundtrack style.
- the first and third videos have different soundtracks and are marked with different soundtrack styles.
- the feature information of each video in the sample can be obtained.
- the loss value of the sample can be determined based on the feature information and the annotation of the soundtrack style in the sample.
- it can be determined whether the initial model is trained based on the determined loss value. If the initial model training is completed, the trained initial model can be determined as the video feature extraction model. Therefore, a model that can be used to extract video features can be obtained, and the video features extracted by the model are helpful for automatic selection of video soundtracks.
- FIG. 4 shows a flow 400 of yet another embodiment of a method of generating a model.
- the process 400 of the method for generating a model includes steps 401 to 408.
- step 401 a sample set is obtained.
- the execution subject of the method of generating the model can obtain the sample set.
- a large number of samples can be included in the sample set.
- the sample may include a first video, a second video, and a third video.
- the first video and the second video have the same soundtrack and are marked with the same soundtrack style.
- the first and third videos have different soundtracks and are marked with different soundtrack styles.
- the soundtrack style label may be information for indicating and distinguishing the style of the soundtrack.
- the soundtrack style can be pre-divided into multiple types, such as sadness, cheerfulness, and soothing.
- the samples in the above-mentioned sample set can be generated by the following steps: First, a video can be randomly extracted from the preset video library as the first video, where the videos in the above-mentioned video library are provided with soundtrack annotations and soundtrack styles Annotation, soundtrack style annotation is set to indicate the style of the soundtrack. Then, a video with the same soundtrack label and the same soundtrack style label as the first video can be randomly extracted from the video library as the second video. Then, a video with a different soundtrack label and a different soundtrack style label from the first video can be randomly selected from the above video library as the third video. Finally, the first video, the second video, and the third video can be aggregated into samples.
- step 402 samples are extracted from the sample set.
- the execution subject may extract samples from the sample set acquired in step 401, and perform the training process from step 403 to step 408.
- the extraction method and number of samples are not limited in this application.
- it may be a random extraction of at least one sample, or a sample with better definition of the video from which the sample is extracted (ie, the pixels of the frame of the video in the sample are higher).
- step 403 the frames in the video in the extracted sample are input to the initial model to obtain the feature information of each video in the sample.
- the above-mentioned execution subject may input the frames in the video in the sample extracted in step 402 to the initial model to obtain feature information of the first video, feature information of the second video, and feature information of the third video, respectively .
- feature information can be expressed in the form of vectors or matrices.
- the initial model may be a convolutional neural network created based on machine learning technology.
- the initial model can perform feature extraction on the frames in the video, and then perform fusion, analysis and other processing on the extracted multi-frame features, and finally output the feature information of the video.
- the established product neural network may include a convolutional layer, a pooling layer, a feature fusion layer, a fully connected layer, and so on.
- step 404 the first Euclidean distance and the second Euclidean distance are determined.
- the above-mentioned execution subject may determine the first Euclidean distance and the second Euclidean distance.
- the first Euclidean distance is the Euclidean distance of the feature information of the video marked with the same soundtrack style
- the second Euclidean distance is the Euclidean distance of the feature information of the video marked with a different soundtrack style.
- the first difference may be the feature information of the first video in the extracted sample and the first The difference in the feature information of the two videos. Because the soundtracks of the first video and the third video in the extracted sample are different and have different soundtrack style annotations, the second video and the third video have different soundtrack style annotations. Therefore, the second difference may be the difference between the feature information of the first video and the feature information of the third video in the extracted sample, or the difference between the feature information of the second video and the feature information of the third video.
- step 405 the difference between the second Euclidean distance and the first Euclidean distance is determined.
- the execution subject may determine the difference between the second Euclidean distance and the first Euclidean distance.
- step 406 the above difference is compared with the first preset value to determine the loss value of the sample.
- the execution subject may compare the difference with the first preset value to determine the loss value of the sample.
- the first preset value is a positive number (for example, 0.2).
- the first preset numerical value may be a numerical value specified in advance by a technician based on a large number of data statistics and calculations.
- the execution subject may determine the second predetermined value (eg, 0) as a sample Loss value.
- the second predetermined value eg, 0
- the loss value can be set to a small value, so that the sample loss value has little effect on gradient descent or does not participate in gradient descent.
- the second preset numerical value may be set to a number smaller than the target value, for example, 0.
- the execution subject may determine the difference between the first preset value and the difference as the sample Loss value. It can be understood that, when the difference between the second Euclidean distance and the first Euclidean distance is less than or equal to the first preset value, the numerical relationship between the second Euclidean distance and the first Euclidean distance Does not meet expectations. At this time, since the difference between the first preset value and the difference is greater than zero, the difference between the first preset value and the difference can be determined as the loss value of the sample and participate in the gradient descent.
- step 407 it is determined whether the initial model has been trained based on the loss value.
- the above-mentioned execution subject may determine whether the initial model is trained based on the loss value determined in step 406.
- the above-mentioned executive body may determine whether the loss value has converged. In the case where the loss value is determined to converge, it can be determined that the initial model at this time has been trained.
- step 408 may be continued.
- the parameters in the initial model can be updated based on the determined loss value, samples can be re-extracted from the above sample set, and the initial model with the updated parameters can be used as the initial model to continue the above training process.
- the gradient of the loss value relative to the model parameters can be obtained using a back propagation algorithm, and then the model parameters can be updated based on the gradient using a gradient descent algorithm.
- the above-mentioned back propagation algorithm, gradient descent algorithm and machine learning method are well-known technologies that have been widely researched and applied at present, and will not be repeated here.
- the extraction method here is also not limited in this application. For example, when there are a large number of samples in the sample set, the execution agent can extract unextracted samples from it.
- step 408 in response to determining that the initial model training is completed, the trained initial model is determined as the video feature extraction model.
- the above-mentioned execution subject may determine the trained initial model as the video feature extraction model.
- the process flow 400 of the method for generating a model in this embodiment involves a way to determine the loss value of the extracted sample. Therefore, the solution described in this embodiment can make the difference of the feature information extracted by the model from the frame of the video with the same soundtrack and with the same soundtrack style annotation as small as possible. The difference in feature information extracted from the frames of the video is as large as possible. Thus, a model that can extract video features can be obtained, and the video features extracted by the model are helpful for the automatic selection of video soundtracks.
- the present application provides an embodiment of a device for generating a model.
- the device embodiment corresponds to the method embodiment shown in FIG. 2, and the device can be applied Used in various electronic devices.
- the apparatus 500 for generating a model includes: an obtaining unit 501 configured to obtain a sample set, where the samples in the sample set include a first video, a second video, and a third video, The soundtracks of the first video and the second video are the same and have the same soundtrack style annotations, the soundtracks of the first video and the third video are different and have different soundtrack style annotations; the training unit 502 is configured to extract from the above sample set For the sample, perform the following training process: input the frames of the video in the extracted sample to the initial model to obtain the feature information of each video in the sample; determine the loss value of the sample based on the feature information and the label style label in the sample Determine whether the initial model is completed based on the above loss value; in response to determining that the initial model is completed, determine the initial model after training as the video feature extraction model.
- the training unit 502 may be configured to determine a first Euclidean distance and a second Euclidean distance, where the first Euclidean distance is for videos with the same soundtrack style annotation Euclidean distance of feature information, the second Euclidean distance is the Euclidean distance of the feature information of the video marked with different soundtrack styles; determine the difference between the second Euclidean distance and the first Euclidean distance; The difference is compared with the first preset value to determine the loss value of the sample, where the first preset value is a positive number.
- the training unit 502 may be configured to: in response to determining that the difference is greater than the first preset value, determine the second preset value as the loss value of the sample, where the second The preset value is smaller than the difference between the above-mentioned difference and the above-mentioned first preset value.
- the training unit 502 may be configured to: in response to determining that the difference is less than or equal to the first preset value, determine the difference between the first preset value and the difference as a sample Loss value.
- the samples in the above-mentioned sample set are generated by the following steps: randomly extracting videos from the preset video library as the first video, where the videos in the above-mentioned video library are provided with soundtrack annotations and soundtrack styles Annotation, the soundtrack style annotation is set to indicate the style of the soundtrack; a video with the same soundtrack annotation and the same soundtrack style annotation as the first video is randomly extracted from the above video library as the second video; randomly selected from the above video library A video with a different soundtrack label and a different soundtrack style label from the above-mentioned first video is taken as the third video; the above-mentioned first video, the above-mentioned second video, and the above-mentioned third video are aggregated as samples.
- the apparatus may further include an update unit (not shown in the figure).
- the above-mentioned updating unit may be configured to update the parameters in the initial model based on the above-mentioned loss value, re-extract samples from the above-mentioned sample set in response to determining that the initial model is not trained, continue to use the updated initial model as the initial model, and continue Perform the above training process.
- the device provided in the above embodiment of the present application obtains the sample set through the obtaining unit 501, and samples can be extracted therefrom for training of the initial model.
- the samples in the foregoing sample set may include the first video, the second video, and the third video.
- the first and second videos have the same soundtrack and are marked with the same soundtrack style.
- the first and third videos have different soundtracks and are marked with different soundtrack styles.
- the training unit 502 inputs the frames of the video in the extracted sample to the initial model, and the feature information of each video in the sample can be obtained.
- the loss value of the sample can be determined based on the feature information and the annotation of the soundtrack style in the sample.
- FIG. 6 shows a process 600 of an embodiment of a method for pushing information provided by the present application.
- the method for pushing information may include steps 601 to 603.
- step 601 in response to receiving the target video, the frames in the target video are input to the video feature extraction model to obtain target feature information of the target video.
- the executive body of the push information may receive the target video by inputting the frames in the target video into the video feature Extract the model to obtain the target feature information of the target video.
- the target video may be a video uploaded by the terminal device that has not been scored.
- the video feature extraction model may be generated by using the method described in the embodiment of FIG. 2 described above.
- For the generation process reference may be made to the related description in the embodiment of FIG. 2, and details are not described herein again.
- step 602 similarity calculation is performed on the target feature information and the feature information of the video in the preset video library, and a preset number of videos are selected from the video library as candidate videos in order of similarity from large to small.
- the execution subject may calculate the similarity between the target feature information and the feature information of the video in the preset video library, and select a preset number from the video library in order of similarity from large to small (For example, 5 videos) as candidate videos.
- step 603 the soundtrack information of the candidate video is acquired, and the selected soundtrack information is pushed.
- the execution subject may obtain the soundtrack information of the candidate video.
- the soundtrack information may include but is not limited to at least one of the following: an audio file of the soundtrack, a name of the soundtrack, and a style name of the soundtrack.
- the selected soundtrack information can be pushed, for example, to the above terminal device, for user selection.
- the method for pushing information in this embodiment can be used to test the video feature extraction models generated in the foregoing embodiments.
- the video feature extraction model can be continuously optimized according to the test results.
- This method may also be a practical application method of the video feature extraction model generated in the above embodiments. Using the video feature extraction models generated in the above embodiments to extract the target feature information of the target video, based on the extracted target video feature information, select the soundtrack information, you can recommend the soundtrack to the video without the soundtrack, and realize the richness Targeted information push.
- the present application provides an embodiment of an apparatus for pushing information.
- the device embodiment corresponds to the method embodiment shown in FIG. 6, and the device can be applied to various electronic devices.
- the apparatus 700 for pushing information includes: a receiving unit 701 configured to, in response to receiving the target video, use the frame input in the target video as described in the embodiment of FIG. 2 described above
- the video feature extraction model generated by the method of the method obtains the target feature information of the target video
- the selection unit 702 is configured to calculate the similarity between the target feature information and the feature information of the video in the preset video library, according to the similarity In order from large to small, a preset number of videos are selected from the above video library as candidate videos
- the pushing unit 703 is configured to acquire the soundtrack information of the above candidate videos and push the selected soundtrack information.
- FIG. 8 shows a schematic structural diagram of a computer system 800 suitable for implementing an electronic device according to an embodiment of the present application.
- the electronic device shown in FIG. 8 is only an example, and should not bring any limitation to the functions and use scope of the embodiments of the present application.
- the computer system 800 includes a central processing unit (Central Processing Unit, CPU) 801, which can be loaded into a random unit according to a program stored in a read-only memory (Read-Only Memory, ROM) 802 or from the storage section 808 Random Access (RAM) 803 programs are executed to perform various appropriate actions and processes.
- ROM Read-Only Memory
- RAM Random Access
- the CPU 801, ROM 802, and RAM 803 are connected to each other through a bus 804.
- An input / output (Input / Output, I / O) interface 805 is also connected to the bus 804.
- the following components are connected to the I / O interface 805: an input section 806 including a keyboard, a mouse, etc .; including an output section 807 such as a cathode ray tube (Cathode Ray Tube, CRT), a liquid crystal display (Liquid Crystal Display, LCD), etc. and a speaker, etc.
- the communication section 809 performs communication processing via a network such as the Internet.
- the driver 810 is also connected to the I / O interface 805 as needed.
- a removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed on the drive 810 as necessary, so that a computer program read therefrom is installed into the storage portion 808 as necessary.
- the process described above with reference to the flowchart may be implemented as a computer software program.
- embodiments of the present disclosure include a computer program product that includes a computer program carried on a computer-readable medium, the computer program containing program code for performing the method shown in the flowchart.
- the computer program may be downloaded and installed from the network through the communication section 809, and / or installed from the removable medium 811.
- CPU central processing unit
- the above-mentioned functions defined in the method of the present application are executed.
- the computer-readable medium described in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with at least one wire, portable computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable only Read memory (Erasable Programmable Read-Only Memory, EPROM or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination.
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable only Read memory
- EPROM Erasable Programmable Read-Only Memory
- CD-ROM Compact Disc Read-Only Memory
- optical storage device magnetic storage device, or any suitable combination.
- the computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
- the computer-readable signal medium may include a data signal that is propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable medium may send, propagate, or transmit a program for use by or in combination with an instruction execution system, apparatus, or device. .
- the program code contained on the computer-readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
- each block in the flowchart or block diagram may represent a module, a program segment, or a part of code, and the module, program segment, or part of the code contains at least one Execute instructions.
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks represented in succession may actually be executed in parallel, and they may sometimes be executed in reverse order, depending on the functions involved.
- each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts can be implemented with dedicated hardware-based systems that perform specified functions or operations Or, it can be realized by a combination of dedicated hardware and computer instructions.
- the units described in the embodiments of the present application may be implemented in software or hardware.
- the described unit may also be provided in the processor.
- a processor includes an acquisition unit and a training unit.
- the names of these units do not constitute a limitation on the unit itself.
- the acquisition unit can also be described as a “unit for acquiring a sample set”.
- the present application also provides a computer-readable medium, which may be included in the device described in the foregoing embodiments; or may exist alone without being assembled into the device.
- the above computer readable medium carries at least one program, and when the above at least one program is executed by the device, the device is caused to: acquire a sample set; extract the sample from the sample set, perform the following training process: convert the video in the extracted sample
- the frames in are input to the initial model to obtain the feature information of each video in the sample; based on the feature information and the soundtrack style annotation in the sample, the loss value of the sample is determined; based on the loss value, it is determined whether the initial model is trained; in response to the determination After the initial model training is completed, the trained initial model is determined as the video feature extraction model.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
本申请实施例公开了生成模型的方法和装置。该方法的一示例实施方式包括:获取样本集;从该样本集中提取样本,执行如下训练过程:将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息;基于特征信息和样本中的配乐风格标注,确定样本的损失值;基于该损失值确定初始模型是否训练完成;响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
Description
本申请要求在2018年10月30日提交中国专利局、申请号为201811273701.7的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
本申请实施例涉及计算机技术领域,例如生成模型的方法和装置。
随着计算机技术的发展,短视频类应用应运而生。用户可以利用短视频类应用上传、发布视频。利用短视频应用上传视频时,通常需要用户选择一个音乐作为视频配乐。
相关技术中的方式,通常是用户手动从配乐列表中选取配乐。
发明内容
本申请实施例提出了生成模型的方法和装置。
第一方面,本申请实施例提供了一种生成模型的方法,该方法包括:获取样本集,其中,样本集中的样本包括第一视频、第二视频和第三视频,所述第一视频和所述第二视频的配乐相同且带有相同的配乐风格标注,所述第一视频与所述第三视频的配乐不同且带有不同的配乐风格标注;从样本集中提取样本,执行如下训练过程:将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息;基于所述特征信息和样本中的配乐风格标注,确定样本的损失值;基于所述损失值确定初始模型是否训练完成;响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
第二方面,本申请实施例提供了一种生成模型的装置,该装置包括:获取单元,被配置成获取样本集,其中,样本集中的样本包括第一视频、第二视频和第三视频,所述第一视频和所述第二视频的配乐相同且带有相同的配乐风格标注,所述第一视频与所述第三视频的配乐不同且带有不同的配乐风格标注;训练单元,被配置成从样本集中提取样本,执行如下训练过程:将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息;基于所述特征信息和样本中的配乐风格标注,确定样本的损失值;基于所述损失值确定初始模型是否训练完成;响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
第三方面,本申请实施例提供了一种推送信息的方法,包括:响应于接收 到目标视频,将目标视频中的帧输入采用如上述第一方面中任一实施例所描述的方法生成的视频特征提取模型,得到目标视频的目标特征信息;将目标特征信息与预置的视频库中的视频的特征信息进行相似度计算,按照相似度从大到小的顺序,从视频库中选取预设数量的视频作为候选视频;获取候选视频的配乐信息,将所述配乐信息进行推送。
第四方面,本申请实施例提供了一种推送信息的装置,包括:接收单元,被配置成响应于接收到目标视频,将目标视频中的帧输入采用如上述第一方面中任一实施例所描述的方法生成的视频特征提取模型,得到目标视频的目标特征信息;选取单元,被配置成将目标特征信息与预置的视频库中的视频的特征信息进行相似度计算,按照相似度从大到小的顺序,从视频库中选取预设数量的视频作为候选视频;推送单元,被配置成获取候选视频的配乐信息,将所述配乐信息进行推送。
第五方面,本申请实施例提供了一种电子设备,包括:至少一个处理器;存储装置,其上存储有至少一个程序,所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如上述第一方面和第三方面中任一实施例的方法。
第六方面,本申请实施例提供了一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现如上述第一方面和第三方面中任一实施例的方法。
图1是本申请的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本申请的生成模型的方法的一个实施例的流程图;
图3是根据本申请的生成模型的方法的一个应用场景的示意图;
图4是根据本申请的生成模型的方法的又一个实施例的流程图;
图5是根据本申请的生成模型的装置的一个实施例的结构示意图;
图6是根据本申请的推送信息的方法的一个实施例的流程图;
图7是根据本申请的推送信息的装置的一个实施例的结构示意图;
图8是适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。
下面结合附图和实施例对本申请作详细说明。可以理解的是,此处所描述的实施例仅仅用于解释相关发明,而非对该申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关申请相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
图1示出了可以应用本申请的生成模型的方法或生成模型的装置的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如视频录制类应用、视频播放类应用、语音交互类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是硬件,也可以是软件。在终端设备101、102、103为硬件的情况下,可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。在终端设备101、102、103为软件的情况下,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做限定。
在终端设备101、102、103为硬件的情况下,其上还可以安装有图像采集设备。图像采集设备可以是各种能实现采集图像功能的设备,如摄像头、传感器等等。用户可以利用终端设备101、102、103上的图像采集设备,来采集视频。
服务器105可以是提供各种服务的服务器,例如用于对终端设备101、102、103上传的视频进行存储、管理或者分析的视频处理服务器。视频处理服务器可以获取样本集。样本集中可以包含大量的样本。其中,上述样本集中的样本可以包括第一视频、第二视频和第三视频。第一视频和第二视频的配乐相同且带有相同的配乐风格标注。第一视频与第三视频的配乐不同且带有不同的配乐风格标注。此外,视频处理服务器可以利用样本集中的样本,对初始模型进行训练,并可以将训练结果(如生成的视频特征提取模型)进行存储。这样,在用户利用终端设备101、102、103上传视频后,服务器105可以确定用户所上传的视频的特征信息,进而,可以对该视频进行配乐信息的选取和推送等操作。
需要说明的是,服务器105可以是硬件,也可以是软件。在服务器为硬件的情况下,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。在服务器为软件的情况下,可以实现成多个软件或软件模块 (例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做限定。
需要说明的是,本申请实施例所提供的生成模型的方法一般由服务器105执行,相应地,生成模型的装置一般设置于服务器105中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
参考图2,示出了根据本申请的生成模型的方法的一个实施例的流程200。该生成模型的方法,包括步骤201至步骤206。
在步骤201中,获取样本集。
在本实施例中,生成模型的方法的执行主体(例如图1所示的服务器105)可以通过多种方式来获取样本集。例如,执行主体可以通过有线连接方式或无线连接方式,从用于存储样本的另一服务器(例如数据库服务器)中获取存储于其中的样本集。再例如,用户可以通过终端设备(例如图1所示的终端设备101、102、103)来收集样本。这样,上述执行主体可以接收终端所收集的样本,并将这些样本存储在本地,从而生成样本集。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。
此处,样本集中可以包括大量的样本。其中,样本可以包括第一视频、第二视频和第三视频。需要说明的是,第一视频和第二视频的配乐相同且带有相同的配乐风格标注。第一视频与第三视频的配乐不同且带有不同的配乐风格标注。配乐风格标注可以是用于指示和区分配乐的风格的信息。配乐风格可以预先划分为多种,例如悲伤、欢快、舒缓等。
在本实施例的一些实现方式中,上述样本集中的样本可以通过如下步骤生成:首先,可以从预置的视频库中随机提取视频作为第一视频,其中,上述视频库中的视频带有配乐标注和配乐风格标注。实践中,配乐标注可以设置为指示和区分配乐。例如,配乐标注可以是配乐的名称。接着,可以从上述视频库中随机提取与上述第一视频具有相同的配乐标注且具有相同的配乐风格标注的视频作为第二视频。接着,可以从上述视频库中随机选取与上述第一视频带有不同的配乐标注且具有不同的配乐风格标注的视频作为第三视频。最后,可以将上述第一视频、上述第二视频、上述第三视频汇总为样本。
需要说明的是,上述样本中的样本还可以利用其他方式生成,例如人工选取等方式,此处不再赘述。
在步骤202中,从样本集中提取样本。
在本实施例中,执行主体可以从步骤201中获取的样本集中提取样本, 以及执行步骤203至步骤206的训练过程。其中,样本的提取方式和提取数量在本申请中并不限制。例如,可以是随机提取至少一个样本,也可以是从中提取样本的视频的清晰度较好(即样本中的视频的帧的像素较高)的样本。
在步骤203中,将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息。
在本实施例中,上述执行主体可以将步骤202中提取的样本中的视频中的帧输入至初始模型。由于所提取的样本中包括第一视频、第二视频、第三视频,因而,可以分别得到第一视频的特征信息、第二视频的特征信息、第三视频的特征信息。初始模型可以通过对视频中的帧进行分析等,输出视频的特征信息。实践中,特征信息可以以向量或者矩阵的形式来表示。
需要说明的是,所输入的视频中的帧,可以是随机抽取的一帧或多帧;也可以是按照指定时间间隔(例如1s或者2s等)从视频中抽取的多帧。此处不作限定。
在本实施例中,初始模型可以是基于机器学习技术而创建的各种具有图像特征提取功能的模型。初始模型可以对视频中的帧进行特征提取,而后对所提取的多帧的特征进行融合、分析等处理,最终输出视频的特征信息。
作为示例,初始模型可以是使用各种相关技术中的结构(例如DenseBox、VGGNet、ResNet、SegNet等)的卷积神经网络。实践中,卷积神经网络(Convolutional Neural Network,CNN)是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元,对于图像处理有出色表现,因而,可以利用卷积神经网络进行视频中的帧的特征的提取。
在本示例中,所建立的积神经网络可以包含卷积层、池化层、特征融合层、全连接层等。其中,卷积层可以设置为提取图像特征。池化层可以设置为对输入的信息进行降采样(downsample)。特征融合层可以设置为将所得到的多帧对应的图像特征(例如,可以是特征矩阵的形式,或者特征向量的形式)进行融合。例如,可以将不同帧对应的特征矩阵中的相同位置的特征值取平均,以进行特征融合,生成一个融合后的特征矩阵。全连接层可以设置为对融合后的特征进行处理。
需要说明的是,上述初始模型也可以是具有图像特征提取功能的其他模型,并不限于上述示例,对于模型结构此处不作限定。
在步骤204中,基于特征信息和样本中的配乐风格标注,确定样本的损失值。
在本实施例中,训练初始模型的目标是使之从配乐相同且带有相同的配乐风格标注的视频的帧中提取的特征信息的差异尽可能小,同时,从配乐不同且带有不同配乐风格标注的视频的帧中提取的特征信息的差异尽可能大。 由此,可以将从配乐相同且带有相同的配乐风格标注的视频的帧中提取的特征信息的差异称为第一差异。可以将从配乐不同且带有不同配乐风格标注的视频的帧中提取的特征信息的差异称为第二差异。上述执行主体可以首先根据特征信息确定出第一差异和第二差异的值。实践中,由于所提取的样本中的第一视频与第二视频配乐相同且带有相同的配乐风格标注,因此,第一差异可以是所提取的样本中的第一视频的特征信息与第二视频的特征信息的差异。由于所提取的样本中的第一视频与第三视频配乐不同且带有不同的配乐风格标注,第二视频与第三视频的配乐不同且带有不同的配乐风格标注。因此,第二差异可以是所提取的样本中的第一视频的特征信息与第三视频的特征信息的差异,或者,可以是第二视频的特征信息与第三视频的特征信息的差异。此处,特征信息的差异可以利用欧氏距离、余弦相似度等方式来确定。特征信息越相似,差异越小。特征信息越不相似,差异越大。
在得到第一差异与第二差异之后,上述执行主体可以将第一差异、第二差异输入至预先建立的损失函数,确定样本的损失值。损失函数是一个非负实值函数。一般情况下,损失函数的值(损失值)越小,模型的鲁棒性就越好。损失函数可以根据实际需求来设置。作为示例,上述损失函数可以是用于表征第二差异与第一差异的差异程度的函数,例如三元组损失函数(triplet loss)。
在本实施例的一些实现方式中,上述执行主体可以通过如下步骤确定所提取的样本的损失值:
第一步,确定带有相同的配乐风格标注的视频的特征信息的欧氏距离,将该距离作为第一欧氏距离。并确定不同的配乐风格标注的视频的特征信息的欧氏距离,将该距离作为第二欧氏距离。
第二步,确定上述第二欧氏距离与上述第一欧氏距离的差值。
第三步,将上述差值与第一预设数值进行比较,确定样本的损失值,其中,上述第一预设数值为正数(例如0.2)。此处,第一预设数值可以是技术人员基于大量数据统计和计算而预先指定的数值。
在一实施例中,响应于确定上述第二欧氏距离与上述第一欧氏距离的差值大于上述第一预设数值,上述执行主体可以将第二预设数值(例如0)确定为样本的损失值。可以理解的是,在上述第二欧氏距离与上述第一欧氏距离的差值大于上述第一预设数值的情况下,可以认为第二欧氏距离与第一欧氏距离的数值关系满足预期。此时,损失值可以设置为较小的数,使该样本的损失值对梯度下降的影响较小或不参与梯度下降。由于此时第二欧氏距离与上述第一欧氏距离的差值大于上述第一预设数值,因而该差值与第一预设数值的差(此处可称为目标值)大于0,因此,可以将上述第二预设数值设 置为小于上述目标值的数,例如0。
在一实施例中,响应于确定上述第二欧氏距离与上述第一欧氏距离的差值小于或等于上述第一预设数值,上述执行主体可以将上述第一预设数值与上述差值的差确定为样本的损失值。可以理解的是,在上述第二欧氏距离与上述第一欧氏距离的差值小于或等于第一预设数值的情况下,可以认为第二欧氏距离与第一欧氏距离的数值关系不满足预期。此时,由于第一预设数值与该差值的差大于零,因此,可以将上述第一预设数值与上述差值的差确定为样本的损失值,参与梯度下降。
在步骤205中,基于损失值确定初始模型是否训练完成。
在本实施例中,上述执行主体可以基于步骤204所确定的损失值,确定初始模型是否训练完成。作为示例,上述执行主体可以确定损失值是否已收敛。在确定损失值收敛的情况下,则可以确定此时的初始模型已训练完成。作为又一示例,上述执行主体可以首先将损失值与目标值进行比较。响应于确定损失值小于或等于目标值,可以统计最近执行的预设数量(例如100)次训练过程所确定的损失值中,小于或等于上述目标值的损失值的数量所占比例。在该比例大于预设比例(例如95%)时,可以确定初始模型训练完成。需要说明的是,目标值一般可以设置为表示预测值与真实值之间的不一致程度的理想情况。也就是说,在损失值小于或等于目标值的情况下,可以认为预测值接近或近似真实值。目标值可以根据实际需求来设置。
需要说明的是,响应于确定初始模型已训练完成,则可以继续执行步骤206。响应于确定初始模型未训练完成,可以基于所确定的损失值,更新初始模型中的参数,从上述样本集中重新提取样本,使用更新参数后的初始模型作为初始模型,继续执行上述训练过程。此处,可以利用反向传播算法求得损失值相对于模型参数的梯度,而后利用梯度下降算法基于梯度更新模型参数。需要说明的是,上述反向传播算法、梯度下降算法以及机器学习方法是目前广泛研究和应用的公知技术,在此不再赘述。需要指出的是,这里的提取方式在本申请中也不限制。例如在样本集中有大量样本的情况下,执行主体可以从中提取未被提取过的样本。
在步骤206中,响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
在本实施例中,响应于确定初始模型训练完成,上述执行主体可以将训练后的初始模型确定为视频特征提取模型。
在本实施例的一些实现方式中,在训练得到视频特征提取模型之后,响应于接收到目标视频,将上述目标视频中的帧输入至该视频特征提取模型,得到上述目标视频的目标特征信息;而后,可以将上述目标特征信息与预置 的视频库中的视频的特征信息进行相似度计算,按照相似度从大到小的顺序,从上述视频库中选取预设数量的视频作为候选视频;最后,可以获取上述候选视频的配乐信息,将所选取的配乐信息进行推送。
参见图3,图3是根据本实施例的生成模型的方法的应用场景的一个示意图。在图3的应用场景中,图3的应用场景中,用户所使用的终端设备301上可以安装有模型训练类应用。在用户打开该应用,并上传样本集或样本集的存储路径后,对该应用提供后台支持的服务器302可以运行生成模型的方法,包括:
首先,可以获取样本集。其中,上述样本集中的样本包括第一视频、第二视频和第三视频,第一视频和第二视频的配乐相同且带有相同的配乐风格标注,第一视频与第三视频的配乐不同且带有不同的配乐风格标注。之后,可以从上述样本集中提取样本,执行如下训练过程:对所提取的样本中的视频进行抽帧。将所抽取的第一视频中的帧303、第二视频中的帧304、第三视频中的帧305输入至初始模型306,得到样本中的每种视频的特征信息。基于特征信息和样本中的配乐风格标注307,确定样本的损失值308。基于上述损失值确定初始模型是否训练完成。响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型309。
通过获取样本集,可以从中提取样本以进行初始模型的训练。其中,上述样本集中的样本可以包括第一视频、第二视频和第三视频。第一视频和第二视频的配乐相同且带有相同的配乐风格标注。第一视频与第三视频的配乐不同且带有不同的配乐风格标注。这样,将提取的样本中的视频中的帧输入至初始模型,便可以得到样本中的每种视频的特征信息。之后,可以基于特征信息和样本中的配乐风格标注,确定样本的损失值。最后,可以基于所确定的损失值确定初始模型是否训练完成。如果初始模型训练完成,就可以将训练后的初始模型确定为视频特征提取模型。从而能够得到一种可以用于提取视频特征的模型,且该模型所提取的视频特征有助于视频配乐的自动选取。
参考图4,其示出了生成模型的方法的又一个实施例的流程400。该生成模型的方法的流程400,包括步骤401至步骤408。
在步骤401中,获取样本集。
在本实施例中,生成模型的方法的执行主体(例如图1所示的服务器105)可以获取样本集。样本集中可以包括大量的样本。其中,样本可以包括第一视频、第二视频和第三视频。需要说明的是,第一视频和第二视频的配乐相同且带有相同的配乐风格标注。第一视频与第三视频的配乐不同且带有不同的配乐风格标注。配乐风格标注可以是用于指示和区分配乐的风格的信息。配乐风格可以预先划分为多种,例如悲伤、欢快、舒缓等。
在本实施例中,上述样本集中的样本可以通过如下步骤生成:首先,可以从预置的视频库中随机提取视频作为第一视频,其中,上述视频库中的视频带有配乐标注和配乐风格标注,配乐风格标注设置为指示配乐的风格。接着,可以从上述视频库中随机提取与上述第一视频具有相同的配乐标注且具有相同的配乐风格标注的视频作为第二视频。接着,可以从上述视频库中随机选取与上述第一视频带有不同的配乐标注且具有不同的配乐风格标注的视频作为第三视频。最后,可以将上述第一视频、上述第二视频、上述第三视频汇总为样本。
在步骤402中,从样本集中提取样本。
在本实施例中,执行主体可以从步骤401中获取的样本集中提取样本,以及执行步骤403至步骤408的训练过程。其中,样本的提取方式和提取数量在本申请中并不限制。例如,可以是随机提取至少一个样本,也可以是从中提取样本的视频的清晰度较好(即样本中的视频的帧的像素较高)的样本。
在步骤403中,将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息。
在本实施例中,上述执行主体可以将步骤402中提取的样本中的视频中的帧输入至初始模型,分别得到第一视频的特征信息、第二视频的特征信息、第三视频的特征信息。实践中,特征信息可以以向量或者矩阵的形式来表示。
在本实施例中,初始模型可以是基于机器学习技术而创建的卷积神经网络。初始模型可以对视频中的帧进行特征提取,而后对所提取的多帧的特征进行融合、分析等处理,最终输出视频的特征信息。此处,所建立的积神经网络可以包含卷积层、池化层、特征融合层、全连接层等。
在步骤404中,确定第一欧氏距离和第二欧氏距离。
在本实施例中,上述执行主体可以确定第一欧氏距离和第二欧氏距离。其中,上述第一欧氏距离为带有相同的配乐风格标注的视频的特征信息的欧氏距离,上述第二欧氏距离为带有不同的配乐风格标注的视频的特征信息的欧氏距离。
此处,由于所提取的样本中的第一视频与第二视频的配乐相同且带有相同的配乐风格标注,因此,第一差异可以是所提取的样本中的第一视频的特征信息与第二视频的特征信息的差异。由于所提取的样本中的第一视频与第三视频的配乐不同且带有不同的配乐风格标注,第二视频与第三视频带有不同的配乐风格标注。因此,第二差异可以是所提取的样本中的第一视频的特征信息与第三视频的特征信息的差异,或者,第二视频的特征信息与第三视频的特征信息的差异。
在步骤405中,确定第二欧氏距离与第一欧氏距离的差值。
在本实施例中,上述执行主体可以确定上述第二欧氏距离与上述第一欧氏距离的差值。
在步骤406中,将上述差值与第一预设数值进行比较,确定样本的损失值。
在本实施例中,上述执行主体可以将上述差值与第一预设数值进行比较,确定样本的损失值。其中,上述第一预设数值为正数(例如0.2)。此处,第一预设数值可以是技术人员基于大量数据统计和计算而预先指定的数值。
在一实施例中,响应于确定上述第二欧氏距离与上述第一欧氏距离的差值大于上述第一预设数值,上述执行主体可以将第二预设数值(例如0)确定为样本的损失值。可以理解的是,在上述第二欧氏距离与上述第一欧氏距离的差值大于上述第一预设数值的情况下,可以认为第二欧氏距离与第一欧氏距离的数值关系满足预期。此时,损失值可以设置为较小的数值,使该样本的损失值对梯度下降的影响较小或不参与梯度下降。由于此时第二欧氏距离与上述第一欧氏距离的差值大于上述第一预设数值,因而该差值与第一预设数值的差(此处可称为目标值)大于0,因此,可以将上述第二预设数值设置为小于上述目标值的数,例如0。
响应于确定上述第二欧氏距离与上述第一欧氏距离的差值小于或等于上述第一预设数值,上述执行主体可以将上述第一预设数值与上述差值的差确定为样本的损失值。可以理解的是,在上述第二欧氏距离与上述第一欧氏距离的差值小于或等于第一预设数值的情况下,可以认为第二欧氏距离与第一欧氏距离的数值关系不满足预期。此时,由于第一预设数值与该差值的差大于零,因此,可以将上述第一预设数值与上述差值的差确定为样本的损失值,参与梯度下降。
在步骤407中,基于损失值确定初始模型是否训练完成。
在本实施例中,上述执行主体可以基于步骤406所确定的损失值,确定初始模型是否训练完成。作为示例,上述执行主体可以确定损失值是否已收敛。在确定损失值收敛的情况下,可以确定此时的初始模型已训练完成。
需要说明的是,响应于确定初始模型已训练完成,则可以继续执行步骤408。响应于确定初始模型未训练完成,可以基于所确定的损失值,更新初始模型中的参数,从上述样本集中重新提取样本,使用更新参数后的初始模型作为初始模型,继续执行上述训练过程。此处,可以利用反向传播算法求得损失值相对于模型参数的梯度,而后利用梯度下降算法基于梯度更新模型参数。需要说明的是,上述反向传播算法、梯度下降算法以及机器学习方法是目前广泛研究和应用的公知技术,在此不再赘述。需要指出的是,这里的提取方式在本申请中也不限制。例如在样本集中有大量样本的情况下,执行主 体可以从中提取未被提取过的样本。
在步骤408中,响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
在本实施例中,响应于确定初始模型训练完成,上述执行主体可以将训练后的初始模型确定为视频特征提取模型。
从图4中可以看出,与图2对应的实施例相比,本实施例中的生成模型的方法的流程400涉及了确定所提取的样本的损失值的一种方式。由此,本实施例描述的方案可以使模型从配乐相同且带有相同的配乐风格标注的视频的帧中提取的特征信息的差异尽可能小,同时,从配乐不同且带有不同配乐风格标注的视频的帧中提取的特征信息的差异尽可能大。由此,能够得到一种可以提取视频特征的模型,且该模型所提取的视频特征有助于视频配乐的自动选取。
参考图5,作为对上述各图所示方法的实现,本申请提供了一种生成模型的装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置可以应用于各种电子设备中。
如图5所示,本实施例所述的生成模型的装置500包括:获取单元501,被配置成获取样本集,其中,上述样本集中的样本包括第一视频、第二视频和第三视频,第一视频和第二视频的配乐相同且带有相同的配乐风格标注,第一视频与第三视频的配乐不同且带有不同的配乐风格标注;训练单元502,被配置成从上述样本集中提取样本,执行如下训练过程:将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息;基于特征信息和样本中的配乐风格标注,确定样本的损失值;基于上述损失值确定初始模型是否训练完成;响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
在本实施例的一些实现方式中,上述训练单元502可以被配置成确定第一欧氏距离和第二欧氏距离,其中,上述第一欧氏距离为带有相同的配乐风格标注的视频的特征信息的欧氏距离,上述第二欧氏距离为带有不同的配乐风格标注的视频的特征信息的欧氏距离;确定上述第二欧氏距离与上述第一欧氏距离的差值;将上述差值与第一预设数值进行比较,确定样本的损失值,其中,上述第一预设数值为正数。
在本实施例的一些实现方式中,上述训练单元502可以被配置成:响应于确定上述差值大于第一预设数值,将第二预设数值确定为样本的损失值,其中,上述第二预设数值小于上述差值与上述第一预设数值的差。
在本实施例的一些实现方式中,上述训练单元502可以被配置成:响应于确定上述差值小于或等于第一预设数值,将上述第一预设数值与上述差值 的差确定为样本的损失值。
在本实施例的一些实现方式中,上述样本集中的样本通过如下步骤生成:从预置的视频库中随机提取视频作为第一视频,其中,上述视频库中的视频带有配乐标注和配乐风格标注,配乐风格标注设置为指示配乐的风格;从上述视频库中随机提取与上述第一视频具有相同的配乐标注且具有相同的配乐风格标注的视频作为第二视频;从上述视频库中随机选取与上述第一视频带有不同的配乐标注且具有不同的配乐风格标注的视频作为第三视频;将上述第一视频、上述第二视频、上述第三视频汇总为样本。
在本实施例的一些实现方式中,该装置还可以包括更新单元(图中未示出)。其中,上述更新单元可以被配置成响应于确定初始模型未训练完成,基于上述损失值,更新初始模型中的参数,从上述样本集中重新提取样本,使用更新参数后的初始模型作为初始模型,继续执行上述训练过程。
本申请的上述实施例提供的装置,通过获取单元501获取样本集,可以从中提取样本以进行初始模型的训练。其中,上述样本集中的样本可以包括第一视频、第二视频和第三视频。第一视频和第二视频的配乐相同且带有相同的配乐风格标注。第一视频与第三视频的配乐不同且带有不同的配乐风格标注。这样,训练单元502将提取的样本中的视频中的帧输入至初始模型,便可以得到样本中的每种视频的特征信息。之后,可以基于特征信息和样本中的配乐风格标注,确定样本的损失值。最后,可以基于所确定的损失值确定初始模型是否训练完成。如果初始模型训练完成,就可以将训练后的初始模型确定为视频特征提取模型。从而能够得到一种可以提取视频特征的模型,且该模型所提取的视频特征有助于视频配乐的自动选取。
请参见图6,其示出了本申请提供的推送信息的方法的一个实施例的流程600。该推送信息的方法可以包括步骤601至步骤603。
在步骤601中,响应于接收到目标视频,将目标视频中的帧输入视频特征提取模型,得到目标视频的目标特征信息。
在本实施例中,推送信息的执行主体(例如图1所示的服务器105,或者存储有视频特征提取模型的其他服务器)响应于接收到目标视频,可以将上述目标视频中的帧输入视频特征提取模型,得到上述目标视频的目标特征信息。此处,目标视频可以是终端设备上传的尚未配乐的视频。
在本实施例中,视频特征提取模型可以是采用如上述图2实施例所描述的方法而生成的。生成过程可以参见图2实施例的相关描述,此处不再赘述。
在步骤602中,将目标特征信息与预置的视频库中的视频的特征信息进行相似度计算,按照相似度从大到小的顺序,从视频库中选取预设数量的视频作为候选视频。
在本实施例中,上述执行主体可以上述目标特征信息与预置的视频库中的视频的特征信息进行相似度计算,按照相似度从大到小的顺序,从上述视频库中选取预设数量(例如5个)的视频作为候选视频。
在步骤603中,获取候选视频的配乐信息,将所选取的配乐信息进行推送。
在本实施例中,上述执行主体可以获取上述候选视频的配乐信息。此处,配乐信息可以包括但不限于以下至少一项:配乐的音频文件、配乐的名称、配乐的风格名称。最后,可以将所选取的配乐信息进行推送,例如推送至上述终端设备,以供用户选择。
需要说明的是,本实施例推送信息的方法可以用于测试上述各实施例所生成的视频特征提取模型。进而根据测试结果可以不断地优化视频特征提取模型。该方法也可以是上述各实施例所生成的视频特征提取模型的实际应用方法。采用上述各实施例所生成的视频特征提取模型,来提取目标视频的目标特征信息,基于所提取的目标视频的特征信息,选取配乐信息,可以对没有配乐的视频进行配乐推荐,实现了富于针对性的信息推送。
参见图7,作为对上述图6所示方法的实现,本申请提供了一种推送信息的装置的一个实施例。该装置实施例与图6所示的方法实施例相对应,该装置可以应用于各种电子设备中。
如图7所示,本实施例所述的推送信息的装置700包括:接收单元701,被配置成响应于接收到目标视频,将上述目标视频中的帧输入采用如上述图2实施例所描述的方法生成的视频特征提取模型,得到上述目标视频的目标特征信息;选取单元702,被配置成将上述目标特征信息与预置的视频库中的视频的特征信息进行相似度计算,按照相似度从大到小的顺序,从上述视频库中选取预设数量的视频作为候选视频;推送单元703,被配置成获取上述候选视频的配乐信息,将所选取的配乐信息进行推送。
可以理解的是,该装置700中记载的诸单元与参考图6描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作、特征同样适用于装置700及其中包含的单元,在此不再赘述。
下面参考图8,其示出了适于用来实现本申请实施例的电子设备的计算机系统800的结构示意图。图8示出的电子设备仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图8所示,计算机系统800包括中央处理单元(Central Processing Unit,CPU)801,其可以根据存储在只读存储器(Read-Only Memory,ROM)802中的程序或者从存储部分808加载到随机访问存储器(Random Access Memory,RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中,还存储有系统800操作所需的各种程序和数据。CPU 801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(Input/Output,I/O)接口805也连接至总线804。
以下部件连接至I/O接口805:包括键盘、鼠标等的输入部分806;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分807;包括硬盘等的存储部分808;以及包括诸如局域网(Local Area Network,LAN)卡、调制解调器等的网络接口卡的通信部分809。通信部分809经由诸如因特网的网络执行通信处理。驱动器810也根据需要连接至I/O接口805。可拆卸介质811,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器810上,以便于从其上读出的计算机程序根据需要被安装入存储部分808。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分809从网络上被下载和安装,和/或从可拆卸介质811被安装。在该计算机程序被中央处理单元(CPU)801执行时,执行本申请的方法中限定的上述功能。需要说明的是,本申请所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有至少一个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当 的介质传输,包括但不限于:无线、电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含至少一个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括获取单元和训练单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,获取单元还可以被描述为“获取样本集的单元”。
作为另一方面,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的装置中所包含的;也可以是单独存在,而未装配入该装置中。上述计算机可读介质承载有至少一个程序,在上述至少一个程序被该装置执行时,使得该装置:获取样本集;从该样本集中提取样本,执行如下训练过程:将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息;基于特征信息和样本中的配乐风格标注,确定样本的损失值;基于该损失值确定初始模型是否训练完成;响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
Claims (16)
- 一种生成模型的方法,包括:获取样本集,其中,所述样本集中的样本包括第一视频、第二视频和第三视频,所述第一视频和所述第二视频的配乐相同且带有相同的配乐风格标注,所述第一视频与所述第三视频的配乐不同且带有不同的配乐风格标注;从所述样本集中提取样本,执行如下训练过程:将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息;基于所述特征信息和样本中的配乐风格标注,确定样本的损失值;基于所述损失值确定初始模型是否训练完成;响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
- 根据权利要求1所述的生成模型的方法,其中,所述基于所述特征信息和样本中的配乐风格标注,确定样本的损失值,包括:确定第一欧氏距离和第二欧氏距离,其中,所述第一欧氏距离为带有相同的配乐风格标注的视频的特征信息的欧氏距离,所述第二欧氏距离为带有不同的配乐风格标注的视频的特征信息的欧氏距离;确定所述第二欧氏距离与所述第一欧氏距离的差值;将所述差值与第一预设数值进行比较,确定样本的损失值,其中,所述第一预设数值为正数。
- 根据权利要求2所述的生成模型的方法,其中,所述将所述差值与第一预设数值进行比较,确定样本的损失值,包括:响应于确定所述差值大于所述第一预设数值,将第二预设数值确定为样本的损失值,其中,所述第二预设数值小于所述差值与所述第一预设数值的差。
- 根据权利要求2所述的生成模型的方法,其中,所述将所述差值与第一预设数值进行比较,确定样本的损失值,包括:响应于确定所述差值小于或等于所述第一预设数值,将所述第一预设数值与所述差值的差确定为样本的损失值。
- 根据权利要求1所述的生成模型的方法,其中,所述样本集中的样本通过如下步骤生成:从预置的视频库中随机提取视频作为第一视频,其中,所述视频库中的视频带有配乐标注和配乐风格标注;从所述视频库中随机提取与所述第一视频具有相同的配乐标注且具有相同的配乐风格标注的视频作为第二视频;从所述视频库中随机选取与所述第一视频带有不同的配乐标注且具有不同的配乐风格标注的视频作为第三视频;将所述第一视频、所述第二视频、所述第三视频汇总为样本。
- 根据权利要求1所述的生成模型的方法,还包括:响应于确定初始模型未训练完成,基于所述损失值,更新初始模型中的参数,从所述样本集中重新提取样本,使用更新参数后的初始模型作为初始模型,继续执行所述训练过程。
- 一种生成模型的装置,包括:获取单元,被配置成获取样本集,其中,所述样本集中的样本包括第一视频、第二视频和第三视频,所述第一视频和所述第二视频的配乐相同且带有相同的配乐风格标注,所述第一视频与所述第三视频的配乐不同且带有不同的配乐风格标注;训练单元,被配置成从所述样本集中提取样本,执行如下训练过程:将所提取的样本中的视频中的帧输入至初始模型,得到样本中的每种视频的特征信息;基于所述特征信息和样本中的配乐风格标注,确定样本的损失值;基于所述损失值确定初始模型是否训练完成;响应于确定初始模型训练完成,将训练后的初始模型确定为视频特征提取模型。
- 根据权利要求7所述的生成模型的装置,其中,所述训练单元被配置成:确定第一欧氏距离和第二欧氏距离,其中,所述第一欧氏距离为带有相同的配乐风格标注的视频的特征信息的欧氏距离,所述第二欧氏距离为带有不同的配乐风格标注的视频的特征信息的欧氏距离;确定所述第二欧氏距离与所述第一欧氏距离的差值;将所述差值与第一预设数值进行比较,确定样本的损失值,其中,所述第一预设数值为正数。
- 根据权利要求8所述的生成模型的装置,其中,所述训练单元被配置成:响应于确定所述差值大于第一预设数值,将第二预设数值确定为样本的损失值,其中,所述第二预设数值小于所述差值与所述第一预设数值的差。
- 根据权利要求8所述的生成模型的装置,其中,所述训练单元被配置成:响应于确定所述差值小于或等于第一预设数值,将所述第一预设数值与所述差值的差确定为样本的损失值。
- 根据权利要求7所述的生成模型的装置,其中,所述样本集中的样本通过如下步骤生成:从预置的视频库中随机提取视频作为第一视频,其中,所述视频库中的视频带有配乐标注和配乐风格标注;从所述视频库中随机提取与所述第一视频具有相同的配乐标注且具有相同的配乐风格标注的视频作为第二视频;从所述视频库中随机选取与所述第一视频带有不同的配乐标注且具有不同的配乐风格标注的视频作为第三视频;将所述第一视频、所述第二视频、所述第三视频汇总为样本。
- 根据权利要求7所述的生成模型的装置,还包括:更新单元,被配置成响应于确定初始模型未训练完成,基于所述损失值,更新初始模型中的参数,从所述样本集中重新提取样本,使用更新参数后的初始模型作为初始模型,继续执行所述训练过程。
- 一种推送信息的方法,包括:响应于接收到目标视频,将所述目标视频中的帧输入采用如权利要求1-6任一所述的方法生成的视频特征提取模型中,得到所述目标视频的目标特征信息;将所述目标特征信息与预置的视频库中的视频的特征信息进行相似度计算,按照相似度从大到小的顺序,从所述视频库中选取预设数量的视频作为候选视频;获取所述候选视频的配乐信息,将所述的配乐信息进行推送。
- 一种推送信息的装置,包括:接收单元,被配置成响应于接收到目标视频,将所述目标视频中的帧输入采用如权利要求1-6任一所述的方法生成的视频特征提取模型,得到所述目标视频的目标特征信息;选取单元,被配置成将所述目标特征信息与预置的视频库中的视频的特征信息进行相似度计算,按照相似度从大到小的顺序,从所述视频库中选取预设数量的视频作为候选视频;推送单元,被配置成获取所述候选视频的配乐信息,将所述配乐信息进行推送。
- 一种电子设备,包括:至少一个处理器;存储装置,其上存储有至少一个程序,所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-6、13中任一项所述的方法。
- 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-6、13中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811273701.7 | 2018-10-30 | ||
CN201811273701.7A CN109492128B (zh) | 2018-10-30 | 2018-10-30 | 用于生成模型的方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020087979A1 true WO2020087979A1 (zh) | 2020-05-07 |
Family
ID=65693282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/095735 WO2020087979A1 (zh) | 2018-10-30 | 2019-07-12 | 生成模型的方法和装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109492128B (zh) |
WO (1) | WO2020087979A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111666922A (zh) * | 2020-07-02 | 2020-09-15 | 上海眼控科技股份有限公司 | 视频匹配方法、装置、计算机设备和存储介质 |
CN113627354A (zh) * | 2021-08-12 | 2021-11-09 | 北京百度网讯科技有限公司 | 模型训练、视频处理方法,装置,设备以及存储介质 |
CN114863202A (zh) * | 2022-03-23 | 2022-08-05 | 腾讯科技(深圳)有限公司 | 视频表征方法及装置 |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492128B (zh) * | 2018-10-30 | 2020-01-21 | 北京字节跳动网络技术有限公司 | 用于生成模型的方法和装置 |
CN109862393B (zh) * | 2019-03-20 | 2022-06-14 | 深圳前海微众银行股份有限公司 | 视频文件的配乐方法、系统、设备及存储介质 |
CN112397163B (zh) * | 2019-08-16 | 2024-02-02 | 北京大数医达科技有限公司 | 用于生成病例输入模型的方法、装置、电子设备和介质 |
CN112397195B (zh) * | 2019-08-16 | 2024-02-09 | 北京大数医达科技有限公司 | 用于生成体格检查模型的方法、装置、电子设备和介质 |
CN112394924B (zh) * | 2019-08-16 | 2024-06-07 | 北京大数医达科技有限公司 | 用于生成提问模型的方法、装置、电子设备和介质 |
CN112397196A (zh) * | 2019-08-16 | 2021-02-23 | 北京大数医达科技有限公司 | 生成影像检查推荐模型的方法和装置 |
CN112397194B (zh) * | 2019-08-16 | 2024-02-06 | 北京大数医达科技有限公司 | 用于生成患者病情归因解释模型的方法、装置和电子设备 |
CN111324773A (zh) * | 2020-02-12 | 2020-06-23 | 腾讯科技(深圳)有限公司 | 一种背景音乐构建方法、装置、电子设备和存储介质 |
CN111314771B (zh) * | 2020-03-13 | 2021-08-27 | 腾讯科技(深圳)有限公司 | 一种视频播放方法及相关设备 |
CN111831855B (zh) * | 2020-07-20 | 2022-09-27 | 北京字节跳动网络技术有限公司 | 用于匹配视频的方法、装置、电子设备和介质 |
CN113923517B (zh) * | 2021-09-30 | 2024-05-07 | 北京搜狗科技发展有限公司 | 一种背景音乐生成方法、装置及电子设备 |
CN114625876B (zh) * | 2022-03-17 | 2024-04-16 | 北京字节跳动网络技术有限公司 | 作者特征模型的生成方法、作者信息处理方法和装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170125059A1 (en) * | 2015-11-02 | 2017-05-04 | Facebook, Inc. | Systems and methods for generating videos based on selecting media content items and moods |
CN107273458A (zh) * | 2017-06-01 | 2017-10-20 | 百度在线网络技术(北京)有限公司 | 深度模型训练方法及装置、图像检索方法及装置 |
EP3276617A1 (en) * | 2016-07-29 | 2018-01-31 | Booktrack Holdings Limited | Systems and methods for automatic-generation of soundtracks for live speech audio |
CN107888843A (zh) * | 2017-10-13 | 2018-04-06 | 深圳市迅雷网络技术有限公司 | 用户原创内容的混音方法、装置、存储介质及终端设备 |
CN109492128A (zh) * | 2018-10-30 | 2019-03-19 | 北京字节跳动网络技术有限公司 | 用于生成模型的方法和装置 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102968416A (zh) * | 2011-09-01 | 2013-03-13 | 佳能株式会社 | 基于用户意图识别执行推荐的设备和方法 |
US8860720B1 (en) * | 2014-01-02 | 2014-10-14 | Ubitus Inc. | System and method for delivering graphics over network |
CN105975472A (zh) * | 2015-12-09 | 2016-09-28 | 乐视网信息技术(北京)股份有限公司 | 一种推荐方法和装置 |
CN108122234B (zh) * | 2016-11-29 | 2021-05-04 | 北京市商汤科技开发有限公司 | 卷积神经网络训练及视频处理方法、装置和电子设备 |
CN106779073B (zh) * | 2016-12-27 | 2019-05-31 | 西安石油大学 | 基于深度神经网络的媒体信息分类方法及装置 |
CN106851394A (zh) * | 2017-01-18 | 2017-06-13 | 广东小天才科技有限公司 | 一种背景音乐切换方法和装置 |
CN108509929A (zh) * | 2018-04-08 | 2018-09-07 | 百度在线网络技术(北京)有限公司 | 用于生成人脸检测模型的方法、人脸检测方法和装置 |
CN108805091B (zh) * | 2018-06-15 | 2021-08-10 | 北京字节跳动网络技术有限公司 | 用于生成模型的方法和装置 |
-
2018
- 2018-10-30 CN CN201811273701.7A patent/CN109492128B/zh active Active
-
2019
- 2019-07-12 WO PCT/CN2019/095735 patent/WO2020087979A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170125059A1 (en) * | 2015-11-02 | 2017-05-04 | Facebook, Inc. | Systems and methods for generating videos based on selecting media content items and moods |
EP3276617A1 (en) * | 2016-07-29 | 2018-01-31 | Booktrack Holdings Limited | Systems and methods for automatic-generation of soundtracks for live speech audio |
CN107273458A (zh) * | 2017-06-01 | 2017-10-20 | 百度在线网络技术(北京)有限公司 | 深度模型训练方法及装置、图像检索方法及装置 |
CN107888843A (zh) * | 2017-10-13 | 2018-04-06 | 深圳市迅雷网络技术有限公司 | 用户原创内容的混音方法、装置、存储介质及终端设备 |
CN109492128A (zh) * | 2018-10-30 | 2019-03-19 | 北京字节跳动网络技术有限公司 | 用于生成模型的方法和装置 |
Non-Patent Citations (1)
Title |
---|
QIE, ZI-HAN ET AL.: "An Artificial Neural Network Model for Video Background Music Selection", COMPUTER KNOWLEDGE AND TECHNOLOGY, vol. 13, no. 21, 25 July 2017 (2017-07-25), pages 173 - 175, XP055704021, ISSN: 1009-3044, DOI: 10.14004/j.cnki.ckt.2017.2332 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111666922A (zh) * | 2020-07-02 | 2020-09-15 | 上海眼控科技股份有限公司 | 视频匹配方法、装置、计算机设备和存储介质 |
CN113627354A (zh) * | 2021-08-12 | 2021-11-09 | 北京百度网讯科技有限公司 | 模型训练、视频处理方法,装置,设备以及存储介质 |
CN113627354B (zh) * | 2021-08-12 | 2023-08-08 | 北京百度网讯科技有限公司 | 模型训练、视频处理方法,装置,设备以及存储介质 |
CN114863202A (zh) * | 2022-03-23 | 2022-08-05 | 腾讯科技(深圳)有限公司 | 视频表征方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN109492128B (zh) | 2020-01-21 |
CN109492128A (zh) | 2019-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020087979A1 (zh) | 生成模型的方法和装置 | |
WO2020087974A1 (zh) | 生成模型的方法和装置 | |
US11176423B2 (en) | Edge-based adaptive machine learning for object recognition | |
CN108520220B (zh) | 模型生成方法和装置 | |
CN109376267B (zh) | 用于生成模型的方法和装置 | |
JP7222008B2 (ja) | 動画クリップの検索方法及び装置 | |
CN108830235B (zh) | 用于生成信息的方法和装置 | |
CN109447156B (zh) | 用于生成模型的方法和装置 | |
CN109360028B (zh) | 用于推送信息的方法和装置 | |
CN109740018B (zh) | 用于生成视频标签模型的方法和装置 | |
WO2020000879A1 (zh) | 图像识别方法和装置 | |
CN110688528B (zh) | 生成视频的分类信息的方法、装置、电子设备和介质 | |
CN109145828B (zh) | 用于生成视频类别检测模型的方法和装置 | |
CN110009059B (zh) | 用于生成模型的方法和装置 | |
JP7394809B2 (ja) | ビデオを処理するための方法、装置、電子機器、媒体及びコンピュータプログラム | |
CN111738010A (zh) | 用于生成语义匹配模型的方法和装置 | |
CN111340221A (zh) | 神经网络结构的采样方法和装置 | |
CN111539903B (zh) | 训练人脸图像合成模型的方法和装置 | |
CN111340220A (zh) | 用于训练预测模型的方法和装置 | |
CN110209658B (zh) | 数据清洗方法和装置 | |
CN111783810A (zh) | 用于确定用户的属性信息的方法和装置 | |
CN109816023B (zh) | 用于生成图片标签模型的方法和装置 | |
WO2024099171A1 (zh) | 视频生成方法和装置 | |
JP2024508502A (ja) | 情報をプッシュする方法および装置 | |
US20230367972A1 (en) | Method and apparatus for processing model data, electronic device, and computer readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19879490 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 22/06/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19879490 Country of ref document: EP Kind code of ref document: A1 |