WO2022188844A1 - 视频分类方法、装置、设备及介质 - Google Patents

视频分类方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022188844A1
WO2022188844A1 PCT/CN2022/080208 CN2022080208W WO2022188844A1 WO 2022188844 A1 WO2022188844 A1 WO 2022188844A1 CN 2022080208 W CN2022080208 W CN 2022080208W WO 2022188844 A1 WO2022188844 A1 WO 2022188844A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
model
feature vector
text
sample
Prior art date
Application number
PCT/CN2022/080208
Other languages
English (en)
French (fr)
Inventor
陈凯兵
刘国翌
Original Assignee
百果园技术(新加坡)有限公司
陈凯兵
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 陈凯兵 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2022188844A1 publication Critical patent/WO2022188844A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, and more particularly, to a video classification method, apparatus, device and medium.
  • a user may add a category label to the published short video, so as to achieve the purpose of classifying the short video.
  • the category tag is randomly typed, which leads to a wrong classification of the video, and causes some videos unrelated to the category tag to appear in the category tag aggregation page.
  • Embodiments of the present disclosure provide a video classification method, apparatus, device, and medium, which can improve the accuracy of video classification.
  • a video classification method comprising:
  • the category label of the target video is determined to be the target category label.
  • a video classification apparatus comprising:
  • the first acquisition module is configured to acquire the target video and the target category label
  • a video module configured to extract the video content feature of the target video through a preset video model, to obtain a video feature vector corresponding to the target video;
  • a text module configured to extract the text content feature of the target category label through a preset text model to obtain a text feature vector corresponding to the target category label
  • a second obtaining module configured to obtain a correlation score between the target video and the target category label according to the video feature vector and the text feature vector;
  • the determining module is configured to determine the category label of the target video as the target category label when the correlation score is greater than or equal to a preset score threshold.
  • an electronic device comprising a memory and a processor, the memory being configured to store executable instructions; the processor being configured to execute, under the control of the instructions, a processor according to the above The video classification method described in the first aspect.
  • a non-transitory computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the video classification according to the first aspect of the present disclosure method.
  • the video content features of the target video are extracted based on the pre-trained video model, and the text content features of the target category label are extracted based on the pre-trained text module, This can improve the accuracy of the extracted video content features and text content features, so that both the extracted video content features and text content features can accurately reflect the classification of the target video and the target category label.
  • it directly calculates the correlation score between the video feature vector composed of the video content features of the target video and the text feature vector composed of the text content features of the target category label, so that the accuracy of the target video classification can be improved.
  • FIG. 1 is a schematic flowchart of a video classification method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of a video classification method according to another embodiment of the present disclosure.
  • FIG. 3 is a schematic flowchart of a video classification method according to another embodiment of the present disclosure.
  • FIG. 4 is a schematic block diagram of a video classification apparatus according to an embodiment of the present disclosure.
  • FIG. 5 is a block diagram of a hardware configuration of an electronic device according to an embodiment of the present disclosure.
  • a video classification method is provided.
  • the method is implemented by an electronic device.
  • the electronic device may be a server or a terminal device.
  • the video classification method may include the following steps S1100 to S1500.
  • Step S1100 acquiring the target video and the target category label.
  • the target video is any video uploaded by the user through the video platform, for example, the target video may be any short video uploaded by the user through the short video platform.
  • the target category label is the category label of the video collection where the target video is located. For example, when uploading a short video through the short video platform, a user can add a category label to the short video to classify the short video into the video collection where the category label is located, and then To achieve the purpose of classifying the short video.
  • multiple category tags may be added to the video.
  • the short video B when a user uploads a short video B through the short video platform A, the short video B can be marked with a category label C.
  • the category label C marked by the user for the short video B may not be the real category label of the short video B, that is, the content of the short video B is not actually related to the category label C marked, resulting in The inaccuracy of the classification of the short video B will also cause some content unrelated to the category label C to appear in the video set with the category label C, resulting in the degradation of the video quality of the video set with the category label C.
  • Step S1200 extracting video content features of the target video by using a preset video model to obtain a video feature vector corresponding to the target video.
  • the preset video model is configured to extract video content features in the target video that can accurately reflect the category to which the target video belongs, and then obtain a video feature vector corresponding to the target video.
  • the preset video model may be a video model reflecting only the video cover, or a video model reflecting other video content in the video except the video cover, and of course, may also be a video model reflecting the entire video content.
  • the above video model reflects the relationship between the target video and the video content features.
  • the input of the video model is the target video
  • the output is the video content features extracted from the target video that can reflect the category to which the target video belongs.
  • the video model may be a neural network model, such as, but not limited to, a BP (Back Propagation) neural network model, a convolutional neural network model, etc. This embodiment does not specifically limit the video model here.
  • the video feature vector X corresponding to the target video is composed of the video content features xj extracted by the video model, and the value of j is a natural number from 1 to p, p represents the total number of extracted video content features, and the value of p It can be set according to actual application scenarios and actual needs, and the value of p can be 128.
  • Step S1300 Extract the text content feature of the target category label by using a preset text model to obtain a text feature vector corresponding to the target category label.
  • the preset text model is configured to extract text content features in the target category label that can accurately reflect the category to which the target category label belongs, and then obtain a text feature vector corresponding to the target category label.
  • the above text model reflects the relationship between the target category label and the text content feature.
  • the input of the text model is the target category label, and the output is the text content feature extracted from the target category label.
  • the text model may be a neural network model, such as, but not limited to, a BP (Back Propagation) neural network model, a convolutional neural network model, a Word2Vec model, etc.
  • the text model is not specifically limited in this embodiment.
  • the text feature vector Y corresponding to the target category label is composed of the text content features yi extracted by the text model.
  • the value of i is a natural number from 1 to q
  • q represents the total number of extracted text content features.
  • the value of q is The value can be set according to actual application scenarios and actual needs.
  • the value of q is usually the same as p, and the value of q is also 128.
  • the category label C that the user has marked for the short video B can be used as the input of the text model, so that the 128 dimensions in the category label C that can accurately reflect the category to which the category label C belongs are extracted through the text model.
  • the text content features y 1 , y 2 ?? y 128 , and the text feature vector Y (y 1 , y 2 ?? y 128 ) corresponding to the category label C is obtained.
  • the above steps S1200 and this step S1300 are performed in no particular order.
  • the above steps S1200 may be performed first to extract the video content features of the target video through a preset video model, and then the video feature vector corresponding to the target video is obtained. , and then perform this step S1300 to extract the text content feature of the target category label by using a preset text model to obtain a text feature vector corresponding to the target category label.
  • this step S1300 may be performed first, and then the above step S1200 may be performed.
  • this step S1300 and the above step S1200 may also be performed simultaneously.
  • Step S1400 Obtain the correlation score between the target video and the target category label according to the video feature vector and the text feature vector.
  • the video feature vector and the video feature vector can be calculated. Correlation score between text feature vectors to determine whether the category label of the target video is the target category label through the correlation score.
  • obtaining the correlation score between the target video and the target category label according to the video feature vector and the text feature vector in this step S1400 may further include: obtaining the target video according to the distance between the video feature vector and the text feature vector Correlation score with target class labels.
  • any distance calculation algorithm may be used to calculate the distance between the video feature vector and the text feature vector.
  • the distance calculation algorithm may be a cosine similarity (CosineSimilarity) algorithm. Of course, it may also be configured to calculate the distance between vectors.
  • Other distance calculation algorithms such as log-likelihood similarity algorithm, Manhattan distance algorithm, etc.
  • the distance can be directly used as the correlation score.
  • mapping data of the mapping relationship between the distance and the correlation score may also be pre-stored, so that after the distance is obtained, the correlation score is obtained according to the distance and the mapping data.
  • Step S1500 when the correlation score is greater than or equal to a preset score threshold, determine the category label of the target video as the target category label.
  • the correlation score can be compared with a preset score threshold, so as to judge the target video according to the comparison result. Whether the class label is the target class label.
  • the preset score threshold may be a value set according to actual application scenarios and actual requirements, and the preset score threshold may be 0.25.
  • the target video when the correlation score is greater than or equal to the preset score threshold, it can be determined that the category label of the target video is the target category label, and the target video can be used as the target video in the video set of the target category label. video.
  • the correlation score is less than the score threshold, the target video needs to be filtered out from the video set of the target category label, so as to improve the video quality of the video set of the target category label.
  • the correlation score of 0.3 is 0.3, and the correlation score of 0.3 is greater than the score threshold of 0.25, then the category label of short video B is determined to be the category label C.
  • the short video B can be filtered out from the video set of the category label C.
  • the short videos included in the initial video set of the category label C are: Short video B, short video D, and short video E, the short videos included in the filtered video set include short video D and short video E.
  • the video content feature of the target video is extracted based on the pre-trained video model, and the text content feature of the target category label is extracted based on the pre-trained text module, which can improve the extracted video content.
  • the accuracy of the features and text content features enables the extracted video content features and text content features to accurately reflect the classification of the target video and target category labels.
  • it directly calculates the correlation score between the video feature vector composed of the video content features of the target video and the text feature vector composed of the text content features of the target category label, so that the accuracy of the target video classification can be improved.
  • the public video classification method further includes the following steps S2100-S2200:
  • Step S2100 acquiring a training sample set.
  • Each training sample in the training sample set includes a video sample and a sample category label of the video sample.
  • the required number of training samples can be determined taking into account the accuracy of the training results and the data processing cost.
  • Step S2200 through the training sample set, the basic video model and the basic text model are synchronously trained under the set convergence conditions, and the trained basic video model is obtained as the preset video model and the trained basic text model is obtained as the preset text model.
  • the convergence conditions include: the video content features of the video samples extracted by the basic video model and the text content features of the sample class labels extracted by the basic text model both have classification results corresponding to the sample class labels.
  • the basic video model and the basic text model are synchronously trained with the set convergence conditions through the training sample set, and the trained basic video model is obtained as the preset video model and the trained basic text model.
  • the preset text model may further include steps S2210a-S2220a:
  • Step S2210a through the training sample set, the model parameters of the basic video model are fixed, and the basic text model is trained under the convergence condition, so as to obtain the basic text model after the first stage training.
  • model parameters of the video model are used to train the text model, and then the video model is trained, which can reduce the training period of the model and improve the convergence speed of the model.
  • the training sample set includes a first sample set and a second sample set, wherein the number of samples in the first sample set is greater than the number of samples in the second sample set.
  • the first set number of category labels with the largest number of short videos among all the category labels set by the short video platform for short videos can be selected as the sample type labels, and for each sample category label, and randomly select a second set number of short videos as training videos.
  • the first set quantity may be a value set according to actual application scenarios and actual requirements, and the first set quantity may be, for example, 30,000.
  • the second set number may also be a value set according to actual application scenarios and actual requirements, and the second set number may be 500, for example. Exemplarily, when the first set number is 30,000 and the second set number is 500, the first sample set includes 15 million first training samples.
  • the above first set number of category labels may be obtained as sample category labels, and for each sample category label, a third set number of short videos clicked by the user to play may be collected as sample category labels. training video.
  • the third set number may also be a value set according to actual application scenarios and actual requirements, and the second set number may be 100, for example. Exemplarily, when the first set number is 30,000 and the third set number is 100, the second sample set includes 3 million second training samples.
  • the model parameters of the basic video model are fixed by the training sample set, and the basic text model is trained with the convergence condition, and obtaining the basic text model after the first stage training may further include: firstly passing the first sample set , fix the model parameters of the basic video model, train the basic text model with convergence conditions, and obtain the basic text model after pre-training; then through the second sample set, fix the model parameters of the basic video model, and continue with the basic text model after the previous training. Train the basic text model to obtain the basic text model after the first stage of training.
  • the text model is first trained according to a large number of randomly collected training samples, and then the training data collected according to the actual click situation of the user is used to train the text model.
  • the samples continue to train the text model, which is equivalent to first training the text model with a large number of training samples to adjust the model parameters of the text model, and then using a small number of real training samples to train the text model to fine-tune the model parameters of the text model, which not only reduces the speed of the text model
  • the training cycle of training can also improve the accuracy of text model training.
  • Step S2220a through the training sample set, train the basic video model under the convergence condition and continue to train the basic text model following the basic text model trained in the first stage to obtain a preset video model and a preset text model.
  • the above second sample set can be used to train the basic video model under the convergence condition and continue to train the basic text model following the basic text model trained in the first stage to obtain a preset video model and a preset text model.
  • a staged training method is used to train the model, which can reduce the training period of the model and improve the convergence speed of the model.
  • each step of synchronously training the basic video model and the basic text model under the set convergence condition may further include the following steps S2210b to S2240b:
  • Step S2210b extract the video content feature of the video sample through the basic video model corresponding to the current step, and obtain a first sample feature vector corresponding to the video sample.
  • the basic video model corresponding to the current step can first extract the 2048-dimensional video content features of the video sample, and then reduce the 2048-dimensional video content features to 128-dimensional video content features to obtain the corresponding video content
  • Step S2220b extract the text content feature of the sample category label through the basic text model corresponding to the current step, and obtain a second sample feature vector corresponding to the sample category label.
  • the basic text model corresponding to the current step can first extract the 2048-dimensional text content features of the sample category label, and then reduce the 2048-dimensional text content features to 128-dimensional text content features to obtain the corresponding
  • Step S2230b classify the first sample eigenvector and the second sample eigenvector respectively through the multi-classifiers sharing the classification parameters, and obtain the first classification result corresponding to the first sample eigenvector and the second sample eigenvector corresponding to the the second classification result.
  • step S2230b in the process of model training, the shared classification parameters of the multi-classifiers sharing the classification parameters are also adjusted according to the training of each step, so that the multi-classifiers are used for the first sample feature vector and The classification by the second sample feature vector is also more and more accurate.
  • the first sample feature vector and the second sample feature vector are respectively classified by the multi-classifiers that share the classification parameters, and the first classification result corresponding to the first sample feature vector and corresponding to the first sample feature vector are obtained.
  • the second classification result of the second sample feature vector may further include the following steps S2231b to S2232b:
  • Step S2231b classify the first sample feature vector and the second sample feature vector respectively by the multi-classifiers sharing the classification parameters, and obtain the first initial classification result corresponding to the first sample feature vector and the second sample feature corresponding to the first initial classification result.
  • the type of the category label corresponding to the multi-classifier is the same as the type of the sample category label contained in the training sample set. For example, when the training sample set includes 30,000 sample class labels, the types of class labels corresponding to the multi-classifier are also 30,000.
  • the first sample feature vector and the second sample feature vector can be classified respectively by the multi-classifiers sharing the classification parameters, so as to obtain the score of the first sample feature vector for each category label and the first sample feature vector.
  • the score of the two-sample feature vector for each class label is the first sample feature vector and the second sample feature vector.
  • Step S2232b performing normalization processing on the first initial classification result and the second initial classification result by using a preset normalization index function to obtain the first classification result and the second classification result.
  • the preset normalized exponential function can be a softmax function, through which the score of the first sample feature vector for each category label and the score of the second sample feature vector for each category label can be mapped to (0,1) range.
  • the first sample feature vector is obtained.
  • the score of the sample feature vector for each category label and the score of the second sample feature vector for each category label for example, the scores of the first sample feature vector for each category label will have similar scores.
  • the scores of the second sample feature vector for each category label may also have similar scores.
  • the difference between the scores can be further enlarged by using the normalized exponential function.
  • the first sample feature vector and the second sample feature vector are respectively classified by the multi-classifiers sharing the classification parameters, and the score of the first sample feature vector for each category label and the second sample feature vector are obtained.
  • the softmax function can be used to normalize the score of the first sample feature vector for each category label and the score of the second sample feature vector for each category label.
  • the normalization process is performed to obtain the normalized value of the score of the first sample feature vector for each category label, and the normalized value of the score of the second sample feature vector for each category label.
  • Step S2240b train the basic video model and the basic text model with the convergence condition.
  • Training the basic video model and the basic text model under the convergence condition in this step S2240b may further include: according to the first classification result, obtaining the first classification loss of the multi-classifier for the sample label category; according to the second classification result, obtaining the multi-classifier for the sample label category The second classification loss of the sample label category; according to the first classification loss and the second classification loss, the basic feature model and the basic text model are trained with convergence conditions.
  • the training sample set provided by this embodiment, it is the convergence condition that the video content features of the video samples extracted by the basic video model and the text content features of the sample category labels extracted by the basic text model both have the classification results corresponding to the sample category labels.
  • Model training has high accuracy.
  • the video content features in the target video that accurately reflect the classification of the target video can be accurately extracted, and, through the trained text model, it can be accurately extracted.
  • the target category label accurately reflects the text content features of the category to which the target category label belongs.
  • the video classification method may include:
  • Step S3010 acquiring a first sample set and a second sample set.
  • step S3020 the model parameters of the basic video model are fixed through the first sample set, and the basic text model is trained under the convergence condition to obtain the basic text model after pre-training.
  • the convergence conditions include: the video content features of the video samples extracted by the basic video model and the text content features of the sample class labels extracted by the basic text model both have classification results corresponding to the sample class labels.
  • Step S3030 fixing the model parameters of the basic video model through the second sample set, continuing to train the basic text model following the basic text model trained in the previous stage, to obtain the basic text model trained in the first stage.
  • Step S3040 train the basic video model under the convergence condition through the second sample set and continue to train the basic text model following the basic text model trained in the first stage to obtain a preset video model and a preset text model.
  • Step S3050 acquiring the target video and the target category label.
  • Step S3060 extracting the video content feature of the target video by using a preset video model to obtain a video feature vector corresponding to the target video.
  • Step S3070 extracting the text content features of the target category label by using a preset text model, to obtain a text feature vector corresponding to the target category label.
  • Step S3080 obtain the correlation score between the target video and the target category label according to the distance between the video feature vector and the text feature vector.
  • Step S3090 when the correlation score is greater than or equal to a preset score threshold, determine the category label of the target video as the target category label.
  • different sample sets are used to train the video model and the text model in stages, which can not only reduce the training period of the model, but also improve the convergence speed of the model.
  • the video model can extract the video content features that accurately reflect the classification of the target video
  • the text model can extract the text content features that accurately reflect the classification of the target category label, and directly calculate the video content features.
  • the correlation score between the text feature vector composed of the video feature vector and the text content feature which can improve the classification accuracy of the target video.
  • a video classification apparatus 4000 is provided, as shown in FIG.
  • the first obtaining module 4100 is configured to obtain the target video and the target category label.
  • the video module 4200 is configured to extract the video content feature of the target video by using a preset video model to obtain a video feature vector corresponding to the target video.
  • the text module 4300 is configured to extract the text content feature of the target category label through a preset text model to obtain a text feature vector corresponding to the target category label.
  • the second obtaining module 4400 is configured to obtain a correlation score between the target video and the target category label according to the video feature vector and the text feature vector.
  • the determining module 4500 is configured to determine the category label of the target video as the target category label when the correlation score is greater than or equal to a preset score threshold.
  • the video classification apparatus 4000 can be implemented in various ways.
  • the video classification apparatus 4000 may be implemented by configuring a processor with instructions.
  • the video classification apparatus 4000 may be implemented by storing the instructions in ROM and reading the instructions from the ROM into the programmable device when the device is started.
  • the video classification apparatus 4000 may be built into a dedicated device (eg, an ASIC).
  • the video classification apparatus 4000 may be divided into mutually independent units, or may be implemented by combining them together.
  • the video classification apparatus 4000 may be implemented by one of the above various implementation manners, or may be implemented by a combination of two or more of the above various implementation manners.
  • the video classification apparatus 4000 may have various implementation forms.
  • the video classification apparatus 4000 may be any software product that provides video services or a functional module running in an application program, or a The peripheral embedded parts, plug-ins, patches, etc. of these software products or application programs can also be these software products or application programs themselves.
  • Embodiments of the present disclosure provide an electronic device 5000 .
  • the electronic device 5000 includes a processor 5100 and a memory 5200, the memory 5200 stores executable instructions, and the processor 5100 executes the video classification method provided in any of the foregoing embodiments under the control of the instructions.
  • the electronic device 5000 may be a server.
  • the server provides the business point of processing, database, and communication facilities.
  • a server can be a monolithic server or a distributed server across multiple computers or computer data centers.
  • Servers may be of various types, such as, but not limited to, web servers, news servers, mail servers, messaging servers, advertising servers, file servers, application servers, interactive servers, database servers, or proxy servers.
  • each server may include hardware, software, or embedded logical components or a combination of two or more such components configured to perform the appropriate functions supported or implemented by the server.
  • servers such as blade servers, cloud servers, and the like.
  • the electronic device 5000 may also be a terminal device, such as a smart phone, a laptop computer, a desktop computer, a tablet computer, and the like.
  • An embodiment of the present disclosure provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the video classification method provided by any of the foregoing embodiments.
  • the present disclosure may be an apparatus, method and/or computer program product.
  • the computer program product may include a non-transitory computer-readable storage medium having computer-readable program instructions loaded thereon configured to cause a processor to implement various aspects of the present disclosure.
  • the above-mentioned non-transitory computer-readable storage medium may include but is not limited to: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc.
  • ROM read-only memory
  • RAM random access memory
  • mobile hard disk magnetic disk or optical disk, etc.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that contains one or more logical functions configured to implement the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.
  • the accuracy of the extracted video content features and text content features can be improved, so that the extracted video content features and text content features can accurately reflect the classification of the target video and the target category label.
  • it directly calculates the correlation score between the video feature vector composed of the video content features of the target video and the text feature vector composed of the text content features of the target category label, so that the accuracy of the target video classification can be improved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供了一种视频分类方法、装置、设备及介质,方法包括:获取目标视频和目标类别标签(S1100);通过预设视频模型提取目标视频的视频内容特征,得到对应于目标视频的视频特征向量(S1200);通过预设文本模型提取目标类别标签的文本内容特征,得到对应于目标类别标签的文本特征向量(S1300);根据视频特征向量和文本特征向量,获得目标视频与目标类别标签间的相关性分数(S1400);在相关性分数大于或等于预设的分数阈值的情况下,确定目标视频的类别标签为目标类别标签(S1500)。即,其是直接计算能够反映目标视频所属分类的视频特征向量和能够反映目标类别标签所属分类的文本特征向量间的相关性分数,从而可以提高目标视频归类的准确性。

Description

视频分类方法、装置、设备及介质
本公开要求于2021年03月12日提交中国专利局,申请号为202110267539.3,申请名称为“视频分类方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及人工智能技术领域,更具体地,涉及一种视频分类方法、装置、设备及介质。
背景技术
在短视频领域,用户时常会对自身发布的短视频按照视频内容进行归类,以使得归类后的短视频可以基于用户的特定兴趣进行查找和推荐。
通常,用户在发布短视频时,可以是对所发布的短视频打上类别标签,以达到对该短视频归类的目的。然而,用户在为视频打类别标签时,会出现乱打类别标签的情况,从而导致视频归类错误,使得该类别标签聚合页中出现一些与该类别标签无关的视频。
发明内容
本公开实施例提供了一种视频分类方法、装置、设备及介质,可以提高视频归类的准确性。
根据本公开的第一方面,提供了一种视频分类方法,所述方法包括:
获取目标视频和目标类别标签;
通过预设视频模型提取所述目标视频的视频内容特征,得到对应于所述目标视频的视频特征向量;
通过预设文本模型提取所述目标类别标签的文本内容特征,得到对应于所述目标类别标签的文本特征向量;
根据所述视频特征向量和所述文本特征向量,获得所述目标视频与所述目标类别标签间的相关性分数;
在所述相关性分数大于或等于预设的分数阈值的情况下,确定所述目标视频的类别标签为所述目标类别标签。
根据本公开的第二方面,提供了一种视频分类装置,所述装置包括:
第一获取模块,被配置为获取目标视频和目标类别标签;
视频模块,被配置为通过预设视频模型提取所述目标视频的视频内容特征,得到对应于所述目标视频的视频特征向量;
文本模块,被配置为通过预设文本模型提取所述目标类别标签的文本内容特征,得到对应于所述目标类别标签的文本特征向量;
第二获取模块,被配置为根据所述视频特征向量和所述文本特征向量,获得所述目标视频与所述目标类别标签间的相关性分数;
确定模块,被配置为在所述相关性分数大于或等于预设的分数阈值的情况下,确定所述目标视频的类别标签为所述目标类别标签。
根据本公开的第三方面,提供一种电子设备,其包括存储器和处理器,所述存储器被配置为存储可执行的指令;所述处理器被配置为在所述指令的控制下执行根据以上第一方面所述的视频分类方法。
根据本公开的第四方面,提供了一种非瞬时性计算机可读存储介质,其上存储有计算机程序,所述计算机程序在被处理器执行时实现如本公开第一方面所述的视频分类方法。
根据本公开实施例的视频分类方法、装置、设备及介质,其是基于预先训练好的视频模型提取目标视频的视频内容特征,及基于预先训练好的文本模块提取目标类别标签的文本内容特征,这可以提高所提取的视频内容特征和文本内容特征的准确性,使得所提取出的视频内容特征和文本内容特征均能够准确反映目标视频和目标类别标签的所属分类。同时,其是直接计算目标视频的视频内容特征组成的视频特征向量和目标类别标签的文本内容特征组成的文本特征向量间的相关性分数,从而可以提高目标视频归类的准确性。
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征及其优点将会变得清楚。
附图说明
被结合在说明书中并构成说明书的一部分的附图示出了本公开的实施例,并且连同其说明一起被配置为解释本公开的原理。
图1是根据本公开实施例的视频分类方法的示意性流程图;
图2是根据本公开另一实施例的视频分类方法的示意性流程图;
图3是根据本公开另一实施例的视频分类方法的示意性流程图;
图4是根据本公开实施例的视频分类装置的原理框图;
图5是根据本公开实施例的电子设备的硬件配置的框图。
具体实施方式
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
在这里示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
<方法实施例>
在本实施例中,提供一种视频分类方法。该方法由电子设备实施。该电子设备可以是服务器,也可以是终端设备。
根据图1所示,本公开实施例的视频分类方法可以包括如下步骤S1100~S1500。
步骤S1100,获取目标视频和目标类别标签。
目标视频为用户通过视频平台上传的任意视频,例如,该目标视频可以是用户通过短视频平台上传的任意短视频。
目标类别标签为目标视频所在视频集合的类别标签,例如,用户可以通过短视频平台上传短视频时为该短视频打上类别标签,以将该短视频归类至该类别标签所在视频集合中,进而达到对该短视频进行归类的目的。
在一个例子中,可以是在上传视频时,仅为该视频打上一个类别标签。
在一个例子中,也可以是在上传视频时,为该视频打上多个类别标签。
示例性地,用户通过短视频平台A上传短视频B时,可以为该短视频B打上类别标签C。可以理解的是,由于用户为该短视频B所打类别标签C可能并不是该短视频B的真实类别标签,即,该短视频B的内容与所打类别标签C实际并不相关,从而导致该短视频B归类的不准确性,同时,也会使得所打类别标签C的视频集合中出现一些与该类别标签C无关的内容,导致类别标签C的视频集合的视频质量下降。
在获取目标视频和目标类别标签之后,进入:
步骤S1200,通过预设视频模型提取目标视频的视频内容特征,得到对应于目标视频的视频特征向量。
预设视频模型被配置为提取目标视频中能够准确反映该目标视频所属分类的视频内容特征,进而得到对应于该目标视频的视频特征向量。该预设视频模型可以是仅反映视频封面的视频模型,也可以是反映视频中除视频封面以外的其他视频内容的视频模型,当然,还可以是反映整个视频内容的视频模型。
以上视频模型反映目标视频与视频内容特征间的关系,该视频模型的输入为目标视频,输出为从该目标视频中所提取中的能够反映该目标视频所属分类的视频内容特征。该视频模型可以为神经网络模型,例如但不限于是BP(Back Propagation)神经网络模型、卷积神经网络模型等,本实施例在此并不对视频模型进行具体限定。
对应于目标视频的视频特征向量X由视频模型所提取出的视频内容特征x j组成,j的取值为1至p的自然数,p表示所提取的视频内容特征的总 数,该p的取值可以是根据实际应用场景和实际需求进行设置,该p的取值可以为128,在此,视频特征向量X由视频模型所提取出的128维度的视频内容特征组成,可以将该视频特征向量表示为X=(x 1,x 2......x 128),并且,该视频特征向量X可以准确反映该目标视频的所属分类。
继续上述示例,可以是将短视频B作为该视频模型的输入,以通过该视频模型提取出该短视频B中能够准确反映该短视频B所属分类的128维度的视频内容特征x 1,x 2......x 128,得到对应于该短视频B的视频特征向量X=(x 1,x 2......x 128)。
在通过预设视频模型提取出目标视频的视频内容特征,得到对应于目标视频的视频特征向量之后,进入:
步骤S1300,通过预设文本模型提取目标类别标签的文本内容特征,得到对应于目标类别标签的文本特征向量。
预设文本模型被配置为提取目标类别标签中能够准确反映该目标类别标签所属分类的文本内容特征,进而得到对应于该目标类别标签的文本特征向量。
以上文本模型反映目标类别标签与文本内容特征间的关系,该文本模型的输入为目标类别标签,输出为从该目标类别标签中所提取中的文本内容特征。该文本模型可以为神经网络模型,例如但不限于是BP(Back Propagation)神经网络模型、卷积神经网络模型、Word2Vec模型等,本实施例在此并不对文本模型进行具体限定。
对应于目标类别标签的文本特征向量Y由文本模型所提取出的文本内容特征y i组成,i的取值为1至q的自然数,q表示所提取的文本内容特征的总数,该q的取值可以是根据实际应用场景和实际需求进行设置,为了能够计算视频内容向量与文本特征向量之间的相似度,q的取值通常和p相同,该q的取值也为128,在此,文本特征向量Y由文本模型所提取出的128维度的文本内容特征组成,可以将该文本特征向量表示为Y=(y 1,y 2......y 128),并且,该文本特征向量Y可以准确反映该目标类别标签的所属分类。
继续上述示例,可以是将用户为该短视频B所打的类别标签C作为该 文本模型的输入,以通过该文本模型提取出该类别标签C中能够准确反映该类别标签C所属分类的128维度的文本内容特征y 1,y 2......y 128,得到对应于该类别标签C的文本特征向量Y=(y 1,y 2......y 128)。
可以理解的是,以上步骤S1200和本步骤S1300的执行不分先后顺序,例如可以是先执行以上步骤S1200通过预设视频模型提取目标视频的视频内容特征,得到对应于目标视频的视频特征向量之后,再执行本步骤S1300通过预设文本模型提取目标类别标签的文本内容特征,得到对应于目标类别标签的文本特征向量。又例如也可以是先执行本步骤S1300,再执行以上步骤S1200。再例如还可以是同时执行本步骤S1300和以上步骤S1200。
在执行通过预设文本模型提取目标类别标签的文本内容特征,得到对应于目标类别标签的文本特征向量之后,进入:
步骤S1400,根据视频特征向量和文本特征向量,获得目标视频与目标类别标签间的相关性分数。
本实施例中,在获得能够准确反映目标视频所属分类的视频内容特征组成的视频特征向量,及准确反映目标类别标签所属分类的文本内容特征组成的文本特征向量之后,便可计算视频特征向量和文本特征向量之间的相关性分数,以通过相关性分数判断目标视频的类别标签是否为目标类别标签。
本实施例中,本步骤S1400中根据视频特征向量和文本特征向量,获得目标视频与目标类别标签间的相关性分数可以进一步包括:根据视频特征向量与文本特征向量之间的距离,获得目标视频与目标类别标签间的相关性分数。
本实施例中,可以利用任意的距离计算算法计算视频特征向量和文本特征向量间的距离,该距离计算算法可以是余弦相似度(CosineSimilarity)算法,当然,还可以是被配置为计算向量间距离的其他距离计算算法,例如对数似然相似度算法、曼哈顿距离算法等。
在一个例子中,可以是直接将该距离作为相关性分数。
在一个例子中,也可以是预先存储距离与相关性分数间的映射关系的映射数据,以在得到距离之后,根据该距离和映射数据,获得该相关性分 数。
继续上述示例,可以是利用余弦相似度算法计算对应于短视频B的视频特征向量X=(x 1,x 2......x 128)和对应于类别标签C的文本特征向量Y=(y 1,y 2......y 128)间的距离,并将该距离作为视频特征向量X=(x 1,x 2......x 128)和文本特征向量Y=(y 1,y 2......y 128)间的相关性分数。
在根据视频特征向量和文本特征向量,获得目标视频与目标类别标签间的相关性分数之后,进入:
步骤S1500,在相关性分数大于或等于预设的分数阈值的情况下,确定目标视频的类别标签为目标类别标签。
本实施例中,在得到对应于目标视频的视频特征向量和目标类别标签间的相关性分数之后,便可将该相关性分数和预设的分数阈值相比较,以根据比较结果判断目标视频的类别标签是否为目标类别标签。
预设的分数阈值可以是根据实际应用场景和实际需求设置的数值,该预设的分数阈值可以是0.25。
本实施例中,在相关性分数大于或等于预设的分数阈值的情况下,可以确定目标视频的类别标签为该目标类别标签,便可将该目标视频作为该目标类别标签的视频集合中的视频。而在相关性分数小于分数阈值的情况下,需要从目标类别标签的视频集合中滤除目标视频,以达到提高该目标类别标签的视频集合的视频质量。
继续上述示例,例如所获得的视频特征向量X=(x 1,x 2......x 128)和文本特征向量Y=(y 1,y 2......y 128)间的相关性分数为0.3,该相关性分数0.3大于分数阈值0.25,则确定短视频B的类别标签为该类别标签C。
又例如所获得的视频特征向量X=(x 1,x 2......x 128)和文本特征向量Y=(y 1,y 2......y 128)间的相关性分数为0.1,该相关性分数0.1小于分数阈值0.25,此时便可从类别标签C的视频集合中滤除该短视频B,示例性地,类别标签C的初始视频集合中包括的短视频有短视频B、短视频D、及短视频E,则过滤后的视频集合中包括的短视频有短视频D、及短视频E。
根据本公开实施例的方法,其是基于预先训练好的视频模型提取目标视频的视频内容特征,及基于预先训练好的文本模块提取目标类别标签的 文本内容特征,这可以提高所提取的视频内容特征和文本内容特征的准确性,使得所提取出的视频内容特征和文本内容特征均能够准确反映目标视频和目标类别标签的所属分类。同时,其是直接计算目标视频的视频内容特征组成的视频特征向量和目标类别标签的文本内容特征组成的文本特征向量间的相关性分数,从而可以提高目标视频归类的准确性。
在一个实施例中,在执行以上步骤S1200通过预设视频模型提取目标视频的视频内容特征,及以上步骤S1300通过预设文本模型提取目标类别标签的文本内容特征之前,如图2所示,本公开视频分类方法还包括如下步骤S2100~S2200:
步骤S2100,获取训练样本集。
训练样本集中每一训练样本包括视频样本及视频样本的样本类别标签。
训练样本的数量越多,训练结果也通常越精准,但训练样本达到一定数量后,训练结果的精度的增加将变的越来越缓慢,直至取向稳定。在此,可以兼顾训练结果的精度和数据处理成本确定所需的训练样本的数量。
步骤S2200,通过训练样本集,以设定的收敛条件同步训练基础视频模型和基础文本模型,得到训练后的基础视频模型作为预设视频模型及得到训练后的基础文本模型作为预设文本模型。
收敛条件包括:通过基础视频模型提取的视频样本的视频内容特征和通过基础文本模型提取的样本类别标签的文本内容特征均具有对应于样本类别标签的分类结果。
在一个例子中,本步骤S2200中通过训练样本集,以设定的收敛条件同步训练基础视频模型和基础文本模型,得到训练后的基础视频模型作为预设视频模型及得到训练后的基础文本模型作为预设文本模型可以进一步包括步骤S2210a~S2220a:
步骤S2210a,通过训练样本集,固定基础视频模型的模型参数,以收敛条件训练基础文本模型,得到第一阶段训练后的基础文本模型。
本例子中,由于视频模型的模型参数非常多,如果不分阶段训练,会 导致训练周期特别长,并且,模型的收敛速度也会特别慢,在此,本例子采用分阶段训练方式例如先固定视频模型的模型参数而去训练文本模型,然后再去训练视频模型,其能够降低模型的训练周期,并且提高模型的收敛速度。
本例子中,训练样本集包括第一样本集和第二样本集,其中,该第一样本集的样本数量大于第二样本集的样本数量。
针对第一样本集,例如可以是先选取短视频平台为短视频所设置的所有类别标签中短视频数量最多的前第一设定数量个类别标签作为样本类型标签,并针对每一个样本类别标签,随机选取第二设定数量个短视频作为训练视频。该第一设定数量可以是根据实际应用场景和实际需求设置的数值,该第一设定数量例如可以是3万。该第二设定数量也可以是根据实际应用场景和实际需求设置的数值,该第二设定数量例如可以是500。示例性地,在第一设定数量为3万,第二设定数量为500的情况下,该第一样本集中包括1500万个第一训练样本。
针对第二样本集,例如可以是先获取以上的前第一设定数量个类别标签作为样本类别标签,并针对每一个样本类别标签,收集有用户点击播放的第三设定数量个短视频作为训练视频。该第三设定数量也可以是根据实际应用场景和实际需求设置的数值,该第二设定数量例如可以是100。示例性地,在第一设定数量为3万,第三设定数量为100的情况下,该第二样本集中包括300万个第二训练样本。
本例子中,本步骤S2210a中通过训练样本集,固定基础视频模型的模型参数,以收敛条件训练基础文本模型,得到第一阶段训练后的基础文本模型可以进一步包括:先通过第一样本集,固定基础视频模型的模型参数,以收敛条件训练基础文本模型,得到前期训练后的基础文本模型;然后通过第二样本集,固定基础视频模型的模型参数,接续前期训练后的基础文本模型继续训练基础文本模型,得到第一阶段训练后的基础文本模型。
根据该例子,其在固定基础视频模型的模型参数,以收敛条件训练基础文本模型时,是先根据大量的随机采集到的训练样本训练文本模型,然后根据用户的实际点击情况所收集到的训练样本继续训练文本模型,这相 当于是先用大量的训练样本训练文本模型以调整文本模型的模型参数,然后利用少量的真实训练样本训练文本模型以微调文本模型的模型参数,这不仅可以降低文本模型训练的训练周期,还可以提高文本模型训练的准确性。
步骤S2220a,通过训练样本集,以收敛条件训练基础视频模型并接续第一阶段训练后的基础文本模型继续训练基础文本模型,得到预设视频模型和预设文本模型。
本步骤S2220a中,其可以通过以上的第二样本集,以收敛条件训练基础视频模型并接续第一阶段训练后的基础文本模型继续训练基础文本模型,得到预设视频模型和预设文本模型。
根据以上步骤S2210a~S2220a,其采用分阶段训练方式去训练模型,可以降低模型的训练周期,并且提高模型的收敛速度。
在一个例子中,本步骤S2200中通过训练样本集,以设定的收敛条件同步训练基础视频模型和基础文本模型中的每一步训练可以进一步包括如下步骤S2210b~S2240b:
步骤S2210b,通过对应当前步的基础视频模型提取视频样本的视频内容特征,得到对应于视频样本的第一样本特征向量。
本步骤S2210b中,通过对应当前步的基础视频模型可以先提取视频样本的2048维度的视频内容特征,然后在将该2048维度的视频内容特征降维至128维度的视频内容特征,得到对应于视频样本的第一样本特征向量,即,该对应于视频样本的第一样本特征向量X=(x 1,x 2......x 128)。
步骤S2220b,通过对应当前步的基础文本模型提取样本类别标签的文本内容特征,得到对应于样本类别标签的第二样本特征向量。
本步骤S2220b中,通过对应当前步的基础文本模型可以先提取样本类别标签的2048维度的文本内容特征,然后在将该2048维度的文本内容特征降维至128维度的文本内容特征,得到对应于样本类别标签的第二样本特征向量,即,该对应于样本类别标签的第二样本特征向量Y=(y 1,y 2......y 128)。
步骤S2230b,通过共享分类参数的多分类器分别对第一样本特征向量和第二样本特征向量进行分类,得到对应于第一样本特征向量的第一分类 结果和对应于第二样本特征向量的第二分类结果。
本步骤S2230b中,在模型训练的过程中,该共享分类参数的多分类器的共享分类参数也是根据每一步的训练进行调整的,以使得通过该多分类器分别对第一样本特征向量和第二样本特征向量进行的分类也越来越准确。
本例子中,本步骤S2230b中通过共享分类参数的多分类器分别对第一样本特征向量和第二样本特征向量进行分类,得到对应于第一样本特征向量的第一分类结果和对应于第二样本特征向量的第二分类结果可以进一步包括如下步骤S2231b~S2232b:
步骤S2231b,通过共享分类参数的多分类器分别对第一样本特征向量和第二样本特征向量进行分类,得到对应于第一样本特征向量的第一初始分类结果和对应于第二样本特征向量的第二初始分类结果。
该多分类器对应的类别标签的种类和训练样本集包含的样本类别标签的种类相同。例如,在训练样本集包括3万个样本类别标签的情况下,该多分类器对应的类别标签的种类也为3万个。
本步骤S2231b中,通过该共享分类参数的多分类器可以分别对第一样本特征向量和第二样本特征向量进行分类,以得到第一样本特征向量对于每一种类别标签的分数和第二样本特征向量对于每一种类别标签的分数。
步骤S2232b,通过预设的归一化指数函数对第一初始分类结果和第二初始分类结果进行归一化处理,得到第一分类结果和第二分类结果。
预设的归一化指数函数可以是softmax函数,通过该softmax函数可以将第一样本特征向量对于每一种类别标签的分数和第二样本特征向量对于每一种类别标签的分数均映射至(0,1)区间内。
可以理解的是,由于现实场景的类别标签,会存在大量同语义的情况,在通过该共享分类参数的多分类器分别对第一样本特征向量和第二样本特征向量进行分类,得到第一样本特征向量对于每一种类别标签的分数和第二样本特征向量对于每一种类别标签的分数之后,例如第一样本特征向量对于每一种类别标签的分数中会存在分数接近的情况,又例如第二样本特征向量对于每一种类别标签的分数中也会存在分数接近的情况,在此,通过归一化指数函数可以进一步拉大分数之间的差异。
本步骤S2232b中,在通过该共享分类参数的多分类器分别对第一样本特征向量和第二样本特征向量进行分类,得到第一样本特征向量对于每一种类别标签的分数和第二样本特征向量对于每一种类别标签的分数之后,可以分别利用softmax函数将第一样本特征向量对于每一种类别标签的分数和第二样本特征向量对于每一种类别标签的分数进行归一化处理,以得到第一样本特征向量对于每一种类别标签的分数的归一化值,及第二样本特征向量对于每一种类别标签的分数的归一化值。
步骤S2240b,以收敛条件训练基础视频模型和基础文本模型。
本步骤S2240b中以收敛条件训练基础视频模型和基础文本模型可以进一步包括:根据第一分类结果,获得多分类器对于样本标签类别的第一分类损失;根据第二分类结果,获得多分类器对于样本标签类别的第二分类损失;根据第一分类损失和第二分类损失,以收敛条件训练基础特征模型和基础文本模型。
该实施例提供的根据训练样本集,以通过基础视频模型提取的视频样本的视频内容特征和通过基础文本模型提取的样本类别标签的文本内容特征均具有对应于样本类别标签的分类结果为收敛条件进行模型训练,具有较高的准确性,通过训练出的视频模型,能够准确提取出目标视频中准确反映该目标视频所属分类的视频内容特征,以及,通过训练出的文本模型,能够准确提取出目标类别标签中准确反映该目标类别标签所属分类的文本内容特征。
<例子>
接下来示出一个例子的视频分类方法的流程示意图,该例子中,如图3所示,该视频分类方法可以包括:
步骤S3010,获取第一样本集和第二样本集。
步骤S3020,通过第一样本集,固定基础视频模型的模型参数,以收敛条件训练基础文本模型,得到前期训练后的基础文本模型。
收敛条件包括:通过基础视频模型提取的视频样本的视频内容特征和通过基础文本模型提取的样本类别标签的文本内容特征均具有对应于样本 类别标签的分类结果。
步骤S3030,通过第二样本集,固定基础视频模型的模型参数,接续前期训练后的基础文本模型继续训练基础文本模型,得到第一阶段训练后的基础文本模型。
步骤S3040,通过第二样本集,以收敛条件训练基础视频模型并接续第一阶段训练后的基础文本模型继续训练基础文本模型,得到预设视频模型和预设文本模型。
步骤S3050,获取目标视频和目标类别标签。
步骤S3060,通过预设视频模型提取目标视频的视频内容特征,得到对应于目标视频的视频特征向量。
步骤S3070,通过预设文本模型提取目标类别标签的文本内容特征,得到对应于目标类别标签的文本特征向量。
步骤S3080,根据视频特征向量和文本特征向量间的距离,获得目标视频与目标类别标签间的相关性分数。
步骤S3090,在相关性分数大于或等于预设的分数阈值的情况下,确定目标视频的类别标签为目标类别标签。
根据该例子,一方面,其是采用不同的样本集分阶段训练视频模型和文本模型,这不仅可以降低模型的训练周期,还能提高模型的收敛速度。另一方面,由于通过视频模型能够提取出准确反映目标视频所属分类的视频内容特征,以及,通过文本模型能够提取出准确反映目标类别标签所属分类的文本内容特征,并直接计算视频内容特征组成的视频特征向量和文本内容特征组成的文本特征向量间的相关性分数,这可以提高目标视频归类的准确性。
<装置实施例>
本实施例中,提供一种视频分类装置4000,如图4所示,该视频分类装置4000可以包括第一获取模块4100、视频模块4200、文本模块4300、第二获取模块4400和确定模块4500。
第一获取模块4100,被配置为获取目标视频和目标类别标签。
视频模块4200,被配置为通过预设视频模型提取所述目标视频的视频内容特征,得到对应于所述目标视频的视频特征向量。
文本模块4300,被配置为通过预设文本模型提取所述目标类别标签的文本内容特征,得到对应于所述目标类别标签的文本特征向量。
第二获取模块4400,被配置为根据所述视频特征向量和所述文本特征向量,获得所述目标视频与所述目标类别标签间的相关性分数。
确定模块4500,被配置为在所述相关性分数大于或等于预设的分数阈值的情况下,确定所述目标视频的类别标签为所述目标类别标签。
本领域技术人员应当明白,可以通过各种方式来实现视频分类装置4000。例如,可以通过指令配置处理器来实现视频分类装置4000。例如,可以将指令存储在ROM中,并且当启动设备时,将指令从ROM读取到可编程器件中来实现视频分类装置4000。例如,可以将视频分类装置4000固化到专用器件(例如ASIC)中。可以将视频分类装置4000分成相互独立的单元,或者可以将它们合并在一起实现。视频分类装置4000可以通过上述各种实现方式中的一种来实现,或者可以通过上述各种实现方式中的两种或更多种方式的组合来实现。
在本实施例中,在本实施例中,视频分类装置4000可以具有多种实现形式,例如,视频分类装置4000可以是任何的提供视频服务的软件产品或者应用程序中运行的功能模块,或者是这些软件产品或者应用程序的外设嵌入件、插件、补丁件等,还可以是这些软件产品或者应用程序本身。
<设备实施例>
本公开实施例提供了一种电子设备5000。
如图5所示,电子设备5000包括处理器5100和存储器5200,存储器5200中存储有可执行的指令,处理器5100在指令的控制下执行前述任一实施例提供的视频分类方法。
在一个例子中,该电子设备5000可以是服务器。服务器提供处理、数据库、通讯设施的业务点。服务器可以是整体式服务器或是跨多计算机或计算机数据中心的分散式服务器。服务器可以是各种类型的,例如但不 限于,网络服务器,新闻服务器,邮件服务器,消息服务器,广告服务器,文件服务器,应用服务器,交互服务器,数据库服务器,或代理服务器。在一些实施例中,每个服务器可以包括硬件,软件,或被配置为执行服务器所支持或实现的合适功能的内嵌逻辑组件或两个或多个此类组件的组合。例如,服务器例如刀片服务器、云端服务器等。
在另一个例子中,该电子设备5000也可以是终端设备,例如可以是智能手机、便携式电脑、台式计算机、平板电脑等。
<介质实施例>
本公开实施例提供了一种非瞬时性计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现前述任一实施例提供的视频分类方法。
本公开可以是设备、方法和/或计算机程序产品。计算机程序产品可以包括非瞬时性计算机可读存储介质,其上载有被配置为使处理器实现本公开的各个方面的计算机可读程序指令。
上述非瞬时性计算机可读存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个被配置为实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。对于本领域技术人员来说公知的是,通过硬件方式实现、通过软件方式实现以及通过软 件和硬件结合的方式实现都是等价的。
工业实用性
通过本公开实施例,可以提高所提取的视频内容特征和文本内容特征的准确性,使得所提取出的视频内容特征和文本内容特征均能够准确反映目标视频和目标类别标签的所属分类。同时,其是直接计算目标视频的视频内容特征组成的视频特征向量和目标类别标签的文本内容特征组成的文本特征向量间的相关性分数,从而可以提高目标视频归类的准确性。

Claims (13)

  1. 一种视频分类方法,所述方法包括:
    获取目标视频和目标类别标签;
    通过预设视频模型提取所述目标视频的视频内容特征,得到对应于所述目标视频的视频特征向量;
    通过预设文本模型提取所述目标类别标签的文本内容特征,得到对应于所述目标类别标签的文本特征向量;
    根据所述视频特征向量和所述文本特征向量,获得所述目标视频与所述目标类别标签间的相关性分数;
    在所述相关性分数大于或等于预设的分数阈值的情况下,确定所述目标视频的类别标签为所述目标类别标签。
  2. 根据权利要求1所述的方法,其中,所述目标类别标签为所述目标视频所在视频集合的类别标签,所述方法还包括:
    在所述相关性分数小于所述分数阈值的情况下,从所述视频集合中滤除所述目标视频。
  3. 根据权利要求1所述的方法,其中,所述方法在所述通过预设视频模型提取所述目标视频的视频内容特征,及所述通过预设文本模型提取所述目标类别标签的文本内容特征之前,还包括:
    获取训练样本集;其中,所述训练样本集中每一训练样本包括视频样本及所述视频样本的样本类别标签;
    通过所述训练样本集,以设定的收敛条件同步训练基础视频模型和基础文本模型,得到训练后的基础视频模型作为所述预设视频模型及得到训练后的基础文本模型作为所述预设文本模型;
    其中,所述收敛条件包括:通过所述基础视频模型提取的所述视频样本的视频内容特征和通过所述基础文本模型提取的所述样本类别标签的文本内容特征均具有对应于所述样本类别标签的分类结果。
  4. 根据权利要求3所述的方法,其中,所述通过所述训练样本集,以设定的收敛条件同步训练基础视频模型和基础文本模型,得到训练后的基础视频模型作为所述预设视频模型及得到训练后的基础文本模型作为所述预设文本模型,包括:
    通过所述训练样本集,固定所述基础视频模型的模型参数,以所述收敛条件训练所述基础文本模型,得到第一阶段训练后的基础文本模型;
    通过所述训练样本集,以所述收敛条件训练所述基础视频模型并接续所述第一阶段训练后的基础文本模型继续训练所述基础文本模型,得到所述预设视频模型和所述预设文本模型。
  5. 根据权利要求4所述的方法,其中,所述训练样本集包括第一样本集和第二样本集,所述通过所述训练样本集,固定所述基础视频模型的模型参数,以所述收敛条件训练所述基础文本模型,得到第一阶段训练后的基础文本模型,包括:
    通过所述第一样本集,固定所述基础视频模型的模型参数,以所述收敛条件训练所述基础文本模型,得到前期训练后的基础文本模型;
    通过所述第二样本集,固定所述基础视频模型的模型参数,接续所述前期训练后的基础文本模型继续训练所述基础文本模型,得到所述第一阶段训练后的基础文本模型。
  6. 根据权利要求3所述的方法,其中,所述通过所述训练样本集,以设定的收敛条件同步训练基础视频模型和基础文本模型中的每一步训练,包括:
    通过对应当前步的基础视频模型提取所述视频样本的视频内容特征,得到对应于视频样本的第一样本特征向量;
    通过对应当前步的基础文本模型提取所述样本类别标签的文本内容特征,得到对应于所述样本类别标签的第二样本特征向量;
    通过共享分类参数的多分类器分别对所述第一样本特征向量和所述第二样本特征向量进行分类,得到对应于所述第一样本特征向量的第一分类 结果和对应于所述第二样本特征向量的第二分类结果;
    以所述收敛条件训练所述基础视频模型和所述基础文本模型。
  7. 根据权利要求6所述的方法,其中,所述多分类器对应的类别标签的种类与所述训练样本集包含的样本类别标签的种类相同。
  8. 根据权利要求6所述的方法,其中,所述通过共享分类参数的多分类器分别对所述第一样本特征向量和所述第二样本特征向量进行分类,得到对应于所述第一样本特征向量的第一分类结果和对应于所述第二样本特征向量的第二分类结果,包括:
    通过共享分类参数的多分类器分别对所述第一样本特征向量和所述第二样本特征向量进行分类,得到对应于所述第一样本特征向量的第一初始分类结果和对应于所述第二样本特征向量的第二初始分类结果;
    通过预设的归一化指数函数对所述第一初始分类结果和所述第二初始分类结果进行归一化处理,得到所述第一分类结果和所述第二分类结果。
  9. 根据权利要求8所述的方法,其中,所述以所述收敛条件训练所述基础特征模型和所述基础文本模型,包括:
    根据所述第一分类结果,获得所述多分类器对于所述样本标签类别的第一分类损失;
    根据所述第二分类结果,获得所述多分类器对于所述样本标签类别的第二分类损失;
    根据所述第一分类损失和所述第二分类损失,以所述收敛条件训练所述基础特征模型和所述基础文本模型。
  10. 根据权利要求1所述的方法,其中,所述根据所述视频特征向量和所述文本特征向量,获得所述目标视频与所述目标类别标签间的相关性分数,包括:
    根据所述视频特征向量与所述文本特征向量之间的距离,获得所述目 标视频与所述目标类别标签间的相关性分数。
  11. 一种视频分类装置,所述装置包括:
    第一获取模块,被配置为获取目标视频和目标类别标签;
    视频模块,被配置为通过预设视频模型提取所述目标视频的视频内容特征,得到对应于所述目标视频的视频特征向量;
    文本模块,被配置为通过预设文本模型提取所述目标类别标签的文本内容特征,得到对应于所述目标类别标签的文本特征向量;
    第二获取模块,被配置为根据所述视频特征向量和所述文本特征向量,获得所述目标视频与所述目标类别标签间的相关性分数;
    确定模块,被配置为在所述相关性分数大于或等于预设的分数阈值的情况下,确定所述目标视频的类别标签为所述目标类别标签。
  12. 一种电子设备,包括存储器和处理器,所述存储器被配置为存储可执行的指令;所述处理器被配置为在所述指令的控制下执行根据权利要求1至10中任一项所述的视频分类方法。
  13. 一种非瞬时性计算机可读存储介质,其上存储有计算机程序,所述计算机程序在被处理器执行时实现如权利要求1至10中任一项所述的视频分类方法。
PCT/CN2022/080208 2021-03-12 2022-03-10 视频分类方法、装置、设备及介质 WO2022188844A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110267539.3 2021-03-12
CN202110267539.3A CN112784111B (zh) 2021-03-12 2021-03-12 视频分类方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2022188844A1 true WO2022188844A1 (zh) 2022-09-15

Family

ID=75762567

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/080208 WO2022188844A1 (zh) 2021-03-12 2022-03-10 视频分类方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN112784111B (zh)
WO (1) WO2022188844A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784111B (zh) * 2021-03-12 2024-07-02 有半岛(北京)信息科技有限公司 视频分类方法、装置、设备及介质
CN113449700B (zh) * 2021-08-30 2021-11-23 腾讯科技(深圳)有限公司 视频分类模型的训练、视频分类方法、装置、设备及介质
CN117112836A (zh) * 2023-09-05 2023-11-24 广西华利康科技有限公司 一种面向视频内容的大数据智能分类方法
CN118035849A (zh) * 2024-04-10 2024-05-14 浙江孚临科技有限公司 一种对货物数据进行商品分类方法、系统和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160099127A (ko) * 2015-02-11 2016-08-22 중앙대학교 산학협력단 다중 레이블을 분류하기 위해 이용되는 특징 셋의 선택 방법 및 장치
CN111831854A (zh) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 视频标签的生成方法、装置、电子设备和存储介质
CN111967302A (zh) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 视频标签的生成方法、装置及电子设备
CN112100438A (zh) * 2020-09-21 2020-12-18 腾讯科技(深圳)有限公司 一种标签抽取方法、设备及计算机可读存储介质
CN112784111A (zh) * 2021-03-12 2021-05-11 有半岛(北京)信息科技有限公司 视频分类方法、装置、设备及介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325148A (zh) * 2018-08-03 2019-02-12 百度在线网络技术(北京)有限公司 生成信息的方法和装置
CN109344908B (zh) * 2018-10-30 2020-04-28 北京字节跳动网络技术有限公司 用于生成模型的方法和装置
CN109257622A (zh) * 2018-11-01 2019-01-22 广州市百果园信息技术有限公司 一种音视频处理方法、装置、设备及介质
CN109359636B (zh) * 2018-12-14 2023-04-28 腾讯科技(深圳)有限公司 视频分类方法、装置及服务器
CN110070067B (zh) * 2019-04-29 2021-11-12 北京金山云网络技术有限公司 视频分类方法及其模型的训练方法、装置和电子设备
CN110674349B (zh) * 2019-09-27 2023-03-14 北京字节跳动网络技术有限公司 视频poi识别方法、装置及电子设备
CN111444878B (zh) * 2020-04-09 2023-07-18 Oppo广东移动通信有限公司 一种视频分类方法、装置及计算机可读存储介质
CN111612093A (zh) * 2020-05-29 2020-09-01 Oppo广东移动通信有限公司 一种视频分类方法、视频分类装置、电子设备及存储介质
CN111611436B (zh) * 2020-06-24 2023-07-11 深圳市雅阅科技有限公司 一种标签数据处理方法、装置以及计算机可读存储介质
CN112188295B (zh) * 2020-09-29 2022-07-05 有半岛(北京)信息科技有限公司 一种视频推荐方法及装置
CN112287170B (zh) * 2020-10-13 2022-05-17 泉州津大智能研究院有限公司 一种基于多模态联合学习的短视频分类方法及装置
CN112149632A (zh) * 2020-10-21 2020-12-29 腾讯科技(深圳)有限公司 一种视频识别方法、装置及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160099127A (ko) * 2015-02-11 2016-08-22 중앙대학교 산학협력단 다중 레이블을 분류하기 위해 이용되는 특징 셋의 선택 방법 및 장치
CN111831854A (zh) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 视频标签的生成方法、装置、电子设备和存储介质
CN111967302A (zh) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 视频标签的生成方法、装置及电子设备
CN112100438A (zh) * 2020-09-21 2020-12-18 腾讯科技(深圳)有限公司 一种标签抽取方法、设备及计算机可读存储介质
CN112784111A (zh) * 2021-03-12 2021-05-11 有半岛(北京)信息科技有限公司 视频分类方法、装置、设备及介质

Also Published As

Publication number Publication date
CN112784111A (zh) 2021-05-11
CN112784111B (zh) 2024-07-02

Similar Documents

Publication Publication Date Title
US20190325259A1 (en) Feature extraction and machine learning for automated metadata analysis
WO2022188844A1 (zh) 视频分类方法、装置、设备及介质
US8930288B2 (en) Learning tags for video annotation using latent subtags
US11748401B2 (en) Generating congruous metadata for multimedia
US11074434B2 (en) Detection of near-duplicate images in profiles for detection of fake-profile accounts
US10599774B1 (en) Evaluating content items based upon semantic similarity of text
CN107463605B (zh) 低质新闻资源的识别方法及装置、计算机设备及可读介质
Murray et al. A deep architecture for unified aesthetic prediction
US10740802B2 (en) Systems and methods for gaining knowledge about aspects of social life of a person using visual content associated with that person
US11763164B2 (en) Image-to-image search method, computer-readable storage medium and server
US11195099B2 (en) Detecting content items in violation of an online system policy using semantic vectors
KR102053635B1 (ko) 불신지수 벡터 기반의 가짜뉴스 탐지 장치 및 방법, 이를 기록한 기록매체
Chen et al. Velda: Relating an image tweet’s text and images
US11436446B2 (en) Image analysis enhanced related item decision
CN108959323B (zh) 视频分类方法和装置
CN109992781B (zh) 文本特征的处理方法、装置和存储介质
US11363064B2 (en) Identifying spam using near-duplicate detection for text and images
CN106464682A (zh) 使用登录到在线服务的状态以用于内容项推荐
US20200210760A1 (en) System and method for cascading image clustering using distribution over auto-generated labels
Hou et al. Deep Hierarchical Representation from Classifying Logo‐405
de Boer et al. Improving video event retrieval by user feedback
US20200089764A1 (en) Media data classification, user interaction and processors for application integration
US20170255619A1 (en) System and methods for determining access permissions on personalized clusters of multimedia content elements
CN117493645B (zh) 一种基于大数据的电子档案推荐系统
US20240089519A1 (en) Live distribution system, estimation method, and information storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22766366

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22766366

Country of ref document: EP

Kind code of ref document: A1