CN113672783B - Feature processing method, model training method and media resource processing method - Google Patents

Feature processing method, model training method and media resource processing method Download PDF

Info

Publication number
CN113672783B
CN113672783B CN202110917334.5A CN202110917334A CN113672783B CN 113672783 B CN113672783 B CN 113672783B CN 202110917334 A CN202110917334 A CN 202110917334A CN 113672783 B CN113672783 B CN 113672783B
Authority
CN
China
Prior art keywords
media resource
samples
feature
target
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110917334.5A
Other languages
Chinese (zh)
Other versions
CN113672783A (en
Inventor
曹效伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202110917334.5A priority Critical patent/CN113672783B/en
Publication of CN113672783A publication Critical patent/CN113672783A/en
Application granted granted Critical
Publication of CN113672783B publication Critical patent/CN113672783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a feature processing method, a model training method and a media resource processing method, and belongs to the technical field of computers. The feature processing method comprises the following steps: acquiring a characteristic value of at least one media resource characteristic from a plurality of media resource samples, wherein the plurality of media resource samples comprise positive samples belonging to a target category and negative samples not belonging to the target category; determining relevant parameters corresponding to at least one media resource feature based on the obtained feature values, wherein the relevant parameters are used for representing the degree of distinction between media resources belonging to the target category and media resources not belonging to the target category in the dimension of the at least one media resource feature; and determining target media resource characteristics with related parameters larger than a threshold value, taking the target media resource characteristics as input characteristics of a media resource processing model, and improving the relativity between the input characteristics and target categories predicted by the media resource model, so that the accuracy of model prediction can be improved.

Description

Feature processing method, model training method and media resource processing method
Technical Field
The disclosure relates to the field of computer technology, and in particular, to a feature processing method, a model training method, and a media resource processing method.
Background
In recent years, machine learning and deep learning techniques have been widely used in various scenes, and based on the machine learning and deep learning techniques, models having discrimination ability can be trained to solve the prediction problem. For example, in a media asset recommendation scenario, it is predicted whether a user will click on a media asset to be recommended by training a predictive model.
As the problem of prediction is more and more complex, for example, in a media resource recommendation scenario, the related features include multiple features related to a media resource, and how to select suitable features from a large number of features as model input to promote effective learning of a model and improve accuracy of model prediction is a problem to be solved urgently.
Disclosure of Invention
The embodiment of the disclosure provides a feature processing method, a model training method and a media resource processing method, which are used for selecting proper features as model input, promoting effective model learning and improving model prediction accuracy. The technical scheme of the present disclosure is as follows:
in one aspect, a feature processing method is provided, where the feature processing method includes:
Acquiring characteristic values of at least one media resource characteristic from a plurality of media resource samples, wherein each media resource sample comprises characteristic values of a plurality of media resource characteristics corresponding to one media resource, and the plurality of media resource samples comprise positive samples belonging to a target category and negative samples not belonging to the target category;
based on the obtained characteristic values, determining relevant parameters corresponding to the at least one media resource characteristic, wherein the relevant parameters are used for representing the distinguishing degree of the media resources belonging to the target category and the media resources not belonging to the target category on the dimension of the at least one media resource characteristic;
and determining target media resource characteristics with related parameters larger than a threshold value from the plurality of media resource characteristics, and taking the target media resource characteristics as input characteristics of a media resource processing model, wherein the media resource processing model is used for predicting whether the media resource belongs to the target category.
According to the technical scheme provided by the embodiment of the disclosure, the characteristic values of the media resource samples are analyzed respectively in the dimensions of different media resource characteristics to obtain the related parameters, the degree of distinction between the positive and negative samples is represented by the related parameters, and then the degree of relativity between the media resource characteristics and the training target is reflected, quantitative reflection of the degree of relativity between the media resource characteristics and the training target is realized, the accuracy of characteristic selection is improved by selecting the media resource characteristics with the related parameters larger than the threshold value, the media resource characteristics with the related parameters larger than the threshold value are used as the input characteristics of the media resource processing model, the degree of relativity between the input characteristics and the training target is improved, the resolving capability of the media resource processing model on the positive type and the negative type is enhanced, the effective learning of the model is promoted, and the accuracy of model prediction is improved.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
acquiring a plurality of first thresholds corresponding to the first media resource characteristics from a position between a maximum value and a minimum value in a plurality of first characteristic values corresponding to the first media resource characteristics;
determining false positive class rate and true class rate corresponding to each first threshold;
based on the false positive class rate and the true class rate corresponding to each first threshold, determining relevant parameters corresponding to the first media resource characteristics;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In the technical scheme, the correlation parameter represents that under the condition of taking any first threshold value, a positive sample and a negative sample are randomly given, the first characteristic value of the positive sample is larger than the probability of the first characteristic value of the negative sample, the correlation parameter quantitatively represents the distinguishing degree between the positive sample and the negative sample more accurately in a statistical sense, and the accuracy of the correlation parameter representing the distinguishing degree is improved.
In some embodiments, the determining the relevant parameter corresponding to the first media resource feature based on the pseudo-positive class rate and the true class rate corresponding to each first threshold includes:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
In the above-described technical solution, the area under the receiver operating characteristic is similar to the statistical meaning represented by the relevant parameter, and therefore, by determining the relevant parameter in the same way as determining the area under the receiver operating characteristic, the relevant parameter can represent the following meaning: under the condition of taking any first threshold value, a positive sample and a negative sample are randomly given, and the probability that the first characteristic value of the positive sample is larger than that of the negative sample is higher than that of the positive sample, so that the correlation parameter reflects the distinguishing degree between the positive sample and the negative sample more accurately, and the accuracy of the correlation parameter in representing the distinguishing degree is improved.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
For a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
According to the technical scheme, the correlation parameters corresponding to the combination of the plurality of media resource characteristics are determined to reflect the correlation degree between the combination of the plurality of media resource characteristics and the training target, so that quantitative representation of the correlation degree between the combination of the characteristics and the training target is realized, the correlation parameters can more comprehensively and accurately represent the correlation degree between the media resource characteristics and the training target, the accuracy of the representation of the correlation parameters is improved, the correlation degree between the combination of the characteristics and the training target is further combined, the characteristic selection accuracy is further improved, the selected media resource characteristics are used as input characteristics of a media resource processing model, and the model prediction accuracy is further improved.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
sequencing the plurality of media resource samples according to the sequence of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, wherein the inter-group rank of one media resource sample refers to the sequencing sequence number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
obtaining a first number of a plurality of sample pairs, a sample pair consisting of a positive sample and a negative sample of the plurality of media asset samples;
determining a second number of target sample pairs in which the inter-group rank of positive samples is greater than the inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
and determining the ratio of the second quantity to the first quantity as a relevant parameter corresponding to the first media resource characteristic.
In the above technical solution, the relevant parameter represents: and randomly giving a positive sample and a negative sample, wherein the probability that the first characteristic value of the positive sample is larger than that of the negative sample is given, the distinguishing degree between the positive sample and the negative sample is more accurately and quantitatively represented by the related parameter, and the accuracy of the distinguishing degree represented by the related parameter is improved.
In some embodiments, the determining a second number of target sample pairs based on an inter-group rank of the plurality of media resource samples comprises:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
According to the technical scheme, the second number of the target sample pairs is determined based on the inter-group rank sum and the intra-group rank sum, so that the process of traversing a plurality of sample pairs for comparison is omitted, and the determination efficiency of the second number is improved.
In one aspect, a model training method is provided, the model training method comprising:
acquiring a training sample and marking information corresponding to the training sample, wherein the training sample comprises characteristic values of a plurality of media resource characteristics corresponding to media resources, and the marking information is used for indicating whether the training sample belongs to a target class;
Acquiring a characteristic value of a target media resource characteristic from the training sample, wherein the target media resource characteristic is a media resource characteristic with a related parameter being larger than a threshold value, and the related parameter is used for representing media resources belonging to the target category and media resources not belonging to the target category, and representing a degree of distinction in a dimension of at least one media resource characteristic comprising the target media resource characteristic;
and taking the characteristic value of the target media resource characteristic as the input of a media resource processing model, taking the labeling information as the output target of the media resource processing model, and training the media resource processing model.
According to the technical scheme provided by the embodiment of the disclosure, the target media resource characteristics which can highlight the distinction between the media resources belonging to the target category and the media resources not belonging to the target category are obtained from the training sample, the target media resource characteristics are used as the input of the media resource processing model, the media resource processing model is trained, the interference of irrelevant characteristics to the training process is reduced, the effective learning of the model is promoted, the resolving power of the model to the positive and negative categories can be enhanced, and the accuracy of model prediction is improved.
In some embodiments, before the obtaining, from the training sample, the feature value of the target media resource feature, the model training method further includes:
and determining the target media resource characteristics corresponding to the target category based on the stored correspondence between the target category and the target media resource characteristics.
In some embodiments, before the determining the target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature, the model training method further includes:
acquiring characteristic values of at least one media resource characteristic from a plurality of media resource samples, wherein each media resource sample comprises characteristic values of a plurality of media resource characteristics corresponding to one media resource, and the plurality of media resource samples comprise positive samples belonging to a target category and negative samples not belonging to the target category;
based on the obtained characteristic value, determining a relevant parameter corresponding to the at least one media resource characteristic;
determining the target media resource characteristics with related parameters greater than a threshold value from the plurality of media resource characteristics;
And storing the corresponding relation between the target category and the target media resource characteristic.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
acquiring a plurality of first thresholds corresponding to the first media resource characteristics from a position between a maximum value and a minimum value in a plurality of first characteristic values corresponding to the first media resource characteristics;
determining false positive class rate and true class rate corresponding to each first threshold;
based on the false positive class rate and the true class rate corresponding to each first threshold, determining relevant parameters corresponding to the first media resource characteristics;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In some embodiments, the determining the relevant parameter corresponding to the first media resource feature based on the pseudo-positive class rate and the true class rate corresponding to each first threshold includes:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
Acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
Sequencing the plurality of media resource samples according to the sequence of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, wherein the inter-group rank of one media resource sample refers to the sequencing sequence number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
obtaining a first number of a plurality of sample pairs, a sample pair consisting of a positive sample and a negative sample of the plurality of media asset samples;
determining a second number of target sample pairs in which the inter-group rank of positive samples is greater than the inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
and determining the ratio of the second quantity to the first quantity as a relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the determining a second number of target sample pairs based on an inter-group rank of the plurality of media resource samples comprises:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
Determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
In one aspect, a media resource processing method is provided, where the media resource processing method includes:
acquiring characteristic values of target media resource characteristics from the characteristic values of a plurality of media resource characteristics corresponding to the media resources;
inputting the characteristic value of the target media resource characteristic into a media resource processing model to obtain a prediction result output by the media resource processing model, wherein the media resource processing model is obtained by training based on a training sample and marking information corresponding to the training sample, and the prediction result is used for indicating whether the media resource belongs to a target class;
the target media resource feature is a media resource feature with a related parameter greater than a threshold, and the related parameter is used for representing the media resource belonging to the target category and the media resource not belonging to the target category, and the distinction is shown in the dimension of at least one media resource feature comprising the target media resource feature.
According to the technical scheme provided by the embodiment of the disclosure, the target media resource characteristics which can highlight the distinction between the media resources belonging to the target category and the media resources not belonging to the target category are obtained from the characteristic values of the plurality of media resource characteristics corresponding to the media resources, and the media resource processing model predicts based on the target media resource characteristics, so that the interference of irrelevant characteristics to the prediction process is reduced, and the accuracy of model prediction is improved.
In some embodiments, before the obtaining the feature value of the target media resource feature from the feature values of the plurality of media resource features corresponding to the media resource, the media resource processing method further includes:
acquiring the target category corresponding to the media resource processing model;
and determining the target media resource characteristics corresponding to the target category based on the stored correspondence between the target category and the target media resource characteristics.
In some embodiments, before the determining the target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature, the media resource processing method further includes:
Acquiring characteristic values of at least one media resource characteristic from a plurality of media resource samples, wherein each media resource sample comprises characteristic values of a plurality of media resource characteristics corresponding to one media resource, and the plurality of media resource samples comprise positive samples belonging to a target category and negative samples not belonging to the target category;
based on the obtained characteristic value, determining a relevant parameter corresponding to the at least one media resource characteristic;
determining the target media resource characteristics with related parameters greater than a threshold value from the plurality of media resource characteristics;
and storing the corresponding relation between the target category and the target media resource characteristic.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
acquiring a plurality of first thresholds corresponding to the first media resource characteristics from a position between a maximum value and a minimum value in a plurality of first characteristic values corresponding to the first media resource characteristics;
determining false positive class rate and true class rate corresponding to each first threshold;
based on the false positive class rate and the true class rate corresponding to each first threshold, determining relevant parameters corresponding to the first media resource characteristics;
The false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In some embodiments, the determining the relevant parameter corresponding to the first media resource feature based on the pseudo-positive class rate and the true class rate corresponding to each first threshold includes:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
Determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
The true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
sequencing the plurality of media resource samples according to the sequence of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, wherein the inter-group rank of one media resource sample refers to the sequencing sequence number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
obtaining a first number of a plurality of sample pairs, a sample pair consisting of a positive sample and a negative sample of the plurality of media asset samples;
determining a second number of target sample pairs in which the inter-group rank of positive samples is greater than the inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
And determining the ratio of the second quantity to the first quantity as a relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the determining a second number of target sample pairs based on an inter-group rank of the plurality of media resource samples comprises:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
In one aspect, there is provided a feature processing apparatus including:
a first feature value obtaining unit configured to obtain feature values of at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including a positive sample belonging to a target category and a negative sample not belonging to the target category;
A first parameter determining unit configured to perform determining, based on the obtained feature value, a related parameter corresponding to the at least one media resource feature, the related parameter being used to represent a degree of distinction between media resources belonging to the target category and media resources not belonging to the target category in a dimension of the at least one media resource feature;
a first feature determining unit configured to perform determining a target media resource feature, from the plurality of media resource features, for which a related parameter is larger than a threshold, and using the target media resource feature as an input feature of a media resource processing model for predicting whether a media resource belongs to the target category.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic; the first parameter determination unit includes:
a first threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a first ratio determining subunit configured to perform determining a false positive class ratio and a true class ratio corresponding to each first threshold;
A first parameter determining subunit configured to perform determining, based on the pseudo-positive class rate and the true class rate corresponding to each first threshold, a related parameter corresponding to the first media resource feature;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In some embodiments, the first parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
Determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features; the first parameter determination unit is configured to perform:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
The true class rate corresponding to any threshold group is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic; the first parameter determination unit includes:
a first inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to a sequence from the first eigenvalue to the big, to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
a first number acquisition subunit configured to perform acquiring a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a second number determination subunit configured to perform determining a second number of target sample pairs in which an inter-group rank of positive samples is greater than an inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
A second parameter determination subunit configured to perform determining a ratio of the second number to the first number as a related parameter corresponding to the first media resource feature.
In some embodiments, the second number of determination subunits is configured to perform:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
In some embodiments, the first feature determining unit is configured to perform determining a media resource feature for which the corresponding relevant parameter is greater than a threshold as the target media resource feature; or alternatively, the process may be performed,
The first feature determining unit is configured to perform ranking of the plurality of media resource features in order of the related parameters from large to small, and determine a front target number of media resource features as the target media resource features.
In one aspect, there is provided a model training apparatus comprising:
the system comprises a sample acquisition unit, a target classification unit and a target classification unit, wherein the sample acquisition unit is configured to acquire a training sample and marking information corresponding to the training sample, the training sample comprises characteristic values of a plurality of media resource characteristics corresponding to media resources, and the marking information is used for indicating whether the training sample belongs to the target classification;
a second feature value obtaining unit configured to obtain a feature value of a target media resource feature from the training sample, the target media resource feature being a media resource feature with a correlation parameter greater than a threshold, the correlation parameter being used to represent a media resource belonging to the target category and a media resource not belonging to the target category, and a degree of distinction being represented in a dimension of at least one media resource feature including the target media resource feature;
and the model training unit is configured to perform training of the media resource processing model by taking the characteristic value of the target media resource characteristic as input of the media resource processing model and the marking information as output target of the media resource processing model.
In some embodiments, the model training apparatus further comprises:
and a second feature determining unit configured to perform determination of the target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature.
In some embodiments, the model training apparatus further comprises:
a third feature value obtaining unit configured to obtain feature values of the at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including a positive sample belonging to a target category and a negative sample not belonging to the target category;
a second parameter determining unit configured to perform determining a related parameter corresponding to the at least one media resource feature based on the obtained feature value;
a third feature determination unit configured to perform determining the target media resource feature for which the relevant parameter is greater than a threshold value from among the plurality of media resource features;
and the first relation storage unit is configured to store the corresponding relation between the target category and the target media resource characteristic.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the second parameter determination unit includes:
a second threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a second ratio determining subunit configured to perform determining a false positive class rate and a true class rate corresponding to each of the first thresholds;
a third parameter determining subunit configured to perform determining, based on the pseudo-positive class rate and the true class rate corresponding to each first threshold, a related parameter corresponding to the first media resource feature;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In some embodiments, the third parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features;
the second parameter determination unit is configured to perform:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
For a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the second parameter determination unit includes:
a second inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to the order of the first eigenvalues from small to large, to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
A third number acquisition subunit configured to perform acquisition of a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a fourth number determination subunit configured to perform determining a second number of target sample pairs in which an inter-group rank of positive samples is greater than an inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
a fourth parameter determination subunit configured to perform determining a ratio of the second number to the first number as a related parameter corresponding to the first media resource feature.
In some embodiments, the fourth number determination subunit is configured to perform:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
Determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
In one aspect, there is provided a media resource processing device, comprising:
a fourth feature value acquisition unit configured to perform acquisition of a feature value of a target media resource feature from feature values of a plurality of media resource features corresponding to the media resource;
the media resource processing unit is configured to input the characteristic value of the target media resource characteristic into a media resource processing model to obtain a prediction result output by the media resource processing model, wherein the media resource processing model is obtained by training based on a training sample and marking information corresponding to the training sample, and the prediction result is used for indicating whether the media resource belongs to a target class;
the target media resource feature is a media resource feature with a related parameter greater than a threshold, and the related parameter is used for representing the media resource belonging to the target category and the media resource not belonging to the target category, and the distinction is shown in the dimension of at least one media resource feature comprising the target media resource feature.
In some embodiments, the media asset processing device further comprises:
a category acquisition unit configured to perform acquisition of the target category corresponding to the media resource processing model;
and a fourth feature determining unit configured to perform determination of the target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature.
In some embodiments, the media asset processing device further comprises:
a fifth feature value obtaining unit configured to obtain feature values of the at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including a positive sample belonging to a target category and a negative sample not belonging to the target category;
a third parameter determining unit configured to perform determining a related parameter corresponding to the at least one media resource feature based on the obtained feature value;
a fifth feature determination unit configured to perform determining the target media resource feature for which the relevant parameter is greater than a threshold value from among the plurality of media resource features;
And the second relation storage unit is configured to store the corresponding relation between the target category and the target media resource characteristic.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the third parameter determination unit is configured to perform:
a third threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a third ratio determining subunit configured to perform determining a false positive class rate and a true class rate corresponding to each of the first thresholds;
a fifth parameter determining subunit configured to perform determining, based on the pseudo-positive class rate and the true class rate corresponding to each first threshold, a related parameter corresponding to the first media resource feature;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
The true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In some embodiments, the fifth parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features;
the third parameter determination unit is configured to perform:
For each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
the third parameter determination unit includes:
a third inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to the order of the first eigenvalues from small to large, to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
a fifth number acquisition subunit configured to perform acquiring a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a sixth number determination subunit configured to perform determining a second number of target sample pairs in which an inter-group rank of positive samples is greater than an inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
a sixth parameter determination subunit configured to perform determining a ratio of the second number to the first number as the relevant parameter corresponding to the first media resource feature.
In some embodiments, the sixth number determination subunit is configured to perform:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
In one aspect, there is provided a computer device comprising: one or more processors; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the feature processing method, the model training method, or the media resource processing method described in the above embodiments.
In one aspect, a computer readable storage medium is provided, which when executed by a processor of a computer device, enables the computer device to perform the feature processing method, the model training method, or the media resource processing method described in the above embodiments.
In one aspect, a computer program product is provided, including a computer program, which when executed by a processor implements the feature processing method, the model training method, or the media resource processing method described in the above embodiments.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
FIG. 1 is a flow chart illustrating a feature processing method according to an exemplary embodiment;
FIG. 2 is a flow chart illustrating a model training method according to an exemplary embodiment;
FIG. 3 is a flowchart illustrating a media asset processing method according to an exemplary embodiment;
FIG. 4 is a flowchart illustrating a feature processing method according to an exemplary embodiment;
FIG. 5 is a schematic diagram illustrating a receiver operating characteristic in accordance with an exemplary embodiment;
FIG. 6 is a flowchart illustrating a feature processing method according to an exemplary embodiment;
FIG. 7 is a flowchart illustrating a feature processing method according to an exemplary embodiment;
FIG. 8 is a block diagram of a feature processing apparatus, shown in accordance with an exemplary embodiment;
FIG. 9 is a block diagram of a model training apparatus, according to an example embodiment;
FIG. 10 is a block diagram of a media asset processing device according to an exemplary embodiment;
FIG. 11 is a block diagram of a computer device, according to an example embodiment.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
The user features referred to in the embodiments of the present disclosure are obtained and processed after the user and the parties are fully authorized.
The feature processing method provided by the embodiment of the disclosure is executed by computer equipment. In some embodiments, the computer device is configured as a server. The server is a server, a plurality of servers, a cloud server, a cloud computing platform or a virtualization center. In some embodiments, the computer device is configured as a terminal. The terminal is a desktop computer, a notebook computer, a tablet computer, a smart phone or other terminals.
Fig. 1 is a flow chart illustrating a feature processing method according to an exemplary embodiment. The feature processing method is briefly described below with reference to fig. 1, and includes the steps of:
in step S101, feature values of at least one media asset feature are obtained from a plurality of media asset samples, each media asset sample including feature values of a plurality of media asset features corresponding to one media asset, the plurality of media asset samples including positive samples belonging to a target category and negative samples not belonging to the target category.
It should be noted that the media resource is text, image, audio or video. A media asset sample includes feature values for a plurality of media asset features corresponding to a media asset, each media asset feature reflecting a feature of the media asset from a different perspective. In some embodiments, the media resource is a media resource published on a resource sharing platform, and the plurality of media resource characteristics corresponding to the media resource includes at least one of a resource characteristic characterizing the media resource, a characteristic of a user performing an operation on the media resource, and a contextual characteristic. In some embodiments, the characteristics characterizing the media asset include at least one of an exposure, a click rate, and an asset characterization vector. In some embodiments, the user characteristics include at least one of age, gender, and resource browsing preferences. The contextual features are used to characterize the user behavior before and after browsing the media asset.
Another point to be noted is that at least one media asset feature refers to any one or more of a plurality of media asset features, and "plurality" refers to two or more.
In some embodiments, the at least one media asset feature is a first media asset feature, the first media asset feature is any one of a plurality of media asset features, and the computer device obtains a feature value of the first media asset feature from each media asset sample to obtain a plurality of first feature values. For example, the computer device obtains feature values of the first media resource feature x from 3 media resource samples, to obtain 3 first feature values: x1, x2 and x3.
In some embodiments, the at least one media asset feature is a plurality of second media asset features, the plurality of second media asset features being any of the plurality of media asset features. For each second media resource feature, the computer equipment obtains the feature value of the second media resource feature from each media resource sample, and obtains a plurality of second feature values corresponding to the second media resource feature. Wherein, a plurality of second characteristic values corresponding to a second media resource characteristic respectively belong to a plurality of media resource samples.
For example, the plurality of second media resource features include x and y, and for the second media resource feature x, the computer device obtains feature values of the second media resource feature x from 3 media resource samples respectively, to obtain 3 second feature values corresponding to the second media resource feature x: x1, x2 and x3; for the second media resource feature y, the computer equipment obtains feature values of the second media resource feature y from 3 media resource samples respectively to obtain 3 second feature values corresponding to the second media resource feature y: y1, y2 and y3.
Another point to be described is that the positive sample is a sample belonging to the target class, the negative sample is a sample not belonging to the target class, and the training target of the corresponding media resource processing model is to predict whether the media resource belongs to the target class. Wherein, the target category can be called positive category, the target category not can be called negative category, and the prediction result of the media resource processing model is positive category or negative category. The target categories may be flexibly configured according to the problem to be predicted.
For example, in a media asset recommendation scenario, the target category represents a recommendation of a media asset to a user. If a media resource that would perform a target operation needs to be recommended to the user, the target category indicates that the user would perform the target operation on the media resource. The target operation is a clicking operation, a collection operation, a forwarding operation or a downloading operation, etc. For example, if a media resource that the user would click on needs to be recommended to the user, the target operation is a click operation, and the target category is used for indicating that the user would perform the click operation on the media resource; if media resources which can be collected are required to be recommended to the user, the target operation is a collection operation, and the target category is used for indicating that the user can execute the collection operation on the media resources. For another example, in the media asset classification scenario, the target category is a content category to which the media asset belongs, for example, the target category is news category, photography category, lovely pet category, food category, or the like.
In addition, the number of the plurality of media asset samples may be flexibly configured, and the embodiments of the present disclosure are not limited thereto, for example, the number of the plurality of media asset samples is 10, 12, 15, or the like.
In step S102, based on the obtained feature values, a relevant parameter corresponding to at least one media resource feature is determined, where the relevant parameter is used to represent a distinction between media resources belonging to the target category and media resources not belonging to the target category in a dimension of the at least one media resource feature.
The greater the correlation parameter, the greater the degree of distinction between the media resources belonging to the target category and the media resources not belonging to the target category from the dimension of the at least one media resource feature, the greater the difference between the distribution of the media resources belonging to the target category and the distribution of the media resources not belonging to the target category, then the greater the resolution of the media resource processing model to the positive and negative categories is made with the at least one media resource feature as an input feature, that is, the greater the degree of correlation between the at least one media resource feature and the training target is, the more the at least one media resource feature can promote the model to learn effectively, and the accuracy of model prediction is improved.
In some embodiments, the at least one media asset characteristic is a first media asset characteristic, and the computer device determines the associated parameter corresponding to the first media asset characteristic based on the plurality of first feature values.
In some embodiments, the at least one media asset feature is a plurality of second media asset features, the plurality of second media asset features comprising a feature combination, and the computer device determines a correlation parameter corresponding to the feature combination based on a plurality of first feature values corresponding to each of the second media asset features.
In step S103, a target media resource feature with a related parameter greater than a threshold is determined from the plurality of media resource features, and the target media resource feature is used as an input feature of a media resource processing model for predicting whether the media resource belongs to a target category.
It should be noted that, by repeatedly executing the steps S101 to S102, the computer device obtains the relevant parameters corresponding to the multiple media resource features, where one media resource feature corresponds to at least one relevant parameter, that is, one media resource feature corresponds to one or more relevant parameters.
In some embodiments, the computer device determines a relevant parameter corresponding to one media asset feature at a time based on a plurality of feature values corresponding to the one media asset feature, and, correspondingly, one relevant parameter corresponding to one media asset feature.
In some embodiments, the computer device determines, based on a plurality of feature values corresponding to one feature combination at a time, a relevant parameter corresponding to the feature combination, where the relevant parameter corresponding to each media resource feature in the feature combination is the relevant parameter corresponding to the feature combination. Under the condition that the media resource characteristics in the plurality of characteristic combinations are different from each other, one media resource characteristic corresponds to one related parameter; in the case where features in a plurality of feature combinations are repeated, one media asset feature corresponds to a plurality of related parameters.
In some embodiments, the computer device determines, in addition to a related parameter corresponding to a media asset feature based on a plurality of feature values corresponding to the media asset feature, a related parameter corresponding to a feature combination including the media asset feature based on a plurality of feature values corresponding to the feature combination, and correspondingly, the media asset feature corresponds to a plurality of related parameters.
Another point to be noted is that a correlation parameter greater than the threshold value indicates that the correlation parameter corresponding to the media asset characteristic is at a higher level, and the media asset characteristic is strongly correlated with the training target. In some embodiments, the computer device determines a media asset characteristic for which the corresponding relevant parameter is greater than the threshold as the target media asset characteristic. Wherein the threshold value can be flexibly configured, for example, the threshold value is 0.8 or 0.9. In some embodiments, one media asset feature corresponds to a plurality of related parameters, and the computer device determines the media asset feature for which the corresponding plurality of related parameters are each greater than a threshold as a target media asset feature; or determining the media resource characteristics with the average value of the plurality of related parameters being larger than the threshold value as target media resource characteristics.
In other embodiments, the computer device ranks the plurality of media asset characteristics in order of the relative parameter from greater to lesser, determining the front target number of media asset characteristics as the target media asset characteristics. Wherein the target number can be flexibly configured, for example, the target number is 5, 8 or 10.
Wherein the number of target media asset characteristics is one or more, and embodiments of the present disclosure are not limited in this regard.
According to the technical scheme provided by the embodiment of the disclosure, the characteristic values of the media resource samples are analyzed in the dimensions of different media resource characteristics to obtain the related parameters, the related parameters are used for representing the degree of distinction between the media resources belonging to the target category and the media resources not belonging to the target category, and further reflecting the degree of relativity between the media resource characteristics and the training targets, so that quantitative reflection of the degree of relativity between the media resource characteristics and the training targets is realized, the accuracy of characteristic selection is improved by selecting the media resource characteristics with the related parameters larger than the threshold value, the media resource characteristics with the related parameters larger than the threshold value are used as the input characteristics of the media resource processing model, the degree of relativity between the input characteristics and the training targets is improved, the resolving power of the model on the positive type and the negative type is enhanced, the effective learning of the model is promoted, and the accuracy of model prediction is improved.
FIG. 2 is a flow chart illustrating a model training method according to an exemplary embodiment. The model training method is briefly described below with reference to fig. 2, and includes the following steps:
in step S201, a training sample and labeling information corresponding to the training sample are acquired.
The training sample comprises characteristic values of a plurality of media resource characteristics corresponding to the media resources, and the labeling information is used for indicating whether the training sample belongs to a target class or not.
In step S202, feature values of the target media asset feature are acquired from the training samples.
Wherein the target media asset characteristic is a media asset characteristic having a correlation parameter greater than a threshold, the correlation parameter being indicative of a media asset belonging to the target category and a media asset not belonging to the target category, the differentiation being exhibited in a dimension of at least one media asset characteristic comprising the target media asset characteristic.
The target media resource characteristics are characteristics capable of highlighting the distinction between media resources belonging to the target category and media resources not belonging to the target category, and whether the media resources belong to the target category can be accurately and rapidly determined according to the target media resource characteristics.
In step S203, the feature value of the target media resource feature is used as an input of the media resource processing model, and the labeling information is used as an output target of the media resource processing model, so as to train the media resource processing model.
And taking the characteristic value of the characteristic of the target media resource as the input of the media resource processing model, enabling the media resource model to predict whether the training sample belongs to the target category based on the characteristic value of the characteristic of the target media resource, obtaining a predicted result, further determining the difference between the predicted result and the labeling information corresponding to the training sample, reducing the difference, namely taking the predicted result output by the media resource processing model and the labeling information as targets, and training the media resource processing model.
It should be noted that, the training samples generally refer to one of multiple training samples, and the above steps are described by taking a training process as an example, and in the training process of the media resource processing model, the computer device performs a training process based on each training sample, so as to perform iterative training on the media resource processing model.
According to the technical scheme provided by the embodiment of the disclosure, the target media resource characteristics which can highlight the distinction between the media resources belonging to the target category and the media resources not belonging to the target category are obtained from the training sample, the target media resource characteristics are used as the input of the media resource processing model, the media resource processing model is trained, the interference of irrelevant characteristics to the training process is reduced, the effective learning of the model is promoted, the resolving power of the model to the positive and negative categories can be enhanced, and the accuracy of model prediction is improved.
Fig. 3 is a flow chart illustrating a method of media asset processing according to an exemplary embodiment. The following describes briefly the media resource processing method with reference to fig. 3, and the media resource processing method includes the following steps:
in step S301, a feature value of a target media asset feature is acquired from feature values of a plurality of media asset features corresponding to the media asset.
Wherein the target media asset characteristic is a media asset characteristic having a correlation parameter greater than a threshold, the correlation parameter being indicative of a media asset belonging to the target category and a media asset not belonging to the target category, the differentiation being exhibited in a dimension of at least one media asset characteristic comprising the target media asset characteristic.
For example, in a media asset recommendation scenario, a media asset recommendation model is used to predict whether a user will perform a target operation on a media asset, and target media asset characteristics include a asset characterization vector for the media asset and a user preference for performing the target operation on the media asset. And extracting the characteristic value of the resource characterization vector of the media resource and the characteristic value of the preference of the user to be recommended for executing the target operation on the media resource for the media resource published on the resource sharing platform.
In step S302, the feature value of the target media resource feature is input into the media resource processing model, so as to obtain the prediction result output by the media resource processing model.
The media resource processing model is obtained through training based on training samples and marking information corresponding to the training samples, and a prediction result is used for indicating whether the media resource belongs to a target class or not.
For example, in a media resource recommendation scenario, a media resource processing model predicts whether a user will execute a target operation on a media resource based on a feature value of a target media resource feature corresponding to the media resource, and obtains a prediction result; if the prediction result shows that the user can execute the target operation on the media resource, the media resource is recommended to the user, and the media resource is displayed on a recommendation interface corresponding to the user.
According to the technical scheme provided by the embodiment of the disclosure, the target media resource characteristics which can highlight the distinction between the media resources belonging to the target category and the media resources not belonging to the target category are obtained from the characteristic values of the plurality of media resource characteristics corresponding to the media resources, and the media resource processing model predicts based on the target media resource characteristics, so that the interference of irrelevant characteristics to the prediction process is reduced, and the accuracy of model prediction is improved.
While the foregoing embodiments briefly describe a feature processing method, in some embodiments, relevant parameters of a single media asset feature may be determined based on a receiver operating characteristic, and in the following, a detailed description of a feature processing method for determining relevant parameters based on a receiver operating characteristic is given with reference to fig. 4, where fig. 4 is a flowchart illustrating a feature processing method according to an exemplary embodiment. Referring to fig. 4, the feature processing method includes the steps of:
in step S401, the computer device obtains a plurality of first feature values corresponding to the first media resource feature from a plurality of media resource samples.
Step S401 is the same as step S101, and will not be described again.
In step S402, the computer device obtains a plurality of first thresholds corresponding to the first media resource feature from between a maximum value and a minimum value of the plurality of first feature values.
In some embodiments, the computer device determines a maximum value and a minimum value of the plurality of first feature values, between which a plurality of first threshold values are randomly acquired, each of the first threshold values being greater than the minimum value and less than the maximum value.
In some embodiments, the maximum value is denoted as x max The minimum value is expressed as x min The value range of the first characteristic value is expressed as [ x ] min ,x max ]The method comprises the steps of carrying out a first treatment on the surface of the The computer equipment equally divides the value range by m to obtain m-1 demarcation values; and taking the obtained demarcation value as a first threshold value. Wherein m is a positive integer greater than 1.
In step S403, the computer device determines a false positive class rate and a true class rate for each first threshold.
The false positive class rate (FPR, false Positive Rate) refers to the probability of predicting a negative sample as a positive class. The true class rate (TPR, true Positive Rate) refers to the probability of predicting positive samples as positive classes. In the embodiment of the disclosure, samples with the first feature value greater than the first threshold are predicted to be positive, and then the false positive class rate corresponding to the first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, where the first target negative samples are negative samples with the first feature value greater than the first threshold. The true class rate corresponding to the first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
The above embodiment describes a process of determining the pseudo-positive class rate and the true class rate corresponding to one first threshold, and the process of determining the pseudo-positive class rate and the true class rate corresponding to each first threshold is the same and is not described herein.
In the above embodiments, samples for which the first characteristic value is determined to be greater than the first threshold value are predicted to be positive classes, in addition to that, in some embodiments, samples for which the first characteristic value is determined to be equal to the first threshold value are also predicted to be positive classes. In some embodiments, a sample having a first characteristic value equal to a first threshold value is considered to be a sample predicted to be a positive class; in some embodiments, samples with a first eigenvalue equal to a first threshold are considered as 0.5 samples predicted to be positive classes.
In step S404, the computer device determines a relevant parameter corresponding to the first media asset feature based on the pseudo-positive class rate and the true class rate corresponding to each first threshold.
After determining the pseudo-positive class rate and the true class rate corresponding to each first threshold, drawing an ROC curve (Receiver Operating Characteristic Curve ) based on the pseudo-positive class rate and the true class rate corresponding to each first threshold; determining an AUC (Area enclosed by the Area Under Curve and the coordinate axis Under the ROC Curve) value based on the ROC Curve; wherein AUC values represent: taking any first threshold value, a positive sample and a negative sample are randomly given, wherein the probability that the first eigenvalue of the positive sample is larger than the first eigenvalue of the negative sample is given, and the probability that the first eigenvalue of the randomly given positive sample is larger than the first eigenvalue of the randomly given negative sample is larger, which means that the degree of distinction between the positive sample and the negative sample is larger, and therefore, the AUC value is taken as a related parameter to express the degree of distinction between the positive sample and the negative sample.
In some embodiments, the step S404 includes: for each first threshold value of the plurality of first threshold values, the computer equipment determines a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value; determining an ROC curve based on a plurality of points corresponding to the plurality of first thresholds; and determining the area under the ROC curve as a relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the computer device connects a plurality of points to obtain the ROC curve. In some embodiments, the computer device smoothes the curve obtained by connecting the plurality of points to obtain the ROC curve.
Wherein one threshold corresponds to one point, in one example, the computer device determines, based on a plurality of points corresponding to a plurality of first thresholds, an ROC curve as shown in fig. 5, and the computer device determines an area under the ROC curve, that is, an area surrounded by the ROC curve and a straight line represented by a coordinate axis and fpr=1, as a relevant parameter corresponding to the first media resource feature.
In the above technical solution, the AUC value has a similar statistical meaning as the relevant parameter, so that the relevant parameter is determined in a manner similar to determining the AUC value, so that the relevant parameter can represent the following meaning: under the condition of taking any first threshold value, a positive sample and a negative sample are randomly given, and the probability that the first characteristic value of the positive sample is larger than that of the negative sample is higher than that of the positive sample, so that the correlation parameter reflects the distinguishing degree between the positive sample and the negative sample more accurately, and the accuracy of the correlation parameter in representing the distinguishing degree is improved.
In some embodiments, the AUC value is an integral of the true class rate, and after determining the false positive class rate and the true class rate corresponding to each first threshold, the computer device may also determine an integral of the true class rate, and determine the integral as the relevant parameter corresponding to the first media resource feature. That is, the computer device determines the relevant parameters by the following formula one:
equation one:
Figure BDA0003206114180000281
wherein AUC represents the relevant parameter; x is x m Representing any one of a plurality of first thresholds; x is x min Representing a minimum value of the plurality of first feature values; x is x max Representing a maximum value of the plurality of first feature values; TPR represents true class rate;
Figure BDA0003206114180000282
The integral representing the true class rate for the plurality of first thresholds, i.e. the area under the ROC curve.
In step S405, the computer device determines a target media asset feature having a related parameter greater than a threshold from among the plurality of media asset features, and uses the target media asset feature as an input feature of the media asset processing model.
Step S405 is similar to step S103, and will not be described again.
In the technical scheme, the correlation parameter represents that under the condition of taking any first threshold value, a positive sample and a negative sample are randomly given, the first characteristic value of the positive sample is larger than the probability of the first characteristic value of the negative sample, the correlation parameter quantitatively represents the distinguishing degree between the positive sample and the negative sample more accurately in a statistical sense, and the accuracy of the correlation parameter representing the distinguishing degree is improved.
The foregoing embodiments describe one way to determine relevant parameters corresponding to a single media resource feature, in some embodiments, the relevant parameters of a single media resource feature may also be determined by other ways, and in conjunction with fig. 6, a feature processing method for determining relevant parameters based on WMW (Wilcoxon-Mann-Whitney ) rank sum test is described in detail below, and fig. 6 is a flowchart of one feature processing method according to an exemplary embodiment. Referring to fig. 6, the feature processing method includes the steps of:
In step S601, the computer device obtains a plurality of feature values corresponding to the first media resource feature from a plurality of media resource samples.
Step S601 is the same as step S101, and will not be described again here.
In step S602, the computer device sorts the plurality of media resource samples according to the order of the first feature values from small to large, to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a sorting order of the first feature value of the media resource sample in the plurality of first feature values.
WMW rank sum test is a non-parametric statistical method, the main idea being to infer the difference between the distribution of the population in which two sample sets are located using the two sample sets. In WMW rank sum test, the difference between the two distributions is represented by the U statistic. The U statistic represents the probability that a sample in one sample set will be ranked before a sample in another sample set. The larger the U statistic, the greater the difference between the two distributions.
Based on the idea of WMW rank sum test, the computer equipment sorts the plurality of media resource samples according to the order of the first eigenvalues from small to large to obtain inter-group ranks of the plurality of media resource samples. The smaller the first eigenvalue, the smaller the inter-group rank of the media resource samples; the larger the first eigenvalue, the larger the inter-group rank of the media asset samples.
In some embodiments, among the plurality of media asset samples, the first characteristic values of at least two media asset samples are equal, and the computer device sequentially determines an initial ranking number of the at least two media asset samples; determining a mean value of the initial sequence numbers of the at least two media resource samples; the average value is determined as an inter-group rank of the at least two media asset samples. For example, if the initial sequence numbers of the two media resource samples with the first eigenvalue of 1 are 3 and 4 respectively, the average value of 3 and 4 is 3.5, and the rank between the two media resource samples is determined.
In step S603, the computer device obtains a first number of a plurality of sample pairs, one sample pair consisting of one positive sample and one negative sample of the plurality of media asset samples.
The plurality of media asset samples may be divided into a positive sample group and a negative sample group, the positive samples in the positive sample group and the negative samples in the negative sample group being compared two by two, sharing a first number of cases. That is, the plurality of media asset samples includes M positive samples and N negative samples, the positive samples and the negative samples being compared two by two, and there is m×n in common, and the first number is m×n.
In step S604, the computer device determines a second number of target sample pairs in which the inter-group rank of the positive samples is greater than the inter-group rank of the negative samples based on the inter-group ranks of the plurality of media resource samples.
In some embodiments, the second number has an initial value of 0, on which the computer device traverses the plurality of pairs of samples, and accumulates the second number by 1 upon determining that the inter-group rank of the positive samples in a pair of samples is greater than the inter-group rank of the negative samples. After traversing the plurality of pairs of samples, a second number of target pairs of samples is obtained.
Since the inter-group rank of one media resource sample represents the number of media resource samples with the eigenvalue not greater than the media resource sample, wherein the media resource of the media resource sample comprises positive samples and negative samples, and the intra-group rank of one positive sample represents the eigenvalue not greater than the number of positive samples of the positive samples, then the difference between the inter-group rank of one positive sample and the intra-group rank is the number of negative samples with the eigenvalue not greater than the positive samples, and the difference between the inter-group rank of the positive sample and the intra-group rank sum of the positive samples is the total number of cases with the positive samples greater than the negative samples, namely the second number of target sample pairs in the m×n cases.
Accordingly, in some embodiments, the step S604 includes the steps of: the computer device determining an inter-group rank sum of at least one positive sample of the plurality of media asset samples as an inter-group rank sum of the at least one positive sample; the method comprises the steps that computer equipment sorts at least one positive sample according to the sequence of the first characteristic values from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of the at least one positive sample refers to the sorting sequence number of the first characteristic value of the positive sample in the first characteristic value of the at least one positive sample; determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample; the difference between the inter-group rank sum and the intra-group rank sum is determined as a second number.
According to the technical scheme, the second number of the target sample pairs is determined based on the inter-group rank sum and the intra-group rank sum, so that the process of traversing a plurality of sample pairs for comparison is omitted, and the determination efficiency of the second number is improved.
In step S605, the computer device determines a ratio of the second number to the first number as a relevant parameter corresponding to the first media asset characteristic.
The computer equipment determines the ratio of the second quantity to the first quantity as a relevant parameter corresponding to the first media resource characteristic, namely, the computer equipment determines the relevant parameter corresponding to the first media resource characteristic through the following formula II:
formula II:
Figure BDA0003206114180000301
wherein AUC represents the relevant parameter; m is the number of at least one positive sample in the plurality of media asset samples; n is the number of negative samples in the plurality of media asset samples; r is the inter-group rank sum of at least one positive sample;
Figure BDA0003206114180000302
intra-group rank sum for at least one positive sample; />
Figure BDA0003206114180000303
I.e. the second number; m×n is also the first number.
In step S606, the computer device determines a target media asset feature having a correlation parameter greater than a threshold from the plurality of media asset features, and uses the target media asset feature as an input feature of the media asset processing model.
Step S606 is similar to step S103, and will not be described again.
The determination process based on the above related parameters can be known, and the related parameters represent: and randomly giving a positive sample and a negative sample, wherein the probability that the first characteristic value of the positive sample is larger than that of the negative sample is given, the distinguishing degree between the positive sample and the negative sample is more accurately and quantitatively represented by the related parameter, and the accuracy of the distinguishing degree represented by the related parameter is improved.
The foregoing embodiments describe a feature processing method for determining relevant parameters of a single media resource feature, and in some embodiments, a plurality of media resource features form a feature combination, and in the following, the feature processing method for determining relevant parameters corresponding to the feature combination will be described in detail with reference to fig. 7, where fig. 7 is a flowchart of a feature processing method according to an exemplary embodiment. Referring to fig. 7, the feature processing method includes the steps of:
in step S701, for each of the plurality of second media asset features, the computer device obtains a plurality of second feature values corresponding to each of the plurality of second media asset features from the plurality of media asset samples.
Step S701 is the same as step S101, and will not be described again here.
In step S702, for each of the plurality of second media asset features, the computer device obtains a plurality of second thresholds corresponding to the second media asset feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media asset feature.
For each second media resource feature, the computer device obtains a plurality of second thresholds corresponding to the second media resource feature. The process of obtaining the plurality of thresholds corresponding to each second media resource feature by the computer device is the same as step S402, and will not be described herein.
In step S703, the computer device obtains a second threshold from the plurality of second thresholds corresponding to each second media resource feature, to obtain a threshold group.
The computer equipment acquires a second threshold value from a plurality of second threshold values corresponding to each second media resource feature, so that a plurality of second threshold values corresponding to the second media resource features are acquired, wherein one second threshold value corresponds to one second media resource feature, and a plurality of second threshold values corresponding to the second media resource features form a threshold value group.
In some embodiments, the computer device performs step S703 multiple times, resulting in multiple threshold groups. In some embodiments, the computer device randomly selects one second threshold value from the plurality of second threshold values corresponding to each second media resource feature at a time, resulting in one threshold value group.
In some embodiments, the computer device ranks and combines the plurality of second thresholds corresponding to each of the second media asset characteristics to obtain a plurality of threshold groups.
In step S704, the computer device determines, for the plurality of obtained threshold groups, a false positive class rate and a true class rate corresponding to each threshold group.
A threshold group comprising n threshold values, n being a positive integer, the comparison result of the characteristic value and the threshold group being 2 n A kind of module is assembled in the module and the module is assembled in the module. Taking the example that the threshold group comprises 2 thresholds, one threshold group comprises a threshold x 1t And a threshold value x 2t The comparison of the feature value with the threshold value group includes the following 4 cases: (1) X is x 1 >x 1t ,x 2 >x 2t ;(2)x 1 <x 1t ,x 2 >x 2t ;(3)x 1 >x 1t ,x 2 <x 2t ;(4)x 1 <x 1t ,x 2 <x 2t . Wherein x is 1 Is a threshold value x 1t Corresponding media resource characteristics, x 1 Is a threshold value x 2t Corresponding media asset characteristics.
If the two intermediate cases (2) and (3) are considered, the characteristic value involved in the determination of the relevant parameter is limited within a certain range, so that the media resource sample based on which the relevant parameter is determined changes, the media resource sample is not the plurality of media resource samples, but part of the plurality of media resource samples, accordingly, the distinction degree represented by the relevant parameter is also on the premise of the range, and the accuracy of the representation of the distinction degree by the relevant parameter is reduced.
For example, if consideration of case (2) is reserved, the media resource sample conforming to case (1) is taken as the predicted positive class, and the media resource sample conforming to case (2) is taken as the predicted negative class, then both define x 2 >x 2t Equivalent to selecting a coincidence x from the original multiple media resource samples 2 >x 2t Form a sample subset; on the basis of the sample subset, the correlation degree of the media resource characteristics and the training targets is considered, and in the process, the basis of the correlation parameters is determinedThe sample set changes and the accuracy of the relative parameter to the discrimination representation decreases.
That is, when determining the relevant parameters corresponding to the feature combination, if the intermediate condition of comparing the threshold value with the media resource features is considered, the accuracy of the relevant parameters in the distinguishing degree representation is reduced, so in some embodiments, in order to ensure the accuracy of the relevant parameters in the distinguishing degree representation, the intermediate condition of comparing the threshold value with the feature value is omitted, media resource samples with the feature values respectively larger than the threshold value are predicted to be positive, and media resource samples with the feature values respectively smaller than the threshold value are predicted to be negative, and on the basis, the false positive class rate and the true class rate are determined. Taking the example that the threshold group comprises 2 thresholds, the satisfaction of x is determined 1 >x 1t ,x 2 >x 2t Is predicted as positive class, determining that x is satisfied 1 <x 1t ,x 2 <x 2t Is predicted as negative class.
Correspondingly, the false positive class rate corresponding to one threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, and each second characteristic value in the second target negative samples is larger than the corresponding second threshold value. The true class rate corresponding to one threshold group is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, and each second characteristic value in the second target positive samples is greater than a corresponding second threshold.
According to the technical scheme, the intermediate condition of comparison of the threshold value and the media resource characteristics is abandoned, so that the accuracy of the related parameters in the representation of the discrimination degree is improved, the number of samples involved in determining the false positive class rate and the true class rate is reduced under the condition that the comparison of the characteristic combination and the threshold value is exponentially increased, and the efficiency of determining the false positive class rate and the true class rate is improved.
Since the number of media resource features included in the feature combination is larger, the number of samples is reduced as the number of media resource features included in the feature combination is larger, the feature combination can be set to contain a small number of media resource features, the number of samples to be discarded is reduced, and the relevant parameters corresponding to the feature combination can be determined as supplement after the relevant parameters corresponding to the single media resource feature are determined.
In step S705, the computer device determines a correlation parameter corresponding to the plurality of second media asset characteristics based on the pseudo-positive class rate and the true class rate corresponding to each threshold group.
Step S705 is the same as step S404, that is, for each of the plurality of threshold groups, the computer device determines, based on the pseudo-positive class rate and the true class rate corresponding to the threshold group, the point corresponding to the threshold group whose abscissa is the pseudo-positive class rate corresponding to the threshold group and whose ordinate is the true class rate corresponding to the threshold group; determining an ROC curve based on a plurality of points corresponding to the plurality of threshold groups; and determining the area under the ROC curve as a relevant parameter corresponding to the characteristics of the plurality of second media resources, wherein the relevant parameter is a relevant parameter corresponding to a characteristic combination formed by the characteristics of the plurality of second media resources.
In step S706, the computer device determines a target media asset feature having a correlation parameter greater than a threshold from the plurality of media asset features, and uses the target media asset feature as an input feature of the media asset processing model.
Step S706 is similar to step S103, and will not be described again.
According to the technical scheme, the correlation parameters corresponding to the combination of the plurality of media resource characteristics are determined to reflect the correlation degree between the combination of the plurality of media resource characteristics and the training target, so that quantitative representation of the correlation degree between the combination of the characteristics and the training target is realized, the correlation parameters can more comprehensively and accurately represent the correlation degree between the media resource characteristics and the training target, the accuracy of the representation of the correlation parameters is improved, the correlation degree between the combination of the characteristics and the training target is further combined, the characteristic selection accuracy is further improved, the selected media resource characteristics are used as input characteristics of a media resource processing model, and the model prediction accuracy is further improved.
In addition, by selecting the media resource characteristics with the relevant parameters larger than the threshold value, the technical scheme provided by the embodiment of the disclosure omits the media resource characteristics irrelevant to the training target or with low relevance, reduces the interference of the characteristics on model learning, further improves the accuracy of model prediction, reduces the invalid characteristics required to be processed by the model, and improves the model processing efficiency.
In addition, the technical scheme provided by the embodiment of the disclosure is that the correlation degree between the media resource characteristics and the training target is determined through analysis of the characteristic values, the method can be executed before model training, and is not dependent on a model after training, compared with a single variable method, under the condition that the characteristic values of other characteristics are unchanged, the characteristic values of single characteristics are adjusted, the model output is observed, so that the scheme of the correlation degree between the characteristics and the training target is determined, the characteristic values of other characteristics are not changed to be preconditions for analysis, the determined correlation parameters can be used for quantitatively representing the correlation degree between the media resource characteristics and the training target more accurately on the whole in a statistical sense, and the accuracy of the correlation degree representation of the correlation parameters between the media resource characteristics and the training target is improved.
In addition, the technical scheme provided by the embodiment of the disclosure determines the relevant parameters based on the statistical thought, so that the relevant parameters represent the degree of correlation between the media resource characteristics and the training targets in the overall distribution angle, and compared with the scheme of observing the frequency of the co-occurrence of different values of the media resource characteristics and the training targets, the method and the device have the advantages that the degree of correlation between the media resource characteristics and the training targets is summarized and represented more accurately in the overall dimension, and the accuracy of the relevant parameters in representing the degree of correlation between the media resource characteristics and the training targets is improved.
It should be noted that, the target media resource features determined in the embodiments of the present disclosure may highlight the distinction between the media resources belonging to the target category and the media resources not belonging to the target category, where the target media resource features are strongly related to the target category, and when training the media resource processing model for predicting the target category, or when processing the media resources by applying the media resource processing model for predicting the target category, the corresponding feature values may be extracted as model input based on the predetermined target media resource features. In some embodiments, after determining a target media asset characteristic from the plurality of media asset characteristics for which the relevant parameter is greater than the threshold, a correspondence of the target media asset characteristic to the target category is stored. Accordingly, before training any media resource processing model for predicting a target category, determining a target media resource feature corresponding to the target category based on a stored correspondence between the target category and the target media resource feature. Before any media resource processing model is applied to process media resources, a target category corresponding to the media resource processing model is obtained, and target media resource characteristics corresponding to the target category are determined based on the corresponding relation between the stored target category and the target media resource characteristics.
According to the technical scheme, after the target media resource characteristics are determined, the target categories and the target media resource characteristics are correspondingly stored, and then the corresponding relationships can be directly stored before training the model for predicting the target categories or applying the model for predicting the target categories, so that the corresponding target media resource characteristics are determined, the step of determining the target media resource characteristics is not required to be repeatedly executed, and the efficiency of acquiring the target media resource characteristics is improved.
Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.
Fig. 8 is a block diagram of a feature processing apparatus according to an exemplary embodiment. Referring to fig. 8, the feature processing apparatus includes:
a first feature value obtaining unit 801 configured to obtain feature values of at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including a positive sample belonging to a target category and a negative sample not belonging to the target category;
a first parameter determining unit 802 configured to perform determining, based on the obtained feature values, a relevant parameter corresponding to at least one media resource feature, the relevant parameter being used to represent a degree of distinction between media resources belonging to the target category and media resources not belonging to the target category in a dimension of the at least one media resource feature;
A first feature determining unit 803 configured to perform determining a target media resource feature, from among the plurality of media resource features, for which the relevant parameter is larger than the threshold, using the target media resource feature as an input feature of a media resource processing model for predicting whether the media resource belongs to a target category.
According to the feature processing device provided by the embodiment of the disclosure, the feature values of the media resource samples are analyzed respectively in the dimensions of different media resource features to obtain the related parameters, the degree of distinction between the media resources belonging to the target category and the media resources not belonging to the target category is represented by the related parameters, and then the degree of relativity between the media resource features and the training targets is reflected, quantitative reflection of the degree of relativity between the media resource features and the training targets is realized, the accuracy of feature selection is improved by selecting the media resource features with the related parameters larger than the threshold value, the media resource features with the related parameters larger than the threshold value are used as the input features of the media resource processing model, the degree of relativity between the input features and the training targets is improved, the resolving capability of the model on the positive type and the negative type is enhanced, the effective learning of the model is promoted, and the accuracy of model prediction is improved.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic; the first parameter determination unit 802 includes:
a first threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a first ratio determining subunit configured to perform determining a false positive class ratio and a true class ratio corresponding to each first threshold;
a first parameter determining subunit configured to perform determining, based on the pseudo-positive class rate and the true class rate corresponding to each first threshold, a related parameter corresponding to the first media resource feature;
the false positive class rate corresponding to any first threshold value is the ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, and the first target negative samples refer to negative samples with the first characteristic value larger than the first threshold value;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In some embodiments, the first parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as a relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features; a first parameter determination unit 802 configured to perform:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
For a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
based on the false positive class rate and the true class rate corresponding to each threshold value group, determining relevant parameters corresponding to a plurality of second media resource characteristics;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, and the second target negative samples are negative samples with each second characteristic value larger than the corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, and the second target positive samples are positive samples with each second characteristic value being greater than the corresponding second threshold.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic; the first parameter determination unit 802 includes:
a first inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to the order of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
A first number acquisition subunit configured to perform acquiring a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a second number determination subunit configured to perform determining a second number of target sample pairs in which the inter-group rank of the positive samples is greater than the inter-group rank of the negative samples based on the inter-group ranks of the plurality of media resource samples;
and a second parameter determination subunit configured to perform determining a ratio of the second number to the first number as a related parameter corresponding to the first media resource feature.
In some embodiments, the second number of determination subunits is configured to perform:
determining a sum value of an inter-group rank of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing at least one positive sample according to the sequence from small to large of the first characteristic value to obtain an intra-group rank of at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first characteristic value of the positive sample in the first characteristic value of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
The difference between the inter-group rank sum and the intra-group rank sum is determined as a second number.
In some embodiments, the first feature determining unit 803 is configured to perform determining a media resource feature for which the corresponding relevant parameter is greater than the threshold as the target media resource feature; or alternatively, the process may be performed,
a first feature determining unit 803 configured to perform ranking of the plurality of media asset features in order of the related parameters from large to small, determining the front target number of media asset features as target media asset features.
With respect to the feature processing apparatus in the above-described embodiment, the specific manner in which the respective units perform the operations has been described in detail in the embodiment regarding the feature processing method, and is not described in detail here.
FIG. 9 is a block diagram illustrating a model training apparatus according to an example embodiment. Referring to fig. 9, the model training apparatus includes:
the sample acquiring unit 901 is configured to perform acquiring a training sample and labeling information corresponding to the training sample, where the training sample includes feature values of a plurality of media resource features corresponding to media resources, and the labeling information is used to indicate whether the training sample belongs to a target class;
a second feature value obtaining unit 902 configured to obtain, from the training sample, a feature value of a target media resource feature, the target media resource feature being a media resource feature whose related parameter is greater than a threshold, the related parameter being used to represent a media resource belonging to a target category and a media resource not belonging to the target category, the degree of distinction being represented in a dimension of at least one media resource feature including the target media resource feature;
The model training unit 903 is configured to perform training of the media asset processing model by taking the feature value of the target media asset feature as an input of the media asset processing model and the annotation information as an output target of the media asset processing model.
According to the model training device provided by the embodiment of the disclosure, the target media resource characteristics which can highlight the distinction between the media resources belonging to the target category and the media resources not belonging to the target category are obtained from the training sample, the target media resource characteristics are used as the input of the media resource processing model, the media resource processing model is trained, the interference of irrelevant characteristics to the training process is reduced, the effective learning of the model is promoted, the resolving power of the model to the positive and negative categories can be enhanced, and the accuracy of model prediction is improved.
In some embodiments, the model training apparatus further comprises:
and a second feature determination unit configured to perform determination of a target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature.
In some embodiments, the model training apparatus further comprises:
a third feature value obtaining unit configured to obtain feature values of at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including positive samples belonging to a target category and negative samples not belonging to the target category;
A second parameter determining unit configured to perform determining a related parameter corresponding to at least one media resource feature based on the obtained feature value;
a third feature determination unit configured to perform determining a target media resource feature whose related parameter is greater than a threshold value from among the plurality of media resource features;
and the first relation storage unit is configured to store the corresponding relation between the target category and the target media resource characteristic.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
a second parameter determination unit including:
a second threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a second ratio determining subunit configured to perform determining a false positive class rate and a true class rate corresponding to each of the first thresholds;
a third parameter determination subunit configured to perform determining a relevant parameter corresponding to the first media resource feature based on the pseudo-positive class rate and the true class rate corresponding to each first threshold;
the false positive class rate corresponding to any first threshold value is the ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, and the first target negative samples refer to negative samples with the first characteristic value larger than the first threshold value;
The true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In some embodiments, the third parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as a relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features;
a second parameter determination unit configured to perform:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
Acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
based on the false positive class rate and the true class rate corresponding to each threshold value group, determining relevant parameters corresponding to a plurality of second media resource characteristics;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, and the second target negative samples are negative samples with each second characteristic value larger than the corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, and the second target positive samples are positive samples with each second characteristic value being greater than the corresponding second threshold.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
a second parameter determination unit including:
a second inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to the order of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
A third number acquisition subunit configured to perform acquiring a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a fourth number determination subunit configured to perform determining a second number of target sample pairs in which the inter-group rank of the positive samples is greater than the inter-group rank of the negative samples based on the inter-group ranks of the plurality of media resource samples;
and a fourth parameter determination subunit configured to perform determining a ratio of the second number to the first number as a related parameter corresponding to the first media resource feature.
In some embodiments, the fourth number determination subunit is configured to perform:
determining a sum value of an inter-group rank of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing at least one positive sample according to the sequence from small to large of the first characteristic value to obtain an intra-group rank of at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first characteristic value of the positive sample in the first characteristic value of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
The difference between the inter-group rank sum and the intra-group rank sum is determined as a second number.
With respect to the model training apparatus in the above-described embodiment, the specific manner in which the respective units perform the operations has been described in detail in the embodiment regarding the model training method, and is not described in detail herein.
Fig. 10 is a block diagram illustrating a media asset processing device according to an exemplary embodiment. Referring to fig. 10, the media resource processing device includes:
a fourth feature value obtaining unit 1001 configured to obtain a feature value of a target media resource feature from feature values of a plurality of media resource features corresponding to the media resource;
the media resource processing unit 1002 is configured to perform inputting a feature value of a target media resource feature into a media resource processing model, so as to obtain a prediction result output by the media resource processing model, where the media resource processing model is obtained by training based on a training sample and label information corresponding to the training sample, and the prediction result is used to indicate whether the media resource belongs to a target class;
wherein the target media asset characteristic is a media asset characteristic having a correlation parameter greater than a threshold, the correlation parameter being indicative of a media asset belonging to the target category and a media asset not belonging to the target category, the differentiation being exhibited in a dimension of at least one media asset characteristic comprising the target media asset characteristic.
According to the media resource processing device provided by the embodiment of the disclosure, the target media resource characteristics which can highlight the distinction between the media resources belonging to the target category and the media resources not belonging to the target category are obtained from the characteristic values of the plurality of media resource characteristics corresponding to the media resources, and the media resource processing model predicts based on the target media resource characteristics, so that the interference of irrelevant characteristics to the prediction process is reduced, and the accuracy of model prediction is improved.
In some embodiments, the media asset processing device further comprises:
a category acquisition unit configured to perform acquisition of a target category corresponding to the media resource processing model;
and a fourth feature determination unit configured to perform determination of a target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature.
In some embodiments, the media asset processing device further comprises:
a fifth feature value obtaining unit configured to obtain feature values of at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including positive samples belonging to a target category and negative samples not belonging to the target category;
A third parameter determining unit configured to perform determining a related parameter corresponding to at least one media resource feature based on the obtained feature value;
a fifth feature determination unit configured to perform determining a target media resource feature whose related parameter is greater than a threshold value from among the plurality of media resource features;
and the second relation storage unit is configured to store the corresponding relation between the target category and the target media resource characteristic.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
a third parameter determination unit configured to perform:
a third threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a third ratio determining subunit configured to perform determining a false positive class rate and a true class rate corresponding to each of the first thresholds;
a fifth parameter determining subunit configured to perform determining, based on the pseudo-positive class rate and the true class rate corresponding to each of the first thresholds, a related parameter corresponding to the first media resource feature;
the false positive class rate corresponding to any first threshold value is the ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, and the first target negative samples refer to negative samples with the first characteristic value larger than the first threshold value;
The true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
In some embodiments, the fifth parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as a relevant parameter corresponding to the first media resource characteristic.
In some embodiments, the at least one media asset feature comprises a plurality of second media asset features;
a third parameter determination unit configured to perform:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
Acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
based on the false positive class rate and the true class rate corresponding to each threshold value group, determining relevant parameters corresponding to a plurality of second media resource characteristics;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, and the second target negative samples are negative samples with each second characteristic value larger than the corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, and the second target positive samples are positive samples with each second characteristic value being greater than the corresponding second threshold.
In some embodiments, the at least one media asset characteristic comprises a first media asset characteristic;
a third parameter determination unit including:
a third inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to the order of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
A fifth number acquisition subunit configured to perform acquiring a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a sixth number determination subunit configured to perform determining a second number of target sample pairs in which the inter-group rank of the positive samples is greater than the inter-group rank of the negative samples based on the inter-group ranks of the plurality of media resource samples;
a sixth parameter determination subunit configured to perform determining a ratio of the second number to the first number as the relevant parameter corresponding to the first media resource feature.
In some embodiments, the sixth number determination subunit is configured to perform:
determining a sum value of an inter-group rank of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing at least one positive sample according to the sequence from small to large of the first characteristic value to obtain an intra-group rank of at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first characteristic value of the positive sample in the first characteristic value of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
The difference between the inter-group rank sum and the intra-group rank sum is determined as a second number.
With respect to the media resource processing device in the above embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment regarding the media resource processing method, and is not described in detail herein.
Fig. 11 is a block diagram of a computer device 1100, which may vary widely depending on configuration or performance, and may include one or more processors (Central Processing Units, CPU) 1101 and one or more memories 1102, where the memories 1102 are used to store executable instructions, and the processors 1101 are configured to execute the executable instructions to implement the feature processing method, the model training method, or the media resource processing method provided by the above-described method embodiments, according to an exemplary embodiment. Of course, the computer device may also have a wired or wireless network interface, a keyboard, an input/output interface, etc. to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein.
In an exemplary embodiment, a computer readable storage medium is also provided, such as a memory 1102 including instructions executable by the processor 1101 of the computer device 1100 to perform the feature processing method, model training method, or media resource processing method described above. In some embodiments, the computer readable storage medium may be ROM (Read-Only Memory), RAM (Random Access Memory ), CD-ROM (Compact Disc Read-Only Memory, compact disc Read Only), magnetic tape, floppy disk, optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, which comprises a computer program which, when being executed by a processor, implements the feature processing method, the model training method or the media resource processing method in the respective method embodiments described above.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (46)

1. A feature processing method, characterized in that the feature processing method comprises:
Acquiring characteristic values of at least one media resource characteristic from a plurality of media resource samples, wherein each media resource sample comprises characteristic values of a plurality of media resource characteristics corresponding to one media resource, and the plurality of media resource samples comprise positive samples belonging to a target category and negative samples not belonging to the target category;
based on the obtained characteristic values, determining relevant parameters corresponding to the at least one media resource characteristic, wherein the relevant parameters are used for representing the distinguishing degree of the media resources belonging to the target category and the media resources not belonging to the target category on the dimension of the at least one media resource characteristic;
and determining target media resource characteristics with related parameters larger than a threshold value from the plurality of media resource characteristics, and taking the target media resource characteristics as input characteristics of a media resource processing model, wherein the media resource processing model is used for predicting whether the media resource belongs to the target category.
2. The feature processing method of claim 1, wherein the at least one media asset feature comprises a first media asset feature;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
Acquiring a plurality of first thresholds corresponding to the first media resource characteristics from a position between a maximum value and a minimum value in a plurality of first characteristic values corresponding to the first media resource characteristics;
determining false positive class rate and true class rate corresponding to each first threshold;
based on the false positive class rate and the true class rate corresponding to each first threshold, determining relevant parameters corresponding to the first media resource characteristics;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
3. The feature processing method according to claim 2, wherein the determining the relevant parameters corresponding to the first media resource feature based on the pseudo-positive class rate and the true class rate corresponding to each first threshold includes:
For each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
4. The feature processing method of claim 1, wherein the at least one media asset feature comprises a plurality of second media asset features;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
Acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
5. The feature processing method of claim 1, wherein the at least one media asset feature comprises a first media asset feature;
The determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
sequencing the plurality of media resource samples according to the sequence of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, wherein the inter-group rank of one media resource sample refers to the sequencing sequence number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
obtaining a first number of a plurality of sample pairs, a sample pair consisting of a positive sample and a negative sample of the plurality of media asset samples;
determining a second number of target sample pairs in which the inter-group rank of positive samples is greater than the inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
and determining the ratio of the second quantity to the first quantity as a relevant parameter corresponding to the first media resource characteristic.
6. The feature processing method of claim 5, wherein the determining a second number of target sample pairs based on an inter-group rank of the plurality of media resource samples comprises:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
Sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
7. A model training method, characterized in that the model training method comprises:
acquiring a training sample and marking information corresponding to the training sample, wherein the training sample comprises characteristic values of a plurality of media resource characteristics corresponding to media resources, and the marking information is used for indicating whether the training sample belongs to a target class;
acquiring a characteristic value of a target media resource characteristic from the training sample, wherein the target media resource characteristic is a media resource characteristic with a related parameter being larger than a threshold value, and the related parameter is used for representing media resources belonging to the target category and media resources not belonging to the target category, and representing a degree of distinction in a dimension of at least one media resource characteristic comprising the target media resource characteristic;
And taking the characteristic value of the target media resource characteristic as the input of a media resource processing model, taking the labeling information as the output target of the media resource processing model, and training the media resource processing model.
8. The model training method of claim 7, wherein prior to obtaining the feature values of the target media asset feature from the training sample, the model training method further comprises:
and determining the target media resource characteristics corresponding to the target category based on the stored correspondence between the target category and the target media resource characteristics.
9. The model training method of claim 8, wherein before determining the target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature, the model training method further comprises:
acquiring characteristic values of at least one media resource characteristic from a plurality of media resource samples, wherein each media resource sample comprises characteristic values of a plurality of media resource characteristics corresponding to one media resource, and the plurality of media resource samples comprise positive samples belonging to a target category and negative samples not belonging to the target category;
Based on the obtained characteristic value, determining a relevant parameter corresponding to the at least one media resource characteristic;
determining the target media resource characteristics with related parameters greater than a threshold value from the plurality of media resource characteristics;
and storing the corresponding relation between the target category and the target media resource characteristic.
10. The model training method of claim 9, wherein the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
acquiring a plurality of first thresholds corresponding to the first media resource characteristics from a position between a maximum value and a minimum value in a plurality of first characteristic values corresponding to the first media resource characteristics;
determining false positive class rate and true class rate corresponding to each first threshold;
based on the false positive class rate and the true class rate corresponding to each first threshold, determining relevant parameters corresponding to the first media resource characteristics;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
The true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
11. The method of claim 10, wherein determining the relevant parameters corresponding to the first media asset characteristics based on the pseudo-positive class rate and the true class rate corresponding to each first threshold comprises:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
12. The model training method of claim 9, wherein the at least one media asset characteristic comprises a plurality of second media asset characteristics;
The determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
13. The model training method of claim 9, wherein the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
sequencing the plurality of media resource samples according to the sequence of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, wherein the inter-group rank of one media resource sample refers to the sequencing sequence number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
obtaining a first number of a plurality of sample pairs, a sample pair consisting of a positive sample and a negative sample of the plurality of media asset samples;
determining a second number of target sample pairs in which the inter-group rank of positive samples is greater than the inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
and determining the ratio of the second quantity to the first quantity as a relevant parameter corresponding to the first media resource characteristic.
14. The model training method of claim 13, wherein the determining a second number of target sample pairs based on the inter-group rank of the plurality of media asset samples comprises:
Determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
15. A media asset processing method, the media asset processing method comprising:
acquiring characteristic values of target media resource characteristics from the characteristic values of a plurality of media resource characteristics corresponding to the media resources;
inputting the characteristic value of the target media resource characteristic into a media resource processing model to obtain a prediction result output by the media resource processing model, wherein the media resource processing model is obtained by training based on a training sample and marking information corresponding to the training sample, and the prediction result is used for indicating whether the media resource belongs to a target class;
The target media resource feature is a media resource feature with a related parameter greater than a threshold, and the related parameter is used for representing the media resource belonging to the target category and the media resource not belonging to the target category, and the distinction is shown in the dimension of at least one media resource feature comprising the target media resource feature.
16. The method according to claim 15, wherein before obtaining the feature value of the target media resource feature from the feature values of the plurality of media resource features corresponding to the media resource, the method further comprises:
acquiring the target category corresponding to the media resource processing model;
and determining the target media resource characteristics corresponding to the target category based on the stored correspondence between the target category and the target media resource characteristics.
17. The method according to claim 16, wherein before determining the target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature, the method further comprises:
Acquiring characteristic values of at least one media resource characteristic from a plurality of media resource samples, wherein each media resource sample comprises characteristic values of a plurality of media resource characteristics corresponding to one media resource, and the plurality of media resource samples comprise positive samples belonging to a target category and negative samples not belonging to the target category;
based on the obtained characteristic value, determining a relevant parameter corresponding to the at least one media resource characteristic;
determining the target media resource characteristics with related parameters greater than a threshold value from the plurality of media resource characteristics;
and storing the corresponding relation between the target category and the target media resource characteristic.
18. The method of media asset processing according to claim 17, wherein the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
acquiring a plurality of first thresholds corresponding to the first media resource characteristics from a position between a maximum value and a minimum value in a plurality of first characteristic values corresponding to the first media resource characteristics;
determining false positive class rate and true class rate corresponding to each first threshold;
Based on the false positive class rate and the true class rate corresponding to each first threshold, determining relevant parameters corresponding to the first media resource characteristics;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
19. The method according to claim 18, wherein determining the relevant parameters corresponding to the first media resource feature based on the pseudo-positive class rate and the true class rate corresponding to each first threshold comprises:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
Determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
20. The media asset processing method of claim 17, wherein the at least one media asset characteristic comprises a plurality of second media asset characteristics;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
The false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
21. The method of media asset processing according to claim 17, wherein the at least one media asset characteristic comprises a first media asset characteristic;
the determining, based on the obtained feature values, a relevant parameter corresponding to the at least one media resource feature includes:
sequencing the plurality of media resource samples according to the sequence of the first eigenvalues from small to large to obtain an inter-group rank of the plurality of media resource samples, wherein the inter-group rank of one media resource sample refers to the sequencing sequence number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
Obtaining a first number of a plurality of sample pairs, a sample pair consisting of a positive sample and a negative sample of the plurality of media asset samples;
determining a second number of target sample pairs in which the inter-group rank of positive samples is greater than the inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
and determining the ratio of the second quantity to the first quantity as a relevant parameter corresponding to the first media resource characteristic.
22. The method of media resource processing of claim 21, wherein the determining the second number of target sample pairs based on the inter-group rank of the plurality of media resource samples comprises:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
And determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
23. A feature processing apparatus, characterized in that the feature processing apparatus comprises:
a first feature value obtaining unit configured to obtain feature values of at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including a positive sample belonging to a target category and a negative sample not belonging to the target category;
a first parameter determining unit configured to perform determining, based on the obtained feature value, a related parameter corresponding to the at least one media resource feature, the related parameter being used to represent a degree of distinction between media resources belonging to the target category and media resources not belonging to the target category in a dimension of the at least one media resource feature;
a first feature determining unit configured to perform determining a target media resource feature, from the plurality of media resource features, for which a related parameter is larger than a threshold, and using the target media resource feature as an input feature of a media resource processing model for predicting whether a media resource belongs to the target category.
24. The feature handling device of claim 23, wherein the at least one media asset feature comprises a first media asset feature;
the first parameter determination unit includes:
a first threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a first ratio determining subunit configured to perform determining a false positive class ratio and a true class ratio corresponding to each first threshold;
a first parameter determining subunit configured to perform determining, based on the pseudo-positive class rate and the true class rate corresponding to each first threshold, a related parameter corresponding to the first media resource feature;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
25. The feature processing apparatus of claim 24, wherein the first parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
26. The feature processing apparatus of claim 23, wherein the at least one feature comprises a plurality of second media asset features;
the first parameter determination unit is configured to perform:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
Acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
27. The feature handling device of claim 23, wherein the at least one media asset feature comprises a first media asset feature;
The first parameter determination unit includes:
a first inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to a sequence from the first eigenvalue to the big, to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
a first number acquisition subunit configured to perform acquiring a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a second number determination subunit configured to perform determining a second number of target sample pairs in which an inter-group rank of positive samples is greater than an inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
a second parameter determination subunit configured to perform determining a ratio of the second number to the first number as a related parameter corresponding to the first media resource feature.
28. The feature processing apparatus of claim 27, wherein the second number determination subunit is configured to perform:
Determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
29. A model training apparatus, characterized in that the model training apparatus comprises:
the system comprises a sample acquisition unit, a target classification unit and a target classification unit, wherein the sample acquisition unit is configured to acquire a training sample and marking information corresponding to the training sample, the training sample comprises characteristic values of a plurality of media resource characteristics corresponding to media resources, and the marking information is used for indicating whether the training sample belongs to the target classification;
a second feature value obtaining unit configured to obtain a feature value of a target media resource feature from the training sample, the target media resource feature being a media resource feature with a correlation parameter greater than a threshold, the correlation parameter being used to represent a media resource belonging to the target category and a media resource not belonging to the target category, and a degree of distinction being represented in a dimension of at least one media resource feature including the target media resource feature;
And the model training unit is configured to perform training of the media resource processing model by taking the characteristic value of the target media resource characteristic as input of the media resource processing model and the marking information as output target of the media resource processing model.
30. The model training apparatus of claim 29 wherein said model training apparatus further comprises:
and a second feature determining unit configured to perform determination of the target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature.
31. The model training apparatus of claim 30 wherein said model training apparatus further comprises:
a third feature value obtaining unit configured to obtain feature values of the at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including a positive sample belonging to a target category and a negative sample not belonging to the target category;
a second parameter determining unit configured to perform determining a related parameter corresponding to the at least one media resource feature based on the obtained feature value;
A third feature determination unit configured to perform determining the target media resource feature for which the relevant parameter is greater than a threshold value from among the plurality of media resource features;
and the first relation storage unit is configured to store the corresponding relation between the target category and the target media resource characteristic.
32. The model training apparatus of claim 31 wherein said at least one media asset characteristic comprises a first media asset characteristic;
the second parameter determination unit includes:
a second threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a second ratio determining subunit configured to perform determining a false positive class rate and a true class rate corresponding to each of the first thresholds;
a third parameter determining subunit configured to perform determining, based on the pseudo-positive class rate and the true class rate corresponding to each first threshold, a related parameter corresponding to the first media resource feature;
the false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
The true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
33. The model training apparatus of claim 32 wherein the third parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
34. The model training apparatus of claim 31 wherein the at least one media asset characteristic comprises a plurality of second media asset characteristics;
The second parameter determination unit is configured to perform:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
the true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
35. The model training apparatus of claim 31 wherein said at least one media asset characteristic comprises a first media asset characteristic;
the second parameter determination unit includes:
a second inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to the order of the first eigenvalues from small to large, to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
a third number acquisition subunit configured to perform acquisition of a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a fourth number determination subunit configured to perform determining a second number of target sample pairs in which an inter-group rank of positive samples is greater than an inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
a fourth parameter determination subunit configured to perform determining a ratio of the second number to the first number as a related parameter corresponding to the first media resource feature.
36. The model training apparatus of claim 35 wherein the fourth number determination subunit is configured to perform:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
37. A media asset processing device, the media asset processing device comprising:
a fourth feature value acquisition unit configured to perform acquisition of a feature value of a target media resource feature from feature values of a plurality of media resource features corresponding to the media resource;
the media resource processing unit is configured to input the characteristic value of the target media resource characteristic into a media resource processing model to obtain a prediction result output by the media resource processing model, wherein the media resource processing model is obtained by training based on a training sample and marking information corresponding to the training sample, and the prediction result is used for indicating whether the media resource belongs to a target class;
The target media resource feature is a media resource feature with a related parameter greater than a threshold, and the related parameter is used for representing the media resource belonging to the target category and the media resource not belonging to the target category, and the distinction is shown in the dimension of at least one media resource feature comprising the target media resource feature.
38. The media asset processing device of claim 37, further comprising:
a category acquisition unit configured to perform acquisition of the target category corresponding to the media resource processing model;
and a fourth feature determining unit configured to perform determination of the target media resource feature corresponding to the target category based on the stored correspondence between the target category and the target media resource feature.
39. The media asset processing device of claim 38, further comprising:
a fifth feature value obtaining unit configured to obtain feature values of the at least one media resource feature from a plurality of media resource samples, each media resource sample including feature values of a plurality of media resource features corresponding to one media resource, the plurality of media resource samples including a positive sample belonging to a target category and a negative sample not belonging to the target category;
A third parameter determining unit configured to perform determining a related parameter corresponding to the at least one media resource feature based on the obtained feature value;
a fifth feature determination unit configured to perform determining the target media resource feature for which the relevant parameter is greater than a threshold value from among the plurality of media resource features;
and the second relation storage unit is configured to store the corresponding relation between the target category and the target media resource characteristic.
40. The media asset processing device of claim 39, wherein the at least one media asset characteristic comprises a first media asset characteristic;
the third parameter determination unit is configured to perform:
a third threshold value obtaining subunit configured to obtain a plurality of first threshold values corresponding to the first media resource feature from between a maximum value and a minimum value of a plurality of first feature values corresponding to the first media resource feature;
a third ratio determining subunit configured to perform determining a false positive class rate and a true class rate corresponding to each of the first thresholds;
a fifth parameter determining subunit configured to perform determining, based on the pseudo-positive class rate and the true class rate corresponding to each first threshold, a related parameter corresponding to the first media resource feature;
The false positive class rate corresponding to any first threshold is a ratio of the number of first target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the first target negative samples refer to the negative samples with a first characteristic value larger than the first threshold;
the true class rate corresponding to any first threshold is a ratio of the number of first target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, where the first target positive samples are positive samples with a first feature value greater than the first threshold.
41. The media resource processing device of claim 40, wherein the fifth parameter determination subunit is configured to perform:
for each first threshold value of the plurality of first threshold values, determining a point corresponding to the first threshold value based on the false positive class rate and the true class rate corresponding to the first threshold value, wherein the abscissa of the point corresponding to the first threshold value is the false positive class rate corresponding to the first threshold value, and the ordinate of the point corresponding to the first threshold value is the true class rate corresponding to the first threshold value;
Determining a receiver operating characteristic curve based on a plurality of points corresponding to the plurality of first thresholds;
and determining the area under the operation characteristic curve of the receiver as the relevant parameter corresponding to the first media resource characteristic.
42. The media asset processing device of claim 39, wherein the at least one media asset characteristic comprises a plurality of second media asset characteristics;
the third parameter determination unit is configured to perform:
for each second media resource feature of the plurality of second media resource features, acquiring a plurality of second thresholds corresponding to the second media resource feature from between a maximum value and a minimum value of a plurality of second feature values corresponding to the second media resource feature;
acquiring a second threshold value from a plurality of second threshold values corresponding to each second media resource characteristic to obtain a threshold value group;
for a plurality of obtained threshold groups, determining false positive class rate and true class rate corresponding to each threshold group;
determining relevant parameters corresponding to the plurality of second media resource features based on the false positive class rate and the true class rate corresponding to each threshold group;
the false positive class rate corresponding to any threshold value group is the ratio of the number of second target negative samples in the plurality of media resource samples to the total number of negative samples in the plurality of media resource samples, wherein the second target negative samples are negative samples with each second characteristic value being larger than a corresponding second threshold value;
The true class rate corresponding to any one of the threshold groups is a ratio of the number of second target positive samples in the plurality of media resource samples to the total number of positive samples in the plurality of media resource samples, wherein the second target positive samples are positive samples with each second characteristic value being greater than a corresponding second threshold.
43. The media asset processing device of claim 39, wherein the at least one media asset characteristic comprises a first media asset characteristic;
the third parameter determination unit includes:
a third inter-group rank determining subunit configured to perform ranking of the plurality of media resource samples according to the order of the first eigenvalues from small to large, to obtain an inter-group rank of the plurality of media resource samples, where the inter-group rank of one media resource sample refers to a ranking number of the first eigenvalue of the media resource sample in the plurality of first eigenvalues;
a fifth number acquisition subunit configured to perform acquiring a first number of a plurality of pairs of samples, one pair of samples consisting of one positive sample and one negative sample of the plurality of media asset samples;
a sixth number determination subunit configured to perform determining a second number of target sample pairs in which an inter-group rank of positive samples is greater than an inter-group rank of negative samples based on the inter-group ranks of the plurality of media resource samples;
A sixth parameter determination subunit configured to perform determining a ratio of the second number to the first number as the relevant parameter corresponding to the first media resource feature.
44. The media resource processing device of claim 43, wherein the sixth number determination subunit is configured to perform:
determining a sum of inter-group ranks of at least one positive sample of the plurality of media resource samples as an inter-group rank sum of the at least one positive sample;
sequencing the at least one positive sample according to the sequence of the first eigenvalues from small to large to obtain an intra-group rank of the at least one positive sample, wherein the intra-group rank of one positive sample refers to the sequencing sequence number of the first eigenvalue of the positive sample in the first eigenvalue of the at least one positive sample;
determining a sum of the intra-group ranks of the at least one positive sample as an intra-group rank sum of the at least one positive sample;
and determining the difference between the inter-group rank sum and the intra-group rank sum as the second number.
45. A computer device, the computer device comprising:
one or more processors;
a memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement the feature processing method of any of claims 1-6, the model training method of any of claims 7-14, or the media asset processing method of any of claims 15-22.
46. A computer readable storage medium, wherein instructions in the computer readable storage medium, when executed by a processor of a computer device, enable the computer device to perform the feature processing method of any one of claims 1-6, the model training method of any one of claims 7-14, or the media asset processing method of any one of claims 15-22.
CN202110917334.5A 2021-08-11 2021-08-11 Feature processing method, model training method and media resource processing method Active CN113672783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110917334.5A CN113672783B (en) 2021-08-11 2021-08-11 Feature processing method, model training method and media resource processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110917334.5A CN113672783B (en) 2021-08-11 2021-08-11 Feature processing method, model training method and media resource processing method

Publications (2)

Publication Number Publication Date
CN113672783A CN113672783A (en) 2021-11-19
CN113672783B true CN113672783B (en) 2023-07-11

Family

ID=78542228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110917334.5A Active CN113672783B (en) 2021-08-11 2021-08-11 Feature processing method, model training method and media resource processing method

Country Status (1)

Country Link
CN (1) CN113672783B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451894A (en) * 2017-08-03 2017-12-08 北京京东尚科信息技术有限公司 Data processing method, device and computer-readable recording medium
WO2017210949A1 (en) * 2016-06-06 2017-12-14 北京大学深圳研究生院 Cross-media retrieval method
CN110209920A (en) * 2018-05-02 2019-09-06 腾讯科技(深圳)有限公司 Treating method and apparatus, storage medium and the electronic device of media resource
CN110209921A (en) * 2018-05-10 2019-09-06 腾讯科技(深圳)有限公司 The method for pushing and device and storage medium and electronic device of media resource
CN110515904A (en) * 2019-08-13 2019-11-29 北京达佳互联信息技术有限公司 Quality prediction model training method, qualitative forecasting method and the device of media file
CN111444357A (en) * 2020-03-24 2020-07-24 腾讯科技(深圳)有限公司 Content information determination method and device, computer equipment and storage medium
CN111538852A (en) * 2020-04-23 2020-08-14 北京达佳互联信息技术有限公司 Multimedia resource processing method, device, storage medium and equipment
CN111708964A (en) * 2020-05-27 2020-09-25 北京百度网讯科技有限公司 Multimedia resource recommendation method and device, electronic equipment and storage medium
CN111708944A (en) * 2020-06-17 2020-09-25 北京达佳互联信息技术有限公司 Multimedia resource identification method, device, equipment and storage medium
CN112364185A (en) * 2020-11-23 2021-02-12 北京达佳互联信息技术有限公司 Method and device for determining characteristics of multimedia resource, electronic equipment and storage medium
CN112561082A (en) * 2020-12-22 2021-03-26 北京百度网讯科技有限公司 Method, device, equipment and storage medium for generating model

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017210949A1 (en) * 2016-06-06 2017-12-14 北京大学深圳研究生院 Cross-media retrieval method
CN107451894A (en) * 2017-08-03 2017-12-08 北京京东尚科信息技术有限公司 Data processing method, device and computer-readable recording medium
CN110209920A (en) * 2018-05-02 2019-09-06 腾讯科技(深圳)有限公司 Treating method and apparatus, storage medium and the electronic device of media resource
CN110209921A (en) * 2018-05-10 2019-09-06 腾讯科技(深圳)有限公司 The method for pushing and device and storage medium and electronic device of media resource
CN110515904A (en) * 2019-08-13 2019-11-29 北京达佳互联信息技术有限公司 Quality prediction model training method, qualitative forecasting method and the device of media file
CN111444357A (en) * 2020-03-24 2020-07-24 腾讯科技(深圳)有限公司 Content information determination method and device, computer equipment and storage medium
CN111538852A (en) * 2020-04-23 2020-08-14 北京达佳互联信息技术有限公司 Multimedia resource processing method, device, storage medium and equipment
CN111708964A (en) * 2020-05-27 2020-09-25 北京百度网讯科技有限公司 Multimedia resource recommendation method and device, electronic equipment and storage medium
CN111708944A (en) * 2020-06-17 2020-09-25 北京达佳互联信息技术有限公司 Multimedia resource identification method, device, equipment and storage medium
CN112364185A (en) * 2020-11-23 2021-02-12 北京达佳互联信息技术有限公司 Method and device for determining characteristics of multimedia resource, electronic equipment and storage medium
CN112561082A (en) * 2020-12-22 2021-03-26 北京百度网讯科技有限公司 Method, device, equipment and storage medium for generating model

Also Published As

Publication number Publication date
CN113672783A (en) 2021-11-19

Similar Documents

Publication Publication Date Title
KR102260553B1 (en) Method for recommending related problem based on meta data
US11526799B2 (en) Identification and application of hyperparameters for machine learning
TWI677852B (en) A method and apparatus, electronic equipment, computer readable storage medium for extracting image feature
CN109634698B (en) Menu display method and device, computer equipment and storage medium
CN107784010B (en) Method and equipment for determining popularity information of news theme
CN112000822B (en) Method and device for ordering multimedia resources, electronic equipment and storage medium
CN110909222A (en) User portrait establishing method, device, medium and electronic equipment based on clustering
CN111159563A (en) Method, device and equipment for determining user interest point information and storage medium
CN114330499A (en) Method, device, equipment, storage medium and program product for training classification model
CN113657087A (en) Information matching method and device
CN110895706B (en) Method and device for acquiring target cluster number and computer system
WO2023024408A1 (en) Method for determining feature vector of user, and related device and medium
CN113239697B (en) Entity recognition model training method and device, computer equipment and storage medium
CN113672783B (en) Feature processing method, model training method and media resource processing method
CN112464007A (en) Data analysis method, system and platform based on artificial intelligence and Internet
US11676050B2 (en) Systems and methods for neighbor frequency aggregation of parametric probability distributions with decision trees using leaf nodes
US11210605B1 (en) Dataset suitability check for machine learning
CN112906785A (en) Zero-sample object type identification method, device and equipment based on fusion
CN111860655A (en) User processing method, device and equipment
CN112148865A (en) Information pushing method and device
CN117217852B (en) Behavior recognition-based purchase willingness prediction method and device
CN113326385B (en) Target multimedia resource acquisition method and device, electronic equipment and storage medium
CN115171107A (en) Use case collection method, apparatus, device, medium, and program product
CN116958720A (en) Training method of target detection model, target detection method, device and equipment
CN118212994A (en) Metabonomics data processing method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant