CN113254709B

CN113254709B - Content data processing method and device and storage medium

Info

Publication number: CN113254709B
Application number: CN202110731840.5A
Authority: CN
Inventors: 吴曙楠; 王方舟
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-12-28
Anticipated expiration: 2041-06-30
Also published as: CN113254709A

Abstract

The present disclosure provides a content data processing method and apparatus, and a storage medium, wherein the method includes: acquiring a plurality of content data and determining the content characteristics of each content data; calculating to obtain a content characteristic threshold value based on the content characteristics of the content data; acquiring published content data of each content data, and determining a published content characteristic threshold of each content data; the content data and the published content data are published by the same user account; for any one of the content data, if the content feature of any one of the content data is greater than the published content feature threshold of any one of the content data and greater than the content feature threshold, determining that any one of the content data is the target content data. Thereby, a method of mining content data that exhibits an advantage over other content data and also an advantage over its own history level from among a plurality of content data is realized.

Description

Content data processing method and device and storage medium

Technical Field

The present disclosure relates to a method for processing content data, and more particularly, to a method and an apparatus for processing content data, and a storage medium.

Background

In an application scenario in which content data is recommended to a user or in a plurality of application scenarios such as an enhanced platform ecosystem, it is often necessary to dig highlight content data from a plurality of content data. For example: and (3) mining excellent small videos or articles from a plurality of distributed videos or articles, mining excellent live broadcast fragments from a plurality of fragments of one live broadcast, and the like.

The existing content mining method mainly ranks according to the size of the content features of content data from large to small, such as ranking according to reading amount or praise number, and then selects the top N content data as highlight content data, or selects the content data with the content features larger than a content feature threshold value as highlight content data.

However, the size of the content feature of the content data does not completely and accurately reflect whether the content data is excellent or not. For example, for an author with a large number of fans, even if a poor quality work is distributed, the content characteristics of the work are generally greater than those of a superior work distributed by an author with a small amount of fans. Therefore, the conventional mining method cannot accurately mine excellent content data.

Disclosure of Invention

The present disclosure provides a content data processing method and apparatus, and a storage medium, to at least solve the problem that the prior art cannot accurately mine excellent content data. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a method for processing content data, including:

acquiring a plurality of content data and determining the content characteristics of each content data;

calculating to obtain a content characteristic threshold value based on the content characteristics of the content data;

acquiring published content data of each content data, and determining a published content characteristic threshold of each content data; the content data and the published content data are published by the same user account;

for any one of the content data, if the content feature of any one of the content data is greater than the published content feature threshold of any one of the content data and greater than the content feature threshold, determining that any one of the content data is the target content data.

Optionally, in the above method for processing content data, after the obtaining a plurality of content data and determining the content characteristics of each of the content data, the method further includes:

acquiring attribute information of each content data;

performing feature processing on the attribute information of each content data to obtain a feature vector of each content data;

processing the feature vector of each content data by using a clustering algorithm, and dividing the content data into multiple categories;

wherein: for any one of the content data, if the content feature of any one of the content data is greater than the published content feature threshold of any one of the content data and greater than the content feature threshold, determining that any one of the content data is the target content data, including:

for any one content data in the content data of each category, if the content feature of any content data is greater than the published content feature threshold of any content data and greater than the content feature threshold corresponding to the category, determining that any content data is the target content data; and calculating the content characteristic threshold corresponding to each category based on the content characteristics of the content data belonging to the category.

Optionally, in the above method for processing content data, the calculating a content feature threshold based on a content feature of each piece of content data includes:

calculating to obtain the mean value and the standard deviation of the content characteristics of each content datum;

and taking the sum of the average value of the content features of the content data and the standard deviation of the content features of the content data multiplied by M as the content feature threshold, wherein M is a positive integer.

Optionally, in the above method for processing content data, the determining content characteristics of each content data includes:

determining the feature type under the current service scene;

determining the service characteristics of each content data under each characteristic type;

and for each piece of content data, performing weighted calculation on each service characteristic of the content data to obtain the content characteristic of the content data.

Optionally, in the above method for processing content data, before the obtaining published content data of each content data, the method further includes:

acquiring content characteristics of content data issued in each history period in a plurality of history periods; the plurality of history cycles are continuous in the time dimension, and the time length of each history cycle is a preset cycle length;

calculating the mean value of the content features of the content data issued in each history period to obtain a plurality of mean values, and calculating the variance of the content features of the content data issued in each history period to obtain a plurality of variances;

calculating the fluctuation rates of the plurality of mean values, and calculating the fluctuation rates of the plurality of variances;

if the fluctuation rates of the plurality of mean values are not greater than a preset value and the fluctuation rates of the plurality of variances are not greater than a preset value, determining the preset period length as a target period length;

if the fluctuation rates of the plurality of mean values are greater than a preset value and/or the fluctuation rates of the plurality of variances are greater than a preset value, adjusting the length of the preset period, and returning to execute the content characteristics of the content data issued in each history period of the plurality of history periods;

wherein the acquiring the published content data of each of the content data includes:

and acquiring the published content data published in the target period length before the publishing time of each content data to obtain the published content data of each content data.

Optionally, in the above method for processing content data, the determining a published content feature threshold of each piece of content data includes:

calculating the average value of the content characteristics of the published content data of each content data to obtain the average value of the content characteristics corresponding to each content data;

calculating the standard deviation of the content characteristics of the published content data of each content data to obtain the standard deviation of the content characteristics corresponding to each content data;

respectively calculating the published content characteristic threshold value of each content data by using the content characteristic mean value and the content characteristic standard deviation corresponding to each content data; the published content feature threshold of one piece of content data is equal to the sum of the content feature mean value corresponding to the content data and the content feature standard deviation corresponding to N times of the content data; and N is a positive integer.

Optionally, in the above method for processing content data, if a content feature of any content data is greater than a published content feature threshold of any content data and greater than the content feature threshold, determining that any content data is a target content data, further includes:

and when the quantity of the target content data is not within the preset quantity range, adjusting the multiple N of the content characteristic standard deviation corresponding to each piece of content data, and returning and executing the calculation to obtain the published content characteristic threshold value corresponding to each piece of content data by respectively using the content characteristic mean value and the content characteristic standard deviation corresponding to each piece of content data according to the adjusted multiple N of the content characteristic standard deviation corresponding to each piece of content data.

Optionally, in the above method for processing content data, if the number of the target content data is not within a preset number range, adjusting a multiple N of a content characteristic standard deviation corresponding to each content data includes:

if the number of the target content data is larger than the maximum value of the preset number range, reducing the multiple N of the content characteristic standard deviation corresponding to each content data;

and if the number of the target content data is smaller than the minimum value of the preset number range, increasing the multiple N of the content characteristic standard deviation corresponding to each content data.

According to a second aspect of the embodiments of the present disclosure, there is provided a processing apparatus of content data, including:

a first acquisition unit configured to perform acquisition of a plurality of content data and determine a content feature of each of the content data;

a first calculation unit configured to perform calculation of a content feature threshold value based on a content feature of each of the content data;

a first determination unit configured to perform acquisition of published content data of each of the content data and determine a published content feature threshold of each of the content data; the content data and the published content data are published by the same user account;

and the screening unit is configured to execute, for any one of the content data, if the content feature of any one of the content data is greater than the published content feature threshold of any one of the content data and greater than the content feature threshold, determining that any one of the content data is the target content data.

Optionally, the content data processing apparatus further includes:

a second acquisition unit configured to perform acquisition of attribute information of each of the content data;

the characteristic processing unit is configured to perform characteristic processing on the attribute information of each content data to obtain a characteristic vector of each content data;

a classification unit configured to perform processing of a feature vector of each of the content data using a clustering algorithm, the content data being divided into a plurality of categories;

wherein: the screening unit includes:

the screening subunit is configured to execute any content data in the content data of each category, and if the content feature of any content data is greater than the published content feature threshold of any content data and greater than the content feature threshold corresponding to the category, determine that any content data is target content data; and calculating the content characteristic threshold corresponding to each category based on the content characteristics of the content data belonging to the category.

Optionally, in the above apparatus for processing content data, the first calculating unit includes:

a second calculation unit configured to perform calculation to obtain a mean value and a standard deviation of content features of the respective content data;

a third calculation unit configured to perform, as the content feature threshold, a sum of a mean value of content features of the respective pieces of content data and M times of a standard deviation of the content features of the respective pieces of content data, the M being a positive integer.

Optionally, in the above apparatus for processing content data, the first obtaining unit includes:

a first acquisition subunit configured to perform acquisition of a plurality of content data;

a second determining unit configured to perform determining a feature type under a current service scenario;

a third determining unit configured to perform determining a service feature of each of the content data under each of the feature types;

and the fourth calculation unit is configured to perform weighted calculation on each service characteristic of the content data according to each content data to obtain the content characteristics of the content data.

Optionally, the content data processing apparatus further includes:

a second acquisition unit configured to perform acquisition of a content feature of content data issued in each of the plurality of history periods; the plurality of history cycles are continuous in the time dimension, and the time length of each history cycle is a preset cycle length;

a fifth calculation unit configured to perform calculation of a mean value of content features of the content data distributed in each of the history periods to obtain a plurality of mean values, and calculation of a variance of the content features of the content data distributed in each of the history periods to obtain a plurality of variances;

a sixth calculating unit, configured to calculate fluctuation rates of the plurality of mean values and calculate fluctuation rates of the plurality of variances;

a period determination unit configured to perform, when the fluctuation rates of the plurality of mean values are not greater than a preset value and the fluctuation rates of the plurality of variances are not greater than a preset value, determining the preset period length as a target period length;

a first returning unit, configured to perform, when the fluctuation rates of the plurality of mean values are greater than a preset value and/or the fluctuation rates of the plurality of variances are greater than a preset value, adjusting the preset period length, and returning to perform the acquiring of the content features of the content data issued in each of the plurality of history periods;

wherein, when the first determining unit executes the acquiring of the published content data of each of the content data, it is configured to:

Optionally, in the above apparatus for processing content data, the first determining unit includes:

a seventh calculating unit, configured to perform calculation on an average value of content features of published content data of each piece of content data to obtain a content feature average value corresponding to each piece of content data;

the eighth calculating unit is configured to calculate a standard deviation of content features of published content data of each content data to obtain a standard deviation of content features corresponding to each content data;

a ninth calculating unit, configured to perform calculation to obtain a published content feature threshold of each content data by using a content feature mean value and a content feature standard deviation corresponding to each content data; the published content feature threshold of one piece of content data is equal to the sum of the content feature mean value corresponding to the content data and the content feature standard deviation corresponding to N times of the content data; and N is a positive integer.

Optionally, the content data processing apparatus further includes:

the adjusting unit is configured to adjust the multiple N of the content characteristic standard deviation corresponding to each content data when the number of the target content data is not within the preset number range;

and the second returning unit is configured to execute the adjusted multiple N of the standard deviation of the content feature corresponding to each piece of content data, and return to the ninth calculating unit to execute the calculation to obtain the published content feature threshold corresponding to each piece of content data by respectively using the content feature mean value and the content feature standard deviation corresponding to each piece of content data.

Optionally, in the above apparatus for processing content data, the adjusting unit includes:

a first adjusting unit, configured to reduce a multiple N of a content characteristic standard deviation corresponding to each of the content data if the number of the target content data is greater than a maximum value of the preset number range;

a second adjusting unit, configured to, if the number of the target content data is smaller than the minimum value of the preset number range, perform increasing by a multiple N of a content characteristic standard deviation corresponding to each of the content data.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of processing content data as described in any one of the above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of processing content data according to any one of the above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program, characterized in that the computer program realizes the processing method of content data according to any one of the above when executed by a processor.

According to the content data processing method, a plurality of pieces of content data are acquired, content characteristics of each piece of content data are determined, and then a content characteristic threshold value is calculated based on the content characteristics of each piece of content data. Secondly, acquiring a plurality of published content data corresponding to each content data, and calculating a published content feature threshold value of each content data by using the content features of the plurality of published content data corresponding to each content data, thereby finally screening out the content data of which the content features are larger than the published content feature threshold value of the content data and larger than the content feature threshold value from the plurality of content data as target content data, wherein the content feature threshold value is calculated based on the content features of the plurality of content data. Therefore, the content data which is excellent in performance among the plurality of content data to be mined and is superior to the self history level is screened out, and the excellent content data is accurately mined.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a method of processing content data in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of determining content characteristics of content data in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of computing a content feature threshold in accordance with an exemplary embodiment;

FIG. 4 is a flow chart illustrating a method of determining a target cycle length in accordance with an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a method of calculating a published content feature threshold in accordance with an exemplary embodiment;

FIG. 6 is a flow diagram illustrating another method of processing content data in accordance with an illustrative embodiment;

FIG. 7 is a flow diagram illustrating a method of multi-content data classification in accordance with an exemplary embodiment;

FIG. 8 is a flow diagram illustrating another method of computing a content feature threshold in accordance with an exemplary embodiment;

fig. 9 is a schematic structural diagram showing a content data processing apparatus according to an exemplary embodiment;

FIG. 10 is a schematic diagram illustrating a first computing unit in accordance with one illustrative embodiment;

FIG. 11 is a schematic diagram illustrating a first acquisition unit according to an exemplary embodiment;

FIG. 12 is a schematic diagram illustrating a first determination unit according to an exemplary embodiment;

FIG. 13 is a schematic diagram illustrating a configuration of an adjustment calculation unit in accordance with an exemplary embodiment;

fig. 14 is a schematic structural diagram of an electronic device according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a processing method of content data according to an exemplary embodiment. As shown in fig. 1, the method for processing content data, which can be applied to electronic devices, such as servers, computer terminals, etc., specifically includes the following steps S101-S104.

In step S101, a plurality of content data are acquired, and the content characteristics of the respective content data are determined.

In the embodiment of the present disclosure, the content data refers to media data such as video, text, sound, and image. More specifically, the data may include video, news, articles, audio, and the like published by the publishing entity. It should be noted that the content data is not limited to a complete work, such as a complete movie or a complete novel, but may also be a part of a work, such as a segment or a time of a movie, and therefore the plurality of acquired content data may be a plurality of segments of a plurality of videos, that is, the present disclosure does not specifically limit the form, length, distribution time, and the like of the content data.

It should be noted that the content feature of the content data may be used to represent the degree of superiority of the content data in the service scenario, and generally, the larger the value of the content feature of the content data is, the better the content feature of the content data is represented in the service scenario, for example, the more the content feature of the content data is the praise number, the better the content feature of the content data is represented. The content feature of the content data may be determined according to the service scenario, and specifically may be one feature data of the content data, such as one of data of the playing times, the praise number, or the exposure times, or may be a value obtained by fusing multiple feature data. For example, if the service scene is an excellent video mined by analysis, the selected features are features related to video consumption, such as playing times and exposure times, and then the two values are weighted and fused to obtain content features of the content data. The content features under most service scenes are obtained by fusing various data.

Alternatively, if the content feature of the content data is obtained by fusing a plurality of data, as shown in fig. 2, one specific implementation of step S101 includes the following steps S201 to S203.

S201, determining the feature type in the current service scene.

The feature types may be click rate, comment number, praise rate, and the like, and may be used to reflect data of the excellent degree of the content data in the current service scenario. Specifically, the feature types of a plurality of content features corresponding to each service scenario may be preset, and after the user selects a service scenario, the feature type of the content feature required in the service scenario may be determined, and then step S202 is performed.

S202, determining the service characteristics of each content data under each characteristic type.

S203, for each content data, performing weighted calculation on each service feature of the content data to obtain the content feature of the content data.

For example, the feature types in the current service scenario are click rate, comment number, and like amount, and the service features in each feature type of certain content data are: the click rate is 1000, the number of comments is 200, and the number of praise is 500, and the weight values corresponding to the three service features can be respectively 10%, 50%, and 40%, then the weighting calculation is performed on each service feature of the content data, and the content feature of the content data is obtained as 400.

The weight of each service feature in the weighting calculation process may be determined according to the requirement of a service scenario, and in general, the weight corresponding to a service feature having a larger influence on the degree of superiority of the evaluation content data in the service scenario is larger.

Alternatively, the weighting calculation may be performed using an existing comprehensive capability calculation method of subjective weighting and objective weighting. The specific calculation process is the same as the prior art, and is not described herein again.

In step S102, a content feature threshold value is calculated based on the content feature of each content data.

The content feature threshold is calculated based on the content feature of each content data, and can be used to assess whether one content data is excellent in a plurality of content data.

Alternatively, the content feature threshold may be specifically calculated based on the anomaly detection algorithm 3simga, and a specific embodiment is shown in fig. 3, which includes the following steps S301 to S302.

S301, calculating to obtain the mean value and the standard deviation of the content characteristics of each content data.

S302, taking the sum of the average value of the content features of each content data and the standard deviation of the content features of each content data multiplied by M as a content feature threshold value, wherein M is a positive integer.

For example, the content characteristics of 5 pieces of content data are: 350. 400, 600, 500, 550, the mean value of the content characteristics of each content data is 480 and the standard deviation is equal to about 92.74. If M is set to 3, the content feature threshold is: 480+92.74 × 3= 758.22.

It should be noted that, the more specific calculation process for calculating the content feature threshold based on the anomaly detection algorithm 3simga is the same as the process for calculating the published content feature threshold shown in fig. 5, except that fig. 3 converts the content feature of the published content data into the content feature of each content data, and therefore, the details are not repeated here.

In step S103, the published content data of each content data is acquired, and the published content feature threshold of each content data is determined.

And issuing the content data and the issued content data corresponding to the content data by the same user account. The published content data of one content data is published not later than the corresponding content data, that is, the published content data corresponding to one content data refers to other content data which is published in the same user account with the content data and has a publication time not later than the content data. It should be noted that, in this embodiment of the application, if two pieces of content data are published, if both pieces of content data are published by the same user account, it is determined that the two pieces of content data have a mutual correspondence, and the piece of content data with an earlier publication time is used as the published piece of content data corresponding to the piece of content data with a later publication time. And the published content feature threshold of one content data is calculated by using the content features of each published content data of the acquired content data. Alternatively, the published content feature threshold may be calculated based on an anomaly detection algorithm and may be used to assess whether the content data is better than the historical level, so the published content feature threshold may also be colloquially understood to be an excellent level of content features for the published content data.

Generally, for each independent content data, that is, a complete content data that is published separately, such as each video published by a user, published content data corresponding to one content data generally refers to other content data that is published no later than the content data by the publishing subject who published the content data.

For non-independent content data, that is, the content data belongs to a part of complete content data which is issued separately, the content data has a time sequence characteristic, the time sequence characteristic is a time sequence relation between the content data of the part and content data of other parts, for example, a certain video clip in a television episode is issued, and the clip and other clips have a temporal precedence relation, and the issued content data corresponding to the content data may be content data which is issued separately and the issue time of which is earlier than other components of the content data. Of course, the published content data corresponding to the content data may be a component of other content data separately published by the same publishing agent. For example, for a live broadcast segment, the corresponding published content data may be other live broadcast segments in the live broadcast video to which the published content data belongs, and the publishing time is earlier than other live broadcast segments of the live broadcast segment, at this time, by the method provided by the present disclosure, an excellent live broadcast segment in the live broadcast video may be mined. Of course, the published content data corresponding to the live broadcast segment may also be published by the same publishing subject and published at a time earlier than the live broadcast segments in other live broadcast videos of the live broadcast segment.

It should be noted that, with the method provided by the embodiment of the present disclosure, it is necessary to determine content data that is superior to the self history level, and therefore, it is necessary to determine published content data for comparing whether the content data is excellent or not.

Optionally, but not limited to, for each content data, a preset number of corresponding published content data closest to the content data publishing time may be selected, or the published content data corresponding to the content data may be determined in a random manner or the like. If the current service scene is a bright spot time of the content data including the time sequence feature, the content data of other components in the content data to which the content data belongs may be selected.

Optionally, a specific implementation of obtaining published content data of each content data specifically includes: and acquiring the published content data published in the target period length before the publishing time of each content data to obtain the published content data of each content data.

For example, if the distribution time of a piece of content data is 2021 year 2 month 1 and the target period length is one month, the content data distributed by the main body that distributes the piece of content data in january of 2021 year is acquired, and the acquired pieces of content data are the distributed content data of the piece of content data.

Wherein the target cycle length refers to a length of time. The target cycle length may be determined based on the service scenario and the characteristics of the content data to enable published content data that can meet the requirements of the service scenario. Since the content data that are usually mined have the same characteristics, in the embodiment of the present disclosure, the published content data corresponding to each content data is determined by using the same target cycle length. Of course, the target period lengths corresponding to the content data may be determined, and the published content data corresponding to the content data may be determined by using the corresponding target period lengths.

FIG. 4 illustrates a method of determining a target cycle length in accordance with an exemplary embodiment. As shown in fig. 4, specifically includes steps S401 to S406.

In step S401, the content characteristics of the content data issued in each of the plurality of history periods are acquired; the plurality of history periods are continuous in the time dimension, and the time length of each history period is a preset period length.

The preset cycle length is an initial cycle length selected according to the service scenario and the characteristics of the content data, for example, 1 day or 7 days may be selected, or several hours may be selected.

For example, the length of the selected preset period is 7 days, and the content characteristics of the content data issued in each of the 4 history periods are acquired, then the four history periods may be: no. 1 to No. 1 and No. 7, No. 1 and No. 8 to No. 1 and No. 14, No. 1 and No. 15 to No. 1 and No. 21, and No. 1 and No. 27. It should be appreciated that the end of the history period is no later than the current time.

In step S402, the mean of the content features of the content data distributed in each history period is calculated to obtain a plurality of means, and the variance of the content features of the content data distributed in each history period is calculated to obtain a plurality of variances.

For example, there are three history cycles, and the content characteristics of the content data of the first history cycle are: 400. 500, 600, the content characteristics of the content data of the second history period are as follows: 400 and 600, the content characteristics of the content data of the third history period are as follows: 450. 500, 550, and 600, calculating an average value of content features of content data released in three history periods, and obtaining three average values: 500. 500, 525; calculating the variance of the content characteristics of the content data issued in each history period to obtain three variances which are approximately equal to: 6666.67, 10000, 3125.

In step S403, the fluctuation rates of the plurality of mean values are calculated, and the fluctuation rates of the plurality of variances are calculated.

After the mean value and the standard deviation of each historical period are obtained, the fluctuation rates of the mean values and the fluctuation rates of the variance corresponding to the several continuous periods can be respectively calculated. Alternatively, the fluctuation rate of the mean and the variance may be an absolute value of a sum of the change rates of every two adjacent means or variances. For example, for the above calculation, the mean value is obtained: 500. 500, 525, the rate of change of the first average 500 and the second average 500 is: (500- > 500)/500 = 0; the transformation ratio of the second mean 500 to the third mean 525 is: (525) -: 0.05+0= 0.05.

In step S404, it is determined whether the fluctuation rates of the plurality of mean values are not greater than a preset value and the fluctuation rates of the plurality of variances are not greater than a preset value.

If the target period length is too small, the number of the determined published content data is small, so that the contingency is easy to exist, and whether the content data is better than the history level or not can not be reflected well. However, if the target period length is too large, the determined data size of the published content data is too large, which affects the processing efficiency. Therefore, the embodiment of the present disclosure selects, as the target cycle length, a cycle length in which both the mean fluctuation rate and the variance fluctuation rate of the content features of the content data are not greater than the preset value, thereby avoiding both contingency and preventing the data amount from being excessively large. Therefore, if it is determined that the fluctuation rates of the plurality of mean values are not greater than the preset value and the fluctuation rates of the plurality of variances are not greater than the preset value, step S406 is executed; if it is determined that the fluctuation rates of the plurality of mean values are greater than the preset value, and/or the fluctuation rates of the plurality of variances are greater than the preset value, step S405 is executed.

S405, adjusting the length of the preset period.

Adjusting the preset period length is to reselect a preset period length. Specifically, the preset period length is usually adjusted by a set amplitude over the originally selected preset period length, that is, the preset period length is increased by the set amplitude, for example, the preselected period length is 7 days, the set amplitude is 3 days, and the adjusted period length is 10 days. It should be noted that, after the preset period length is adjusted, step S401 is executed for the adjusted preset period length.

In step S406, a preset cycle length is determined as a target cycle length.

Alternatively, when the published content feature threshold is calculated based on the anomaly detection algorithm 3simga, the step of specifically calculating the published content feature threshold, as shown in fig. 5, includes steps S501 to S502.

In step S501, a mean value of content features of the published content data corresponding to each content data is calculated to obtain a mean value of content features corresponding to each content data, and a standard deviation of content features of the published content data of each content data is calculated to obtain a standard deviation of content features corresponding to each content data.

In step S502, a published content feature threshold of each content data is calculated by using a content feature mean value and a content feature standard deviation corresponding to each content data, respectively; the published content feature threshold of one content data is equal to the sum of the content feature mean corresponding to the content data and the content feature standard deviation corresponding to N times of the content data.

Wherein N is a positive integer. In the conventional anomaly detection algorithm 3singa, N is generally 3, and is applied to data distributed as a whole. Under different service scenes, the content characteristics of the published content data meet the requirement of less overall distribution. Therefore, for the data of the non-normal distribution, the size of N can be determined by adopting a box plot mode. Specifically, a box plot is made using the content characteristics of published content data, and N is determined based on the box plot. Wherein, the N is determined such that the published content feature threshold value is at least at the position of the box plot abnormal point and at the position where the data distribution starts to diverge, so that it can be ensured that the content data larger than the published content feature threshold value is screened out as much as possible. Specifically, an initial N is determined randomly or by using a fixed value and the like, then the published content feature threshold is calculated based on the determined N, and according to the position of the published content feature threshold in the box line graph, the N is adjusted correspondingly, then the method returns to calculate the published content feature threshold again based on the current N, and according to the position of the current published content feature threshold in the box line graph, the N is adjusted correspondingly until the published content feature threshold is located at the position of the abnormal point of the box line graph and at the position where the data distribution starts to diverge.

It can be seen that the multiple N of the standard deviation may affect the size of the published content feature threshold, and the content feature of the finally screened target content data needs to be greater than the published content feature threshold, so the multiple N of the standard deviation may affect the number of the finally screened target content data, and therefore after the target content data is screened, the following steps may be further performed: it is determined whether the amount of the target content data satisfies the expected amount.

And if the number of the target content data is judged not to meet the expected number, adjusting the multiple N of the standard deviation. Specifically, if the number of target content data is greater than the expected number, the multiple N of the standard deviation is decreased; if the number of target content data is smaller than the expected number, the standard deviation is increased by a multiple N. And then returning to execute the step of calculating the published content feature threshold value of each content data by using the content features of the plurality of published content data corresponding to each content data, namely executing the step of calculating the mean value and the standard deviation of the content features of the published content data corresponding to each content data.

In step S104, for any content data, if the content feature of the content data is greater than the published content feature threshold of the content data and greater than the content feature threshold, it is determined that the content data is the target content data.

Therefore, the target content data screened out in the embodiment of the present disclosure is not only superior to the self history level, but also belongs to excellent content data (i.e., is also at an excellent level in the entirety) among a plurality of content data, thereby accurately obtaining excellent content data.

It should be noted that, when the business scenario is a highlight time for mining content data including a time-series feature, each acquired content data is a time of a plurality of independent content data. Then, the content feature of the content data is greater than the published content feature threshold of the content data, that is, it indicates that the content data is a highlight time in the independent content data, and if the content feature is greater than the content feature threshold, it indicates that the highlight time is better than the highlight time of other independent content data.

According to the content data processing method provided by the embodiment of the disclosure, a plurality of pieces of content data are acquired, the content characteristics of each piece of content data are determined, and then the content characteristic threshold value is calculated based on the content characteristics of each piece of content data. Secondly, acquiring a plurality of published content data corresponding to each content data, and calculating a published content feature threshold value of each content data by using the content features of the plurality of published content data corresponding to each content data, thereby finally screening out the content data of which the content features are larger than the published content feature threshold value of the content data and larger than the content feature threshold value from the plurality of content data as target content data, wherein the content feature threshold value is calculated based on the content features of the plurality of content data. Therefore, the content data which is excellent in performance among the plurality of content data to be mined and is superior to the history level of the content data is screened out, and the excellent content data is accurately mined.

Fig. 6 is a flowchart illustrating a method of processing content data according to an exemplary embodiment. As shown in fig. 6, the method for processing content data specifically includes steps S601-S605.

In step S601, a plurality of content data are acquired, and the content characteristics of the respective content data are determined.

The content characteristics of the content data are used for explaining the excellent degree of the content data in the service scene.

It should be noted that, for a specific implementation of step S601, reference may be made to the implementation of step S101 in the foregoing method embodiment, and details are not described here again.

In step S602, a plurality of content data items are classified into a plurality of categories.

Because the content characteristics of the content data are influenced by a plurality of factors, such as the category column and the fan amount of the content data, and the watching amount of entertainment works, the content characteristics of the content data issued by the user with a large amount of vermicelli are generally much larger than those of military or sports works, or the content characteristics of the content data issued by the user with a large amount of vermicelli are generally larger than those of the user with a small amount of vermicelli. Therefore, it is necessary to classify content data and then mine the classified content data, respectively, to accurately mine excellent works.

Alternatively, as shown in fig. 7, a method of classifying content data is shown, comprising steps S701-S703.

In step S701, attribute information of each content data is acquired.

The type of the attribute information is selected according to the service scene, and may specifically include service content characteristics of the content data itself, such as viewing amount, praise amount, and the like, and may also include related information of a main publishing body of the content data, such as vermicelli amount, work amount, and the like, and of course, may also include other information related to the content data, and is specifically selected according to the requirements of the service scene.

In step S702, the attribute information of each piece of content data is subjected to feature processing to obtain a feature vector of each piece of content data.

Alternatively, an existing feature processing tool, such as word2vec or the like, may be used to perform feature processing on the attribute information of the content data to obtain a feature vector of each piece of multi-content data.

In step S703, a feature vector of each content data is calculated by a clustering algorithm, and the content data is divided into a plurality of categories.

Alternatively, the content data may be classified using an existing clustering algorithm kmeans. The number of categories may be set according to the requirements of the service scenario.

In step S603, a content feature threshold is calculated based on the content feature of each content data.

Specifically, a content feature threshold corresponding to each category is calculated based on the content feature of the content data of each category. It should be noted that, for the specific calculation process of each category, reference may be made to the specific implementation process of step S102, which is not described herein again.

In step S604, the published content data of each content data is acquired, and the published content feature threshold of each content data is determined.

And the issuing time of each content data is not earlier than that of each issued content data corresponding to the content data.

It should be noted that, for the specific implementation of step S604, reference may be made to each specific implementation of step S103, and details are not described here again.

In step S605, for any content data in each category of content data, if the content feature of the content data is greater than the published content feature threshold of the content data and greater than the content feature threshold corresponding to the category, the content data is determined to be the target content data.

The content feature threshold corresponding to each category is calculated based on the content features of the content data belonging to the category. For example, the content feature threshold value corresponding to the sports category is calculated based on the content feature of each content data under the sports category, and the content feature threshold value corresponding to the entertainment category is calculated based on the content feature of each content data under the entertainment category.

Optionally, in another embodiment of the present application, step S603 is performed after step S604, and a specific implementation of step S603, as shown in fig. 8, includes the following steps S801 to S803.

S801, screening out primary selected content data from the content data in each category, wherein the content characteristics of the primary selected content data are larger than the published content characteristic threshold of the primary selected content data.

S802, respectively calculating the mean value and the standard deviation of the content features of each piece of initially selected content data belonging to the same category.

It should be noted that, content data that does not belong to the primarily selected content data (i.e., non-primarily selected content data) cannot be screened as target content data because its content feature is not greater than its own published content feature threshold. Therefore, excellent content data is screened by comparing the plurality of primary content data, that is, the excellent content data does not need to be screened in consideration of non-primary content data, so that in the embodiment of the disclosure, the content data is screened step by step, that is, the primary content data is screened first, and then the target content data is screened from the primary content data, so that the calculated data amount can be reduced, and the calculation efficiency is improved. Of course, this is only one optional manner, or the content features of all the content data in each category may be adopted, the content feature threshold corresponding to each category is calculated, and finally, the content data whose content feature is greater than the history feature threshold of itself and greater than the content feature threshold corresponding to the category to which the content feature belongs is screened out as the target content data.

And S803, calculating to obtain a content feature threshold corresponding to each category by using the mean value and the standard deviation.

Wherein, the content characteristic threshold value is equal to the sum of the mean value and the standard value multiplied by M, and M is a positive integer.

According to the content data processing method provided by the embodiment of the disclosure, the content characteristics of a plurality of content data are acquired, and then the content data are divided into a plurality of categories to respectively perform data screening on each category, so that excellent content data can be more accurately mined. Subsequently, a plurality of published content data corresponding to each content data are determined, and the content characteristics of the published content data corresponding to each content data are utilized to calculate and obtain the published content characteristic threshold of each content data, so that the initially selected content data with the content characteristics larger than the published content characteristic threshold of the initially selected content data are screened out from the content data, namely the data with the content characteristics higher than the historical level of the initially selected content data are screened out. Then, content feature thresholds of respective categories are calculated using the screened content data, and content data having a content feature larger than the content feature threshold of the category to which the content feature belongs is screened from the primary selected content data as target content data, thereby mining content data that is excellent in the category to which the content feature is excellent and that is more accurately mined due to the content data of the self-history level.

Fig. 9 is a schematic structural diagram illustrating a content data processing apparatus according to an exemplary embodiment. As shown in fig. 9, the content data processing method specifically includes the following units:

a first obtaining unit 901 configured to perform obtaining a plurality of content data and determine a content feature of each content data.

A first calculating unit 902 configured to perform calculation of a content feature threshold based on the content feature of each content data.

A first determining unit 903 configured to perform acquiring published content data of each content data and determine a published content feature threshold of each content data.

Wherein, the content data and the published content data are published by the same user account.

The filtering unit 904 is configured to determine, for any content data, that any content data is the target content data if the content feature of any content data is greater than the published content feature threshold of any content data and greater than the content feature threshold.

Optionally, in a processing apparatus for content data provided in another embodiment of the present disclosure, the apparatus may further include the following unit:

a second acquisition unit configured to perform acquisition of attribute information of each content data.

And the characteristic processing unit is configured to perform characteristic processing on the attribute information of each content data to obtain a characteristic vector of each content data.

And the classification unit is configured to perform processing on the feature vector of each content data by using a clustering algorithm, and divide the content data into a plurality of categories.

Wherein: a screening unit comprising:

and the screening subunit is configured to execute any content data in the content data of each category, and determine that any content data is the target content data if the content feature of any content data is greater than the published content feature threshold of any content data and is greater than the content feature threshold corresponding to the category to which the content data belongs.

The content feature threshold corresponding to each category is calculated based on the content features of the content data belonging to the category.

Optionally, in a processing apparatus of content data provided in another embodiment of the present disclosure, a first calculating unit, as shown in fig. 10, includes:

a second calculation unit 1001 configured to perform calculation to obtain a mean value and a standard deviation of content features of each content data;

a third calculation unit 1002 configured to perform, as a content feature threshold, a sum of a mean value of content features of the respective content data and M times of a standard deviation of the content features of the respective content data, M being a positive integer.

Optionally, in a processing apparatus of content data provided in another embodiment of the present disclosure, a first obtaining unit, as shown in fig. 11, includes the following units:

a first acquisition sub-unit 1101 configured to perform acquisition of a plurality of content data.

A second determining unit 1102 configured to perform determining a feature type under a current service scenario.

A third determining unit 1103 configured to perform determining service characteristics of the respective content data under the respective characteristic types.

A fourth calculating unit 1104 configured to perform weighted calculation on each service feature of the content data for each content data, resulting in the content feature of the content data.

Optionally, in a processing apparatus for content data provided in another embodiment of the present disclosure, the apparatus further includes:

a second acquisition unit configured to perform acquisition of a content feature of the content data issued in each of the plurality of history periods.

The plurality of history periods are continuous in the time dimension, and the time length of each history period is a preset period length.

And the fifth calculation unit is configured to calculate the mean value of the content features of the content data issued in each history period to obtain a plurality of mean values, and calculate the variance of the content features of the content data issued in each history period to obtain a plurality of variances.

And the sixth calculating unit is used for calculating the fluctuation rates of the plurality of mean values and calculating the fluctuation rates of the plurality of variances.

A period determination unit configured to perform a determination of a preset period length as a target period length when a fluctuation rate of the plurality of means is not greater than a preset value and a fluctuation rate of the plurality of variances is not greater than a preset value.

And the first returning unit is configured to adjust the length of the preset period and return to the step of acquiring the content characteristics of the content data issued in each history period of the plurality of history periods when the fluctuation rates of the plurality of mean values are greater than the preset value and/or the fluctuation rates of the plurality of variances are greater than the preset value.

Wherein, when the first determining unit executes the acquisition of the published content data of each content data, it is configured to:

Optionally, in a processing apparatus of content data provided in another embodiment of the present disclosure, the first determining unit, as shown in fig. 12, includes the following units:

a seventh calculating unit 1201, configured to perform calculating an average value of the content features of the published content data of each content data, to obtain a content feature average value corresponding to each content data.

An eighth calculating unit 1202 configured to perform calculating a standard deviation of content features of published content data of each content data, resulting in a content feature standard deviation corresponding to each content data;

a ninth calculating unit 1203, configured to perform calculating to obtain a published content feature threshold of each content data by using the content feature mean and the content feature standard deviation corresponding to each content data, respectively.

The published content feature threshold of one content data is equal to the sum of the content feature mean value corresponding to the content data and the content feature standard deviation corresponding to N times of the content data; n is a positive integer.

and the adjusting unit is configured to adjust the multiple N of the content characteristic standard deviation corresponding to each content data when the number of the target content data is not in the preset number range.

And the second returning unit is configured to execute the adjustment by using the multiple N of the content characteristic standard deviation corresponding to each content datum, and the returning ninth calculating unit executes the calculation by using the content characteristic mean value and the content characteristic standard deviation corresponding to each content datum respectively to obtain the published content characteristic threshold corresponding to each content datum.

Optionally, in a processing apparatus of content data provided in another embodiment of the present disclosure, an adjusting unit, as shown in fig. 13, includes:

the first adjusting unit 1301 is configured to, if the number of the target content data is greater than the maximum value of the preset number range, perform reduction of the multiple N of the content characteristic standard deviation corresponding to each content data.

The second adjusting unit 1302 is configured to, if the number of the target content data is smaller than the minimum value of the preset number range, perform increasing by a multiple N of the content characteristic standard deviation corresponding to each content data.

It should be noted that, for the specific working processes of each unit provided in the foregoing embodiments of the present disclosure, reference may be made to the implementation processes of corresponding steps in the foregoing method embodiments, and details are not described here again.

FIG. 14 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment. Referring to fig. 14, the electronic device 1400 may be, for example, a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, or the like.

Referring to fig. 14, the electronic device may include one or more of the following components: a processing component 1402, a memory 1404, a power component 1406, a multimedia component 1408, an audio component 1410, an input/output (I/O) interface 1412, a sensor component 1414, and a communication component 1416.

The processing component 1402 generally provides for overall operation of the electronic device 1400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 1402 may include one or more processors 1420 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 1402 can include one or more modules that facilitate interaction between processing component 1402 and other components. For example, the processing component 1402 can include a multimedia module to facilitate interaction between the multimedia component 1408 and the processing component 1402.

The memory 1404 is configured to store various types of data to support operations at the electronic device 1400. Examples of such data include instructions for any application or method operating on the electronic device 1400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1404 may be implemented by any type of volatile or non-volatile storage device or combination of devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 1406 provides power to the various components of the electronic device 1400. The power components 1406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 1400.

The multimedia component 1408 includes a screen that provides an output interface between the electronic device 1400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1408 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external content data when the electronic device 1400 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1410 is configured to output and/or input audio signals. For example, the audio component 1410 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1400 is in operating modes, such as a call mode, a record mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1404 or transmitted via the communication component 1416. In some embodiments, audio component 1410 further includes a speaker for outputting audio signals.

I/O interface 1412 provides an interface between processing component 1402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 1414 includes one or more sensors for providing various aspects of status assessment for the electronic device 1400. For example, the sensor component 1414 may detect an open/closed state of the electronic device 1400, a relative positioning of components, such as a display and keypad of the electronic device 1400, a change in position of the electronic device 1400 or a component of the electronic device 1400, the presence or absence of user contact with the electronic device 1400, an orientation or acceleration/deceleration of the electronic device 1400, and a change in temperature of the electronic device 1400. The sensor assembly 1414 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 1414 may also include a photosensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1416 is configured to facilitate wired or wireless communication between the electronic device 1400 and other devices. The electronic device 1400 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 1416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described content data processing methods.

The present disclosure shows, in embodiments, a computer-readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of processing content data as in any one of the above embodiments.

Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Another embodiment of the present disclosure provides a computer program product for performing the processing method of content data provided in any one of the above embodiments when the computer program product is executed.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for processing content data, comprising:

2. The method according to claim 1, wherein after obtaining the plurality of content data and determining the content feature of each of the content data, further comprising:

acquiring attribute information of each content data;

3. The method according to claim 1, wherein the calculating a content feature threshold value based on the content feature of each content data comprises:

4. The method of claim 1, wherein the determining the content characteristics of each of the content data comprises:

determining the feature type under the current service scene;

5. The method according to claim 1, wherein before the obtaining published content data of each of the content data, further comprising:

6. The method of claim 1, wherein determining the published content characteristic threshold for each of the content data comprises:

7. The method according to claim 6, wherein for any of the content data, if the content characteristic of any content data is greater than the published content characteristic threshold of any content data and greater than the content characteristic threshold, after determining that any content data is the target content data, further comprising:

8. The method according to claim 7, wherein the adjusting the multiple N of the standard deviation of the content feature corresponding to each of the content data when the amount of the target content data is not within the preset amount range comprises:

9. An apparatus for processing content data, comprising:

10. The apparatus of claim 9, further comprising:

wherein: the screening unit includes:

11. The apparatus of claim 9, wherein the first computing unit comprises:

12. The apparatus of claim 9, wherein the first obtaining unit comprises:

13. The apparatus of claim 9, further comprising:

14. The apparatus of claim 9, wherein the first determining unit comprises:

15. The apparatus of claim 14, further comprising:

16. The apparatus of claim 15, wherein the adjusting unit comprises:

17. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a method of processing content data as claimed in any one of claims 1 to 8.

18. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform a method of processing content data according to any one of claims 1 to 8.