Disclosure of Invention
The application aims to provide a community content risk assessment method and device, which can ensure more stable estimation of risk proportion indexes even when risk data are very small, reduce sampling errors, improve accuracy and avoid influencing accuracy due to the fact that potential missing risk data are not easily extracted.
In order to solve the above problems, the present application discloses a community content risk assessment method, comprising:
word segmentation is carried out on the whole content text of the community content, word segmentation texts are obtained, and each word segmentation text is converted into a text vector;
clustering each text vector to construct clusters, wherein the clusters contain word segmentation texts corresponding to the text vectors;
determining the number of word segmentation text samples corresponding to each cluster, and in each cluster, performing word segmentation text sampling according to the corresponding number of word segmentation text samples;
judging whether the word segmentation text of each sample is risk content in each cluster, and counting the number of the word segmentation texts which are determined to be risk content in the word segmentation texts of the samples of the clusters;
and determining the risk recall index of the community content according to the number of the word segmentation texts marked as the risk content in each cluster.
In a preferred embodiment, a pre-trained TextCNN classification model is used in the step of converting each segmented text into a text vector.
In a preferred embodiment, in the step of converting each segmented text into a text vector, any one of the following preset models is used: LSTM, word2vec, doc2vec.
In a preferred embodiment, for each text vector cluster, the step of constructing clusters uses any one of the following algorithms: the K-means algorithm, the K-MEDOIDS algorithm, the CLARANS algorithm.
In a preferred embodiment, in the step of determining the number of segmented text samples corresponding to each cluster, the number of segmented text samples is determined using any one of the following means: layered scaling, non-proportional partitioning, neman.
In a preferred embodiment, in the step of determining the number of segmented text samples corresponding to each cluster, if the proportion of the number of segmented texts contained in the cluster to the total number of segmented texts corresponding to the entire content text of the community content is lower than a preset threshold, determining the number of segmented text samples corresponding to the cluster by adopting a non-proportional allocation method.
In a preferred embodiment, before the step of segmenting the whole content text to obtain segmented text, the method further comprises:
the whole content text is preprocessed.
The application also discloses a community content risk assessment device, which comprises:
the text vector module is used for word segmentation of the whole content text of the community content to obtain word segmentation texts, and converting each word segmentation text into a text vector;
the clustering module is used for clustering each text vector to construct clusters, wherein the clusters contain word segmentation texts corresponding to the text vectors;
the sampling module is used for determining the word segmentation text sampling number corresponding to each cluster, and in each cluster, the word segmentation text sampling is carried out according to the corresponding word segmentation text sampling number;
the risk content statistics module is used for judging whether the word segmentation text of each sample is the risk content in each cluster, and counting the number of the word segmentation texts which are determined to be the risk content in the word segmentation texts of the samples of the clusters;
and the risk recall index module is used for determining the risk recall index of the community content according to the number of the word segmentation texts marked as the risk content in each cluster.
The application also discloses community content risk assessment equipment, which comprises:
a memory for storing computer executable instructions; the method comprises the steps of,
a processor for implementing the steps in the above method when executing computer executable instructions.
The application also discloses a computer readable storage medium, wherein the computer readable storage medium stores computer executable instructions which when executed by a processor implement the steps in the method.
In the embodiment of the application, firstly, the text of the community content to be evaluated is segmented, then the segmented text is converted into the text vector, the text vector is clustered, the segmented text corresponding to the text vector forms clusters on the semantic level, then the corresponding segmented text sampling number is determined for each cluster, and the risk recall index of the community content is evaluated according to the number of the segmented text of the risk content in the sampled segmented text.
The numerous technical features described in the description of the present application are distributed among the various technical solutions, which can make the description too lengthy if all possible combinations of technical features of the present application (i.e., technical solutions) are to be listed. In order to avoid this problem, the technical features disclosed in the above summary of the application, the technical features disclosed in the following embodiments and examples, and the technical features disclosed in the drawings may be freely combined with each other to constitute various new technical solutions (these technical solutions are regarded as already described in the present specification) unless such a combination of technical features is technically impossible. For example, in one example, feature a+b+c is disclosed, in another example, feature a+b+d+e is disclosed, and features C and D are equivalent technical means that perform the same function, technically only by alternative use, and may not be adopted simultaneously, feature E may be technically combined with feature C, and then the solution of a+b+c+d should not be considered as already described because of technical impossibility, and the solution of a+b+c+e should be considered as already described.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, it will be understood by those skilled in the art that the claimed application may be practiced without these specific details and with various changes and modifications from the embodiments that follow.
Description of the partial concepts:
UGC: user Generated Content, user generated content.
Content risk: refer to content including risks such as administrative sensitive language, pornography, illegal advertising promotion, etc.
Risk content: refers to the contents including administrative sensitive language, pornography contents, illegal advertising popularization and other risks.
Risk ratio: refers to the ratio of the number of all risk contents in the community contents to the total content number of the community contents, namely: risk content/total content.
Cleanliness: the method is used for measuring the cleanliness of community contents, namely: cleanliness = 1-risk ratio.
Sampling ratio: and randomly extracting N data from the N data to be used as a sampling set, wherein the sampling ratio is N/N.
Content vector: the term text is represented in the form of a vector.
Hierarchical sampling: also called type sampling, which is a method of randomly extracting samples (individuals) from different layers in a prescribed ratio from a population that can be divided into different sub-populations (or called layers). In the application, the word segmentation text in each cluster is sampled based on the clusters formed by the text vector clustering.
TextCNN is an algorithm that classifies text using convolutional neural networks.
And (3) point estimation: also known as constant value estimation, is an estimation value using the actual sample index value as the overall parameter. The method of point estimation is simple and generally does not consider sampling errors and reliability.
The following outline of some of the innovative features of the present application:
according to the application, under a specific community content risk estimation scene, clustering is firstly carried out to form clusters according to the specificity of content texts, then the segmented text in each cluster is sampled and evaluated, and the segmented text sampling number corresponding to the cluster can be determined according to specific conditions, so that the stability and the representativeness of a risk recall index estimated value based on a sampling set are effectively improved under the scene of extremely low risk ratio and low sampling ratio, the accuracy is improved, and the potential missing content risk is easier to find.
Further, the representativeness of the sampling set can be effectively improved and the sampling error can be reduced by firstly converting the word segmentation text into text vectors, then clustering the text vectors into clusters, and then sampling and evaluating the word segmentation text in each cluster, namely, layering sampling. Specifically, conventional hierarchical sampling performs layering based on attributes or classifications of sampling objects, for example, when crowd sampling, the gender, the age, etc. are layered, but in the present application, community contents are unstructured data, and no objective attributes can be used for direct layering. Therefore, the application creatively clusters the segmented text from the semantic level to separate the community content into subclasses with semantic commonality for subsequent hierarchical sampling.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The first embodiment of the application relates to a community content risk assessment method, the flow of which is shown in fig. 1, and the method comprises the following steps:
step 101: text vector conversion
Specifically, word segmentation is carried out on the whole content text of the community content, word segmentation texts are obtained, and each word segmentation text is converted into a text vector.
It should be noted that text preprocessing is mainly aimed at cleaning or converting some traditional, special symbol, emoji expression, chinese number and other contents appearing in the whole content text of community content.
It should be noted that word segmentation is to use word segmentation algorithm to perform word segmentation processing on the character string subjected to the pretreatment so as to obtain word segmentation text.
For example: "weather today is good" after word segmentation, get "weather today/good". Where "today", "weather", "really good" are word segmentation texts.
The method has the advantages that community contents often lack normalization and standards, the processing efficiency is low, the whole content text of the community contents is effectively cleaned and segmented, and word segmentation text convenient to process is provided for subsequent risk content evaluation.
It should be noted that there are many ways of text preprocessing and word segmentation that exist at present, and the present application is not limited to a specific way.
Specifically, in this embodiment, a pre-trained TextCNN two-classification model is used as the word vector model. The TextCNN two-classification model is a two-classification model for risk content identification, and the last layer of the TextCNN network structure can be output as a content vector of an input text.
It should be noted that the text vector conversion method of the present application is not limited to the TextCNN two-classification model, and other models may be used instead, such as LSTM, word2vec, doc2vec, etc.
Step 102: text vector clustering
Specifically, in this step, clusters are constructed for each obtained text vector cluster, where the clusters include word segmentation texts corresponding to the text vectors.
Specifically, in this embodiment, a k-means algorithm is selected, and further, the number of fixed output classes is selected to be N, that is, N clusters are constructed.
It should be noted that the text vector clustering method of the present application is not limited to the k-means algorithm described above, and other clustering algorithms may be used instead.
For example, K-MEDOIDS, CLARANS, and the like.
This has the advantage that the text of the whole content of the community is preprocessed and segmented, and the text vectors obtained after the text vector conversion are clustered, so that the segmented text actually generates new attributes. Before processing, the community content is unstructured data, so that no objective attribute exists, and effective clustering and subsequent evaluation cannot be performed on the segmented texts.
In other words, through the steps, the whole content text of the community content is sub-divided at the semantic level, a new attribute based on the semantic level is given to each word segmentation text, and correspondingly, if the attributes of the text vectors of the word segmentation texts are the same, the text vectors of the word segmentation texts show that the text vectors have certain commonality in terms of semantics.
Furthermore, through the steps, the sub-class separation effect of each word segmentation text of the whole content text of the community content is improved, so that a better precondition is provided for improving the final hierarchical sampling effect and the representativeness of the sampling sample.
Step 103: word segmentation text sampling
Specifically, in this step, the number of segmented text samples corresponding to each cluster is determined, and in each cluster, the segmented text samples are performed according to the corresponding number of segmented text samples.
Specifically, in this embodiment, among clusters formed by clustering the text vectors obtained by conversion, the number of segmented text samples corresponding to each cluster is determined and sampled, which may be referred to as "hierarchical sampling".
Further, the hierarchical sampling method, also called type sampling method, is a method of randomly extracting samples (individuals) from different layers in a prescribed ratio from a population that can be divided into different sub-populations (or referred to as layers).
The method has the advantages that the representative of the sampled segmented text is better, and the sampling error is smaller.
For example, the process of hierarchical sampling includes: the total units are divided into two or more complete groups (e.g., male and female) that are independent of each other, and then simply randomly sampled from the two or more groups, the sampled data being independent of each other. It can be seen that in hierarchical sampling, the units of population are grouped by primary labels, and there is a correlation between the labels of the groupings and the overall characteristics of interest. Further, grouping and sampling corresponds to clustering and sampling in the present embodiment.
It will be appreciated that after the cluster is constructed in the previous step, the original word segmentation text can be assigned to a certain cluster, so that the cluster is a concept of a layer in a hierarchical sampling method, and the hierarchical sampling can be used to obtain a sampling set of the word segmentation text of each cluster.
Further, since the effectiveness of the hierarchical sampling is affected by intra-layer variation, that is, when the variation of samples of the same layer is smaller (variation herein may be understood as the subject of the content, risk ratio, etc.), the effectiveness of the hierarchical sampling is better. In the embodiment of the application, the 'layer' separation is carried out from the semantics of the content by a clustering mode, so that the intra-layer variability can be reduced as much as possible, and the layered sampling effect is improved.
Specifically, in this step, the specific method for determining the number of word segmentation text samples of each cluster may be as follows:
first kind: layering and proportioning.
Specifically, the number of segmented text samples of each cluster is equal to the ratio of the number of all segmented texts corresponding to the entire content text of the community content.
For example, the number of segmented text samples is n=50, the total segmented text number n=500, and N/n=0.1 is the sample ratio, and each layer determines the number of samples of the layer according to the ratio.
Second kind: non-proportional dispensing.
Specifically, when the proportion of the total word segmentation text number of a certain cluster in all word segmentation texts corresponding to the whole content text of the community content is too small, that is, is lower than a preset threshold, in order to enable the semantic features of the cluster to be reflected in the samples, the proportion of the word segmentation text sample number of the cluster in the whole content text sample total number of the community content can be increased appropriately through manual setting.
Third kind: the Neman method.
Specifically, the number of word segmentation text samples for each cluster is proportional to the product of the total word segmentation text number for that cluster and its standard deviation.
In the present embodiment, the "non-proportional allocation method" sampling method of the second type described above is used.
The method has the advantages that in some special scenes, black data, namely the word segmentation text of the risk content, is often small in data size, and if the black data is sampled in a first hierarchical scaling mode, the black data is not easy to sample, so that the stability and accuracy of an evaluation result are affected. Therefore, the non-proportional allocation method ensures a certain number of word segmentation text samples of the cluster, can better balance and estimate the situation of word segmentation texts of all clusters contained in the whole content text of the community content, and avoids that clusters where black data are located are not sampled due to small data quantity.
For example, a sampling rule may be set, where the total word text number of the cluster is N, the word text sampling number is N, and the sampling ratio is sp, and then the sampling may be performed according to the following rule:
1) N=n when N <100
2) N=100 when N <1000
3) Sp=5% when N <10000
4) Sp=1% for N <500000
5) N > =500000, n=5000
It should be noted that, in the cluster-based hierarchical sampling of the present application, the method for determining the number of segmented text samples of each cluster is not limited to the above manner, and other allocation schemes may be used instead, which will not be described herein.
Step 105: word segmentation text for statistical risk content
Specifically, in this step, in each cluster, it is determined whether or not the segmented text of each sample is a risk content, and the number of segmented texts determined as the risk content among the segmented texts of the samples of this cluster is counted.
It should be noted that, the specific manner of determining whether the sampled word segmentation text is the risk content is common knowledge in the art, and will not be described herein.
Note that, since in the above step, the number of segmented text samples corresponding to the one cluster is determined by the non-proportional allocation method for the black data having a smaller data amount, in this step, even if the total segmented text number of the cluster in which the black data is located is small, for example, less than 1000, it is possible to follow the criteria considered to be set, for example: according to the rule exemplified in step 104, n=100 when N <1000, or n=n when N <100, thereby more reasonably sampling the segmented text of the cluster, judging whether the segmented text of each sample is black data, i.e. risk content, and counting the number of segmented texts determined as risk content in the cluster.
The method has the advantages that the method is easier to find even if the data volume is smaller for the risk content which is not easy to find, so that the community content risk assessment result is more stable and accurate.
Step 106: determining risk recall indicators
Specifically, in this step, according to the number r of segmented texts determined as risk contents in each cluster i And determining a risk recall index of the community content.
Specifically, in this step, a specific calculation formula of the risk recall index of the community content is as follows:
where K represents K clusters obtained in the above step of clustering and constructing clusters for each text vector.
Wherein N is i Representing the number of segmented text contained in the i-th cluster.
Wherein n is i The number of segmented text samples determined for the ith cluster, that is, the sample value unit, is represented in the step of determining the number of segmented text samples corresponding to each cluster.
Wherein r is i The number of segmented texts marked as risk content after marking in the i-th cluster, that is, the number of segmented texts determined as risk content.
Thus, the present embodiment obtains the risk assessment result of this community content, i.e., the risk recall index.
The following is an example of the effectiveness of the random sampling method and the cluster-based hierarchical sampling estimation method of the present application.
Fig. 3 shows a two-dimensional mapping of the segmented text, wherein the left graph is an original form, the right graph is a clustering result, points in the left graph in fig. 3 are mapping of UGC content in a community on a 2-dimensional plane, each point represents one content (30 points in total), light color represents normal data, and dark color represents risk data.
From the graph, the actual risk ratio was calculated to be 7% (1/15).
It is now desirable to extract 4 text vectors from 30 points to construct a sample set and estimate the risk duty cycle of the community content as a whole by calculating the risk duty cycle of the sample set.
Firstly, using a random sampling method, the possible estimated values are shown in table one, wherein about 75% of the estimated values can be drawn to 4 light-colored points (without risk content), namely, the potential omission risk is ignored; when a dark dot (risk content) is drawn with a probability of 23.9%, the risk ratio estimation value is directly increased from 0 to 25%, and overestimation is generated on the risk compared with the true value of 7%.
Random sampling probability of occurrence of various conditions and wind duty ratio estimation
Next, the sample estimation method proposed by the present application is used.
It is assumed that the word segmentation text can be divided into 4 clusters as shown in the right diagram in fig. 3 by clustering, and then a non-proportional allocation method is adopted to extract 4 word segmentation texts in total and ensure that at least one word segmentation text is extracted in each cluster, namely one data is extracted in each cluster.
The final risk assessment result is only influenced by the data extracted from the cluster at the lower right corner, and the risk assessment result which can appear is shown in a second table; wherein the estimated risk ratio is reduced to 33.3% with a probability of 0%.
It is noted that in practical applications, it is desirable to draw risk data in the sample, even if the risk ratio is overestimated, as opposed to ignoring the risk of omission.
Further, there is a 66.7% probability that an estimated overall risk ratio of 10% is obtained.
It can be seen that compared with the existing random mode, the estimation of the application is closer to the true value, and is more stable (the probability is higher) and the accuracy is higher.
TABLE II probability of occurrence of various cases and wind ratio estimation based on clustering clusters (where single cluster refers to the cluster at the bottom right in the figure)
Obviously, compared with random sampling, the method provided by the application can better find potential omission risks in the scene of extremely low risk ratio and low sampling ratio, and acquire a representative stronger sampling set and a more stable index estimation.
It should be noted that the effect of the method of the present application on the actual application scene, the data distribution of the specific service scene, and the effect of the clustering algorithm are greatly affected.
In general, the application provides a sampling method based on text vector clustering aiming at improving the stability and the accuracy of community content risk assessment indexes (risk ratio). Firstly, converting word segmentation texts of overall content texts of community contents to obtain text vectors, clustering the text vectors, and then calculating and estimating risk recall indexes of the community text contents by using a hierarchical sampling method based on clusters generated by clustering, wherein the clusters contain word segmentation texts corresponding to the text vectors.
The method has the advantages that the stability and the representativeness of the risk recall index estimated value based on the sampling set can be effectively improved under the scene of extremely low risk ratio and low sampling ratio, and meanwhile, the potential risk of missing content is easier to find, and the accuracy of the evaluation result is improved.
A second embodiment of the present application relates to a community content risk assessment apparatus having a structure as shown in fig. 2, the community content risk assessment apparatus comprising: the system comprises a text vector module, a clustering module, a sampling module, a risk content statistics module and a risk recall index module.
The modules are described in detail below:
the text vector module is used for word segmentation of the whole content text of the community content to obtain word segmentation texts, and converting each word segmentation text into a text vector;
the clustering module is used for clustering each text vector to construct clusters, wherein the clusters contain word segmentation texts corresponding to the text vectors;
the sampling module is used for determining the word segmentation text sampling number corresponding to each cluster, and in each cluster, the word segmentation text sampling is carried out according to the corresponding word segmentation text sampling number;
the risk content statistics module is used for judging whether the word segmentation text of each sample is the risk content in each cluster, and counting the number of the word segmentation texts which are determined to be the risk content in the word segmentation texts of the samples of the clusters;
and the risk recall index module is used for determining the risk recall index of the community content according to the number of the word segmentation texts marked as the risk content in each cluster.
The first embodiment is a method embodiment corresponding to the present embodiment, and the technical details in the first embodiment can be applied to the present embodiment, and the technical details in the present embodiment can also be applied to the first embodiment.
It should be noted that, it should be understood by those skilled in the art that the implementation functions of the modules shown in the embodiments of the community content risk assessment apparatus described above may be understood with reference to the description of the community content risk assessment method described above. The functions of the modules shown in the embodiment of the community content risk assessment apparatus described above may be implemented by a program (executable instructions) running on a processor, or by a specific logic circuit. The community content risk assessment device according to the embodiment of the present application may also be stored in a computer readable storage medium if implemented in the form of a software functional module and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the application are not limited to any specific combination of hardware and software.
Accordingly, embodiments of the present application also provide a computer storage medium having stored therein computer executable instructions which when executed by a processor implement the method embodiments of the present application.
In addition, the embodiment of the application also provides community content risk assessment equipment, which comprises a memory for storing computer executable instructions and a processor; the processor is configured to implement the steps of the method embodiments described above when executing computer-executable instructions in the memory. The processor may be a central processing unit (Central Processing Unit, abbreviated as "CPU"), other general purpose processors, digital signal processors (Digital Signal Processor, abbreviated as "DSP"), application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as "ASIC"), and the like. The aforementioned memory may be a read-only memory (ROM), a random access memory (random access memory, RAM), a Flash memory (Flash), a hard disk, a solid state disk, or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied in a hardware processor for execution, or may be executed by a combination of hardware and software modules in the processor.
It should be noted that in the present patent application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element. In the present patent application, if it is mentioned that an action is performed according to an element, it means that the action is performed at least according to the element, and two cases are included: the act is performed solely on the basis of the element and is performed on the basis of the element and other elements. Multiple, etc. expressions include 2, 2 times, 2, and 2 or more, 2 or more times, 2 or more.
All references mentioned in this disclosure are to be considered as being included in the disclosure of the application in its entirety so that modifications may be made as necessary. Further, it is understood that various changes or modifications of the present application may be made by those skilled in the art after reading the above disclosure, and such equivalents are intended to fall within the scope of the application as claimed.