CN110046251B - Community content risk assessment method and device - Google Patents

Community content risk assessment method and device Download PDF

Info

Publication number
CN110046251B
CN110046251B CN201910221531.6A CN201910221531A CN110046251B CN 110046251 B CN110046251 B CN 110046251B CN 201910221531 A CN201910221531 A CN 201910221531A CN 110046251 B CN110046251 B CN 110046251B
Authority
CN
China
Prior art keywords
text
word segmentation
content
risk
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910221531.6A
Other languages
Chinese (zh)
Other versions
CN110046251A (en
Inventor
赵智源
祝慧佳
周书恒
郭亚
徐陈虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201910221531.6A priority Critical patent/CN110046251B/en
Publication of CN110046251A publication Critical patent/CN110046251A/en
Application granted granted Critical
Publication of CN110046251B publication Critical patent/CN110046251B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the field of Internet, and discloses a community content risk assessment method and device, wherein the method comprises the following steps: word segmentation is carried out on the whole content text of the community content, and each word segmentation text is converted into a text vector; clustering each text vector to construct clusters; determining the number of word segmentation text samples corresponding to each cluster, and in each cluster, performing word segmentation text sampling according to the corresponding number of word segmentation text samples; judging whether the word segmentation text of each sample is risk content in each cluster, and counting the number of the word segmentation texts which are determined to be risk content in the word segmentation texts of the samples of the clusters; and determining the risk recall index of the community content according to the number of word segmentation texts determined to be the risk content in each cluster. The risk recall index estimation method and device can ensure more stable risk recall index estimation, reduce sampling errors, and avoid the influence on accuracy due to the fact that potential missing risk data is not easy to pump.

Description

Community content risk assessment method and device
Technical Field
The application relates to the field of Internet, in particular to a community content security assessment technology.
Background
In a content public community or daily operation on a platform where a large number of users are created with original content (UGC), a loop is indispensable for content risk assessment of community-presented content.
Here, the risk refers to similar contents such as administration, yellow, illegal advertising popularization and the like.
Generally, whether the cleanliness of the community meets the requirement or not is measured by using the risk ratio, but because the content of the UGC community is huge, manual auditing of all the content needs to consume huge manpower, and cannot be realized in actual business.
Therefore, all UGC content is randomly sampled, the sampled data is manually marked to construct a sampling set, and the risk ratio of the whole community is estimated by using the risk ratio calculated in the sampling set.
However, in a real traffic scenario, the real risk data is very small (e.g., <1%, see risk of evaluation in detail), and is also affected by the sampling strategy (mainly the sampling ratio, and the sampling ratio is low in practical application), so that there are problems at present, such as that the risk ratio estimation based on a random sampling set is particularly unstable (i.e., the variance of the estimation index is large). As another example, it often happens that if risk data is accidentally drawn from the sampled data, the amount of missing risk is overestimated because the samples are smaller, resulting in an estimated risk ratio that is much higher than the actual value. Or, if no risk data is drawn, ignoring the potential risk.
The above-mentioned random sampling, that is, random sampling is performed at a fixed sampling ratio based on the content of the entire community, so that the index is calculated based on the sampled data set for evaluating the index of the entire community.
While the above approach has the benefit of simple logic and easy implementation, on the other hand there are the following disadvantages: when the risk data is very few, the estimation of the risk duty ratio index is unstable, the sampling error is large, the accuracy is not high, and meanwhile, the potential missing risk data is not easy to pump.
Disclosure of Invention
The application aims to provide a community content risk assessment method and device, which can ensure more stable estimation of risk proportion indexes even when risk data are very small, reduce sampling errors, improve accuracy and avoid influencing accuracy due to the fact that potential missing risk data are not easily extracted.
In order to solve the above problems, the present application discloses a community content risk assessment method, comprising:
word segmentation is carried out on the whole content text of the community content, word segmentation texts are obtained, and each word segmentation text is converted into a text vector;
clustering each text vector to construct clusters, wherein the clusters contain word segmentation texts corresponding to the text vectors;
determining the number of word segmentation text samples corresponding to each cluster, and in each cluster, performing word segmentation text sampling according to the corresponding number of word segmentation text samples;
judging whether the word segmentation text of each sample is risk content in each cluster, and counting the number of the word segmentation texts which are determined to be risk content in the word segmentation texts of the samples of the clusters;
and determining the risk recall index of the community content according to the number of the word segmentation texts marked as the risk content in each cluster.
In a preferred embodiment, a pre-trained TextCNN classification model is used in the step of converting each segmented text into a text vector.
In a preferred embodiment, in the step of converting each segmented text into a text vector, any one of the following preset models is used: LSTM, word2vec, doc2vec.
In a preferred embodiment, for each text vector cluster, the step of constructing clusters uses any one of the following algorithms: the K-means algorithm, the K-MEDOIDS algorithm, the CLARANS algorithm.
In a preferred embodiment, in the step of determining the number of segmented text samples corresponding to each cluster, the number of segmented text samples is determined using any one of the following means: layered scaling, non-proportional partitioning, neman.
In a preferred embodiment, in the step of determining the number of segmented text samples corresponding to each cluster, if the proportion of the number of segmented texts contained in the cluster to the total number of segmented texts corresponding to the entire content text of the community content is lower than a preset threshold, determining the number of segmented text samples corresponding to the cluster by adopting a non-proportional allocation method.
In a preferred embodiment, before the step of segmenting the whole content text to obtain segmented text, the method further comprises:
the whole content text is preprocessed.
The application also discloses a community content risk assessment device, which comprises:
the text vector module is used for word segmentation of the whole content text of the community content to obtain word segmentation texts, and converting each word segmentation text into a text vector;
the clustering module is used for clustering each text vector to construct clusters, wherein the clusters contain word segmentation texts corresponding to the text vectors;
the sampling module is used for determining the word segmentation text sampling number corresponding to each cluster, and in each cluster, the word segmentation text sampling is carried out according to the corresponding word segmentation text sampling number;
the risk content statistics module is used for judging whether the word segmentation text of each sample is the risk content in each cluster, and counting the number of the word segmentation texts which are determined to be the risk content in the word segmentation texts of the samples of the clusters;
and the risk recall index module is used for determining the risk recall index of the community content according to the number of the word segmentation texts marked as the risk content in each cluster.
The application also discloses community content risk assessment equipment, which comprises:
a memory for storing computer executable instructions; the method comprises the steps of,
a processor for implementing the steps in the above method when executing computer executable instructions.
The application also discloses a computer readable storage medium, wherein the computer readable storage medium stores computer executable instructions which when executed by a processor implement the steps in the method.
In the embodiment of the application, firstly, the text of the community content to be evaluated is segmented, then the segmented text is converted into the text vector, the text vector is clustered, the segmented text corresponding to the text vector forms clusters on the semantic level, then the corresponding segmented text sampling number is determined for each cluster, and the risk recall index of the community content is evaluated according to the number of the segmented text of the risk content in the sampled segmented text.
The numerous technical features described in the description of the present application are distributed among the various technical solutions, which can make the description too lengthy if all possible combinations of technical features of the present application (i.e., technical solutions) are to be listed. In order to avoid this problem, the technical features disclosed in the above summary of the application, the technical features disclosed in the following embodiments and examples, and the technical features disclosed in the drawings may be freely combined with each other to constitute various new technical solutions (these technical solutions are regarded as already described in the present specification) unless such a combination of technical features is technically impossible. For example, in one example, feature a+b+c is disclosed, in another example, feature a+b+d+e is disclosed, and features C and D are equivalent technical means that perform the same function, technically only by alternative use, and may not be adopted simultaneously, feature E may be technically combined with feature C, and then the solution of a+b+c+d should not be considered as already described because of technical impossibility, and the solution of a+b+c+e should be considered as already described.
Drawings
FIG. 1 is a flow chart of a community content risk assessment method according to a first embodiment of the present application;
fig. 2 is a schematic structural diagram of a community content risk assessment apparatus according to a second embodiment of the present application;
fig. 3 is a schematic diagram of experimental test effects of a community content risk assessment method according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, it will be understood by those skilled in the art that the claimed application may be practiced without these specific details and with various changes and modifications from the embodiments that follow.
Description of the partial concepts:
UGC: user Generated Content, user generated content.
Content risk: refer to content including risks such as administrative sensitive language, pornography, illegal advertising promotion, etc.
Risk content: refers to the contents including administrative sensitive language, pornography contents, illegal advertising popularization and other risks.
Risk ratio: refers to the ratio of the number of all risk contents in the community contents to the total content number of the community contents, namely: risk content/total content.
Cleanliness: the method is used for measuring the cleanliness of community contents, namely: cleanliness = 1-risk ratio.
Sampling ratio: and randomly extracting N data from the N data to be used as a sampling set, wherein the sampling ratio is N/N.
Content vector: the term text is represented in the form of a vector.
Hierarchical sampling: also called type sampling, which is a method of randomly extracting samples (individuals) from different layers in a prescribed ratio from a population that can be divided into different sub-populations (or called layers). In the application, the word segmentation text in each cluster is sampled based on the clusters formed by the text vector clustering.
TextCNN is an algorithm that classifies text using convolutional neural networks.
And (3) point estimation: also known as constant value estimation, is an estimation value using the actual sample index value as the overall parameter. The method of point estimation is simple and generally does not consider sampling errors and reliability.
The following outline of some of the innovative features of the present application:
according to the application, under a specific community content risk estimation scene, clustering is firstly carried out to form clusters according to the specificity of content texts, then the segmented text in each cluster is sampled and evaluated, and the segmented text sampling number corresponding to the cluster can be determined according to specific conditions, so that the stability and the representativeness of a risk recall index estimated value based on a sampling set are effectively improved under the scene of extremely low risk ratio and low sampling ratio, the accuracy is improved, and the potential missing content risk is easier to find.
Further, the representativeness of the sampling set can be effectively improved and the sampling error can be reduced by firstly converting the word segmentation text into text vectors, then clustering the text vectors into clusters, and then sampling and evaluating the word segmentation text in each cluster, namely, layering sampling. Specifically, conventional hierarchical sampling performs layering based on attributes or classifications of sampling objects, for example, when crowd sampling, the gender, the age, etc. are layered, but in the present application, community contents are unstructured data, and no objective attributes can be used for direct layering. Therefore, the application creatively clusters the segmented text from the semantic level to separate the community content into subclasses with semantic commonality for subsequent hierarchical sampling.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The first embodiment of the application relates to a community content risk assessment method, the flow of which is shown in fig. 1, and the method comprises the following steps:
step 101: text vector conversion
Specifically, word segmentation is carried out on the whole content text of the community content, word segmentation texts are obtained, and each word segmentation text is converted into a text vector.
It should be noted that text preprocessing is mainly aimed at cleaning or converting some traditional, special symbol, emoji expression, chinese number and other contents appearing in the whole content text of community content.
It should be noted that word segmentation is to use word segmentation algorithm to perform word segmentation processing on the character string subjected to the pretreatment so as to obtain word segmentation text.
For example: "weather today is good" after word segmentation, get "weather today/good". Where "today", "weather", "really good" are word segmentation texts.
The method has the advantages that community contents often lack normalization and standards, the processing efficiency is low, the whole content text of the community contents is effectively cleaned and segmented, and word segmentation text convenient to process is provided for subsequent risk content evaluation.
It should be noted that there are many ways of text preprocessing and word segmentation that exist at present, and the present application is not limited to a specific way.
Specifically, in this embodiment, a pre-trained TextCNN two-classification model is used as the word vector model. The TextCNN two-classification model is a two-classification model for risk content identification, and the last layer of the TextCNN network structure can be output as a content vector of an input text.
It should be noted that the text vector conversion method of the present application is not limited to the TextCNN two-classification model, and other models may be used instead, such as LSTM, word2vec, doc2vec, etc.
Step 102: text vector clustering
Specifically, in this step, clusters are constructed for each obtained text vector cluster, where the clusters include word segmentation texts corresponding to the text vectors.
Specifically, in this embodiment, a k-means algorithm is selected, and further, the number of fixed output classes is selected to be N, that is, N clusters are constructed.
It should be noted that the text vector clustering method of the present application is not limited to the k-means algorithm described above, and other clustering algorithms may be used instead.
For example, K-MEDOIDS, CLARANS, and the like.
This has the advantage that the text of the whole content of the community is preprocessed and segmented, and the text vectors obtained after the text vector conversion are clustered, so that the segmented text actually generates new attributes. Before processing, the community content is unstructured data, so that no objective attribute exists, and effective clustering and subsequent evaluation cannot be performed on the segmented texts.
In other words, through the steps, the whole content text of the community content is sub-divided at the semantic level, a new attribute based on the semantic level is given to each word segmentation text, and correspondingly, if the attributes of the text vectors of the word segmentation texts are the same, the text vectors of the word segmentation texts show that the text vectors have certain commonality in terms of semantics.
Furthermore, through the steps, the sub-class separation effect of each word segmentation text of the whole content text of the community content is improved, so that a better precondition is provided for improving the final hierarchical sampling effect and the representativeness of the sampling sample.
Step 103: word segmentation text sampling
Specifically, in this step, the number of segmented text samples corresponding to each cluster is determined, and in each cluster, the segmented text samples are performed according to the corresponding number of segmented text samples.
Specifically, in this embodiment, among clusters formed by clustering the text vectors obtained by conversion, the number of segmented text samples corresponding to each cluster is determined and sampled, which may be referred to as "hierarchical sampling".
Further, the hierarchical sampling method, also called type sampling method, is a method of randomly extracting samples (individuals) from different layers in a prescribed ratio from a population that can be divided into different sub-populations (or referred to as layers).
The method has the advantages that the representative of the sampled segmented text is better, and the sampling error is smaller.
For example, the process of hierarchical sampling includes: the total units are divided into two or more complete groups (e.g., male and female) that are independent of each other, and then simply randomly sampled from the two or more groups, the sampled data being independent of each other. It can be seen that in hierarchical sampling, the units of population are grouped by primary labels, and there is a correlation between the labels of the groupings and the overall characteristics of interest. Further, grouping and sampling corresponds to clustering and sampling in the present embodiment.
It will be appreciated that after the cluster is constructed in the previous step, the original word segmentation text can be assigned to a certain cluster, so that the cluster is a concept of a layer in a hierarchical sampling method, and the hierarchical sampling can be used to obtain a sampling set of the word segmentation text of each cluster.
Further, since the effectiveness of the hierarchical sampling is affected by intra-layer variation, that is, when the variation of samples of the same layer is smaller (variation herein may be understood as the subject of the content, risk ratio, etc.), the effectiveness of the hierarchical sampling is better. In the embodiment of the application, the 'layer' separation is carried out from the semantics of the content by a clustering mode, so that the intra-layer variability can be reduced as much as possible, and the layered sampling effect is improved.
Specifically, in this step, the specific method for determining the number of word segmentation text samples of each cluster may be as follows:
first kind: layering and proportioning.
Specifically, the number of segmented text samples of each cluster is equal to the ratio of the number of all segmented texts corresponding to the entire content text of the community content.
For example, the number of segmented text samples is n=50, the total segmented text number n=500, and N/n=0.1 is the sample ratio, and each layer determines the number of samples of the layer according to the ratio.
Second kind: non-proportional dispensing.
Specifically, when the proportion of the total word segmentation text number of a certain cluster in all word segmentation texts corresponding to the whole content text of the community content is too small, that is, is lower than a preset threshold, in order to enable the semantic features of the cluster to be reflected in the samples, the proportion of the word segmentation text sample number of the cluster in the whole content text sample total number of the community content can be increased appropriately through manual setting.
Third kind: the Neman method.
Specifically, the number of word segmentation text samples for each cluster is proportional to the product of the total word segmentation text number for that cluster and its standard deviation.
In the present embodiment, the "non-proportional allocation method" sampling method of the second type described above is used.
The method has the advantages that in some special scenes, black data, namely the word segmentation text of the risk content, is often small in data size, and if the black data is sampled in a first hierarchical scaling mode, the black data is not easy to sample, so that the stability and accuracy of an evaluation result are affected. Therefore, the non-proportional allocation method ensures a certain number of word segmentation text samples of the cluster, can better balance and estimate the situation of word segmentation texts of all clusters contained in the whole content text of the community content, and avoids that clusters where black data are located are not sampled due to small data quantity.
For example, a sampling rule may be set, where the total word text number of the cluster is N, the word text sampling number is N, and the sampling ratio is sp, and then the sampling may be performed according to the following rule:
1) N=n when N <100
2) N=100 when N <1000
3) Sp=5% when N <10000
4) Sp=1% for N <500000
5) N > =500000, n=5000
It should be noted that, in the cluster-based hierarchical sampling of the present application, the method for determining the number of segmented text samples of each cluster is not limited to the above manner, and other allocation schemes may be used instead, which will not be described herein.
Step 105: word segmentation text for statistical risk content
Specifically, in this step, in each cluster, it is determined whether or not the segmented text of each sample is a risk content, and the number of segmented texts determined as the risk content among the segmented texts of the samples of this cluster is counted.
It should be noted that, the specific manner of determining whether the sampled word segmentation text is the risk content is common knowledge in the art, and will not be described herein.
Note that, since in the above step, the number of segmented text samples corresponding to the one cluster is determined by the non-proportional allocation method for the black data having a smaller data amount, in this step, even if the total segmented text number of the cluster in which the black data is located is small, for example, less than 1000, it is possible to follow the criteria considered to be set, for example: according to the rule exemplified in step 104, n=100 when N <1000, or n=n when N <100, thereby more reasonably sampling the segmented text of the cluster, judging whether the segmented text of each sample is black data, i.e. risk content, and counting the number of segmented texts determined as risk content in the cluster.
The method has the advantages that the method is easier to find even if the data volume is smaller for the risk content which is not easy to find, so that the community content risk assessment result is more stable and accurate.
Step 106: determining risk recall indicators
Specifically, in this step, according to the number r of segmented texts determined as risk contents in each cluster i And determining a risk recall index of the community content.
Specifically, in this step, a specific calculation formula of the risk recall index of the community content is as follows:
where K represents K clusters obtained in the above step of clustering and constructing clusters for each text vector.
Wherein N is i Representing the number of segmented text contained in the i-th cluster.
Wherein n is i The number of segmented text samples determined for the ith cluster, that is, the sample value unit, is represented in the step of determining the number of segmented text samples corresponding to each cluster.
Wherein r is i The number of segmented texts marked as risk content after marking in the i-th cluster, that is, the number of segmented texts determined as risk content.
Thus, the present embodiment obtains the risk assessment result of this community content, i.e., the risk recall index.
The following is an example of the effectiveness of the random sampling method and the cluster-based hierarchical sampling estimation method of the present application.
Fig. 3 shows a two-dimensional mapping of the segmented text, wherein the left graph is an original form, the right graph is a clustering result, points in the left graph in fig. 3 are mapping of UGC content in a community on a 2-dimensional plane, each point represents one content (30 points in total), light color represents normal data, and dark color represents risk data.
From the graph, the actual risk ratio was calculated to be 7% (1/15).
It is now desirable to extract 4 text vectors from 30 points to construct a sample set and estimate the risk duty cycle of the community content as a whole by calculating the risk duty cycle of the sample set.
Firstly, using a random sampling method, the possible estimated values are shown in table one, wherein about 75% of the estimated values can be drawn to 4 light-colored points (without risk content), namely, the potential omission risk is ignored; when a dark dot (risk content) is drawn with a probability of 23.9%, the risk ratio estimation value is directly increased from 0 to 25%, and overestimation is generated on the risk compared with the true value of 7%.
Random sampling probability of occurrence of various conditions and wind duty ratio estimation
Next, the sample estimation method proposed by the present application is used.
It is assumed that the word segmentation text can be divided into 4 clusters as shown in the right diagram in fig. 3 by clustering, and then a non-proportional allocation method is adopted to extract 4 word segmentation texts in total and ensure that at least one word segmentation text is extracted in each cluster, namely one data is extracted in each cluster.
The final risk assessment result is only influenced by the data extracted from the cluster at the lower right corner, and the risk assessment result which can appear is shown in a second table; wherein the estimated risk ratio is reduced to 33.3% with a probability of 0%.
It is noted that in practical applications, it is desirable to draw risk data in the sample, even if the risk ratio is overestimated, as opposed to ignoring the risk of omission.
Further, there is a 66.7% probability that an estimated overall risk ratio of 10% is obtained.
It can be seen that compared with the existing random mode, the estimation of the application is closer to the true value, and is more stable (the probability is higher) and the accuracy is higher.
TABLE II probability of occurrence of various cases and wind ratio estimation based on clustering clusters (where single cluster refers to the cluster at the bottom right in the figure)
Obviously, compared with random sampling, the method provided by the application can better find potential omission risks in the scene of extremely low risk ratio and low sampling ratio, and acquire a representative stronger sampling set and a more stable index estimation.
It should be noted that the effect of the method of the present application on the actual application scene, the data distribution of the specific service scene, and the effect of the clustering algorithm are greatly affected.
In general, the application provides a sampling method based on text vector clustering aiming at improving the stability and the accuracy of community content risk assessment indexes (risk ratio). Firstly, converting word segmentation texts of overall content texts of community contents to obtain text vectors, clustering the text vectors, and then calculating and estimating risk recall indexes of the community text contents by using a hierarchical sampling method based on clusters generated by clustering, wherein the clusters contain word segmentation texts corresponding to the text vectors.
The method has the advantages that the stability and the representativeness of the risk recall index estimated value based on the sampling set can be effectively improved under the scene of extremely low risk ratio and low sampling ratio, and meanwhile, the potential risk of missing content is easier to find, and the accuracy of the evaluation result is improved.
A second embodiment of the present application relates to a community content risk assessment apparatus having a structure as shown in fig. 2, the community content risk assessment apparatus comprising: the system comprises a text vector module, a clustering module, a sampling module, a risk content statistics module and a risk recall index module.
The modules are described in detail below:
the text vector module is used for word segmentation of the whole content text of the community content to obtain word segmentation texts, and converting each word segmentation text into a text vector;
the clustering module is used for clustering each text vector to construct clusters, wherein the clusters contain word segmentation texts corresponding to the text vectors;
the sampling module is used for determining the word segmentation text sampling number corresponding to each cluster, and in each cluster, the word segmentation text sampling is carried out according to the corresponding word segmentation text sampling number;
the risk content statistics module is used for judging whether the word segmentation text of each sample is the risk content in each cluster, and counting the number of the word segmentation texts which are determined to be the risk content in the word segmentation texts of the samples of the clusters;
and the risk recall index module is used for determining the risk recall index of the community content according to the number of the word segmentation texts marked as the risk content in each cluster.
The first embodiment is a method embodiment corresponding to the present embodiment, and the technical details in the first embodiment can be applied to the present embodiment, and the technical details in the present embodiment can also be applied to the first embodiment.
It should be noted that, it should be understood by those skilled in the art that the implementation functions of the modules shown in the embodiments of the community content risk assessment apparatus described above may be understood with reference to the description of the community content risk assessment method described above. The functions of the modules shown in the embodiment of the community content risk assessment apparatus described above may be implemented by a program (executable instructions) running on a processor, or by a specific logic circuit. The community content risk assessment device according to the embodiment of the present application may also be stored in a computer readable storage medium if implemented in the form of a software functional module and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes. Thus, embodiments of the application are not limited to any specific combination of hardware and software.
Accordingly, embodiments of the present application also provide a computer storage medium having stored therein computer executable instructions which when executed by a processor implement the method embodiments of the present application.
In addition, the embodiment of the application also provides community content risk assessment equipment, which comprises a memory for storing computer executable instructions and a processor; the processor is configured to implement the steps of the method embodiments described above when executing computer-executable instructions in the memory. The processor may be a central processing unit (Central Processing Unit, abbreviated as "CPU"), other general purpose processors, digital signal processors (Digital Signal Processor, abbreviated as "DSP"), application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as "ASIC"), and the like. The aforementioned memory may be a read-only memory (ROM), a random access memory (random access memory, RAM), a Flash memory (Flash), a hard disk, a solid state disk, or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied in a hardware processor for execution, or may be executed by a combination of hardware and software modules in the processor.
It should be noted that in the present patent application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element. In the present patent application, if it is mentioned that an action is performed according to an element, it means that the action is performed at least according to the element, and two cases are included: the act is performed solely on the basis of the element and is performed on the basis of the element and other elements. Multiple, etc. expressions include 2, 2 times, 2, and 2 or more, 2 or more times, 2 or more.
All references mentioned in this disclosure are to be considered as being included in the disclosure of the application in its entirety so that modifications may be made as necessary. Further, it is understood that various changes or modifications of the present application may be made by those skilled in the art after reading the above disclosure, and such equivalents are intended to fall within the scope of the application as claimed.

Claims (9)

1. A community content risk assessment method, comprising:
preprocessing the whole content text of the community content;
word segmentation is carried out on the whole content text of the community content, word segmentation texts are obtained, and each word segmentation text is converted into a text vector;
clustering each text vector, constructing clusters, and dividing the community content into subclasses with semantic commonalities, wherein each cluster comprises word segmentation texts corresponding to the text vectors;
determining the number of word segmentation text samples corresponding to each cluster, and carrying out word segmentation text sampling in each cluster according to the corresponding number of word segmentation text samples;
judging whether the word segmentation text of each sample is risk content in each cluster, and counting the number of the word segmentation texts determined to be the risk content;
and determining the risk recall index of the community content according to the number of word segmentation texts determined to be the risk content in each cluster.
2. The method of claim 1, wherein the step of converting the segmented text after each segmentation into text vectors uses a pre-trained TextCNN classification model.
3. The method of claim 1, wherein in the step of converting the segmented text of each segmented word into a text vector, any one of the following predetermined models is used: LSTM, word2vec, doc2vec.
4. The method of claim 1, wherein said step of clustering each of said text vectors to construct clusters uses any one of the following algorithms: the K-means algorithm, the K-MEDOIDS algorithm, the CLARANS algorithm.
5. The method of claim 1, wherein in the step of determining the number of segmented text samples corresponding to each cluster, the number of segmented text samples is determined using any one of: layered scaling, non-proportional partitioning, neman.
6. The method of claim 5, wherein in the step of determining the number of segmented text samples corresponding to each cluster, if the proportion of the number of segmented texts contained in the cluster to the total number of segmented texts corresponding to the entire content text of the community content is lower than a preset threshold, a non-proportional allocation method is adopted to determine the number of segmented text samples corresponding to the cluster.
7. A community content risk assessment apparatus, comprising:
the text vector module is used for word segmentation of the whole content text of the community content after pretreatment, obtaining word segmentation texts and converting each word segmentation text into a text vector;
the clustering module is used for clustering each text vector, constructing clusters and separating the community content into subclasses with semantic commonalities, wherein the clusters contain word segmentation texts corresponding to the text vectors;
the sampling module is used for determining the word segmentation text sampling number corresponding to each cluster, and in each cluster, the word segmentation text sampling is carried out according to the corresponding word segmentation text sampling number;
the risk content statistics module is used for judging whether the word segmentation text of each sample is risk content in each cluster, and counting the number of the word segmentation texts determined to be the risk content;
and the risk recall index module is used for determining the risk recall index of the community content according to the number of word segmentation texts determined to be the risk content in each cluster.
8. A community content risk assessment apparatus, comprising:
a memory for storing computer executable instructions; the method comprises the steps of,
a processor for implementing the steps in the method of any one of claims 1 to 6 when executing computer executable instructions.
9. A computer readable storage medium, characterized in that computer executable instructions are stored in the computer readable storage medium, which when executed by a processor implement the steps in the method according to any of claims 1 to 6.
CN201910221531.6A 2019-03-22 2019-03-22 Community content risk assessment method and device Active CN110046251B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910221531.6A CN110046251B (en) 2019-03-22 2019-03-22 Community content risk assessment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910221531.6A CN110046251B (en) 2019-03-22 2019-03-22 Community content risk assessment method and device

Publications (2)

Publication Number Publication Date
CN110046251A CN110046251A (en) 2019-07-23
CN110046251B true CN110046251B (en) 2023-12-08

Family

ID=67273946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910221531.6A Active CN110046251B (en) 2019-03-22 2019-03-22 Community content risk assessment method and device

Country Status (1)

Country Link
CN (1) CN110046251B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650849A (en) * 2019-09-25 2021-04-13 北京国双科技有限公司 File processing method and device, storage medium and equipment
CN111143577B (en) 2019-12-27 2023-06-16 北京百度网讯科技有限公司 Data labeling method, device and system
CN111797194B (en) * 2020-05-20 2024-04-02 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN111835622B (en) * 2020-07-10 2023-07-07 腾讯科技(深圳)有限公司 Information interception method, device, computer equipment and storage medium
CN112069785A (en) * 2020-08-06 2020-12-11 北京明略昭辉科技有限公司 Text sampling method and device for improving labeling efficiency

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN109446837A (en) * 2018-10-12 2019-03-08 深圳前海微众银行股份有限公司 Text checking method, equipment and readable storage medium storing program for executing based on sensitive information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330477A (en) * 2017-07-24 2017-11-07 南京邮电大学 A kind of improvement SMOTE resampling methods classified for lack of balance data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN109446837A (en) * 2018-10-12 2019-03-08 深圳前海微众银行股份有限公司 Text checking method, equipment and readable storage medium storing program for executing based on sensitive information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
分层抽样和整群抽样在审计实践中的应用;张红;学术纵横(第12期);第1页 *

Also Published As

Publication number Publication date
CN110046251A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
CN110046251B (en) Community content risk assessment method and device
CN108717408B (en) Sensitive word real-time monitoring method, electronic equipment, storage medium and system
US8635197B2 (en) Systems and methods for efficient development of a rule-based system using crowd-sourcing
RU2517368C2 (en) Method and apparatus for determining and evaluating significance of words
CN108090216B (en) Label prediction method, device and storage medium
US9098741B1 (en) Discriminitive learning for object detection
CN106447066A (en) Big data feature extraction method and device
CN106445988A (en) Intelligent big data processing method and system
CN105225135B (en) Potential customer identification method and device
CN112434194A (en) Similar user identification method, device, equipment and medium based on knowledge graph
CN112163099A (en) Text recognition method and device based on knowledge graph, storage medium and server
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN110929525B (en) Network loan risk behavior analysis and detection method, device, equipment and storage medium
CN108647800A (en) A kind of online social network user missing attribute forecast method based on node insertion
WO2022267454A1 (en) Method and apparatus for analyzing text, device and storage medium
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN114581207A (en) Commodity image big data accurate pushing method and system for E-commerce platform
CN109960730B (en) Short text classification method, device and equipment based on feature expansion
CN110674288A (en) User portrait method applied to network security field
CN114943285B (en) Intelligent auditing system for internet news content data
CN116127079A (en) Text classification method
CN114707517B (en) Target tracking method based on open source data event extraction
US9342795B1 (en) Assisted learning for document classification
CN113706279B (en) Fraud analysis method, fraud analysis device, electronic equipment and storage medium
CN112583860B (en) Method, device and equipment for detecting abnormal internet traffic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200925

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200925

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant