CN110046251A - Community content methods of risk assessment and device - Google Patents

Community content methods of risk assessment and device Download PDF

Info

Publication number
CN110046251A
CN110046251A CN201910221531.6A CN201910221531A CN110046251A CN 110046251 A CN110046251 A CN 110046251A CN 201910221531 A CN201910221531 A CN 201910221531A CN 110046251 A CN110046251 A CN 110046251A
Authority
CN
China
Prior art keywords
text
cluster
content
risk
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910221531.6A
Other languages
Chinese (zh)
Other versions
CN110046251B (en
Inventor
赵智源
祝慧佳
周书恒
郭亚
徐陈虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910221531.6A priority Critical patent/CN110046251B/en
Publication of CN110046251A publication Critical patent/CN110046251A/en
Application granted granted Critical
Publication of CN110046251B publication Critical patent/CN110046251B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application involves internet area, a kind of community content methods of risk assessment and device are disclosed, this method comprises: the entire content text to community content segments, and is text vector by each participle text conversion;Each text vector is clustered, cluster is constructed;It determines the corresponding participle text sampling number of each cluster, in each cluster, carries out participle text sampling according to corresponding participle text sampling number;In each cluster, judge whether the participle text of each sampling is Risk Content, and count in the participle text of sampling of cluster, is confirmed as the quantity of the participle text of Risk Content;According to the quantity for the participle text in each cluster, being confirmed as Risk Content, determine that the risk of community content recalls index.The estimation that the application can ensure to recall risk index is more stable, reduces sampling error, while can be avoided and potential omit risk data due to being not easy to be extracted into and influence accuracy.

Description

Community content methods of risk assessment and device
Technical field
This application involves internet areas, in particular to community content security evaluation.
Background technique
It is disclosed in the daily operation on community or platform in the content for thering is a large number of users original content (UGC) to generate, it is right Community shows that the content risks assessment of content is an essential ring.
Here risk refers to similar to political affairs are related to, and relates to Huang, the contents such as violation advertisement promotion.
Usually our application risk accountings measure whether the cleanliness of community meets the requirements, but due in the community UGC Capacity is huge, carries out manual examination and verification to all contents and needs to expend huge manpower, cannot achieve in practical business is.
Therefore, random sampling can be carried out to whole UGC contents under normal conditions, the data that sampling comes out are carried out artificial Mark building sampling collection, and the risk accounting that sampling collection is calculated is used in estimate the risk accounting of entire community.
But true risk data accounting is few in practical business scene (for example, < 1%, specifically see the wind of assessment Danger), and the influence by Sampling Strategies (predominantly sampling fraction is sampled relatively low in practical applications) leads to that presently, there are one A little problems, for example, especially unstable (that is, the variance of estimation index is larger) based on risk accounting estimation on random sampling collection.Again If being accidentally extracted into risk data in data from the sample survey for example, often will appear, something lost can be over-evaluated because of sampling smaller The risk amount of leakage, causes estimated risk accounting to be significantly larger than actual value.Or if not being extracted into risk data, ignore Potential risks.
Random sampling mentioned above, i.e., the content based on entire community are taken out at random according to a fixed sampling fraction Sample, to be used to assess the index of entire community based on the data set parameter that sampling obtains.
Although the benefit that aforesaid way has logic simple, easy to accomplish, has the disadvantage in that but then in risk Unstable to the estimation of risk accounting index when data are few, sampling error is larger, and accuracy rate is not high, while being not easy to be extracted into It is potential to omit risk data.
Summary of the invention
The application's is designed to provide a kind of community content methods of risk assessment and device, even if few in risk data When, it can also ensure that, reduction sampling error more stable to the estimation of risk accounting index, raising accuracy rate, while can keep away Exempt to influence accuracy due to being not easy to be extracted into potentially omission risk data.
To solve the above-mentioned problems, this application discloses a kind of community content methods of risk assessment, comprising:
The entire content text of the community content is segmented, obtains participle text, and by each participle text conversion For text vector;
Each text vector is clustered, cluster is constructed, wherein includes the corresponding participle text of the text vector in the cluster This;
It determines the corresponding participle text sampling number of each cluster, in each cluster, samples according to corresponding participle text Number carries out participle text sampling;
In each cluster, judge whether the participle text of each sampling is Risk Content, and count the sampling of cluster It segments in text, is confirmed as the quantity of the participle text of Risk Content;
According to the quantity for the participle text in each cluster, being marked as Risk Content, the risk of the community content is determined Recall index.
In a preferred embodiment, by each participle text conversion be text vector the step of in, use pre-training Bis- disaggregated model of TextCNN.
In a preferred embodiment, by each participle text conversion be text vector the step of in, using it is following any one Preset model: LSTM, word2vec, doc2vec.
In a preferred embodiment, in the step of clustering to each text vector, constructing cluster, any one following calculation is used Method: k-means algorithm, K-MEDOIDS algorithm, CLARANS algorithm.
In a preferred embodiment, in the step of determining each cluster corresponding participle text sampling number, using following any A kind of determining participle text sampling number of mode: the fixed ratio of layering, disproportional distribution method, Nai Manfa.
In a preferred embodiment, in the step of determining each cluster corresponding participle text sampling number, if what cluster included The quantity of participle text ratio shared in the sum of the corresponding participle text of entire content text of the community content is low When preset threshold value, the corresponding participle text sampling number of cluster is determined using disproportional distribution method.
In a preferred embodiment, before the step of being segmented to entire content text, obtaining participle text, further includes:
Entire content text is pre-processed.
Disclosed herein as well is a kind of community content risk assessment devices, comprising:
Text vector module is segmented for the entire content text to community content, obtains participle text, and will be every A participle text conversion is text vector;
Cluster module, for each text vector cluster, construct cluster, wherein in the cluster comprising the text to Measure corresponding participle text;
Decimation blocks, for determining the corresponding participle text sampling number of each cluster, in each cluster, according to corresponding Participle text sampling number carries out participle text sampling;
Risk Content statistical module, in each cluster, judging whether the participle text of each sampling is risk Content, and count in the participle text of sampling of cluster, it is confirmed as the quantity of the participle text of Risk Content;
Risk recalls Index module, is used for the quantity according to the participle text in each cluster, being marked as Risk Content, Determine that the risk of community content recalls index.
Disclosed herein as well is a kind of community content risk assessment equipment, comprising:
Memory, for storing computer executable instructions;And
Processor, for realizing the step in the above method when executing computer executable instructions.
Disclosed herein as well is a kind of computer readable storage medium, computer is stored in computer readable storage medium Executable instruction realizes the step in the above method when computer executable instructions are executed by processor.
In the application embodiment, the text for the community content first assessed needs is segmented, then participle text is turned It is changed to text vector, and text vector is clustered, the corresponding participle text of text vector is made to form cluster in semantic level, then Corresponding participle text sampling number is determined to each cluster, according to the participle text of Risk Content present in the participle text of sampling This quantity, the risk for assessing community content recall index, this have the advantage that, even if when risk data is few, Can ensure it is more stable to the estimation of risk accounting index, reduce sampling error, improve accuracy rate, while can be avoided due to It is not easy to be extracted into potential omission risk data and influence accuracy.
A large amount of technical characteristic is described in the description of the present application, is distributed in each technical solution, if to enumerate Out if the combination (i.e. technical solution) of all possible technical characteristic of the application, specification can be made excessively tediously long.In order to keep away Exempt from this problem, each technical characteristic disclosed in the application foregoing invention content, below in each embodiment and example Each technical characteristic disclosed in disclosed each technical characteristic and attached drawing, can freely be combined with each other, to constitute each The new technical solution (these technical solutions have been recorded because being considered as in the present specification) of kind, unless the group of this technical characteristic Conjunction is technically infeasible.For example, disclosing feature A+B+C in one example, spy is disclosed in another example A+B+D+E is levied, and feature C and D are the equivalent technologies means for playing phase same-action, it, can not as long as technically selecting a use Can use simultaneously, feature E can be technically combined with feature C, then, and the scheme of A+B+C+D because technology is infeasible should not It is considered as having recorded, and the scheme of A+B+C+E should be considered as being described.
Detailed description of the invention
Fig. 1 is the community content methods of risk assessment flow diagram according to the application first embodiment;
Fig. 2 is the community content risk assessment apparatus structure schematic diagram according to the application second embodiment;
Fig. 3 is the experiment test effect schematic diagram according to the community content methods of risk assessment of the application embodiment.
Specific embodiment
In the following description, in order to make the reader understand this application better, many technical details are proposed.But this The those of ordinary skill in field is appreciated that even if without these technical details and many variations based on the following respective embodiments And modification, the application technical solution claimed also may be implemented.
The explanation of part concept:
UGC:User Generated Content, refers to user-generated content.
Content risks: refer to include relating to the risks such as the speech of political affairs sensitivity, Pornograph, violation advertisement promotion in content.
Risk Content: refer to including relating to political affairs sensitivity speech, Pornograph, the content of the risks such as violation advertisement promotion.
Risk accounting: referring to the ratio of total content quantity of the quantity of all Risk Contents and community content in community content, That is: Risk Content amount/content total amount.
Cleanliness: for measuring the clean level of community content, it may be assumed that cleanliness=1- risk accounting.
Sampling fraction: n data are randomly selected in N number of data as sampling set, then sampling fraction is n/N.
Content vector: referring to the form of vector indicates participle text.
Stratified sampling: being also type sampling, it is the totality that different subpopulations (or being layer) are segmented into from one In, sample (individual) is randomly selected from different layers in defined ratio method.It in this application, is poly- based on text vector The cluster constituted after class is sampled the participle text in each cluster.
TextCNN: being the algorithm classified using convolutional neural networks to text.
Point estimation: also known as definite value estimation, is exactly to use practical sampling index value as the estimated value of population parameter.Point estimation Method it is simple, do not consider sampling error and the degree of reliability generally.
The part innovative point of summary description the application below:
The application, for the particularity of content text, proposes elder generation under specific " community content evaluation of risk " scene Cluster constitutes cluster, then is sampled assessment to the participle text in each cluster, and each cluster can be as the case may be Determine the corresponding participle text sampling number of this cluster, thus it is extremely low in risk accounting, under the scene of low sampling fraction, effectively promote base The stability and representativeness of index estimated value are recalled in the risk of sampling collection, improves accuracy rate, while being also easier to find potential Missing content risk.
Further, it is text vector by will first segment text conversion, then text vector is clustered and constitutes cluster, then Assessment is sampled to the participle text in each cluster, that is, stratified sampling can effectively promote the representativeness of sampling collection, subtract Few sampling error.Specifically, conventional attribute or classification of the stratified sampling based on target sample is layered, for example, in crowd When sampling, to gender, age etc. is layered, and still, in this application, community content is all unstructured data, without visitor The attribute of sight can be used for direct layering.Therefore, the application is creatively by clustering participle text from semantic level Cluster is constituted, community content is separated into the subclass with semantic general character on the whole, sampled for subsequent hierarchy.
Implementation to keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application Mode is described in further detail.
The first embodiment of the application is related to a kind of community content methods of risk assessment, and process is as shown in Figure 1, the party Method the following steps are included:
Step 101: text vector conversion
Specifically, the entire content text to community content segments, participle text is obtained, and by each participle text Be converted to text vector.
It may be noted that Text Pretreatment is some traditional fonts occurred in the entire content text mainly for community content, it is special Different symbol, emoji expression, the contents such as Chinese figure are cleaned or are converted.
It may be noted that participle is to carry out word segmentation processing to above-mentioned pretreated character string is completed using segmentation methods, divided Word text.
Such as: " weather of today is very good " obtains " today/weather/very good " after participle.Wherein, " today ", " weather ", " very good " are participle texts.
This have the advantage that community content often to be lacked to normative and standard type, treatment effeciency is lower, passes through society The entire content text of area's content has carried out effective cleaning and segmentation, provides convenience processing to the assessment of subsequent Risk Content Segment text.
It may be noted that all there are many modes for existing Text Pretreatment and participle mode at present, it is not limited in this application Specific concrete mode.
Specifically, in the present embodiment, using bis- disaggregated model of TextCNN of pre-training as term vector model.Wherein, Bis- disaggregated model of TextCNN is two disaggregated models identified for Risk Content, can be the last of TextCNN network structure One layer of content vector output as input text.
It may be noted that the text vector conversion regime of the application is not limited to above-mentioned bis- disaggregated model of TextCNN, can also make It is replaced with other models, such as LSTM, word2vec, doc2vec etc..
Step 102: text vector cluster
Specifically, in this step, being clustered to each text vector of above-mentioned acquisition, constructing cluster, wherein in the cluster Include the corresponding participle text of the text vector.
Specifically, in the present embodiment, having selected k-means algorithm, further, selecting fixed output class number is N, That is, constructing N number of cluster.
It may be noted that the text vector cluster mode of the application is not limited to above-mentioned k-means algorithm, can also use other Clustering algorithm replaces.
For example, K-MEDOIDS, CLARANS, etc..
This have the advantage that by entire content Text Pretreatment and participle to community content and text to The text vector obtained after amount conversion is clustered, and participle text is actually made to produce new attribute.And before treatment, society Area's content is all unstructured data, therefore without objective attribute, can not carry out effectively clustering with after to these participle texts Continuous assessment.
In other words, through the above steps, subclass separation is carried out in semantic level to the entire content text of community content, Each participle one, text is imparted in the new attribute based on semantic level, correspondingly, if the text vector of participle text Attribute it is identical, then it represents that semantically, there is also certain general character for they.
Further, through the above steps, each participle text to the entire content text of community content is improved The effect that separates of subclass, so as to promote final stratified sampling effect, the representativeness of sampling samples provides better premise.
Step 103: participle text sampling
Specifically, in this step, determine the corresponding participle text sampling number of each cluster, in each cluster, according to Corresponding participle text sampling number carries out participle text sampling.
Specifically, in the present embodiment, by being clustered in constituted cluster to the text vector being converted to, really The fixed corresponding participle text sampling number of each cluster, and be sampled, it is properly termed as " stratified sampling ".
Further, stratified sampling method is also type sampling, it is to be segmented into different subpopulations from one (or to be Layer) totality in, sample (individual) is randomly selected from different layers in defined ratio method.
The advantages of this method is that the representativeness of the participle text of sampling is relatively good, and sampling error is smaller.
For example, the process of stratified sampling include: an overall constituent parts are divided into it is two or more mutually indepedent Complete group (for example, male and female) simple random sampling is then carried out from two or more groups, sample Data be independent from each other.As it can be seen that being grouped to overall constituent parts by outstanding feature, the mark of grouping in stratified sampling There is correlation between will and the general characteristic of care.Further, be grouped and sample the cluster that is equivalent in the present embodiment and Sampling.
It is appreciated that participle text originally can belong to some in front in the step of after the building of completion cluster In cluster, therefore, cluster here is the concept in stratified sampling method middle layer, and usable stratified sampling obtains the participle text of each cluster This sampling collection.
Further, since the effect of stratified sampling will receive the influence for the situation that makes a variation in layer, that is, the sample when same layer This variation situation gets over hour (variation here can be understood as the theme of content, risk accounting etc.), and the effect of stratified sampling is got over It is good.It, can be as much as possible from the semantic separation for carrying out " layer " of content itself by way of cluster in embodiments herein Variability in layer is reduced, stratified sampling effect is promoted.
Specifically, in this step, the specific method being determined to the participle text sampling number of each cluster can have It is several below:
The first: the fixed ratio of layering.
Specifically, participle text sampling number all participles corresponding with the entire content text of community content of each cluster The ratio of the quantity of text is equal.
For example, text sampling number size n=50, the total participle amount of text N=500 of this cluster are segmented, then n/N= 0.1 is sample proportion, and every layer is pressed this layer of sample number of this ratio-dependent.
Second: disproportional distribution method.
Specifically, when the total participle amount of text of some cluster is corresponding all points in the entire content text of community content Proportion is too small in word text, that is, when being lower than a preset threshold value, in order to enable this cluster can in feature semantically There are enough reflections in sampling, can suitably increase the participle text sampling number of this cluster in community by being manually set Shared ratio in the whole content text population of samples of appearance.
The third: Nai Manfa.
Specifically, the participle text sampling number of each cluster and total participle amount of text of this cluster and its product of standard deviation It is directly proportional.
In the present embodiment, using above-mentioned second " disproportional distribution method " methods of sampling.
This have the advantage that under some special screnes, black data, i.e. the participle text of Risk Content often data Measure it is less, if by the first layering determine ratio in the way of be sampled, black data is just not easy to be sampled, and affects assessment As a result stability and accuracy.Therefore, have by participle text sampling number of the disproportional distribution method to this cluster certain The case where guaranteeing, capable of preferably balancing the participle text for each cluster for estimating that the entire content text of community content includes, keeps away Cluster where exempting from black data is arrived since data volume is less and non-sampled.
For example, sampling prescription can be set, wherein total participle amount of text of cluster is N, and participle text sampling number is N, sampling fraction sp can then sample according to following rule:
1) n=N when N < 100
2) n=100 when N < 1000
3) sp=5% when N < 10000
4) sp=1% when N < 500000
5) N >=500000 when n=5000
It may be noted that in the stratified sampling based on cluster of the application, the determination method of the participle text sampling number of each cluster It is not limited to upper type, other allocation plans can also be used to replace, this will not be repeated here.
Step 105: the participle text of statistical risk content
Specifically, in this step, in each cluster, judging whether the participle text of each sampling is in risk Hold, and count in the participle text of sampling of this cluster, is confirmed as the quantity of the participle text of Risk Content.
It may be noted that the participle text of judgement sampling whether be Risk Content concrete mode, be the common knowledge of this field, This will not be repeated here.
It may be noted that being determined for the lesser black data of data volume by disproportional distribution method due in above-mentioned steps The corresponding participle text sampling number of this cluster, therefore, in this step, even if total participle textual data of the cluster where black data Measure very little, be, for example, less than 1000, still, still can according to think setting standard, such as: according to what is illustrated in step 104 Rule, n=100 when N < 1000, alternatively, n=N when N < 100, thus more reasonably carries out the pumping of participle text to this cluster Sample, and judge whether the participle text of each sampling is black data, that is, Risk Content, and to being determined as risk in this cluster The quantity of the participle text of content is counted.
This have the advantage that, even if data volume is less, being also more easier to send out for the Risk Content for being not easy to realize It is existing, so that the result of community content risk assessment is more stable and accurate.
Step 106: determining that risk recalls index
Specifically, in this step, according to the quantity r for the participle text in each cluster, being confirmed as Risk Contenti, Determine that the risk of the community content recalls index.
Specifically, the specific formula for calculation that the risk of community content recalls index is as follows in this step:
Wherein, K was indicated in above-mentioned the step of clustering to each text vector and construct cluster, had obtained K cluster.
Wherein, NiIndicate the quantity of participle text for including in i-th of cluster.
Wherein, niIt indicates in above-mentioned determination each cluster corresponding participle text sampling number the step of, it is true to i-th of cluster Fixed participle text sampling number, that is, sampling amount.
Wherein, riIt indicates in i-th of cluster, the quantity of the participle text of Risk Content is marked as after mark, that is, really It is set to the quantity of the participle text of Risk Content.
The present embodiment obtains the risk evaluation result of this community content as a result, that is, risk recalls index.
Below with an example come the stratified sampling estimation method based on cluster to arbitrary sampling method and the application Effect is compared.
As Fig. 3 shows the two-dimensional map of participle text, wherein left figure is original form, and right figure is in cluster result Fig. 3 Point in left figure is mapping of the UGC content on 2 dimensional planes in a community, and each point represents a content (totally 30 points), The expression normal data of light color, dark expression risk data.
It is 7% (1/15) that actual risk accounting can be calculated from figure.
It now desires to extract 4 text vectors from 30 points to construct sampling collection, and the risk by calculating sampling collection Accounting estimates the risk accounting of community content entirety.
Firstly, using the method for random sampling, then the estimated value being likely to occur such as table one, wherein there is nearly 75% probability meeting 4 light points (devoid of risk content) are extracted into, that is, ignore potential omission risk;A dark color is extracted into the probability with 23.9% When point (Risk Content), risk accounting estimated value has just directly risen to 25% from 0 and has compared with true value 7% to risk production It has given birth to and has over-evaluated.
There is the probability of various situations and the estimation of wind accounting in one, random sampling of table
Next, the Sampling Estimation method proposed using the application.
Assuming that by clustering 4 clusters that participle text can be divided into such as the right figure in Fig. 3, then use disproportional distribution method Guarantee at least to take out a participle text in each cluster while total amount extracts 4 participle texts, i.e., each cluster extracts one Data.
At this time ultimate risk assessment result only influenced by the data that cluster of the lower right corner is extracted into, in fact it could happen that risk Assessment result such as table two;The probability that wherein estimated risk accounting is 0% is reduced to 33.3%.
It may be noted that in practical applications, compared with the risk for ignoring omission, even if risk accounting can be over-evaluated, it is also desirable to Risk data is extracted into sampling.
Further, have 66.7% probability obtain overall risk accounting estimated value be 10%.
As it can be seen that the application is compared with existing random fashion, the estimation of the application is while more stable closer to true value (probability is also higher), accuracy is higher.
Two, of table carries out stratified sampling based on the cluster that cluster obtains and the probability of various situations and the estimation of wind accounting occurs (wherein Single cluster refers to that cluster of bottom right in figure)
Obviously, the present processes are extremely low in risk accounting compared to random sampling, can be under the scene of low sampling fraction Preferably discovery is potential omits risk, and the representative stronger sampling collection of acquisition and more stable index estimation.
It may be noted that effect of the present processes in practical application scene, receives the data distribution of specific business scenario, and The influential effect of clustering algorithm is larger.
Generally speaking, the application for promoted community content risk assessment index (risk accounting) stability and precisely Degree proposes the methods of sampling clustered based on text vector.This method is first to the participle text of the entire content text of community content This is converted, and obtains text vector, and cluster to text vector, then the cluster generated based on cluster, includes text in the cluster Participle text corresponding to this vector, using stratified sampling method, to the risk of community's content of text recall index calculate and Estimation.
This have the advantage that it is extremely low in risk accounting using this method, under the scene of low sampling fraction, can effectively it mention It rises the risk based on sampling collection and recalls the stability and representativeness of index estimated value, while being also easier to find in potentially omission Hold risk, improves the accuracy of assessment result.
The second embodiment of the application is related to a kind of community content risk assessment device, and structure is as shown in Fig. 2, the society It includes: text vector module, cluster module, decimation blocks, Risk Content statistical module and risk that area's content risks, which assess device, Recall Index module.
Each module is detailed below:
Text vector module is segmented for the entire content text to community content, obtains participle text, and will be every One participle text conversion is text vector;
Cluster module, for each text vector cluster, construct cluster, wherein in the cluster comprising the text to Measure corresponding participle text;
Decimation blocks, for determining the corresponding participle text sampling number of each cluster, in each cluster, according to corresponding Participle text sampling number carries out participle text sampling;
Risk Content statistical module, in each cluster, judging whether the participle text of each sampling is risk Content, and count in the participle text of sampling of cluster, it is confirmed as the quantity of the participle text of Risk Content;
Risk recalls Index module, is used for the quantity according to the participle text in each cluster, being marked as Risk Content, Determine that the risk of community content recalls index.
First embodiment is method implementation corresponding with present embodiment, and the technology in first embodiment is thin Section can be applied to present embodiment, and the technical detail in present embodiment also can be applied to first embodiment.
It should be noted that it will be appreciated by those skilled in the art that the embodiment party of above-mentioned community content risk assessment device The realization function of each module shown in formula can refer to the associated description of aforementioned community content methods of risk assessment and understand.It is above-mentioned The function of each module shown in the embodiment of community content risk assessment device can be by running on the program on processor (executable instruction) and realize, can also be realized by specific logic circuit.The above-mentioned community content risk of the embodiment of the present application If assessment device is realized and when sold or used as an independent product in the form of software function module, also can store In one computer-readable storage medium.Based on this understanding, the technical solution of the embodiment of the present application is substantially in other words The part that contributes to existing technology can be embodied in the form of software products, which is stored in one In a storage medium, including some instructions are used so that computer equipment (can be personal computer, server or Network equipment etc.) execute each embodiment method of the application all or part.And storage medium above-mentioned includes: USB flash disk, movement Various Jie that can store program code such as hard disk, read-only memory (ROM, Read Only Memory), magnetic or disk Matter.It is combined in this way, the embodiment of the present application is not limited to any specific hardware and software.
Correspondingly, the application embodiment also provides a kind of computer storage medium, wherein it is executable to be stored with computer Instruction, the computer executable instructions realize each method embodiment of the application when being executed by processor.
In addition, the application embodiment also provides a kind of community content risk assessment equipment, including based on storing The memory of calculation machine executable instruction, and, processor;The computer that the processor is used in the execution memory is executable The step in above-mentioned each method embodiment is realized when instruction.Wherein, which can be central processing unit (Central Processing Unit, referred to as " CPU "), it can also be other general processors, digital signal processor (Digital Signal Processor, referred to as " DSP "), specific integrated circuit (Application Specific Integrated Circuit, referred to as " ASIC ") etc..Memory above-mentioned can be read-only memory (read-only memory, abbreviation " ROM "), random access memory (random access memory, referred to as " RAM "), flash memory (Flash), hard disk Or solid state hard disk etc..The step of method disclosed in each embodiment of the present invention, can be embodied directly in hardware processor execution Complete, or in processor hardware and software module combine execute completion.
It should be noted that relational terms such as first and second and the like are only in the application documents of this patent For distinguishing one entity or operation from another entity or operation, without necessarily requiring or implying these entities Or there are any actual relationship or orders between operation.Moreover, the terms "include", "comprise" or its any other Variant is intended to non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only It including those elements, but also including other elements that are not explicitly listed, or further include for this process, method, object Product or the intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence " including one ", not There is also other identical elements in the process, method, article or equipment for including element for exclusion.The application documents of this patent In, if it is mentioned that certain behavior is executed according to certain element, then refers to the meaning for executing the behavior according at least to the element, including Two kinds of situations: the behavior is executed according only to the element and the behavior is executed according to the element and other elements.Multiple, multiple, A variety of equal expression include 2,2 times, 2 kinds and 2 or more, 2 times or more, two or more.
It is included in disclosure of this application with being considered as globality in all documents that the application refers to, so as to It can be used as the foundation of modification if necessary.In addition, it should also be understood that, after having read the above disclosure of the application, this field Technical staff can make various changes or modifications the application, and such equivalent forms equally fall within the application model claimed It encloses.

Claims (10)

1. a kind of community content methods of risk assessment characterized by comprising
The entire content text of the community content is segmented, obtains participle text, and by each participle text Be converted to text vector;
Each described text vector is clustered, cluster is constructed, wherein is corresponding comprising the text vector in each described cluster Segment text;
The corresponding participle text sampling number of each cluster is determined, and in each described cluster, according to the corresponding participle text This sampling number carries out participle text sampling;
In each described cluster, judge whether the participle text of each sampling is Risk Content, and count and be confirmed as wind The quantity of the participle text of dangerous content;
According to the quantity for the participle text in cluster described in each, being confirmed as Risk Content, the wind of the community content is determined Recall index in danger.
2. method as claimed in claim 1, which is characterized in that the participle text conversion by each after segmenting is text In the step of vector, bis- disaggregated model of TextCNN of pre-training is used.
3. method as claimed in claim 1, which is characterized in that the participle text conversion by each after segmenting is text In the step of vector, any one following preset model: LSTM, word2vec, doc2vec is used.
4. method as claimed in claim 1, which is characterized in that described the step of clustering each described text vector, construct cluster In, use any one following algorithm: k-means algorithm, K-MEDOIDS algorithm, CLARANS algorithm.
5. method as claimed in claim 1, which is characterized in that the step of each cluster of the determination corresponding participle text sampling number In, the participle text sampling number is determined using any one following mode: the fixed ratio of layering, disproportional distribution method, Nai Manfa.
6. method as claimed in claim 5, which is characterized in that the step of each cluster of the determination corresponding participle text sampling number In, if the quantity for the participle text that the cluster includes is in the corresponding participle text of entire content text of the community content Shared ratio is lower than preset threshold value in sum, determines that the corresponding participle text of the cluster is sampled using disproportional distribution method Number.
7. method as claimed in claim 6, which is characterized in that described to be segmented to entire content text, the institute after being segmented Before the step of stating participle text, further includes:
The entire content text is pre-processed.
8. a kind of community content risk assessment device characterized by comprising
Text vector module is segmented for the entire content text to the community content, obtains participle text, and will be every One participle text conversion is text vector;
Cluster module constructs cluster for clustering each described text vector, wherein in the cluster comprising the text to Measure corresponding participle text;
Decimation blocks, for determining the corresponding participle text sampling number of each described cluster, in each described cluster, according to right The participle text sampling number answered carries out participle text sampling;
Risk Content statistical module, in each described cluster, judging whether the participle text of each sampling is risk Content, and count the quantity for being confirmed as the participle text of Risk Content;
Risk recalls Index module, is used for the quantity according to the participle text in cluster described in each, being confirmed as Risk Content, Determine that the risk of the community content recalls index.
9. a kind of community content risk assessment equipment characterized by comprising
Memory, for storing computer executable instructions;And
Processor, for being realized when executing computer executable instructions in the method such as any one of claim 1 to 7 Step.
10. a kind of computer readable storage medium, which is characterized in that being stored with computer in computer readable storage medium can hold The step in the method such as any one of claim 1 to 7 is realized in row instruction when computer executable instructions are executed by processor Suddenly.
CN201910221531.6A 2019-03-22 2019-03-22 Community content risk assessment method and device Active CN110046251B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910221531.6A CN110046251B (en) 2019-03-22 2019-03-22 Community content risk assessment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910221531.6A CN110046251B (en) 2019-03-22 2019-03-22 Community content risk assessment method and device

Publications (2)

Publication Number Publication Date
CN110046251A true CN110046251A (en) 2019-07-23
CN110046251B CN110046251B (en) 2023-12-08

Family

ID=67273946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910221531.6A Active CN110046251B (en) 2019-03-22 2019-03-22 Community content risk assessment method and device

Country Status (1)

Country Link
CN (1) CN110046251B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143577A (en) * 2019-12-27 2020-05-12 北京百度网讯科技有限公司 Data annotation method, device and system
CN111797194A (en) * 2020-05-20 2020-10-20 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN111835622A (en) * 2020-07-10 2020-10-27 腾讯科技(深圳)有限公司 Information interception method and device, computer equipment and storage medium
CN112069785A (en) * 2020-08-06 2020-12-11 北京明略昭辉科技有限公司 Text sampling method and device for improving labeling efficiency
CN112650849A (en) * 2019-09-25 2021-04-13 北京国双科技有限公司 File processing method and device, storage medium and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN107330477A (en) * 2017-07-24 2017-11-07 南京邮电大学 A kind of improvement SMOTE resampling methods classified for lack of balance data
CN109446837A (en) * 2018-10-12 2019-03-08 深圳前海微众银行股份有限公司 Text checking method, equipment and readable storage medium storing program for executing based on sensitive information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776713A (en) * 2016-11-03 2017-05-31 中山大学 It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN107330477A (en) * 2017-07-24 2017-11-07 南京邮电大学 A kind of improvement SMOTE resampling methods classified for lack of balance data
CN109446837A (en) * 2018-10-12 2019-03-08 深圳前海微众银行股份有限公司 Text checking method, equipment and readable storage medium storing program for executing based on sensitive information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张红: "分层抽样和整群抽样在审计实践中的应用", 学术纵横, no. 12, pages 1 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650849A (en) * 2019-09-25 2021-04-13 北京国双科技有限公司 File processing method and device, storage medium and equipment
CN111143577A (en) * 2019-12-27 2020-05-12 北京百度网讯科技有限公司 Data annotation method, device and system
CN111143577B (en) * 2019-12-27 2023-06-16 北京百度网讯科技有限公司 Data labeling method, device and system
US11860838B2 (en) 2019-12-27 2024-01-02 Beijing Baidu Netcom Science And Teciinology Co., Ltd. Data labeling method, apparatus and system, and computer-readable storage medium
CN111797194A (en) * 2020-05-20 2020-10-20 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN111797194B (en) * 2020-05-20 2024-04-02 北京三快在线科技有限公司 Text risk detection method and device, electronic equipment and storage medium
CN111835622A (en) * 2020-07-10 2020-10-27 腾讯科技(深圳)有限公司 Information interception method and device, computer equipment and storage medium
CN112069785A (en) * 2020-08-06 2020-12-11 北京明略昭辉科技有限公司 Text sampling method and device for improving labeling efficiency

Also Published As

Publication number Publication date
CN110046251B (en) 2023-12-08

Similar Documents

Publication Publication Date Title
US11095594B2 (en) Location resolution of social media posts
CN110046251A (en) Community content methods of risk assessment and device
CN107273517B (en) Graph-text cross-modal retrieval method based on graph embedding learning
US11405344B2 (en) Social media influence of geographic locations
CN104050247B (en) The method for realizing massive video quick-searching
CN105225135B (en) Potential customer identification method and device
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
WO2023108980A1 (en) Information push method and device based on text adversarial sample
CN108647800A (en) A kind of online social network user missing attribute forecast method based on node insertion
CN113032525A (en) False news detection method and device, electronic equipment and storage medium
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN111177559A (en) Text travel service recommendation method and device, electronic equipment and storage medium
CN110751191A (en) Image classification method and system
CN115545103A (en) Abnormal data identification method, label identification method and abnormal data identification device
CN113656699B (en) User feature vector determining method, related equipment and medium
CN113409157B (en) Cross-social network user alignment method and device
JP7092194B2 (en) Information processing equipment, judgment method, and program
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN110321565B (en) Real-time text emotion analysis method, device and equipment based on deep learning
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN113722484A (en) Rumor detection method, device, equipment and storage medium based on deep learning
Van Le et al. An efficient pretopological approach for document clustering
CN114626340B (en) Behavior feature extraction method based on mobile phone signaling and related device
CN113407727B (en) Qualitative measure and era recommendation method based on legal knowledge graph and related equipment
CN113868438B (en) Information reliability calibration method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200925

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200925

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman, British Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant