Summary of the invention
The application's is designed to provide a kind of community content methods of risk assessment and device, even if few in risk data
When, it can also ensure that, reduction sampling error more stable to the estimation of risk accounting index, raising accuracy rate, while can keep away
Exempt to influence accuracy due to being not easy to be extracted into potentially omission risk data.
To solve the above-mentioned problems, this application discloses a kind of community content methods of risk assessment, comprising:
The entire content text of the community content is segmented, obtains participle text, and by each participle text conversion
For text vector;
Each text vector is clustered, cluster is constructed, wherein includes the corresponding participle text of the text vector in the cluster
This;
It determines the corresponding participle text sampling number of each cluster, in each cluster, samples according to corresponding participle text
Number carries out participle text sampling;
In each cluster, judge whether the participle text of each sampling is Risk Content, and count the sampling of cluster
It segments in text, is confirmed as the quantity of the participle text of Risk Content;
According to the quantity for the participle text in each cluster, being marked as Risk Content, the risk of the community content is determined
Recall index.
In a preferred embodiment, by each participle text conversion be text vector the step of in, use pre-training
Bis- disaggregated model of TextCNN.
In a preferred embodiment, by each participle text conversion be text vector the step of in, using it is following any one
Preset model: LSTM, word2vec, doc2vec.
In a preferred embodiment, in the step of clustering to each text vector, constructing cluster, any one following calculation is used
Method: k-means algorithm, K-MEDOIDS algorithm, CLARANS algorithm.
In a preferred embodiment, in the step of determining each cluster corresponding participle text sampling number, using following any
A kind of determining participle text sampling number of mode: the fixed ratio of layering, disproportional distribution method, Nai Manfa.
In a preferred embodiment, in the step of determining each cluster corresponding participle text sampling number, if what cluster included
The quantity of participle text ratio shared in the sum of the corresponding participle text of entire content text of the community content is low
When preset threshold value, the corresponding participle text sampling number of cluster is determined using disproportional distribution method.
In a preferred embodiment, before the step of being segmented to entire content text, obtaining participle text, further includes:
Entire content text is pre-processed.
Disclosed herein as well is a kind of community content risk assessment devices, comprising:
Text vector module is segmented for the entire content text to community content, obtains participle text, and will be every
A participle text conversion is text vector;
Cluster module, for each text vector cluster, construct cluster, wherein in the cluster comprising the text to
Measure corresponding participle text;
Decimation blocks, for determining the corresponding participle text sampling number of each cluster, in each cluster, according to corresponding
Participle text sampling number carries out participle text sampling;
Risk Content statistical module, in each cluster, judging whether the participle text of each sampling is risk
Content, and count in the participle text of sampling of cluster, it is confirmed as the quantity of the participle text of Risk Content;
Risk recalls Index module, is used for the quantity according to the participle text in each cluster, being marked as Risk Content,
Determine that the risk of community content recalls index.
Disclosed herein as well is a kind of community content risk assessment equipment, comprising:
Memory, for storing computer executable instructions;And
Processor, for realizing the step in the above method when executing computer executable instructions.
Disclosed herein as well is a kind of computer readable storage medium, computer is stored in computer readable storage medium
Executable instruction realizes the step in the above method when computer executable instructions are executed by processor.
In the application embodiment, the text for the community content first assessed needs is segmented, then participle text is turned
It is changed to text vector, and text vector is clustered, the corresponding participle text of text vector is made to form cluster in semantic level, then
Corresponding participle text sampling number is determined to each cluster, according to the participle text of Risk Content present in the participle text of sampling
This quantity, the risk for assessing community content recall index, this have the advantage that, even if when risk data is few,
Can ensure it is more stable to the estimation of risk accounting index, reduce sampling error, improve accuracy rate, while can be avoided due to
It is not easy to be extracted into potential omission risk data and influence accuracy.
A large amount of technical characteristic is described in the description of the present application, is distributed in each technical solution, if to enumerate
Out if the combination (i.e. technical solution) of all possible technical characteristic of the application, specification can be made excessively tediously long.In order to keep away
Exempt from this problem, each technical characteristic disclosed in the application foregoing invention content, below in each embodiment and example
Each technical characteristic disclosed in disclosed each technical characteristic and attached drawing, can freely be combined with each other, to constitute each
The new technical solution (these technical solutions have been recorded because being considered as in the present specification) of kind, unless the group of this technical characteristic
Conjunction is technically infeasible.For example, disclosing feature A+B+C in one example, spy is disclosed in another example
A+B+D+E is levied, and feature C and D are the equivalent technologies means for playing phase same-action, it, can not as long as technically selecting a use
Can use simultaneously, feature E can be technically combined with feature C, then, and the scheme of A+B+C+D because technology is infeasible should not
It is considered as having recorded, and the scheme of A+B+C+E should be considered as being described.
Specific embodiment
In the following description, in order to make the reader understand this application better, many technical details are proposed.But this
The those of ordinary skill in field is appreciated that even if without these technical details and many variations based on the following respective embodiments
And modification, the application technical solution claimed also may be implemented.
The explanation of part concept:
UGC:User Generated Content, refers to user-generated content.
Content risks: refer to include relating to the risks such as the speech of political affairs sensitivity, Pornograph, violation advertisement promotion in content.
Risk Content: refer to including relating to political affairs sensitivity speech, Pornograph, the content of the risks such as violation advertisement promotion.
Risk accounting: referring to the ratio of total content quantity of the quantity of all Risk Contents and community content in community content,
That is: Risk Content amount/content total amount.
Cleanliness: for measuring the clean level of community content, it may be assumed that cleanliness=1- risk accounting.
Sampling fraction: n data are randomly selected in N number of data as sampling set, then sampling fraction is n/N.
Content vector: referring to the form of vector indicates participle text.
Stratified sampling: being also type sampling, it is the totality that different subpopulations (or being layer) are segmented into from one
In, sample (individual) is randomly selected from different layers in defined ratio method.It in this application, is poly- based on text vector
The cluster constituted after class is sampled the participle text in each cluster.
TextCNN: being the algorithm classified using convolutional neural networks to text.
Point estimation: also known as definite value estimation, is exactly to use practical sampling index value as the estimated value of population parameter.Point estimation
Method it is simple, do not consider sampling error and the degree of reliability generally.
The part innovative point of summary description the application below:
The application, for the particularity of content text, proposes elder generation under specific " community content evaluation of risk " scene
Cluster constitutes cluster, then is sampled assessment to the participle text in each cluster, and each cluster can be as the case may be
Determine the corresponding participle text sampling number of this cluster, thus it is extremely low in risk accounting, under the scene of low sampling fraction, effectively promote base
The stability and representativeness of index estimated value are recalled in the risk of sampling collection, improves accuracy rate, while being also easier to find potential
Missing content risk.
Further, it is text vector by will first segment text conversion, then text vector is clustered and constitutes cluster, then
Assessment is sampled to the participle text in each cluster, that is, stratified sampling can effectively promote the representativeness of sampling collection, subtract
Few sampling error.Specifically, conventional attribute or classification of the stratified sampling based on target sample is layered, for example, in crowd
When sampling, to gender, age etc. is layered, and still, in this application, community content is all unstructured data, without visitor
The attribute of sight can be used for direct layering.Therefore, the application is creatively by clustering participle text from semantic level
Cluster is constituted, community content is separated into the subclass with semantic general character on the whole, sampled for subsequent hierarchy.
Implementation to keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application
Mode is described in further detail.
The first embodiment of the application is related to a kind of community content methods of risk assessment, and process is as shown in Figure 1, the party
Method the following steps are included:
Step 101: text vector conversion
Specifically, the entire content text to community content segments, participle text is obtained, and by each participle text
Be converted to text vector.
It may be noted that Text Pretreatment is some traditional fonts occurred in the entire content text mainly for community content, it is special
Different symbol, emoji expression, the contents such as Chinese figure are cleaned or are converted.
It may be noted that participle is to carry out word segmentation processing to above-mentioned pretreated character string is completed using segmentation methods, divided
Word text.
Such as: " weather of today is very good " obtains " today/weather/very good " after participle.Wherein, " today ",
" weather ", " very good " are participle texts.
This have the advantage that community content often to be lacked to normative and standard type, treatment effeciency is lower, passes through society
The entire content text of area's content has carried out effective cleaning and segmentation, provides convenience processing to the assessment of subsequent Risk Content
Segment text.
It may be noted that all there are many modes for existing Text Pretreatment and participle mode at present, it is not limited in this application
Specific concrete mode.
Specifically, in the present embodiment, using bis- disaggregated model of TextCNN of pre-training as term vector model.Wherein,
Bis- disaggregated model of TextCNN is two disaggregated models identified for Risk Content, can be the last of TextCNN network structure
One layer of content vector output as input text.
It may be noted that the text vector conversion regime of the application is not limited to above-mentioned bis- disaggregated model of TextCNN, can also make
It is replaced with other models, such as LSTM, word2vec, doc2vec etc..
Step 102: text vector cluster
Specifically, in this step, being clustered to each text vector of above-mentioned acquisition, constructing cluster, wherein in the cluster
Include the corresponding participle text of the text vector.
Specifically, in the present embodiment, having selected k-means algorithm, further, selecting fixed output class number is N,
That is, constructing N number of cluster.
It may be noted that the text vector cluster mode of the application is not limited to above-mentioned k-means algorithm, can also use other
Clustering algorithm replaces.
For example, K-MEDOIDS, CLARANS, etc..
This have the advantage that by entire content Text Pretreatment and participle to community content and text to
The text vector obtained after amount conversion is clustered, and participle text is actually made to produce new attribute.And before treatment, society
Area's content is all unstructured data, therefore without objective attribute, can not carry out effectively clustering with after to these participle texts
Continuous assessment.
In other words, through the above steps, subclass separation is carried out in semantic level to the entire content text of community content,
Each participle one, text is imparted in the new attribute based on semantic level, correspondingly, if the text vector of participle text
Attribute it is identical, then it represents that semantically, there is also certain general character for they.
Further, through the above steps, each participle text to the entire content text of community content is improved
The effect that separates of subclass, so as to promote final stratified sampling effect, the representativeness of sampling samples provides better premise.
Step 103: participle text sampling
Specifically, in this step, determine the corresponding participle text sampling number of each cluster, in each cluster, according to
Corresponding participle text sampling number carries out participle text sampling.
Specifically, in the present embodiment, by being clustered in constituted cluster to the text vector being converted to, really
The fixed corresponding participle text sampling number of each cluster, and be sampled, it is properly termed as " stratified sampling ".
Further, stratified sampling method is also type sampling, it is to be segmented into different subpopulations from one (or to be
Layer) totality in, sample (individual) is randomly selected from different layers in defined ratio method.
The advantages of this method is that the representativeness of the participle text of sampling is relatively good, and sampling error is smaller.
For example, the process of stratified sampling include: an overall constituent parts are divided into it is two or more mutually indepedent
Complete group (for example, male and female) simple random sampling is then carried out from two or more groups, sample
Data be independent from each other.As it can be seen that being grouped to overall constituent parts by outstanding feature, the mark of grouping in stratified sampling
There is correlation between will and the general characteristic of care.Further, be grouped and sample the cluster that is equivalent in the present embodiment and
Sampling.
It is appreciated that participle text originally can belong to some in front in the step of after the building of completion cluster
In cluster, therefore, cluster here is the concept in stratified sampling method middle layer, and usable stratified sampling obtains the participle text of each cluster
This sampling collection.
Further, since the effect of stratified sampling will receive the influence for the situation that makes a variation in layer, that is, the sample when same layer
This variation situation gets over hour (variation here can be understood as the theme of content, risk accounting etc.), and the effect of stratified sampling is got over
It is good.It, can be as much as possible from the semantic separation for carrying out " layer " of content itself by way of cluster in embodiments herein
Variability in layer is reduced, stratified sampling effect is promoted.
Specifically, in this step, the specific method being determined to the participle text sampling number of each cluster can have
It is several below:
The first: the fixed ratio of layering.
Specifically, participle text sampling number all participles corresponding with the entire content text of community content of each cluster
The ratio of the quantity of text is equal.
For example, text sampling number size n=50, the total participle amount of text N=500 of this cluster are segmented, then n/N=
0.1 is sample proportion, and every layer is pressed this layer of sample number of this ratio-dependent.
Second: disproportional distribution method.
Specifically, when the total participle amount of text of some cluster is corresponding all points in the entire content text of community content
Proportion is too small in word text, that is, when being lower than a preset threshold value, in order to enable this cluster can in feature semantically
There are enough reflections in sampling, can suitably increase the participle text sampling number of this cluster in community by being manually set
Shared ratio in the whole content text population of samples of appearance.
The third: Nai Manfa.
Specifically, the participle text sampling number of each cluster and total participle amount of text of this cluster and its product of standard deviation
It is directly proportional.
In the present embodiment, using above-mentioned second " disproportional distribution method " methods of sampling.
This have the advantage that under some special screnes, black data, i.e. the participle text of Risk Content often data
Measure it is less, if by the first layering determine ratio in the way of be sampled, black data is just not easy to be sampled, and affects assessment
As a result stability and accuracy.Therefore, have by participle text sampling number of the disproportional distribution method to this cluster certain
The case where guaranteeing, capable of preferably balancing the participle text for each cluster for estimating that the entire content text of community content includes, keeps away
Cluster where exempting from black data is arrived since data volume is less and non-sampled.
For example, sampling prescription can be set, wherein total participle amount of text of cluster is N, and participle text sampling number is
N, sampling fraction sp can then sample according to following rule:
1) n=N when N < 100
2) n=100 when N < 1000
3) sp=5% when N < 10000
4) sp=1% when N < 500000
5) N >=500000 when n=5000
It may be noted that in the stratified sampling based on cluster of the application, the determination method of the participle text sampling number of each cluster
It is not limited to upper type, other allocation plans can also be used to replace, this will not be repeated here.
Step 105: the participle text of statistical risk content
Specifically, in this step, in each cluster, judging whether the participle text of each sampling is in risk
Hold, and count in the participle text of sampling of this cluster, is confirmed as the quantity of the participle text of Risk Content.
It may be noted that the participle text of judgement sampling whether be Risk Content concrete mode, be the common knowledge of this field,
This will not be repeated here.
It may be noted that being determined for the lesser black data of data volume by disproportional distribution method due in above-mentioned steps
The corresponding participle text sampling number of this cluster, therefore, in this step, even if total participle textual data of the cluster where black data
Measure very little, be, for example, less than 1000, still, still can according to think setting standard, such as: according to what is illustrated in step 104
Rule, n=100 when N < 1000, alternatively, n=N when N < 100, thus more reasonably carries out the pumping of participle text to this cluster
Sample, and judge whether the participle text of each sampling is black data, that is, Risk Content, and to being determined as risk in this cluster
The quantity of the participle text of content is counted.
This have the advantage that, even if data volume is less, being also more easier to send out for the Risk Content for being not easy to realize
It is existing, so that the result of community content risk assessment is more stable and accurate.
Step 106: determining that risk recalls index
Specifically, in this step, according to the quantity r for the participle text in each cluster, being confirmed as Risk Contenti,
Determine that the risk of the community content recalls index.
Specifically, the specific formula for calculation that the risk of community content recalls index is as follows in this step:
Wherein, K was indicated in above-mentioned the step of clustering to each text vector and construct cluster, had obtained K cluster.
Wherein, NiIndicate the quantity of participle text for including in i-th of cluster.
Wherein, niIt indicates in above-mentioned determination each cluster corresponding participle text sampling number the step of, it is true to i-th of cluster
Fixed participle text sampling number, that is, sampling amount.
Wherein, riIt indicates in i-th of cluster, the quantity of the participle text of Risk Content is marked as after mark, that is, really
It is set to the quantity of the participle text of Risk Content.
The present embodiment obtains the risk evaluation result of this community content as a result, that is, risk recalls index.
Below with an example come the stratified sampling estimation method based on cluster to arbitrary sampling method and the application
Effect is compared.
As Fig. 3 shows the two-dimensional map of participle text, wherein left figure is original form, and right figure is in cluster result Fig. 3
Point in left figure is mapping of the UGC content on 2 dimensional planes in a community, and each point represents a content (totally 30 points),
The expression normal data of light color, dark expression risk data.
It is 7% (1/15) that actual risk accounting can be calculated from figure.
It now desires to extract 4 text vectors from 30 points to construct sampling collection, and the risk by calculating sampling collection
Accounting estimates the risk accounting of community content entirety.
Firstly, using the method for random sampling, then the estimated value being likely to occur such as table one, wherein there is nearly 75% probability meeting
4 light points (devoid of risk content) are extracted into, that is, ignore potential omission risk;A dark color is extracted into the probability with 23.9%
When point (Risk Content), risk accounting estimated value has just directly risen to 25% from 0 and has compared with true value 7% to risk production
It has given birth to and has over-evaluated.
There is the probability of various situations and the estimation of wind accounting in one, random sampling of table
Next, the Sampling Estimation method proposed using the application.
Assuming that by clustering 4 clusters that participle text can be divided into such as the right figure in Fig. 3, then use disproportional distribution method
Guarantee at least to take out a participle text in each cluster while total amount extracts 4 participle texts, i.e., each cluster extracts one
Data.
At this time ultimate risk assessment result only influenced by the data that cluster of the lower right corner is extracted into, in fact it could happen that risk
Assessment result such as table two;The probability that wherein estimated risk accounting is 0% is reduced to 33.3%.
It may be noted that in practical applications, compared with the risk for ignoring omission, even if risk accounting can be over-evaluated, it is also desirable to
Risk data is extracted into sampling.
Further, have 66.7% probability obtain overall risk accounting estimated value be 10%.
As it can be seen that the application is compared with existing random fashion, the estimation of the application is while more stable closer to true value
(probability is also higher), accuracy is higher.
Two, of table carries out stratified sampling based on the cluster that cluster obtains and the probability of various situations and the estimation of wind accounting occurs (wherein
Single cluster refers to that cluster of bottom right in figure)
Obviously, the present processes are extremely low in risk accounting compared to random sampling, can be under the scene of low sampling fraction
Preferably discovery is potential omits risk, and the representative stronger sampling collection of acquisition and more stable index estimation.
It may be noted that effect of the present processes in practical application scene, receives the data distribution of specific business scenario, and
The influential effect of clustering algorithm is larger.
Generally speaking, the application for promoted community content risk assessment index (risk accounting) stability and precisely
Degree proposes the methods of sampling clustered based on text vector.This method is first to the participle text of the entire content text of community content
This is converted, and obtains text vector, and cluster to text vector, then the cluster generated based on cluster, includes text in the cluster
Participle text corresponding to this vector, using stratified sampling method, to the risk of community's content of text recall index calculate and
Estimation.
This have the advantage that it is extremely low in risk accounting using this method, under the scene of low sampling fraction, can effectively it mention
It rises the risk based on sampling collection and recalls the stability and representativeness of index estimated value, while being also easier to find in potentially omission
Hold risk, improves the accuracy of assessment result.
The second embodiment of the application is related to a kind of community content risk assessment device, and structure is as shown in Fig. 2, the society
It includes: text vector module, cluster module, decimation blocks, Risk Content statistical module and risk that area's content risks, which assess device,
Recall Index module.
Each module is detailed below:
Text vector module is segmented for the entire content text to community content, obtains participle text, and will be every
One participle text conversion is text vector;
Cluster module, for each text vector cluster, construct cluster, wherein in the cluster comprising the text to
Measure corresponding participle text;
Decimation blocks, for determining the corresponding participle text sampling number of each cluster, in each cluster, according to corresponding
Participle text sampling number carries out participle text sampling;
Risk Content statistical module, in each cluster, judging whether the participle text of each sampling is risk
Content, and count in the participle text of sampling of cluster, it is confirmed as the quantity of the participle text of Risk Content;
Risk recalls Index module, is used for the quantity according to the participle text in each cluster, being marked as Risk Content,
Determine that the risk of community content recalls index.
First embodiment is method implementation corresponding with present embodiment, and the technology in first embodiment is thin
Section can be applied to present embodiment, and the technical detail in present embodiment also can be applied to first embodiment.
It should be noted that it will be appreciated by those skilled in the art that the embodiment party of above-mentioned community content risk assessment device
The realization function of each module shown in formula can refer to the associated description of aforementioned community content methods of risk assessment and understand.It is above-mentioned
The function of each module shown in the embodiment of community content risk assessment device can be by running on the program on processor
(executable instruction) and realize, can also be realized by specific logic circuit.The above-mentioned community content risk of the embodiment of the present application
If assessment device is realized and when sold or used as an independent product in the form of software function module, also can store
In one computer-readable storage medium.Based on this understanding, the technical solution of the embodiment of the present application is substantially in other words
The part that contributes to existing technology can be embodied in the form of software products, which is stored in one
In a storage medium, including some instructions are used so that computer equipment (can be personal computer, server or
Network equipment etc.) execute each embodiment method of the application all or part.And storage medium above-mentioned includes: USB flash disk, movement
Various Jie that can store program code such as hard disk, read-only memory (ROM, Read Only Memory), magnetic or disk
Matter.It is combined in this way, the embodiment of the present application is not limited to any specific hardware and software.
Correspondingly, the application embodiment also provides a kind of computer storage medium, wherein it is executable to be stored with computer
Instruction, the computer executable instructions realize each method embodiment of the application when being executed by processor.
In addition, the application embodiment also provides a kind of community content risk assessment equipment, including based on storing
The memory of calculation machine executable instruction, and, processor;The computer that the processor is used in the execution memory is executable
The step in above-mentioned each method embodiment is realized when instruction.Wherein, which can be central processing unit (Central
Processing Unit, referred to as " CPU "), it can also be other general processors, digital signal processor (Digital
Signal Processor, referred to as " DSP "), specific integrated circuit (Application Specific Integrated
Circuit, referred to as " ASIC ") etc..Memory above-mentioned can be read-only memory (read-only memory, abbreviation
" ROM "), random access memory (random access memory, referred to as " RAM "), flash memory (Flash), hard disk
Or solid state hard disk etc..The step of method disclosed in each embodiment of the present invention, can be embodied directly in hardware processor execution
Complete, or in processor hardware and software module combine execute completion.
It should be noted that relational terms such as first and second and the like are only in the application documents of this patent
For distinguishing one entity or operation from another entity or operation, without necessarily requiring or implying these entities
Or there are any actual relationship or orders between operation.Moreover, the terms "include", "comprise" or its any other
Variant is intended to non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only
It including those elements, but also including other elements that are not explicitly listed, or further include for this process, method, object
Product or the intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence " including one ", not
There is also other identical elements in the process, method, article or equipment for including element for exclusion.The application documents of this patent
In, if it is mentioned that certain behavior is executed according to certain element, then refers to the meaning for executing the behavior according at least to the element, including
Two kinds of situations: the behavior is executed according only to the element and the behavior is executed according to the element and other elements.Multiple, multiple,
A variety of equal expression include 2,2 times, 2 kinds and 2 or more, 2 times or more, two or more.
It is included in disclosure of this application with being considered as globality in all documents that the application refers to, so as to
It can be used as the foundation of modification if necessary.In addition, it should also be understood that, after having read the above disclosure of the application, this field
Technical staff can make various changes or modifications the application, and such equivalent forms equally fall within the application model claimed
It encloses.