Embodiment
In the related art, based on the black sample for including harmful content audited out, for being produced in social networking application
Social text carry out content auditing, when completing prevention and control on real-time line, can generally be accomplished by the following way:
In a kind of implementation shown, at the beginning of social networking application is reached the standard grade, special air control personnel can be set, by wind
Control personnel are by manually browsing through the social text that social networking application is produced, by what is manually issued to judge user by social networking application
Social activity text such as message or service content etc., if there is the harmful content violated the rules.When the number of users of social networking application
Constantly increase, during by being manually not enough to support quick examination & verification, air control personnel can empirically configure a large amount of keywords rules,
And then audit platform can based on configuration these keywords rule come automatically check social networking application generation social text in
With the presence or absence of bad keyword.
However, keyword rule is often auditor audits what experience was extract according to history, it can not cover complete
Portion's history msu message, and it is more mechanical by the form progress content auditing of keyword, generally all it is direct matching, exists
The situation of a large amount of erroneous judgements.
In another implementation shown, the social text produced in social networking application can be directed to, and audited
The black sample comprising harmful content gone out enters the accurate content matching of every trade, and then completes the social text to social networking application generation
Content auditing.
However, by way of accurately matching, although response speed when can meet the prevention and control on real-time line is carried out
It is required that, but the expression-form of the content of text of social networking application generation is rich and varied, thus accurate content matching is used, recall
Rate is too low;Moreover, examination & verification platform needs to expend a large amount of process resources and does accurate inquiry, carry out content auditing it is effective very
Difference, it is impossible to meet and require in real time.
, can be based on the similarity algorithms such as editing distance or COS distance, meter in the third implementation shown
The text similarity of the social networking application social text produced and each black sample comprising harmful content audited out is calculated, will
Social text and the black sample that social networking application is produced carry out fuzzy matching, then by the text similarity that calculates to completing
Prevention and control on the actual time line of harmful content.
However, by way of fuzzy matching, based on the similarity algorithm such as editing distance or COS distance, meter
When calculating samples of text that social text produces and the similarity of each black sample, 1 is generally all suffered from:N poll, it is necessary to according to
Secondary calculating social networking application produces the social text of wall scroll, the text similarity with all black samples in black Sample Storehouse, therefore works as black
The quantity of sample is more, and all black samples of poll carry out the calculating of similarity successively, from response speed, it is impossible to meet real
When line on prevention and control requirement.
It can be seen that, content auditing is carried out in the social text produced for social networking application at present, prevention and control on real-time line are completed
When, the degree of accuracy when carrying out content auditing and the response efficiency of system can not be taken into account well;Therefore, how to utilize and examine
A large amount of black samples for include harmful content of core platform precipitation, the social text of completion social networking application generation rapidly and efficiently it is interior
Hold examination & verification, as urgent problem to be solved in the industry.
In view of this, the application propose a kind of text for characterizing new typing using the text filtering ratio of text participle with
The text similarity of black sample, and by the way of accurate matched text participle, to complete the samples of text and black sample of new typing
This fuzzy matching, and then draw the algorithm of the text similarity of the two.
In the algorithm, by based on identical drop policy, to the samples of text in original black Sample Storehouse and newly
The samples of text of typing carries out the text participle that word segmentation processing is obtained, and enters respectively according to the text filtering ratio of multiple holding gradients
Remaining text participle is respectively to the samples of text in original black Sample Storehouse after this participle filtration treatment, and use filtering of composing a piece of writing
And the samples of text of new typing is reconstructed, and the samples of text of new typing is then characterized using the gating rate of text participle
It is new by matching the black Sample Storehouse after reconstruct and the text participle in the samples of text of new typing with the similarity of black sample
The samples of text of typing carries out the black Sample Similarity of text participle setting that participle is obtained, and can be obviously improved and calculate new typing
Samples of text and black Sample Storehouse in samples of text similarity when computational efficiency so that based on black sample to new typing
Samples of text when carrying out prevention and control on real-time line, the content auditing of the samples of text for new typing can be quickly finished,
The response speed of raising system.
The application is described below by specific embodiment and with reference to specific application scenarios.
Fig. 1 is refer to, Fig. 1 is a kind of computational methods for text similarity that the embodiment of the application one is provided, applied to meter
Machine equipment is calculated, the computer equipment includes multiple black Sample Storehouses;The multiple black Sample Storehouse is based on default filtering policy, pin
After being filtered to the part samples of text in original black Sample Storehouse, created and obtained based on remaining samples of text;Wherein, institute
State multiple black Sample Storehouses and correspond to different text filtering ratios respectively;And the corresponding text filtering ratio of the multiple black Sample Storehouse
Keep gradient;Methods described performs following steps:
Step 101, the samples of text for new typing carries out word segmentation processing, obtains some text participles;
Step 102, the multiple black Sample Storehouse is chosen to be target sample storehouse successively, and based on the default filtering plan
Slightly, according to the selected corresponding text filtering ratio in the target sample storehouse, for the part text in some text participles
This participle is filtered;
Step 103, remaining text participle in some text participles is chosen to be target text participle successively, and will
The target text participle is matched successively with the text participle in the target sample storehouse;
Step 104, if the target text participle is with when any text participle is matched in the target sample storehouse, being based on
Text filtering ratio corresponding with the target sample storehouse, is that the target text participle sets black Sample Similarity.
Above computer equipment, can include being used to carry the text similarity measurement algorithm shown by step 101-104, be based on
Some black samples for including harmful content of completion have been audited, any shape of the content auditing to the samples of text of new typing is completed
The computer equipment of formula.In actual applications, above-mentioned computing device can be server device or client device;
For example, above computer equipment can be specifically a server or one and content auditing in content auditing platform
The PC terminals for being used to perform content auditing of platform docking.
Above-mentioned samples of text, can specifically include the social text produced by social networking application;For example, can be logical including user
The chat messages of social networking application issue are crossed, the related to user social contact of social networking application generation used in user can also be included
Service message, etc..
The samples of text of above-mentioned new typing, then can above computer equipment extract, user using it is social should
The new social text of used time typing;And the samples of text in above-mentioned black Sample Storehouse, then can be the history of content auditing platform
The a large amount of social texts for including harmful content precipitated in audit logging.Certainly, in actual applications, above-mentioned samples of text also may be used
It is other types of to need to carry out content auditing to be beyond social text, text on the line of prevention and control is completed on real-time line,
Will be without being particularly limited in the application.
In this application, will propose a kind of text for characterizing new typing using the text filtering ratio of text participle with it is black
The text similarity of sample, and by the way of accurate matched text participle, to complete the samples of text and black sample of new typing
Fuzzy matching, and then draw the algorithm of the text similarity of the two.
Fig. 2 is referred to, Fig. 2 is the whole design and framework figure of the text similarity algorithm shown in the application.
As shown in Fig. 2 in the algorithm, identical drop policy can be based on, to text sample whole in black Sample Storehouse
The samples of text of this and new typing carries out the text participle that word segmentation processing is obtained, according to the gating rates point of multiple holding gradients
Not carry out text participle filtration treatment, and using remaining text participle centrifugal pump, respectively to original black Sample Storehouse sheet with
And the samples of text of new typing is reconstructed, and the text sample of new typing is then characterized using the text filtering ratio of text participle
Originally with the similarity of black sample, and by matching the black Sample Storehouse after reconstruct and the text participle in the samples of text of new typing,
To carry out the text participle that participle is obtained for the samples of text of new typing, black Sample Similarity is set;
Due in the similarity algorithm, being matched by simple text participle, it is possible to be rapidly completed text similarity
Calculating, be that the obtained text participle of samples of text participle of new typing sets out black Sample Similarity, therefore can significantly carry
Rise the computational efficiency when calculating the similarity of samples of text and the samples of text in black Sample Storehouse of new typing, thus based on
When black sample carries out prevention and control on real-time line to the samples of text of new typing, the text sample for new typing can be quickly finished
This content auditing, improves the response speed of system.
The social text produced below by social networking application of above-mentioned samples of text, and combine for social text progress content
Examination & verification, completes to illustrate exemplified by the application scenarios of prevention and control on real-time line.Obviously, it is using above-mentioned samples of text as social networking application
Example, it is exemplary only, it is not used to be defined the technical scheme of the application.
In this application, above computer equipment can collect a large amount of general social texts, to create a general sample
This storehouse.Social text in the general Sample Storehouse, can cover the society that needs are carried out content of text examination & verification by the computer equipment
Hand over and apply produced social text, all that can also cover on the internet that the computer equipment can be collected into are other
Social text produced by social networking application;That is above computer equipment, can be by collecting each middle social networking application institute on internet
The social text of generation, is then based on the social text that is collected into create above-mentioned general Sample Storehouse.
Wherein, in actual applications, the social text in above-mentioned general Sample Storehouse quantity, it is necessary to keep one it is larger
The order of magnitude, so as to ensure the social text in the general Sample Storehouse as far as possible, can cover user on daily line
All issuable keywords in social activity;For example, in the example shown, above computer equipment can collect extraction
General social text on 20000000000 lines, to create above-mentioned general Sample Storehouse.
After the completion of above-mentioned general Sample Storehouse is created, the social text difference of full dose that can be directed in general Sample Storehouse first
Carry out text word segmentation processing;Wherein, the text segmentation methods that use when carrying out text word segmentation processing, in this application without
It is particularly limited to, those skilled in the art may be referred to the note in correlation technique when the technical scheme of the application is put into effect
Carry.
After the completion of the social text word segmentation processing of full dose in for generic text storehouse, due to now participle obtain it is a large amount of
In text participle, some invalid text participles may be included;Such as, punctuation mark, and some such as " ", " " etc. does not have
There is the stop words of physical meaning;Therefore, after the completion of participle, above computer equipment can also be obtained further directed to word segmentation processing
The a large amount of text participles arrived, carry out filtration treatment, further remove the punctuation mark in these text participles, and combine what is carried
Dictionary is disabled, the stop words in these text participles is removed.
Certainly, in actual applications, can also be based on actual in addition to further punctuation mark and stop words filtering
Demand be further introduced into the filtering policys of other forms;Carried out for example, a large amount of text participles after word segmentation processing can be directed to
Part of speech is analyzed, and the result analyzed according to part of speech, selectively retains the text participle which has physical meaning;Such as, only
Retain the related text participle of subject in these text participles, predicate and object.
After the completion of text participle after for word segmentation processing is further filtered, now above computer equipment can be with
Further combined with default statistical analysis algorithms, quantify each text participle after word segmentation processing and correspond to the general Sample Storehouse
Significance level, obtains the weighted value that each text participle corresponds to the general Sample Storehouse.
Wherein, it is used statistics side quantifying each text participle corresponding to the significance level of the general Sample Storehouse
Method, in this application without being particularly limited to.
In a kind of embodiment shown, above-mentioned weighted value can be specifically IDF (inverse document
Frequency, inverse text frequency) value;Above computer equipment can characterize each text participle corresponding to general based on IDF values
The significance level of Sample Storehouse.
Wherein, when the target word in calculating some corpus corresponds to the IDF values of the corpus, it can generally use
General act number in the corpus, divided by the file comprising the target word number, then obtained business taken the logarithm obtained.
And above computer equipment can count general successively when each text participle of calculating corresponds to the significance level of general Sample Storehouse
The quantity of social text comprising each text participle in Sample Storehouse, then using the total quantity of social text in general Sample Storehouse,
Respectively divided by the quantity that counts, then by the obtained business calculating that take the logarithm each text participle is obtained relative to general sample
The IDF values in storehouse.
Certainly, in actual applications, except characterizing important journey of the text participle relative to general Sample Storehouse by IDF values
Beyond degree, the statistical method of other forms can also be used to quantify important journey of each text participle relative to general Sample Storehouse
Degree;
For example, in actual applications, can also be using statistical methods such as chi, information moisture in the soils, to quantify each text point
Word is no longer described in detail in this application relative to the significance level of general Sample Storehouse, and those skilled in the art are by the application
Technical scheme when putting into practice, may be referred to the record in correlation technique.
In this example, above computer equipment can be with pre-configured one original black Sample Storehouse, and the black Sample Storehouse is used to deposit
The substantial amounts of social text (i.e. black sample) comprising harmful content audited out precipitated in storage content auditing platform.When above-mentioned
Computer equipment quantifies each text participle, relative to the significance level of general Sample Storehouse, obtains after respective weights value, subsequently may be used
With using weighted value of each text participle relative to general Sample Storehouse for quantifying as foundation, and according to the multiple of pre-configured completion
The text filtering ratio of gradient is kept, text filtering processing is carried out for the black sample in part in original black Sample Storehouse, then
Original black Sample Storehouse is reconstructed respectively based on remaining black sample, the black Sample Storehouse after multiple reconstruct is obtained.
Fig. 3 is referred to, Fig. 3 is that a kind of social text in original black Sample Storehouse shown in the application is reconstructed
Process chart.
In an initial condition, would generally be precipitated in content auditing platform it is substantial amounts of audit out comprising harmful content
Social text, in order to make full use of the social text that these have audited completion, above computer equipment can put down content auditing
These social texts that platform precipitates are as black sample, to create original black Sample Storehouse, then for the original black sample
The social text of full dose in this storehouse is reconstructed.
As shown in figure 3, when the social text of full dose in for black Sample Storehouse is reconstructed, black sample can be directed to first
The social text of full dose in storehouse carries out text word segmentation processing respectively;, wherein it is desired to explanation, to the social activity text in black Sample Storehouse
The text participle that this progress word segmentation processing is obtained, generally can be to carry out the text that word segmentation processing is obtained for above-mentioned general Sample Storehouse
The subset of this participle.
After the completion of word segmentation processing, above computer equipment can also further filter the punctuation mark in text participle with
And stop words, or other filtering policys progress text participle filterings are further introduced into, concrete implementation process is repeated no more.
Continuing with referring to Fig. 3, text participle is obtained after for black Sample Storehouse progress word segmentation processing and completes further literary
After the filtering of this participle, now above computer equipment can be based on default filtering policy, according to multiple holdings of pre-configured completion
Part text point in the text filtering ratio of gradient, the text participle obtained for above-mentioned original black Sample Storehouse word segmentation processing
Word carries out text filtering processing respectively, and is based respectively on the centrifugal pump of remaining text participle to complete the weight of above-mentioned black Sample Storehouse
Structure.Wherein, in this case, the black Sample Storehouse that reconstruct is completed, different text filtering ratios will be corresponded to respectively.
In a kind of embodiment shown, because each text participle in general Sample Storehouse has quantified phase in advance
For the significance level of general sample, and the weighted value for the significance level that can characterize each text participle is calculated;Moreover, right
For the text participle that word segmentation processing is obtained is carried out for the social text in original black Sample Storehouse, typically for above-mentioned
Social text in general Sample Storehouse carries out the subset for the text participle that word segmentation processing is obtained;Therefore, for original black sample
For each social text in storehouse, there is a weighted value relative to general Sample Storehouse.
In this case, when setting above-mentioned default drop policy, specifically it may be referred in original black Sample Storehouse
The corresponding weighted value of each text participle is selectively filtered, to complete the reconstruct for being directed to original black Sample Storehouse.
In a kind of embodiment shown, above-mentioned default filtering policy can specifically include appointing in following drop policy
One:
Only abandon weighted value highest text participle;
Only abandon the minimum text participle of weighted value;
Weighted value highest and minimum text participle are abandoned simultaneously.
In this application, due to being the text filtering ratio using text participle, to characterize the text and black sample of new typing
This text similarity, therefore the ratio shared by the text participle finally discarded, will influence final text to a certain degree
The result of similarity.
For the text participle that weighted value is minimum, its significance level is minimum, the low text participle pair of this part significance level
The influence of final similarity result is minimum, if preferentially filtering out the minimum text participle of significance level, contributes to lifting most
The precision of whole text similarity result;But precision is too high may to cause content auditing platform final similar based on text
Degree judges that hit-count when whether the social text of new typing hits the text participle in black Sample Storehouse declines, and content auditing is put down
Platform for the recall rate of the social text comprising harmful content it is too low the problem of.Therefore, in this case, if this area skill
Art personnel focus more on the degree of accuracy of final calculation result, then can be set to above-mentioned default filtering policy " to abandon weight
The minimum text participle of value ".
Similar, due to weighted value highest text participle, significance level highest, this part significance level high text point
Influence of the word to final similarity result is maximum, therefore preferentially filters out significance level highest text participle, can cause most
The precision of whole text similarity result is relatively low, causes content auditing platform final and is judging new typing based on text similarity
Hit-count of social text when whether hitting the text participle in black Sample Storehouse rise, content auditing platform is not for comprising
The problem of recall rate of the social text of good content is too high;Therefore, in this case, if those skilled in the art are more closed
Content auditing platform is noted for the recall rate in the social text comprising harmful content, then can be by above-mentioned default filtering policy
It is set to " only abandoning weighted value highest text participle ".
Certainly, in actual applications, content auditing platform usually requires to take into account the degree of accuracy of text similarity result, and
For the recall rate of the social text comprising harmful content;Therefore, in this case, those skilled in the art can will be above-mentioned
Default filtering policy is set to " while abandoning weighted value highest and minimum text participle ";For example, as shown in figure 3, showing in Fig. 3
The filtering policy gone out is " while abandoning weighted value highest and minimum text participle ".
In a kind of embodiment shown, the particular number of the text filtering ratio of above-mentioned multiple holding gradients, and
Grad between each gating rate, in this application without being particularly limited to, those skilled in the art can be based on actual
Demand is configured, or can also be configured based on engineering experience;For example, in one implementation, above-mentioned multiple guarantors
The quantity for holding the default gating rate of gradient is the gating rate of 10%, 20%, 40% and 50% etc. four holding gradient.
Continuing with referring to Fig. 3, it is assumed that it is above-mentioned it is multiple keep gradients text filtering ratio, be 10%, 20%, 40% and
50% grade four keeps the text filtering ratio of 10% growth gradient, and above computer equipment can be by four text filterings
Ratio, is chosen to be goal filtering ratio successively, then according to above-mentioned default drop policy, according to the selected goal filtering ratio
Example, carries out the part text participle in the text participle that word segmentation processing is obtained for the black Sample Storehouse and carries out text participle discarding,
Then the centrifugal pump (such as hash values) of remaining text participle is calculated respectively, and based on remaining text in original black Sample Storehouse
The centrifugal pump of this participle, to recreate centrifugal pump Sample Storehouse (centrifugal pump sample corresponding to above-mentioned goal filtering ratio
Storehouse is the black Sample Storehouse after reconstruct).
Wherein, in a kind of embodiment shown, above computer equipment is in presetting above-mentioned multiple holding gradients
Gating rate, when being chosen to be goal filtering ratio successively, specifically can successively be selected according to the order of gating rate from low to high
For goal filtering ratio.
With continued reference to Fig. 3, exemplified by each text participle is characterized using IDF values and corresponds to the significance level of general Sample Storehouse,
When realizing, above computer equipment first can according to 10% gating rate, discard and carried out for above-mentioned black Sample Storehouse
In the text participle that text participle is obtained, IDF values are higher than the text participle of 95% point of position (i.e. IDF values highest 5%), and low
In the text participle of 5% point of position (i.e. IDF values minimum 5%), the centrifugal pump of remaining text participle is then calculated respectively, is based on
The centrifugal pump of each remaining text participle calculated, generates the first centrifugal pump Sample Storehouse;
Further, after the first centrifugal pump Sample Storehouse is generated, above computer equipment can continue the mistake according to 20%
Filter ratio, discards and is carried out for above-mentioned black Sample Storehouse in the text participle that text participle is obtained, and IDF values are higher than 90% point of position
Text participle, and less than the text participle of 10% point of position, the centrifugal pump of remaining text participle is then calculated respectively, based on meter
The centrifugal pump of each remaining text participle calculated, generates the second centrifugal pump Sample Storehouse.
By that analogy, above computer equipment can subsequently continue the gating rate according to 40%, discard for above-mentioned
Black Sample Storehouse is carried out in the text participle that text participle is obtained, and IDF values are higher than the text participle of 80% point of position, and less than 20%
Divide the text participle of position, the centrifugal pump of remaining text participle is then calculated respectively, generate the 3rd centrifugal pump Sample Storehouse.And,
The gating rate according to 50% can be continued, discard and carry out the text participle that text participle is obtained for above-mentioned black Sample Storehouse
In, IDF values are higher than the text participle of 60% point of position, and less than the text participle of 30% point of position, then calculate respectively remaining
The centrifugal pump of text participle, generates the 4th centrifugal pump Sample Storehouse.
As shown in figure 3, completion is reconstructed to above-mentioned black Sample Storehouse according to mode illustrated above in above computer equipment
Afterwards, 4 centrifugal pump Sample Storehouses for corresponding to different gating rates respectively will can be reconstructed, now above computer equipment can be with
Centrifugal pump record in the centrifugal pump Sample Storehouse for reconstructing completion is loaded into internal memory respectively.Now it is directed to above-mentioned original black sample
The restructuring procedure in this storehouse terminates, original black Sample Storehouse according to different text filtering ratios, be reconstructed in order to it is multiple from
Dissipate value Sample Storehouse.In the centrifugal pump Sample Storehouse completed due to final reconstruct, only including several based on the text in black Sample Storehouse
The centrifugal pump of this participle, therefore above computer equipment needs the data volume loaded to substantially reduce.
Fig. 4 is referred to, Fig. 4 is that a kind of social text to new typing shown in the application performs the processing that similarity is given a mark
Flow chart.
As shown in figure 4, above computer equipment is after social text of the user by the new typing of social networking application is extracted, can
Based on gating rate corresponding with multiple centrifugal pump Sample Storehouses of reconstructed completion, to be entered successively using identical drop policy
Compose a piece of writing this reconstruct.
First, above computer equipment can carry out text word segmentation processing for the social text for the new typing extracted,
Obtain some text participles, and after the completion of word segmentation processing, can also further filter punctuation mark in text participle and
Stop words, or other filtering policys progress text participle filterings are further introduced into, concrete implementation process is repeated no more.
Text participle, which is obtained, after for the social text of new typing progress text word segmentation processing completes further text
After participle filtering, now multiple centrifugal pump Sample Storehouses after above-mentioned reconstruct can be chosen to be target successively by above computer equipment
Sample Storehouse;
Wherein, in a kind of embodiment shown, above computer equipment by above-mentioned multiple centrifugal pump Sample Storehouses, according to
It is secondary when being chosen to be target sample storehouse, specifically can by above-mentioned multiple centrifugal pump Sample Storehouses according to corresponding gating rate from low to high
Order, target sample storehouse is chosen to be successively.
Elected to make behind target sample storehouse, above computer equipment can be based on identical filtering policy, according to selected
The corresponding gating rate in target sample storehouse, is carried out for carrying out the part text participle in the text participle that word segmentation processing is obtained
Text participle is filtered, and completes the first time reconstruct for the social text of new typing.
After the completion of first time reconstructs, remaining text participle can be chosen to be to target participle, and calculate selected successively
The target participle centrifugal pump, then will calculate the obtained centrifugal pump of the target participle and the target sample loaded in internal memory
Centrifugal pump in this storehouse is matched successively;If the centrifugal pump of the target participle and any centrifugal pump in the target sample storehouse
Timing, then can be based on text filtering ratio corresponding with the target sample, black Sample Similarity is set for the target participle;
Wherein, in a kind of embodiment shown, based on text filtering ratio corresponding with the target sample, for this
When target participle sets black Sample Similarity, text filtering ratio corresponding with above-mentioned target sample storehouse can be specifically converted to
Target value, and the difference of 1 and the target value is calculated, then by the black Sample Similarity of the target participle, it is set greater than
Equal to the difference;Such as, then can be by the target participle and above-mentioned black Sample Storehouse when the goal filtering ratio is 10%
The similarity of black sample is set greater than being equal to 0.9.
Certainly, if the centrifugal pump of the target participle is mismatched with the centrifugal pump in above-mentioned target sample storehouse, now may be used
So that next text participle is chosen to be into target participle, above procedure is re-executed, by that analogy, until all text participles
Centrifugal pump complete to match with the centrifugal pump in above-mentioned target sample storehouse, now for the first time reconstruct after centrifugal pump matched
Into.
After the centrifugal pump matching after completing to reconstruct for the first time, the text after the social text word segmentation processing of this stylish typing
In participle, the text participle for being not provided with out similarity may possibly still be present.Therefore, in such a case, it is possible to which continue will be next
Individual centrifugal pump Sample Storehouse is chosen to be target sample storehouse, according to the corresponding text filtering ratio in the target sample storehouse, shows more than
The mode gone out carries out second to the social text of the new typing and reconstructed, and re-executes and illustrated above matched by centrifugal pump
For each text participle score process, by that analogy, until by the social text of new typing according to above-mentioned multiple centrifugal pump samples
The corresponding text filtering ratio in storehouse, is respectively completed reconstruct, and stop when completing after corresponding centrifugal pump matching process.
, wherein it is desired to explanation, for it is upper once reconstruct after it is configured go out similarity score text participle, such as
Fruit in selected target Sample Storehouse using the ascending order of the gating rate of each centrifugal pump Sample Storehouse as selected order, that
Text participle can be no longer participate in the similarity score process reconstruct next time after.
With continued reference to Fig. 4, to characterize the significance level that each text participle corresponds to general Sample Storehouse using IDF values, and
It is reconstructed, obtains respectively according to the gating rate of 10%, 20%, 40% and 50% etc. four holding gradient for black Sample Storehouse
To exemplified by four centrifugal pump Sample Storehouses;, can be according to the descending order of corresponding gating rate, by above-mentioned four when realizing
Individual centrifugal pump Sample Storehouse is chosen to be target sample storehouse successively.
As described in Figure 4, the first centrifugal pump Sample Storehouse that corresponding gating rate is 10% can be chosen to be target first
Sample Storehouse, and according to 10% gating rate, the social text filtered out for new typing carries out the text that text participle is obtained
In participle, IDF values are higher than the text participle of 95% point of position (i.e. IDF values highest 5%), and less than 5% point position (i.e. IDF values
Minimum text participle 5%), and the centrifugal pump of remaining text participle is calculated respectively;Then, by remaining each text participle
Centrifugal pump be chosen to be target participle successively, and by the centrifugal pump of the target participle, with the first centrifugal pump Sample Storehouse from
Scattered value is matched successively;If the centrifugal pump of the target participle is matched with any centrifugal pump in the first centrifugal pump Sample Storehouse
When, then similarity that can be by the target participle relative to the black sample in above-mentioned black Sample Storehouse is set to be not less than 90%.
Certainly, if the centrifugal pump of the target participle is mismatched with the centrifugal pump in above-mentioned first centrifugal pump Sample Storehouse,
Now next text participle can be chosen to be target participle, re-execute above procedure, by that analogy, until all texts
The centrifugal pump of this participle completes to match with the centrifugal pump in above-mentioned first centrifugal pump Sample Storehouse.
Continuing with referring to Fig. 4, when the social text of new typing carries out all text participles that word segmentation processing is obtained
Centrifugal pump completes to match with the centrifugal pump in above-mentioned first centrifugal pump Sample Storehouse, if now similar there are still being not provided with out
The text participle of scoring is spent, now the second centrifugal pump Sample Storehouse that corresponding text filtering ratio is 20% mesh can be chosen to be
Sample Storehouse is marked, and according to 20% text filtering ratio, the social text filtered out for new typing carries out text participle and obtained
Text participle in, IDF values and less than the text participle of 10% point of position, and are counted respectively higher than the text participle of 90% point of position
Calculate the centrifugal pump of remaining text participle;Then, the centrifugal pump of remaining each text participle is chosen to be target participle successively, and
By the centrifugal pump of the target participle, matched successively with the centrifugal pump in the second centrifugal pump Sample Storehouse;If the target point
, then can be by the target participle relative to upper when the centrifugal pump of word is matched with any centrifugal pump in the second centrifugal pump Sample Storehouse
The similarity of the black sample in black Sample Storehouse is stated, is set to be not less than 80%.
If the centrifugal pump of the target participle is mismatched with the centrifugal pump in above-mentioned second centrifugal pump Sample Storehouse, now may be used
So that next text participle is chosen to be into target participle, above procedure is re-executed, by that analogy, until all text participles
Centrifugal pump complete to match with the centrifugal pump in above-mentioned second centrifugal pump Sample Storehouse.
It is similar, when the social text of new typing carry out the centrifugal pumps of all text participles that word segmentation processing is obtained with
Centrifugal pump in above-mentioned second centrifugal pump Sample Storehouse completes matching, if now there are still be not provided with out similarity score
Text participle, now can be chosen to be target sample storehouse by the 3rd centrifugal pump Sample Storehouse that corresponding gating rate is 40%, press
According to 40% gating rate, the social text filtered out for new typing is carried out in the text participle that text participle is obtained, IDF values
Higher than the text participle of 80% point of position, and less than the text participle of 20% point of position, iteration performs similarity illustrated above and commented
Divide process;
Further, when the social text of new typing carries out the centrifugal pump for all text participles that word segmentation processing is obtained
Complete to match with the centrifugal pump in above-mentioned 3rd centrifugal pump Sample Storehouse, if now there are still be not provided with out similarity score
Text participle, now corresponding text filtering ratio for 50% the 4th centrifugal pump Sample Storehouse can be chosen to be target sample
Storehouse, according to 50% text filtering ratio, the social text filtered out for new typing carries out the text point that text participle is obtained
In word, IDF values are higher than the text participle of 60% point of position, and illustrated above less than the text participle iteration execution of 30% point of position
Similarity score process, specific implementation procedure is repeated no more.
Certainly, in actual applications, when new typing social text carry out word segmentation processing after text participle, respectively according to
The corresponding gating rate of above-mentioned multiple centrifugal pump Sample Storehouses, the text participle that part is filtered out respectively completes reconstruct, and owns
The centrifugal pump of text participle is completed after matching with all centrifugal pumps in corresponding centrifugal pump Sample Storehouse, if now this is new
Any text participle in the samples of text of typing, is chosen to be the target participle, and the centrifugal pump of text participle with
, then can be by the black Sample Similarity of text participle, i.e., when centrifugal pump in above-mentioned multiple centrifugal pump Sample Storehouses is mismatched
0 is set with the similarity of the samples of text in above-mentioned black Sample Storehouse.
It can be seen that, by using the text filtering ratio of text participle, to characterize the social text and black sample of new typing
Text similarity, and using centrifugal pump matching by the way of, in the social text of new typing each text participle set with it is black
The similarity score of sample, it is possible to achieve in the way of accurately matching, completes the samples of text of new typing and obscuring for black sample
Matching, with traditional based on the similarity algorithm such as editing distance or COS distance, come calculate the social text of new typing with it is black
The mode of the fuzzy matching of sample is compared, and can be obviously improved computational efficiency.
In this example, when by the above-mentioned similarity score flow shown in Fig. 4, the above-mentioned social text for new typing is completed
After the similarity score for each text participle that this progress word segmentation processing is obtained, above computer equipment can be based on the similarity
Appraisal result, content auditing is carried out to the social text of the new typing.
Specifically, above computer equipment can be with one similarity threshold of preset value, then by the social activity text of the new typing
Similarity score of each text participle is compared with the similarity threshold in this;If any in the social text of the new typing
The similarity of text participle reaches the similarity threshold, text participle can be now defined as to sensitive keys word, and take
Corresponding security measure (such as being shielded to text) is using the social text of above-mentioned new typing as comprising harmful content
Black sample carry out real-time security.
Certainly, if the similarity score of the text participle in the social text of the new typing, it is below the similarity threshold
Value, the social text of the now new typing is normal social text, can be without any processing.
In addition, it is necessary to explanation, when based on similarity score using the social text of new typing be used as black sample carry out phase
After the security processing answered, it can update using the social text of the new typing as black sample and arrive above-mentioned original black sample
In storehouse.In this way, can the result based on content auditing, constantly the black Sample Storehouse in original black Sample Storehouse is entered
Data sample in row incremental update, and then the original black Sample Storehouse that can enrich constantly.
Corresponding with above method embodiment, present invention also provides the embodiment of device.
Fig. 5 is referred to, the application proposes a kind of computing device 50 of text similarity, and the computer equipment includes multiple
Black Sample Storehouse;The multiple black Sample Storehouse is based on default filtering policy, for the part text sample in original black Sample Storehouse
After this progress is filtered, created and obtained based on remaining samples of text;Wherein, the multiple black Sample Storehouse corresponds to different texts respectively
This gating rate;Wherein, Fig. 6 is referred to, as involved by the computer equipment for the computing device 50 for carrying the text similarity
And hardware structure in, generally include CPU, internal memory, nonvolatile memory, network interface and internal bus etc.;With software
Exemplified by realization, the computing device 50 of the text similarity is generally understood that the computer program being carried in internal memory, leads to
The logic device that the software and hardware formed after CPU operations is combined is crossed, described device 50 includes:
Word-dividing mode 501, the samples of text for new typing carries out word segmentation processing, obtains some text participles;
Filtering module 502, target sample storehouse is chosen to be by the multiple black Sample Storehouse successively, and based on the default filtering
Strategy, according to the corresponding text filtering ratio in the target sample storehouse, for the part text in some text participles point
Word is filtered;
Matching module 503, target text participle is chosen to be by remaining text participle in some text participles successively,
And matched the target text participle successively with the text participle in the target sample storehouse;
Setup module 504, if the target text participle with when any text participle is matched in the target sample storehouse,
It is that the target text participle sets black Sample Similarity based on text filtering ratio corresponding with the target sample storehouse.
In this example, the word-dividing mode 501 is further:
Word segmentation processing is carried out successively for the samples of text in the black Sample Storehouse;
The filtering module 502 is further:
By the text filtering ratio of default multiple holding gradients, goal filtering ratio is chosen to be successively;Based on described pre-
If drop policy, according to the goal filtering ratio, carried out for the black Sample Storehouse in the text participle that word segmentation processing is obtained
Part text participle filtered;
Described device 50 also includes:
Creation module 505 (not shown in Fig. 5), calculates the centrifugal pump of remaining text participle in the black Sample Storehouse, and
Based on the centrifugal pump of the remaining text participle calculated, the black Sample Storehouse corresponding to the goal filtering ratio is created.
In this example, the corresponding text filtering ratio of the multiple black Sample Storehouse keeps gradient;The filtering module 502 enters
One step:
Order by the multiple black Sample Storehouse according to corresponding text filtering ratio from low to high, is chosen to be target successively
Sample Storehouse.
In this example, the default filtering policy includes any in following strategy:
Only abandon weighted value highest text participle;
Only abandon the minimum text participle of weighted value;
Weighted value highest and minimum text participle are abandoned simultaneously.
In this example, the weighted value is the IDF values that the text participle corresponds to general Sample Storehouse.
In this example, the setup module 504:
Text filtering ratio corresponding with the target sample storehouse is converted into target value;
Calculate the difference of 1 and the target value;
By the black Sample Similarity of the target text participle, it is set greater than being equal to the difference.
In this example, the setup module 504 is further:
When any text participle in the samples of text of the new typing, with the text participle in the multiple black Sample Storehouse
When mismatching, the black Sample Similarity of text participle is set 0.
In this example, described device 50 also includes:
Protection module 506 (not shown in Fig. 5), when the black sample of any text participle in the samples of text of the new typing
When this similarity reaches predetermined threshold value, the samples of text of the new typing is carried out in real time as the black sample comprising harmful content
Security.
In this example, the samples of text is social text;Samples of text in the black Sample Storehouse is comprising bad interior
The social text of appearance.
For device embodiment, because it corresponds essentially to embodiment of the method, so related part is real referring to method
Apply the part explanation of example.Device embodiment described above is only schematical, wherein described be used as separating component
The unit of explanation can be or may not be physically separate, and the part shown as unit can be or can also
It is not physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be according to reality
Selection some or all of module therein is needed to realize the purpose of application scheme.Those of ordinary skill in the art are not paying
In the case of going out creative work, you can to understand and implement.
System, device, module or unit that above-described embodiment is illustrated, can specifically be realized by computer chip or entity,
Or realized by the product with certain function.A kind of typically to realize that equipment is computer, the concrete form of computer can
To be personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play
In device, navigation equipment, E-mail receiver/send equipment, game console, tablet PC, wearable device or these equipment
The combination of any several equipment.
Those skilled in the art will readily occur to its of the application after considering specification and putting into practice invention disclosed herein
Its embodiment.The application is intended to any modification, purposes or the adaptations of the application, these modifications, purposes or
Person's adaptations follow the general principle of the application and including the undocumented common knowledge in the art of the application
Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the application and spirit are by following
Claim is pointed out.
It should be appreciated that the precision architecture that the application is not limited to be described above and is shown in the drawings, and
And various modifications and changes can be being carried out without departing from the scope.Scope of the present application is only limited by appended claim.
The preferred embodiment of the application is the foregoing is only, not to limit the application, all essences in the application
God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of the application protection.