CN101446970A

CN101446970A - Method for censoring and process text contents issued by user and device thereof

Info

Publication number: CN101446970A
Application number: CNA2008102200098A
Authority: CN
Inventors: 刘怀军; 刘昌毅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co., Ltd.
Priority date: 2008-12-15
Filing date: 2008-12-15
Publication date: 2009-06-03
Anticipated expiration: 2028-12-15
Also published as: CN101446970B

Abstract

The invention discloses a method for censoring and processing text contents issued by a user and a device thereof. The method comprises the following steps: receiving the text contents issued by the user and judging user information according to a list rule database; if the user information neither belongs to a white list or a white rule nor a black list or a black rule, calculating a first similarity of a first characteristic vector of the text contents of the user and a second characteristic vector of pre-established spam sample contents, and judging whether the text contents issued by the user are qualified contents according to the first similarity, if the text contents are the qualified contents, publishing the text contents issued by the user; or sending the text contents issued by the user for manual censoring. The method and the device can help censor and filter the user information and the text contents issued by the user without total manual censoring of the information issued by the user, thus greatly avoid the manual censoring time and saving the human resources and correspondingly enhancing the censoring efficiency.

Description

A kind of method and device thereof that the content of text audit of user's issue is handled

Technical field

The present invention relates to the communications field, a kind of method and device thereof that the content of text audit of user's issue is handled.

Background technology

At present, ask community (network address: http://wenwen.soso.com) be similar to that Baidu is known, Sina likes to ask etc. question and answer type service, the user can ask a question or answers the problem that other people propose at the page, has made things convenient for user's obtaining information to a great extent.Now, ask community and approximately have more than 20 ten thousand new problems generations every day, the information of asking user's submission in the community needs to consume a large amount of manual examination and verification time all via manually examining, the waste of manpower resource, and review efficiency is lower.

Summary of the invention

The invention provides a kind of method and device thereof that the content of text audit of user's issue is handled, it can save a large amount of manual examination and verification time, has improved review efficiency.

Technical scheme of the present invention is: a kind of method that the content of text audit of user's issue is handled comprises step:

Receive the content of text of user's issue, judge user profile according to the list rule database, described list rule database comprises blacklist, black rule, white list and white rule;

If described user profile neither belongs to white list or white rule, do not belong to blacklist or black rule yet, then the content of text to described user carries out format conversion, extracts the notional word in the described content of text;

Calculate the contrary document frequency weighted value of each notional word in the document database of setting up in advance that extracts, obtain first proper vector of forming by described contrary document frequency weighted value;

First similarity of second proper vector of the spam samples content of calculating described first proper vector and setting up in advance, whether the content of text of judging described user's issue according to described first similarity is qualified content, if qualified content is then announced the content of text of described user's issue.

The invention also discloses a kind of device that the content of text audit of user's issue is handled, it comprises, auditing module, be used to receive the content of text of user's issue, judge user profile according to the list rule database, described list rule database comprises blacklist, black rule, white list and white rule;

Modular converter is used for neither belonging to white list or white rule in described user profile, and when also not belonging to blacklist or black rule, the content of text that described user is issued carries out format conversion, extracts the notional word in the described content of text;

Computing module is used for calculating the contrary document frequency weighted value of each notional word of extraction at the document database of setting up in advance, obtains first proper vector of being made up of described contrary document frequency weighted value; First similarity of second proper vector of the spam samples content of calculating described first proper vector simultaneously and setting up in advance;

Judge module is used for judging according to described first similarity whether described user's content of text is qualified content, if qualified content is then announced the content of text of described user's issue.

The method and apparatus that the content of text audit of user's issue is handled of the present invention, only to neither belonging to white list or white rule, the content of text that does not also belong to user's issue of blacklist or black rule is examined filtration treatment, the content of text and the underproof content of text of user's issue sent to manually that belongs to user's issue of black rule and blacklist can be examined, the content of text of user's issue of belonging to white rule and white list and the qualified content of text that the user issues are directly announced; Need not can save a large amount of manual examination and verification time all via manually examining to user's information releasing like this, save human resources, also improve review efficiency accordingly.

Description of drawings

Fig. 1 is the method flow diagram that the present invention handles the content of text audit of user's issue;

Fig. 2 is the structured flowchart () of the present invention to the device of the content of text audit processing of user's issue;

Fig. 3 is the structured flowchart (two) of the present invention to the device of the content of text audit processing of user's issue;

Fig. 4 is the structured flowchart (three) of the present invention to the device of the content of text audit processing of user's issue.

Embodiment

The method and apparatus that the content of text audit of user's issue is handled of the present invention, only to neither belonging to white list or white rule, the content of text that does not also belong to user's issue of blacklist or black rule is examined filtration treatment, the content of text and the underproof content of text of user's issue sent to manually that will belong to user's issue of black rule and blacklist is examined, and the content of text of user's issue of belonging to white rule and white list and the qualified content of text that the user issues are directly announced; Need not can save a large amount of manual examination and verification time all via manually examining to user's information releasing like this, save human resources, also improve review efficiency accordingly.

Below in conjunction with the drawings and specific embodiments the present invention is done a detailed elaboration.

The method that the content of text audit of user issue is handled of the present invention can be applied in to be asked in the question and answer type services such as community, Baidu are known, Sina likes to ask.

The method that the content of text audit of user's issue is handled of the present invention comprises step, as Fig. 1,

The content of text of S100, reception user issue.S101, judge user profile according to the list rule database; Described list rule database comprises blacklist, black rule, white list and white rule.In one embodiment, blacklist can be to have big probability that the user list of junk information is provided, and white list is to have big probability that the user list of proper information is provided; Black rule is to set according to user's grade or credit rating, and its expression user's lower grade is or credit rating is very low, and white rule also is to set according to user's grade or credit rating, and its grade of representing the user is than higher or credit rating is very high.

If the described user profile of S102 neither belongs to white list or white rule, do not belong to blacklist or black rule yet, then the content of text to described user's issue carries out format conversion, extracts the notional word in the described content of text.In one embodiment, format conversion can comprise that described content of text is carried out the traditional font to be changed, remove the conversion in unnecessary space etc. to half-angle to simplified conversion, full-shape, and notional word is the core word of content of text, and function word is not as core word.

Contrary document frequency (IDF) weighted value of each notional word in the document database of setting up in advance that S103, calculating are extracted obtains first proper vector of being made up of described contrary document frequency (IDF) weighted value.In one embodiment, the document database can be made up of the content of text of all user's issues.Calculate contrary document frequency (IDF) weighted value of each notional word in the document database of setting up in advance that extracts, be specifically as follows: according to formula

wgt = t_{f} \times \lg \frac{U}{V}

Calculate contrary document frequency (IDF) weighted value of each notional word; Wherein wgt is contrary document frequency (IDF) weighted value, t _fBe the frequency values that described notional word occurs in described user's content of text, U is the total number of documents in the described document database, and V is for the number of files of described notional word occurring.

S104, calculate described first proper vector and first similarity of second proper vector of the spam samples content set up in advance.Second proper vector of spam samples content can obtain in advance, it is the same with first proper vector that it obtains process, take out a spam samples content, to its format conversion, extract notional word, calculate the contrary document frequency weighted value of each notional word in described document database then, form second proper vector by these weighted values.In one embodiment, calculate described first proper vector and first similarity of second proper vector of the spam samples content set up in advance, be specially: according to formula

Cos (X, Y) = \frac{Σ_{α = 1, β = 1}^{α = m, β = n} x_{α} y_{β}}{\sqrt{Σ_{α = 1}^{m} x_{α}^{2} Σ_{β = 1}^{n} y_{β}^{2}}}

Cos(X，Y)

Calculate described first similarity; Wherein represent described first similarity,

X＝{x ₁，K，x _m}，Y＝{y ₁，K，y _n}

Represent described first proper vector and second proper vector respectively.

S105, judge according to described first similarity whether the content of text of described user issue is qualified content.This determination methods has a variety of modes, can set according to user's needs.In one embodiment, can set a predetermined threshold,, otherwise judge that the content of text of this user's issue is qualified content if the value of described first similarity, can judge then that the content of text of this user's issue is defective content greater than this threshold value.

If qualified content, then carry out the content of text that step S107 announces described user's issue, the content of text of described user's issue is sent to manually examine otherwise can carry out step S106 in one embodiment.

In one embodiment, belong to blacklist or black rule, the content of text of described user's issue is sent to manually examine if can also comprise step S102 user profile after the step S101.If the described user profile of S103 belongs to white list or white rule, the content of text of described user's issue will be announced.

In order to judge comprehensively accurately that further whether the content that the user issues is qualified content, reduces the probability of erroneous judgement.In one embodiment, judging that user profile neither belongs to white list or white rule, when not belonging to blacklist or black rule again, can also comprise step, second similarity of the feature database that comprises phone number format, webpage format and Mars word form etc. that detects the content of text of described user's issue and set up in advance judges according to this second similarity and first similarity whether the content of text of described user's issue is qualified content.When whether the content of text of judging user's issue is qualified content, can distribute weights respectively for first similarity and second similarity, whether detect the weights sum greater than a predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.Whether the value that also can only detect this second similarity in addition greater than a predetermined value, if greater than could judge directly that the content of text of this user's issue is defective content.

In order to reach same purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.In one embodiment, judging that user profile neither belongs to white list or white rule, when not belonging to blacklist or black rule again, can also comprise step, add up the number of characters of the content of text of described user's issue, judge according to this number of characters, first similarity and second similarity whether the content of text of described user's issue is qualified content.When judging whether the content of text of issuing with corpse is qualified content, can distribute weights respectively for number of characters, first similarity and second similarity, whether detect the weights sum greater than a predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.Also can set a predetermined value with regard to this number of characters separately in addition,, can judge that directly the content of text that the user issues is defective content if when detecting number of characters less than this predetermined value.

In order to reach same purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.In one embodiment, judging that user profile neither belongs to white list or white rule, when not belonging to blacklist or black rule again, can also comprise step, the third phase that detects the content of text of described user's issue and the data bank of setting up in advance that can not announce words (this data bank is at some special words and short sentence or the set of interior perhaps other settings of requirement shielding at no distant date) judges like degree, described number of characters, first similarity and second similarity whether the content of text that described user issues is qualified content according to this third phase like degree.When whether the content of text of judging user's issue is qualified content, can distribute weights respectively like degree, number of characters, first similarity and second similarity for third phase, whether detect the weights sum greater than predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.Also can detect this third phase in addition separately and seemingly whether spend greater than a predetermined value, if greater than, can judge that then the content of text of this user's issue is defective content.

The present invention has also disclosed a kind of device that the content of text audit of user's issue is handled, and as Fig. 2, it comprises auditing module, modular converter, computing module and the judge module that connects successively;

Auditing module is used to receive the content of text that the user issues, and judges user profile according to the list rule database, and described list rule database comprises blacklist, black rule, white list and white regular.In one embodiment, blacklist can be to have big probability that the user list of junk information is provided, and white list is to have big probability that the user list of proper information is provided; Black rule is to set according to user's grade or credit rating, and its expression user's lower grade is or credit rating is very low, and white rule also is to set according to user's grade or credit rating, and its grade of representing the user is than higher or credit rating is very high.

Modular converter is used for neither belonging to white list or white rule in described user profile, and when also not belonging to blacklist or black rule, the content of text that described user is issued carries out format conversion, extracts the notional word in the described content of text.In one embodiment, format conversion can comprise that described content of text is carried out body to be changed, remove the conversion in unnecessary space etc. to half-angle to simplified conversion, full-shape, and notional word is the core word of content of text, and function word is not as core word.

Computing module is used for calculating contrary document frequency (IDF) weighted value of each notional word of extraction at the document database of setting up in advance, obtains first proper vector of being made up of described contrary document frequency (IDF) weighted value; First similarity of second proper vector of the spam samples content of calculating described first proper vector simultaneously and setting up in advance.In one embodiment, the document database can be made up of the content of text of all user's issues.Calculate contrary document frequency (IDF) weighted value of each notional word in the document database of setting up in advance that extracts, be specifically as follows: according to formula

wgt = t_{f} \times \lg \frac{U}{V}

Calculate contrary document frequency (IDF) weighted value of each notional word; Wherein wgt is contrary document frequency (IDF) weighted value, t _fBe the frequency values that described notional word occurs in described user's content of text, U is the total number of documents in the described document database, and V is for the number of files of described notional word occurring.Second proper vector of spam samples content can obtain in advance, it is the same with first proper vector that it obtains process, take out a spam samples content, to its format conversion, extract notional word, calculate the contrary document frequency weighted value of each notional word in described document database then, form second proper vector by these weighted values.In one embodiment, calculate described first proper vector and first similarity of second proper vector of the spam samples content set up in advance, be specially: according to formula

Cos (X, Y) = \frac{Σ_{α = 1, β = 1}^{α = m, β = n} x_{α} y_{β}}{\sqrt{Σ_{α = 1}^{m} x_{α}^{2} Σ_{β = 1}^{n} y_{β}^{2}}}

Cos(X，Y)

X＝{x ₁，K，x _m}，Y＝{y ₁，K，y _n}

Represent described first proper vector and second proper vector respectively.

Judge module is used for judging according to described first similarity whether described user's content of text is qualified content, if qualified content is then announced the content of text of described user's issue.In one embodiment, be defective content if judge described user's content of text, the content of text that then described judge module is issued described user sends to manually to be examined.

In one embodiment, described auditing module belongs to blacklist or black rule in user profile, the content of text of described user's issue is sent to manually examine; Belong to white list or white rule in described user profile, will announce the content of text of described user's issue.

In order to judge comprehensively accurately that further whether the content that the user issues is qualified content, reduces the probability of erroneous judgement.As Fig. 3, between described auditing module and described judge module, also be connected with detection module, be used for neither belonging to white list or white rule in user profile, when not belonging to blacklist or black rule again, detect the content of text of described user's issue and second similarity of the feature database that comprises phone number format, webpage format and Mars word form of foundation in advance; And/or detect described user's content of text and the third phase of the data bank that can not announce words set up in advance like degree, and described second similarity and/or third phase sent to described judge module like degree, described judge module judges like degree whether the content of text that described user issues is qualified content according to described first similarity, second similarity and/or third phase.When whether the content of text of judging user's issue is qualified content, can distribute weights respectively like degree for first similarity, second similarity and/or third phase, whether detect the weights sum greater than predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.

In order to reach identical purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.As Fig. 4, between described auditing module and described judge module, also be connected with statistical module, be used for neither belonging to white list or white rule in user profile, when not belonging to blacklist or black rule again, add up the number of characters of the content of text of described user's issue, and this number of characters sent to described judge module, described judge module judges like degree whether the content of text that described user issues is qualified content according to this number of characters, described first similarity, second similarity and/or third phase.When whether the content of text of judging user's issue is qualified content, can distribute weights respectively like degree for number of characters, first similarity, second similarity and/or third phase, whether detect the weights sum greater than predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.

In sum, the method and apparatus that the content of text audit of user's issue is handled of the present invention, can examine filtration treatment to the content of text of user profile and user's issue, the content of text and the underproof content of text of user's issue sent to manually that will belong to user's issue of black rule and blacklist is examined, and the content of text of user's issue of belonging to white rule and white list and the qualified content of text that the user issues are directly announced; Need not can save a large amount of manual examination and verification time all via manually examining to user's information releasing like this, save human resources, also improve review efficiency accordingly.

Above-described embodiment of the present invention does not constitute the qualification to protection domain of the present invention.Any modification of being done within the spirit and principles in the present invention, be equal to and replace and improvement etc., all should be included within the claim protection domain of the present invention.

Claims

1, a kind of method that the content of text audit of user's issue is handled is characterized in that, comprises step:

If described user profile neither belongs to white list or white rule, do not belong to blacklist or black rule yet, then the content of text to described user's issue carries out format conversion, extracts the notional word in the described content of text;

2, the method that the content of text audit of user's issue is handled according to claim 1, it is characterized in that: neither belong to white list or white rule in described user profile, when not belonging to blacklist or black rule yet, also comprise step, second similarity of the feature database that comprises phone number format, webpage format and Mars word form that detects the content of text of described user's issue and set up in advance judges according to described second similarity and first similarity whether the content of text of described user's issue is qualified content.

3, the method that the content of text audit of user's issue is handled according to claim 2, it is characterized in that: neither belong to white list or white rule in described user profile, when not belonging to blacklist or black rule yet, also comprise step, add up the number of characters of the content of text of described user's issue, judge according to this number of characters, first similarity and second similarity whether the content of text of described user's issue is qualified content.

4, the method that the content of text audit of user's issue is handled according to claim 3, it is characterized in that: neither belong to white list or white rule in described user profile, when not belonging to blacklist or black rule yet, also comprise step, the third phase that comprises the data bank that can not announce words that detects the content of text of described user's issue and set up in advance judges like degree, described number of characters, first similarity and second similarity whether the content of text of described user's issue is qualified content according to this third phase like degree.

5, according to the described method that the content of text audit of user's issue is handled of the arbitrary claim of claim 1 to 4, it is characterized in that: the contrary document frequency weighted value of each notional word that described calculating is extracted in the document database of setting up in advance is specially: according to formula

wgt = t_{f} \times \lg \frac{U}{V}

Calculate the contrary document frequency weighted value of each notional word; Wherein wgt is contrary document frequency weighted value, t _fBe the frequency values that described notional word occurs in described user's content of text, U is the total number of documents in the described document database, and V is for the number of files of described notional word occurring.

6, the method that audit is handled to user profile and content of text according to claim 5 is characterized in that: calculate described first proper vector and first similarity of second proper vector of the spam samples content of foundation in advance, be specially: according to formula

Cos (X, Y) = \frac{Σ_{α = 1, β = 1}^{α = m, β = n} x_{α} y_{β}}{\sqrt{Σ_{α = 1}^{m} x_{α}^{2} Σ_{β = 1}^{n} y_{β}^{2}}}

Cos(X，Y)

X＝{x ₁，K，x _m}，Y＝{y ₁，K，y _n}

Represent described first proper vector and second proper vector respectively.

7, method to user profile and content of text audit processing according to claim 4, it is characterized in that: seemingly spend according to this third phase, described number of characters, first similarity and second similarity judge whether the content of text of described user's issue is qualified content, concrete deterministic process is: be respectively described third phase like degree, described number of characters, first similarity and second similarity are distributed corresponding weights, detect described weights and whether greater than predetermined value, if, the content of text of then judging described user's issue is defective content, otherwise the content of text of described user's issue is qualified content.

8, a kind of device that the content of text audit of user's issue is handled is characterized in that: comprises,

Auditing module is used to receive the content of text that the user issues, and judges user profile according to the list rule database, and described list rule database comprises blacklist, black rule, white list and white regular;

Judge module is used for judging according to described first similarity whether the content of text of described user's issue is qualified content, if qualified content is then announced the content of text of described user's issue.

9, the device that the content of text audit of user's issue is handled according to claim 8, it is characterized in that: also comprise detection module, neither belong to white list or white rule in described user profile, when also not belonging to blacklist or black rule, be used to detect the content of text of described user's issue and second similarity of the feature database that comprises phone number format, webpage format and Mars word form of foundation in advance; And/or the third phase that comprises the data bank that can not announce words that detects described user's content of text and foundation is in advance seemingly spent, and described second similarity and/or third phase sent to described judge module like degree, described judge module judges like degree whether the content of text that described user issues is qualified content according to described first similarity, second similarity and/or third phase.

10, the device to user profile and content of text audit processing according to claim 9, it is characterized in that: also comprise statistical module, neither belong to white list or white rule in described user profile, when not belonging to blacklist or black rule yet, be used to add up the number of characters of described content of text, and described number of characters sent to described judge module, described judge module judges like the degree and first similarity whether the content of text of described user's issue is qualified content according to described number of characters, second similarity, third phase.