Summary of the invention
The invention provides a kind of method and device thereof that the content of text audit of user's issue is handled, it can save a large amount of manual examination and verification time, has improved review efficiency.
Technical scheme of the present invention is: a kind of method that the content of text audit of user's issue is handled comprises step:
Receive the content of text of user's issue, judge user profile according to the list rule database, described list rule database comprises blacklist, black rule, white list and white rule;
If described user profile neither belongs to white list or white rule, do not belong to blacklist or black rule yet, then the content of text to described user carries out format conversion, extracts the notional word in the described content of text;
Calculate the contrary document frequency weighted value of each notional word in the document database of setting up in advance that extracts, obtain first proper vector of forming by described contrary document frequency weighted value;
First similarity of second proper vector of the spam samples content of calculating described first proper vector and setting up in advance, whether the content of text of judging described user's issue according to described first similarity is qualified content, if qualified content is then announced the content of text of described user's issue.
The invention also discloses a kind of device that the content of text audit of user's issue is handled, it comprises, auditing module, be used to receive the content of text of user's issue, judge user profile according to the list rule database, described list rule database comprises blacklist, black rule, white list and white rule;
Modular converter is used for neither belonging to white list or white rule in described user profile, and when also not belonging to blacklist or black rule, the content of text that described user is issued carries out format conversion, extracts the notional word in the described content of text;
Computing module is used for calculating the contrary document frequency weighted value of each notional word of extraction at the document database of setting up in advance, obtains first proper vector of being made up of described contrary document frequency weighted value; First similarity of second proper vector of the spam samples content of calculating described first proper vector simultaneously and setting up in advance;
Judge module is used for judging according to described first similarity whether described user's content of text is qualified content, if qualified content is then announced the content of text of described user's issue.
The method and apparatus that the content of text audit of user's issue is handled of the present invention, only to neither belonging to white list or white rule, the content of text that does not also belong to user's issue of blacklist or black rule is examined filtration treatment, the content of text and the underproof content of text of user's issue sent to manually that belongs to user's issue of black rule and blacklist can be examined, the content of text of user's issue of belonging to white rule and white list and the qualified content of text that the user issues are directly announced; Need not can save a large amount of manual examination and verification time all via manually examining to user's information releasing like this, save human resources, also improve review efficiency accordingly.
Embodiment
The method and apparatus that the content of text audit of user's issue is handled of the present invention, only to neither belonging to white list or white rule, the content of text that does not also belong to user's issue of blacklist or black rule is examined filtration treatment, the content of text and the underproof content of text of user's issue sent to manually that will belong to user's issue of black rule and blacklist is examined, and the content of text of user's issue of belonging to white rule and white list and the qualified content of text that the user issues are directly announced; Need not can save a large amount of manual examination and verification time all via manually examining to user's information releasing like this, save human resources, also improve review efficiency accordingly.
Below in conjunction with the drawings and specific embodiments the present invention is done a detailed elaboration.
The method that the content of text audit of user issue is handled of the present invention can be applied in to be asked in the question and answer type services such as community, Baidu are known, Sina likes to ask.
The method that the content of text audit of user's issue is handled of the present invention comprises step, as Fig. 1,
The content of text of S100, reception user issue.S101, judge user profile according to the list rule database; Described list rule database comprises blacklist, black rule, white list and white rule.In one embodiment, blacklist can be to have big probability that the user list of junk information is provided, and white list is to have big probability that the user list of proper information is provided; Black rule is to set according to user's grade or credit rating, and its expression user's lower grade is or credit rating is very low, and white rule also is to set according to user's grade or credit rating, and its grade of representing the user is than higher or credit rating is very high.
If the described user profile of S102 neither belongs to white list or white rule, do not belong to blacklist or black rule yet, then the content of text to described user's issue carries out format conversion, extracts the notional word in the described content of text.In one embodiment, format conversion can comprise that described content of text is carried out the traditional font to be changed, remove the conversion in unnecessary space etc. to half-angle to simplified conversion, full-shape, and notional word is the core word of content of text, and function word is not as core word.
Contrary document frequency (IDF) weighted value of each notional word in the document database of setting up in advance that S103, calculating are extracted obtains first proper vector of being made up of described contrary document frequency (IDF) weighted value.In one embodiment, the document database can be made up of the content of text of all user's issues.Calculate contrary document frequency (IDF) weighted value of each notional word in the document database of setting up in advance that extracts, be specifically as follows: according to formula
Calculate contrary document frequency (IDF) weighted value of each notional word; Wherein wgt is contrary document frequency (IDF) weighted value, t
fBe the frequency values that described notional word occurs in described user's content of text, U is the total number of documents in the described document database, and V is for the number of files of described notional word occurring.
S104, calculate described first proper vector and first similarity of second proper vector of the spam samples content set up in advance.Second proper vector of spam samples content can obtain in advance, it is the same with first proper vector that it obtains process, take out a spam samples content, to its format conversion, extract notional word, calculate the contrary document frequency weighted value of each notional word in described document database then, form second proper vector by these weighted values.In one embodiment, calculate described first proper vector and first similarity of second proper vector of the spam samples content set up in advance, be specially: according to formula
Cos(X,Y)
Calculate described first similarity; Wherein represent described first similarity,
X={x
1,K,x
m},Y={y
1,K,y
n}
Represent described first proper vector and second proper vector respectively.
S105, judge according to described first similarity whether the content of text of described user issue is qualified content.This determination methods has a variety of modes, can set according to user's needs.In one embodiment, can set a predetermined threshold,, otherwise judge that the content of text of this user's issue is qualified content if the value of described first similarity, can judge then that the content of text of this user's issue is defective content greater than this threshold value.
If qualified content, then carry out the content of text that step S107 announces described user's issue, the content of text of described user's issue is sent to manually examine otherwise can carry out step S106 in one embodiment.
In one embodiment, belong to blacklist or black rule, the content of text of described user's issue is sent to manually examine if can also comprise step S102 user profile after the step S101.If the described user profile of S103 belongs to white list or white rule, the content of text of described user's issue will be announced.
In order to judge comprehensively accurately that further whether the content that the user issues is qualified content, reduces the probability of erroneous judgement.In one embodiment, judging that user profile neither belongs to white list or white rule, when not belonging to blacklist or black rule again, can also comprise step, second similarity of the feature database that comprises phone number format, webpage format and Mars word form etc. that detects the content of text of described user's issue and set up in advance judges according to this second similarity and first similarity whether the content of text of described user's issue is qualified content.When whether the content of text of judging user's issue is qualified content, can distribute weights respectively for first similarity and second similarity, whether detect the weights sum greater than a predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.Whether the value that also can only detect this second similarity in addition greater than a predetermined value, if greater than could judge directly that the content of text of this user's issue is defective content.
In order to reach same purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.In one embodiment, judging that user profile neither belongs to white list or white rule, when not belonging to blacklist or black rule again, can also comprise step, add up the number of characters of the content of text of described user's issue, judge according to this number of characters, first similarity and second similarity whether the content of text of described user's issue is qualified content.When judging whether the content of text of issuing with corpse is qualified content, can distribute weights respectively for number of characters, first similarity and second similarity, whether detect the weights sum greater than a predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.Also can set a predetermined value with regard to this number of characters separately in addition,, can judge that directly the content of text that the user issues is defective content if when detecting number of characters less than this predetermined value.
In order to reach same purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.In one embodiment, judging that user profile neither belongs to white list or white rule, when not belonging to blacklist or black rule again, can also comprise step, the third phase that detects the content of text of described user's issue and the data bank of setting up in advance that can not announce words (this data bank is at some special words and short sentence or the set of interior perhaps other settings of requirement shielding at no distant date) judges like degree, described number of characters, first similarity and second similarity whether the content of text that described user issues is qualified content according to this third phase like degree.When whether the content of text of judging user's issue is qualified content, can distribute weights respectively like degree, number of characters, first similarity and second similarity for third phase, whether detect the weights sum greater than predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.Also can detect this third phase in addition separately and seemingly whether spend greater than a predetermined value, if greater than, can judge that then the content of text of this user's issue is defective content.
The present invention has also disclosed a kind of device that the content of text audit of user's issue is handled, and as Fig. 2, it comprises auditing module, modular converter, computing module and the judge module that connects successively;
Auditing module is used to receive the content of text that the user issues, and judges user profile according to the list rule database, and described list rule database comprises blacklist, black rule, white list and white regular.In one embodiment, blacklist can be to have big probability that the user list of junk information is provided, and white list is to have big probability that the user list of proper information is provided; Black rule is to set according to user's grade or credit rating, and its expression user's lower grade is or credit rating is very low, and white rule also is to set according to user's grade or credit rating, and its grade of representing the user is than higher or credit rating is very high.
Modular converter is used for neither belonging to white list or white rule in described user profile, and when also not belonging to blacklist or black rule, the content of text that described user is issued carries out format conversion, extracts the notional word in the described content of text.In one embodiment, format conversion can comprise that described content of text is carried out body to be changed, remove the conversion in unnecessary space etc. to half-angle to simplified conversion, full-shape, and notional word is the core word of content of text, and function word is not as core word.
Computing module is used for calculating contrary document frequency (IDF) weighted value of each notional word of extraction at the document database of setting up in advance, obtains first proper vector of being made up of described contrary document frequency (IDF) weighted value; First similarity of second proper vector of the spam samples content of calculating described first proper vector simultaneously and setting up in advance.In one embodiment, the document database can be made up of the content of text of all user's issues.Calculate contrary document frequency (IDF) weighted value of each notional word in the document database of setting up in advance that extracts, be specifically as follows: according to formula
Calculate contrary document frequency (IDF) weighted value of each notional word; Wherein wgt is contrary document frequency (IDF) weighted value, t
fBe the frequency values that described notional word occurs in described user's content of text, U is the total number of documents in the described document database, and V is for the number of files of described notional word occurring.Second proper vector of spam samples content can obtain in advance, it is the same with first proper vector that it obtains process, take out a spam samples content, to its format conversion, extract notional word, calculate the contrary document frequency weighted value of each notional word in described document database then, form second proper vector by these weighted values.In one embodiment, calculate described first proper vector and first similarity of second proper vector of the spam samples content set up in advance, be specially: according to formula
Cos(X,Y)
Calculate described first similarity; Wherein represent described first similarity,
X={x
1,K,x
m},Y={y
1,K,y
n}
Represent described first proper vector and second proper vector respectively.
Judge module is used for judging according to described first similarity whether described user's content of text is qualified content, if qualified content is then announced the content of text of described user's issue.In one embodiment, be defective content if judge described user's content of text, the content of text that then described judge module is issued described user sends to manually to be examined.
In one embodiment, described auditing module belongs to blacklist or black rule in user profile, the content of text of described user's issue is sent to manually examine; Belong to white list or white rule in described user profile, will announce the content of text of described user's issue.
In order to judge comprehensively accurately that further whether the content that the user issues is qualified content, reduces the probability of erroneous judgement.As Fig. 3, between described auditing module and described judge module, also be connected with detection module, be used for neither belonging to white list or white rule in user profile, when not belonging to blacklist or black rule again, detect the content of text of described user's issue and second similarity of the feature database that comprises phone number format, webpage format and Mars word form of foundation in advance; And/or detect described user's content of text and the third phase of the data bank that can not announce words set up in advance like degree, and described second similarity and/or third phase sent to described judge module like degree, described judge module judges like degree whether the content of text that described user issues is qualified content according to described first similarity, second similarity and/or third phase.When whether the content of text of judging user's issue is qualified content, can distribute weights respectively like degree for first similarity, second similarity and/or third phase, whether detect the weights sum greater than predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.
In order to reach identical purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.As Fig. 4, between described auditing module and described judge module, also be connected with statistical module, be used for neither belonging to white list or white rule in user profile, when not belonging to blacklist or black rule again, add up the number of characters of the content of text of described user's issue, and this number of characters sent to described judge module, described judge module judges like degree whether the content of text that described user issues is qualified content according to this number of characters, described first similarity, second similarity and/or third phase.When whether the content of text of judging user's issue is qualified content, can distribute weights respectively like degree for number of characters, first similarity, second similarity and/or third phase, whether detect the weights sum greater than predetermined value, if greater than a predetermined value, the content of text that can judge this user's issue is defective content, otherwise is qualified content.
In sum, the method and apparatus that the content of text audit of user's issue is handled of the present invention, can examine filtration treatment to the content of text of user profile and user's issue, the content of text and the underproof content of text of user's issue sent to manually that will belong to user's issue of black rule and blacklist is examined, and the content of text of user's issue of belonging to white rule and white list and the qualified content of text that the user issues are directly announced; Need not can save a large amount of manual examination and verification time all via manually examining to user's information releasing like this, save human resources, also improve review efficiency accordingly.
Above-described embodiment of the present invention does not constitute the qualification to protection domain of the present invention.Any modification of being done within the spirit and principles in the present invention, be equal to and replace and improvement etc., all should be included within the claim protection domain of the present invention.