CN103605693A - Device and method used for identifying advertisement features of issued message in online game - Google Patents

Device and method used for identifying advertisement features of issued message in online game Download PDF

Info

Publication number
CN103605693A
CN103605693A CN201310537964.5A CN201310537964A CN103605693A CN 103605693 A CN103605693 A CN 103605693A CN 201310537964 A CN201310537964 A CN 201310537964A CN 103605693 A CN103605693 A CN 103605693A
Authority
CN
China
Prior art keywords
text
out information
characteristic
feature
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310537964.5A
Other languages
Chinese (zh)
Inventor
孙林
陈培军
秦吉胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310537964.5A priority Critical patent/CN103605693A/en
Publication of CN103605693A publication Critical patent/CN103605693A/en
Priority to PCT/CN2014/087175 priority patent/WO2015062377A1/en
Priority to US15/034,307 priority patent/US20160283582A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a device and a method used for identifying advertisement features of an issued message in an online game. The method includes: detecting issued message events of a game client server; obtaining an issued message text according to the issued message events; extracting one or more feature vectors contained in the issued message text; identifying whether or not the issued message text to be tested matches with one or more records in an advertisement feature database; performing shielding processing to the issued message events when above matching is identified. By the aid of the device and the method, the feature vectors can be extracted from the issued message text, and whether or not the issued message text to be tested matches with one or more records in the advertisement feature database is identified according to the feature vectors, so that the advertisement features of the issued message in the online game can be identified accurately.

Description

The apparatus and method of the characteristic of advertisement giving out information for recognition network game
Technical field
The present invention relates to computer network field, be specifically related to a kind of apparatus and method of the characteristic of advertisement giving out information for recognition network game.
Background technology
, there is a large amount of online game product and network gaming user in the rise along with application such as online games.Network gaming user, when playing, can exchange by giving out information, yet in a large amount of releasing news, has advertising message, to user, has brought inconvenience, has also reduced the quality of online game simultaneously.In order to address this problem, the research work of the characteristic of advertisement giving out information in recognition network game is carried out gradually, to expect the finding out junk information with characteristic of advertisement from release news.
Summary of the invention
In view of the above problems, the present invention has been proposed to a kind of overcome the problems referred to above or a kind of Similar Text pick-up unit addressing the above problem at least in part and corresponding a kind of Similar Text detection method are provided.
According to one aspect of the present invention, a kind of device of the characteristic of advertisement giving out information for recognition network game is provided, comprising: detecting unit, is suitable for detecting the event that gives out information of game client; Text acquiring unit, be suitable for according to described in the event of giving out information obtain the text that gives out information; Proper vector extraction unit, the one or more proper vectors that comprise in the text that is suitable for giving out information described in extracting; Recognition unit, is suitable for according to described proper vector, identify to be detected give out information text whether with characteristic of advertisement database in one or more record matchings; Screen unit, is suitable for when recognition unit identifies above-mentioned coupling, and the described event of giving out information is carried out to shielding processing.
Alternatively, described detecting unit, be suitable for before described in described text acquiring unit basis, the event of giving out information is obtained the text that gives out information, whether the type that detects described message event is broadcast event or multicast message event, exit if not flow process, if obtain by the event of giving out information described in described text acquiring unit basis the text that gives out information.
Alternatively, described screen unit, be positioned at game server or carry out described in the give out information game client of event.
Alternatively, described recognition unit, is suitable for each feature in described proper vector, detects in characteristic of advertisement database whether repeatedly occur this feature; Described recognition unit, be suitable for judging whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold in characteristic of advertisement database, be to determine the described to be detected record matching giving out information in text and characteristic of advertisement database, otherwise do not mate.
Alternatively, described recognition unit, be suitable for each feature in described proper vector, from characteristic of advertisement database, search and whether have this feature, if existed, further check the weights of this feature, if the weights of this feature are more than or equal to Second Threshold, in characteristic of advertisement database, repeatedly occur this feature.
Alternatively, this device further comprises characteristic of advertisement database update unit, described characteristic of advertisement database update unit, be suitable for when determining the described to be detected record matching giving out information in text and characteristic of advertisement database, for each feature in described proper vector, if detect in characteristic of advertisement database and have this feature, the weights of this feature in characteristic of advertisement database are added to 1.
Alternatively, described recognition unit, be suitable in each feature in described proper vector, before whether there is this feature in detection characteristic of advertisement database, whether the number that judges the feature in described proper vector is less than the 3rd threshold value, be that the described to be detected text that gives out information does not mate and finishes decision operation with the record in characteristic of advertisement database, otherwise for each feature in described proper vector, detect in characteristic of advertisement database whether repeatedly occur this feature.
Alternatively, described proper vector extraction unit comprises: Chinese text obtains subelement, is suitable for the text that gives out information to carry out text-processing to obtain Chinese text; Phonetic text obtains subelement, is suitable for transferring the Chinese character in the Chinese text obtaining to phonetic and obtains phonetic text; Fingerprint obtains subelement, is suitable for extracting the feature of described phonetic text, by the proper vector of phonetic text described in the Characteristics creation extracting.
Alternatively, described Chinese text obtains subelement, is suitable for the text that gives out information to carry out data cleansing operation, and the content in text is converted to regular character; Phonetic is converted into Chinese character; And conventional Chinese character will be retained.
Alternatively, described Chinese text obtains subelement, be suitable for identifying and abandon HTML mark, the complex form of Chinese characters is converted to simplified Chinese character, double byte character is converted to half-angle character, capitalization English letter is converted to small letter English alphabet, and identifies and abandon url and punctuation mark, so that the content giving out information in text is converted to regular character; Described Chinese text obtains subelement, is suitable for using two-way maximum matching algorithm that the phonetic in text is converted to Chinese character, if the corresponding a plurality of Chinese characters of phonetic, from a plurality of Chinese characters of correspondence optional one, so that the phonetic in text is converted into Chinese character; Described Chinese text obtains subelement, is suitable for using the Chinese characters in common use in GBK coding schedule to filter the text that gives out information, and abandons all characters that do not belong to Chinese characters in common use, to retain conventional Chinese character.
Alternatively, described phonetic text obtains subelement, is suitable for using the Chinese-character phonetic letter table of comparisons, and each Chinese character is converted to corresponding pinyin string, to obtain phonetic text.
Alternatively, described fingerprint obtains subelement, is suitable for take individual Chinese character and extracts the feature of described phonetic text as cutting granularity, and use vector space model by the proper vector of phonetic text described in the Characteristics creation extracting.
According to another aspect of the present invention, a kind of method of the characteristic of advertisement giving out information for recognition network game is provided, comprising: the event that gives out information that detects game client; According to the described event of giving out information, obtain the text that gives out information; The one or more proper vectors that give out information described in extraction and comprise in text; According to described proper vector, identify to be detected give out information text whether with characteristic of advertisement database in one or more record matchings; When identifying above-mentioned coupling, the described event of giving out information is carried out to shielding processing.
Alternatively, the method further comprises: before described in described basis, the event of giving out information is obtained the text that gives out information, whether the type that detects described message event is broadcast event or multicast message event, exit if not flow process, if according to described in the event of giving out information obtain the text that gives out information.
Alternatively, the described event of giving out information being carried out to shielding processing is carried out by game server or game client.
Alternatively, described according to described proper vector, identify to be detected give out information text whether with characteristic of advertisement database in one or more record matchings, specifically comprise: to each feature in described proper vector, detect in characteristic of advertisement database whether repeatedly occur this feature; Judge whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold in characteristic of advertisement database, be to determine the described to be detected record matching giving out information in text and characteristic of advertisement database, otherwise do not mate.
Alternatively, in described detection characteristic of advertisement database, whether repeatedly occur that this feature comprises: from characteristic of advertisement database, search and whether have this feature, if existed, further check the weights of this feature, if the weights of this feature are more than or equal to Second Threshold, in characteristic of advertisement database, repeatedly there is this feature.
Alternatively, when determining the described to be detected record matching giving out information in text and characteristic of advertisement database, the method further comprises: for each feature in described proper vector, if detect in characteristic of advertisement database and have this feature, these weights by this feature in characteristic of advertisement database add 1.
Alternatively, in each feature in described proper vector, before whether there is this feature in detection characteristic of advertisement database, described judge to be detected give out information text whether with characteristic of advertisement database in record matching further comprise: whether the number that judges the feature in described proper vector is less than the 3rd threshold value, that the described to be detected text that gives out information does not mate with record in characteristic of advertisement database and finishes decision operation, otherwise for each feature in described proper vector, detect in characteristic of advertisement database whether repeatedly occur this feature.
Alternatively, the one or more proper vectors that comprise in the text that gives out information described in described extraction, specifically comprise: the to be detected text that gives out information is carried out to text-processing to obtain Chinese text; Transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text; Extract the feature of described phonetic text, by the proper vector of phonetic text described in the Characteristics creation extracting.
Alternatively, described text is carried out to text-processing to obtain Chinese text, specifically comprise: text is carried out to data cleansing operation, the content giving out information in text is converted to regular character; Phonetic is converted into Chinese character; Retain conventional Chinese character.
Alternatively, described the text that gives out information is carried out to data cleansing operation, specifically comprise: identify and abandon HTML mark, the complex form of Chinese characters is converted to simplified Chinese character, double byte character is converted to half-angle character, capitalization English letter is converted to small letter English alphabet, and identify and abandon url and punctuation mark; Described phonetic in text is converted into Chinese character, specifically comprises: use two-way maximum matching algorithm that the phonetic in text is converted to Chinese character, if the corresponding a plurality of Chinese character of phonetic, from a plurality of Chinese characters of correspondence optional one; The Chinese character that described reservation is conventional, specifically comprises: use the Chinese characters in common use in GBK coding schedule to filter the text that gives out information, abandon all characters that do not belong to Chinese characters in common use.
Alternatively, describedly transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text, specifically comprise: use the Chinese-character phonetic letter table of comparisons, each Chinese character is converted to corresponding pinyin string, obtain phonetic text.
Alternatively, the feature of the described phonetic text of described extraction, by the proper vector of phonetic text described in the Characteristics creation extracting, specifically comprise: the individual Chinese character of take extracts the feature of described phonetic text as cutting granularity, and use vector space model by the proper vector of phonetic text described in the Characteristics creation extracting.
According to the apparatus and method of the characteristic of advertisement giving out information for recognition network game of the present invention, can obtain proper vector by the text that gives out information, and then according to eigenvector recognition to be detected give out information text whether with characteristic of advertisement database in one or more record matchings, and the described event of giving out information is carried out to shielding processing while identifying above-mentioned coupling, the characteristic of advertisement effectively giving out information in recognition network game.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the method for the characteristic of advertisement giving out information for recognition network game according to an embodiment of the invention;
Fig. 2 shows the detailed process flow diagram of step S300 as shown in Figure 1;
Fig. 3 shows step S310, step S320 as shown in Figure 2 and the detailed process flow diagram of step S330;
Fig. 4 shows the detailed process flow diagram of step S400 as shown in Figure 1;
Fig. 5 shows according to the block diagram of the device of the characteristic of advertisement giving out information for recognition network game of first embodiment of the invention;
Fig. 6 shows according to the detailed block diagram of the device of the characteristic of advertisement giving out information for recognition network game of first embodiment of the invention; And
Fig. 7 shows according to the detailed block diagram of the device of the characteristic of advertisement giving out information for recognition network game of second embodiment of the invention.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
Fig. 1 shows the process flow diagram of the method for the characteristic of advertisement giving out information for recognition network game according to an embodiment of the invention.The method comprises the following steps S100, S200, S300, S400 and S500.
The event that gives out information of S100, detection game client.
Particularly, when game client gives out information, the event of giving out information can be detected.Further, can, by detecting the Content of Communication of game server and game client, detect the event that gives out information.
Described in S200, basis, the event of giving out information is obtained the text that gives out information.Those skilled in the art are easily appreciated that, by the detection event that gives out information, can obtain the text that gives out information.
The one or more proper vectors that comprise in S300, the text that gives out information described in extracting.In the present embodiment, can be by detecting punctuate symbol, the text dividing that will give out information is multistage text, and then obtains a plurality of proper vectors; Also the text that can non-dividedly give out information, and then obtain a proper vector.
S400, according to described proper vector, identify to be detected give out information text whether with characteristic of advertisement database in one or more record matchings.
In the present embodiment, to each feature in proper vector, can detect in a default characteristic of advertisement database whether repeatedly occur this feature.After having detected all features in proper vector, the feature repeatedly occurring in characteristic of advertisement database in judging characteristic vector accounts for the ratio of whole features of proper vector, thereby judges whether text to be detected mates with the record in characteristic of advertisement database.In the present embodiment, default characteristic of advertisement database is used Redis characteristic of advertisement database, can be to analyze by the web advertisement text to magnanimity (such as capturing the junk information such as the web advertisement of collecting) feature that obtains magnanimity, and the number of each feature of obtaining of statistics and obtain weights, make feature (Shingle) and weights (Value) formation characteristic of advertisement database.
S500, when identifying above-mentioned coupling, the described event of giving out information is carried out to shielding processing.Preferably, the described event of giving out information being carried out to shielding processing is carried out by game server or game client.
Further, the present invention among step S200 according to described in before the event of giving out information obtains the text that gives out information, also comprise: whether the type that detects described message event is broadcast event or multicast message event, exit if not flow process, if according to described in the event of giving out information obtain the text that gives out information.
Step S300 of the present invention and step S400, realized by with characteristic of advertisement database in record carry out Similar Text monitoring, the characteristic of advertisement giving out information in recognition network game.A kind of Similar Text detection method that is different from step S300 of the present invention and step S400 is: the feature of first extracting text (is for example carried out participle to text, extract entity word) and use various technology to expand and (for example use synonym word woods feature, the knowledge bases such as near synonym dictionary are carried out vocabulary extension), and with VSM model, text (for example using VSM model is a vector by one piece of text representation) is described, then use clustering method to carry out cluster (for example, for two pieces of texts to text, after vectorization represents, calculate two vectorial cosine angles for characterizing the similarity of two pieces of texts, if similarity is greater than certain threshold value, think that two pieces of texts are similar), the text being gathered is together similar.
Yet, in network application, exist the mutation of a large amount of Similar Texts, as used the complex form of Chinese characters, applicable phonetic to replace word, replace former word, add a large amount of insignificant interference characters by phonetically similar word, etc., there is following shortcoming in above-mentioned technology: (one) word segmentation result exists error; (2) text of the different words of unisonance cannot be judged as similar; (3) cannot be Similar Text by two pieces of text identification processing through alphabetizing; (4) for example, to the computation complexity of text too high (, be vector by text representation, need larger operand).Therefore, this method cannot meet the computing requirement of real-time in current big data quantity situation.
Fig. 2 shows the detailed process flow diagram of step S300 as shown in Figure 1.The method comprises the following steps S310, S320 and S330.
S310, the to be detected text that gives out information is carried out to text-processing to obtain Chinese text.
By the text that gives out information by be detected, obtain Chinese text, the impact of the mutation that can eliminate Similar Texts such as including insignificant interference character, the complex form of Chinese characters on the recognition effect of the present embodiment.
S320, transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text.
By the Chinese character unification in Chinese text is converted into phonetic, can eliminate with phonetic replace word, the mutation that replaces the Similar Texts such as former word by the phonetically similar word impact on the recognition effect of the present embodiment.
S330, extract the feature of described phonetic text, by the proper vector of phonetic text described in the Characteristics creation extracting.
In the present embodiment, can adopt N gram language model (N-gram) to mention the proper vector of phonetic text, the Chinese character granularity in the Chinese text obtaining based on step S310, the phonetic text that step S320 is obtained extracts N-gram feature SHINGLE 1, SHINGLE 2... SHINGLE m.For example, if the Chinese text that step S310 obtains is " I love Tian An-men, Beijing ", Chinese character granularity be " I ", " love ", " north ", " capital ", " my god ", " peace ", " door ", the phonetic text that step S320 obtains is " wo ai bei jing tian an men ", pinyin string is split as " wo ", " ai ", " bei ", " jing ", " tian ", " an ", " men " so, if make N=6, in step S330, the N-gram feature SHINGLE obtaining 1for " wo ai bei jing tian an ", SHINGLE 2for " ai bei jing tian an men ", the like.And use vector space model (VSM, Vector Space Model) to form proper vector D=<SHINGLE 1, SHINGLE 2..., SHINGLE m>.
Fig. 3 shows step S310, step S320 as shown in Figure 2 and the detailed process flow diagram of step S330.Step S310 specifically comprises:
S311, the to be detected text that gives out information is carried out to data cleansing operation, the to be detected content giving out information in text is converted to regular character.
Wherein, the to be detected text that gives out information is carried out to data cleansing operation, specifically comprise: identify and abandon HTML mark, the complex form of Chinese characters is converted to simplified Chinese character, double byte character is converted to half-angle character, capitalization English letter is converted to small letter English alphabet, and identifies and abandon url and punctuation mark.
S312, phonetic is converted into Chinese character.
Wherein, the phonetic in the text of processing through step S311 is converted into Chinese character, specifically comprises: use two-way maximum matching algorithm that the phonetic in text is converted to Chinese character, if the corresponding a plurality of Chinese character of phonetic, from a plurality of Chinese characters of correspondence optional one.
S313, retain conventional Chinese character.
Wherein, retain conventional Chinese character, specifically comprise: use the Chinese characters in common use in GBK coding schedule to filter text, abandon all characters that do not belong to Chinese characters in common use, only retain Chinese character GBK and be coded in the Chinese character in 0xB0A0~0xF7FE.
Step S320 specifically comprises: use the Chinese-character phonetic letter table of comparisons, each Chinese character is converted to corresponding pinyin string, obtain phonetic text.
By step S310, by the to be detected text that gives out information, obtain Chinese text, and by step S320, transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text, can, by the different mutation of Similar Text, be identified as identical phonetic text.For example, by be detected give out information text and three kinds of mutation as shown in table 1, by step S310 and S320, obtain identical phonetic text.
Give out information text and three kinds of mutation that table 1 is to be detected
Figure BDA0000407820630000091
Use step S310 of the present invention and step S320 to process respectively above-mentioned original text and three kinds of mutation, can obtain identical phonetic text: " tian mao shou ye zhan tie dao liu lan qi fang wen tian mao chao shi zhan tie dao liu lan qi fang wen ".Take mutation 3 as example: the text after step S110 carries out data cleansing as: " 1x3f days Mao homepages paste Liu pull tfa days mao supermarkets of device access paste Liu and pull device access sdjh " phonetic turns Chinese character, result phonetic being converted into after Chinese character through step S312 is: " 1x3f days Mao homepages paste Liu pull tfa days cat supermarkets of device access paste Liu and pull device access sdjh ", wherein " 1x3f ", " tfa " and " sdjh " be not due in lexicon with Pinyin, therefore do not process, " mao " is in lexicon with Pinyin, therefore the random Chinese character " cat " of selecting is used for substituting it, through step S313, retain conventional Chinese character, result is: " a day Mao homepage paste Liu pull a device access day cat supermarket paste Liu and pull device access ", further use the Chinese-character phonetic letter table of comparisons, each Chinese character is converted to corresponding phonetic, obtain above-mentioned phonetic text.Original text, mutation 1 and mutation 2 also can obtain identical phonetic text.
When N=6, the proper vector obtaining through step S330 is <tian mao shou ye zhan tie, mao shou ye zhan tie dao, shou ye zhan tie dao liu, ye zhan tie dao liu lan, zhan tie dao liu lan qi, tie dao liu lan qi fang, dao liu lan qi fang wen, liu lan qi fang wen tan, lan qi fang wen tan mao, qi fang wen tan mao chao, fang wen tan mao chao shi, wen tan mao chao shi zhan, tan mao chao shi zhan tie, mao chao shi zhan tie dao, chao shi zhan tie dao liu, shi zhan tie dao liu lan, zhan tie dao liu lan qi, tie dao liu lan qi fang, dao liu lan qi fang wen>.
Fig. 4 shows the detailed process flow diagram of step S400 in Fig. 1.To the proper vector of being obtained by above-mentioned steps S300, step S400 specifically comprises the following steps:
Whether the number K of the feature in S410, judging characteristic vector is less than the 3rd threshold value T3, be to perform step S490, otherwise execution step S420.The advantage of this single stepping has 2 points at least, first, in actual online game, the length of advertisement information often can be too not short, for example, and the message of a great deal of is that the text that length is very little (no more than three to five Chinese characters) therefore judges by this step in online game, make the proper vector of text size little (number of the feature of obtaining is less than default threshold value) no longer carry out the judgement of step S420-S470, reduced the operand of the present embodiment method; Moreover, the short number of features of text size of text is few, known according to follow-up step S470, for text, exist because indivedual features occur in characteristic of advertisement database, be mistaken for characteristic of advertisement database in the probability of record matching, by step S410, avoided this erroneous judgement.
In S420, selected characteristic vector one not with characteristic of advertisement database in the feature (Shingle) that compares of record.
S430, judge in characteristic of advertisement database whether have the feature of obtaining in step S420, if perform step S440, otherwise execution step S460.
S440, judge whether the weights of this feature in characteristic of advertisement database are more than or equal to Second Threshold T2, if perform step S450, otherwise execution step S460.
In S450, judgement characteristic of advertisement database, repeatedly there is this feature, and perform step S460.Owing to having judged in step S440 that weights are more than or equal to Second Threshold T2, so judge in step S450 and repeatedly occur this feature in characteristic of advertisement database.
Whole features in S460, judging characteristic vector, whether with characteristic of advertisement database in record compare, if perform step S470, otherwise return to execution step S420, read one not with characteristic of advertisement database in the feature that compares of record, each feature to proper vector, all can perform step S430.
S470, judge whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold T1, is to perform step S480 in characteristic of advertisement database, otherwise execution step S490.In the present embodiment, by judge that the feature repeatedly occurring in a proper vector accounts for the ratio of whole features of this proper vector, reflects whether text to be detected mates with the record in characteristic of advertisement database in characteristic of advertisement database.As from the foregoing, the operational method that the present embodiment adopts all belongs to simple text transform operation and simple data compare operation, and the relation between operand and text size is roughly once linear relationship, and computing expense is little.
S480, determine the record matching in text to be detected and characteristic of advertisement database and finish decision operation.
S490, determine that text to be detected does not mate with record in characteristic of advertisement database and finishes decision operation.
Preferably, while determining the record matching in described text to be detected and characteristic of advertisement database in step S480, the method of the present embodiment further comprises: for each feature in described proper vector, if detect in characteristic of advertisement database and have this feature, these weights by this feature in characteristic of advertisement database add 1.In other words, if the record matching in text to be detected and characteristic of advertisement database upgrades characteristic of advertisement database Redis, thereby when using method of the present invention, realize the renewal to characteristic of advertisement database.
It is example that the proper vector of being obtained by the text that gives out information in table 1 is take in continuation, when N=6, the proper vector obtaining through step S300 is <tian mao shou ye zhan tie, mao shou ye zhan tie dao, shou ye zhan tie dao liu, ye zhan tie dao liu lan, zhan tie dao liu lan qi, tie dao liu lan qi fang, dao liu lan qi fang wen, liu lan qi fang wen tan, lan qi fang wen tan mao, qi fang wen tan mao chao, fang wen tan mao chao shi, wen tan mao chao shi zhan, tan mao chao shi zhan tie, mao chao shi zhan tie dao, chao shi zhan tie dao liu, shi zhan tie dao liu lan, zhan tie dao liu lan qi, tie dao liu lan qi fang, dao liu lan qi fang wen>.First by step S410, whether the number K=24 of the feature in judging characteristic vector is less than the 3rd threshold value T3, suppose the 3rd threshold value T3=10, K > T3, further by step S420, choose one not with characteristic of advertisement database in the feature that compares of record, for example " tian mao shou ye zhan tie ", by step S430, judge and in characteristic of advertisement database, whether have this feature, if be judged as NO, by step S460, return to step S420 and choose another feature, if step S430 is judged as YES, by step S440, whether the weights Value that judges this feature in characteristic of advertisement database is more than or equal to Second Threshold T2, suppose weights Value=6, Second Threshold T2=2, by repeatedly there is this feature in step S450 judgement characteristic of advertisement database, preferably, can for example to feature, carry out mark in several ways or by this feature of charting to record the operating result of this step.When 24 features all having been carried out judging (at least passing through step S420 and step S430), perform step S470, whether the ratio that the feature that judgement repeatedly occurs in characteristic of advertisement database accounts for above-mentioned 24 features reaches first threshold T1, suppose in characteristic of advertisement database repeatedly occur be characterized as 12, the ratio that accounts for above-mentioned 24 features is 50%, suppose that first threshold T1 is 30%, determine text and the record matching in characteristic of advertisement database to be detected and finish decision operation.
Fig. 5 shows according to the block diagram of the device of the characteristic of advertisement giving out information for recognition network game of first embodiment of the invention.This device comprises detecting unit 100, text acquiring unit 200, proper vector extraction unit 300, recognition unit 400, screen unit 500, and characteristic of advertisement database 600.
Wherein, detecting unit 100, is suitable for detecting the event that gives out information of game client.
Particularly, when game client gives out information, detecting unit 100 can detect the event of giving out information.Further, detecting unit 100 can, by detecting the Content of Communication of game server and game client, detect the event that gives out information.
Further, detecting unit 100, be suitable for before described in text acquiring unit 200 bases, the event of giving out information is obtained the text that gives out information, whether the type that detects described message event is broadcast event or multicast message event, exit if not flow process, if obtain by the event of giving out information described in text acquiring unit 200 bases the text that gives out information.
Text acquiring unit 200, be suitable for according to described in the event of giving out information obtain the text that gives out information.Those skilled in the art are easily appreciated that, text acquiring unit 200, by the detection event that gives out information, can obtain the text that gives out information.
Proper vector extraction unit 300, the one or more proper vectors that comprise in the text that is suitable for giving out information described in extracting.
Recognition unit 400, is suitable for according to described proper vector, identify to be detected give out information text whether with characteristic of advertisement database 600 in one or more record matchings.
Characteristic of advertisement database 600 in the present embodiment is used Redis characteristic of advertisement database, can be by the network text of magnanimity (such as capturing the junk information such as the web advertisement of collecting) is analyzed to the feature that obtains magnanimity, and the number of each feature of obtaining of statistics and obtain weights, make feature (Shingle) and weights (Value) formation characteristic of advertisement database.
Particularly, recognition unit 400, is suitable for each feature in described proper vector, detects in characteristic of advertisement database 600 whether repeatedly occur this feature.Particularly, recognition unit 400, be suitable for each feature in described proper vector, from characteristic of advertisement database 600, search and whether have this feature, if existed, further check the weights of this feature, if the weights of this feature are more than or equal to default Second Threshold T2, judge and in characteristic of advertisement database 600, repeatedly occur this feature.
Recognition unit 400, be further adapted for the ratio that the feature repeatedly occurring judging in described proper vector accounts for whole features of this proper vector and whether reach first threshold T1 in characteristic of advertisement database 600, be the record matching of determining in described text to be detected and characteristic of advertisement database 600, otherwise do not mate.
Further, recognition unit 400, be suitable in each feature in described proper vector, before whether there is this feature in detection characteristic of advertisement database 600, whether the number that judges the feature in described proper vector is less than the 3rd threshold value T3, be that described text to be detected does not mate and finishes decision operation with the record in characteristic of advertisement database 600, otherwise further for each feature in described proper vector, detect in characteristic of advertisement database 600 whether repeatedly occur this feature.
Screen unit 500, is suitable for when recognition unit identifies above-mentioned coupling, and the described event of giving out information is carried out to shielding processing.The screen unit 500 of the present embodiment, be positioned at game server or carry out described in the give out information game client of event.
Fig. 6 shows according to the detailed block diagram of the device of the characteristic of advertisement giving out information for recognition network game of first embodiment of the invention.Wherein, proper vector extraction unit 300, specifically comprises that Chinese text obtains subelement 310, phonetic text obtains subelement 320 and fingerprint obtains subelement 330.
Wherein, Chinese text obtains subelement 310, is suitable for the to be detected text that gives out information to carry out text-processing to obtain Chinese text.
More specifically, Chinese text obtains subelement 310, be suitable for the to be detected text that gives out information to carry out data cleansing operation, data cleansing operation comprises identifies and abandons HTML mark, the complex form of Chinese characters is converted to simplified Chinese character, double byte character is converted to half-angle character, capitalization English letter is converted to small letter English alphabet, and identify and abandon url and punctuation mark, so that the content in text is converted to regular character, the content in text is converted to regular character; Chinese text obtains subelement 310, be further adapted for phonetic is converted into Chinese character, comprise and use two-way maximum matching algorithm that the phonetic in text is converted to Chinese character, if the corresponding a plurality of Chinese characters of phonetic, from a plurality of Chinese characters of correspondence optional one, so that the phonetic in text is converted into Chinese character; Chinese text obtains subelement 310, be further adapted for and retain conventional Chinese character, comprise that the Chinese characters in common use that use in GBK coding schedule filter text, abandon all characters that do not belong to Chinese characters in common use, only retain Chinese character GBK and be coded in the Chinese character in 0xB0A0~0xF7FE, to retain conventional Chinese character.
Phonetic text obtains subelement 320, is suitable for transferring the Chinese character in the Chinese text obtaining to phonetic and obtains phonetic text, comprises and uses the Chinese-character phonetic letter table of comparisons, each Chinese character is converted to corresponding pinyin string, to obtain phonetic text.
By Chinese text, obtain subelement 310 and obtain Chinese text by the to be detected text that gives out information, and by phonetic text, obtain subelement 320 and transfer the Chinese character in the Chinese text obtaining to phonetic and obtain phonetic text, can, by the different mutation of Similar Text, be identified as identical phonetic text.
Fingerprint obtains subelement 330, be suitable for extracting the feature of described phonetic text, by the proper vector of phonetic text described in the Characteristics creation extracting, particularly, fingerprint obtains subelement 330, be suitable for take individual Chinese character and extract the feature of described phonetic text as cutting granularity, and use vector space model by the proper vector of phonetic text described in the Characteristics creation extracting.Preferably, fingerprint obtains subelement 330 and adopts N gram language model (N-gram) to mention the proper vector of phonetic text, based on Chinese text, obtain the Chinese character granularity in the Chinese text that subelement 310 obtains, phonetic text is obtained to the phonetic text that subelement 320 obtains and extract N-gram feature SHINGLE 1, SHINGLE 2... SHINGLE m.And use vector space model to form proper vector D=<SHINGLE 1, SHINGLE 2..., SHINGLE m>.
Fig. 7 shows according to the detailed block diagram of the device of the characteristic of advertisement giving out information for recognition network game of second embodiment of the invention.The second embodiment of this device and the first embodiment are as shown in Figure 6 roughly the same, and difference is, this device further comprises characteristic of advertisement database update unit 700.
Described characteristic of advertisement database update unit 700, while being suitable for the record matching in determining described text to be detected and characteristic of advertisement database 600, for each feature in described proper vector, if detect in characteristic of advertisement database 600 and have this feature, the weights of this feature in characteristic of advertisement database 600 are added to 1.In other words, if the record matching in text to be detected and characteristic of advertisement database upgrades characteristic of advertisement database 600, thereby realize the renewal to characteristic of advertisement database 600.
It should be noted that:
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the equipment of the characteristic of advertisement giving out information for recognition network game of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims (10)

1. a device for the characteristic of advertisement giving out information for recognition network game, comprising:
Detecting unit, is suitable for detecting the event that gives out information of game client;
Text acquiring unit, be suitable for according to described in the event of giving out information obtain the text that gives out information;
Proper vector extraction unit, the one or more proper vectors that comprise in the text that is suitable for giving out information described in extracting;
Recognition unit, is suitable for according to described proper vector, identify to be detected give out information text whether with characteristic of advertisement database in one or more record matchings;
Screen unit, is suitable for when recognition unit identifies above-mentioned coupling, and the described event of giving out information is carried out to shielding processing.
2. device according to claim 1, wherein,
Described detecting unit, be suitable for before described in described text acquiring unit basis, the event of giving out information is obtained the text that gives out information, whether the type that detects described message event is broadcast event or multicast message event, exit if not flow process, if obtain by the event of giving out information described in described text acquiring unit basis the text that gives out information.
3. device according to claim 1 and 2, wherein,
Described screen unit, be positioned at game server or carry out described in the give out information game client of event.
4. according to the device described in claim 1-3 any one, wherein,
Described recognition unit, is suitable for each feature in described proper vector, detects in characteristic of advertisement database whether repeatedly occur this feature;
Described recognition unit, be suitable for judging whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold in characteristic of advertisement database, be to determine the described to be detected record matching giving out information in text and characteristic of advertisement database, otherwise do not mate.
5. according to the device described in claim 1-4 any one, wherein,
Described recognition unit, be suitable for each feature in described proper vector, from characteristic of advertisement database, search and whether have this feature, if existed, further check the weights of this feature, if the weights of this feature are more than or equal to Second Threshold, in characteristic of advertisement database, repeatedly there is this feature.
6. a method for the characteristic of advertisement giving out information for recognition network game, comprising:
Detect the event that gives out information of game client;
According to the described event of giving out information, obtain the text that gives out information;
The one or more proper vectors that give out information described in extraction and comprise in text;
According to described proper vector, identify to be detected give out information text whether with characteristic of advertisement database in one or more record matchings;
When identifying above-mentioned coupling, the described event of giving out information is carried out to shielding processing.
7. method according to claim 6, wherein, the method further comprises:
Before described in described basis, the event of giving out information is obtained the text that gives out information, whether the type that detects described message event is broadcast event or multicast message event, exit if not flow process, if according to described in the event of giving out information obtain the text that gives out information.
8. according to the method described in claim 6 or 7, wherein,
The described event of giving out information is carried out to shielding processing to be carried out by game server or game client.
9. according to the method described in claim 6-8 any one, wherein, described according to described proper vector, identify to be detected give out information text whether with characteristic of advertisement database in one or more record matchings, specifically comprise:
To each feature in described proper vector, detect in characteristic of advertisement database whether repeatedly occur this feature;
Judge whether the ratio that the feature repeatedly occurring in described proper vector accounts for whole features of this proper vector reaches first threshold in characteristic of advertisement database, be to determine the described to be detected record matching giving out information in text and characteristic of advertisement database, otherwise do not mate.
10. according to the method described in claim 6-9 any one, wherein, in described detection characteristic of advertisement database, whether repeatedly occur that this feature comprises:
From characteristic of advertisement database, search and whether have this feature, if existed, further check the weights of this feature, if the weights of this feature are more than or equal to Second Threshold, in characteristic of advertisement database, repeatedly occur this feature.
CN201310537964.5A 2013-11-04 2013-11-04 Device and method used for identifying advertisement features of issued message in online game Pending CN103605693A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201310537964.5A CN103605693A (en) 2013-11-04 2013-11-04 Device and method used for identifying advertisement features of issued message in online game
PCT/CN2014/087175 WO2015062377A1 (en) 2013-11-04 2014-09-23 Device and method for detecting similar text, and application
US15/034,307 US20160283582A1 (en) 2013-11-04 2014-09-23 Device and method for detecting similar text, and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310537964.5A CN103605693A (en) 2013-11-04 2013-11-04 Device and method used for identifying advertisement features of issued message in online game

Publications (1)

Publication Number Publication Date
CN103605693A true CN103605693A (en) 2014-02-26

Family

ID=50123916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310537964.5A Pending CN103605693A (en) 2013-11-04 2013-11-04 Device and method used for identifying advertisement features of issued message in online game

Country Status (1)

Country Link
CN (1) CN103605693A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015062377A1 (en) * 2013-11-04 2015-05-07 北京奇虎科技有限公司 Device and method for detecting similar text, and application
CN105787133A (en) * 2016-03-31 2016-07-20 北京小米移动软件有限公司 Method and device for filtering advertisement information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
俞吴昊,等: "基于Low-IDF-SIG的句子重复检测", 《中文信息学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015062377A1 (en) * 2013-11-04 2015-05-07 北京奇虎科技有限公司 Device and method for detecting similar text, and application
CN105787133A (en) * 2016-03-31 2016-07-20 北京小米移动软件有限公司 Method and device for filtering advertisement information
CN105787133B (en) * 2016-03-31 2020-06-02 北京小米移动软件有限公司 Advertisement information filtering method and device

Similar Documents

Publication Publication Date Title
CN103605694A (en) Device and method for detecting similar texts
CN110020422B (en) Feature word determining method and device and server
CN103605691A (en) Device and method used for processing issued contents in social network
US9424524B2 (en) Extracting facts from unstructured text
CN107423278B (en) Evaluation element identification method, device and system
CN103605690A (en) Device and method for recognizing advertising messages in instant messaging
JP5605583B2 (en) Retrieval method, similarity calculation method, similarity calculation and same document collation system, and program thereof
CN102227724A (en) Machine learning for transliteration
CN108027814B (en) Stop word recognition method and device
CN109858626B (en) Knowledge base construction method and device
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN103778205A (en) Commodity classifying method and system based on mutual information
CN109388743B (en) Language model determining method and device
CN111241230A (en) Method and system for identifying string mark risk based on text mining
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN113076735A (en) Target information acquisition method and device and server
US20160283582A1 (en) Device and method for detecting similar text, and application
Bhakuni et al. Evolution and evaluation: Sarcasm analysis for twitter data using sentiment analysis
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN103605693A (en) Device and method used for identifying advertisement features of issued message in online game
CN103605692A (en) Device and method used for shielding advertisement contents in ask-and-answer community
CN114707517B (en) Target tracking method based on open source data event extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140226

RJ01 Rejection of invention patent application after publication