CN102902714A - Method and device for detecting content change - Google Patents

Method and device for detecting content change Download PDF

Info

Publication number
CN102902714A
CN102902714A CN2012102998136A CN201210299813A CN102902714A CN 102902714 A CN102902714 A CN 102902714A CN 2012102998136 A CN2012102998136 A CN 2012102998136A CN 201210299813 A CN201210299813 A CN 201210299813A CN 102902714 A CN102902714 A CN 102902714A
Authority
CN
China
Prior art keywords
vector
website
intention
text collection
advertiser
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012102998136A
Other languages
Chinese (zh)
Inventor
孙翔
吴欢琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PANGU CULTURE COMMUNICATION CO Ltd
Original Assignee
PANGU CULTURE COMMUNICATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PANGU CULTURE COMMUNICATION CO Ltd filed Critical PANGU CULTURE COMMUNICATION CO Ltd
Priority to CN2012102998136A priority Critical patent/CN102902714A/en
Publication of CN102902714A publication Critical patent/CN102902714A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method and a device for detecting content change, relates to the natural language processing field, and can improve the accuracy for identifying advertising main content change and reduce false alarms. According to an embodiment, the method includes respectively acquiring a creative text set of the content of an arbitrary advertising creative and a website text set of the content of an advertising main website corresponding to the advertising creative; performing text vectorization on the creative text set and the website text set respectively to obtain a creative vector and a website vector; determining the similarity of the advertising creative content with the advertising main website content according to the creative vector, the website vector and quantity of common elements in the creative vector and the website vector; and determining that the advertising main website content changes when the similarity is smaller than a preset threshold. The embodiment is suitable for detecting whether the advertising main content changes.

Description

A kind of method and apparatus of Detection of content change
Technical field
The present invention relates to natural language processing field, relate in particular to a kind of method and apparatus of Detection of content change.
Background technology
When Detection of content changes, usually adopt by the vector space model of setting up and calculate similarity between document, and then determine whether the document content of an a certain piece of writing has change.
Vector space model is model commonly used in a kind of natural language processing, and the processing of document content is reduced to vector operation in the space, and expresses similarity semantic in the document content with the similarity on the space.When document is represented as document space vectorial, can measure similarity between document by the similarity between the compute vector.Concrete, in vector space model, the semantic content similarity between two documents represents with the cosine value of the angle between two vectors.
Yet, when employing is determined advertiser's content alteration by vector space model, because some advertising creative content of pages can be far fewer than advertiser's web site contents in advertiser's content, the word frequency of each lexical item is widely different in both, when causing calculating the similarity of advertising creative and advertiser's web site contents, the similarity value that obtains can be very little, and the lexical item number that the lexical item number that both have occurs in the advertiser website, if calculate both similarities so utilize the cosine similarity, the meeting of similarity value and the actual conditions that draw differ greatly, whether cause occurring false alarm, can not accurately identify advertiser's content has change.
Summary of the invention
Embodiments of the invention provide a kind of method and apparatus of Detection of content change, can improve the accuracy rate of identification advertiser content alteration, reduce false alarm.
For achieving the above object, embodiments of the invention adopt following technical scheme:
A kind of method of Detection of content change comprises:
Obtain respectively the intention text collection of the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative;
Respectively described intention text collection and described website text collection are carried out text vector, obtain intention vector sum website vector;
According to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determine the similarity of described advertising creative content and described advertiser's web site contents;
When described similarity during less than predetermined threshold value, determine that described advertiser's web site contents changes.
A kind of device of Detection of content change comprises:
The text collection acquiring unit is used for obtaining respectively the intention text collection of the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative;
The vector acquiring unit is used for respectively described intention text collection and described website text collection being carried out text vector, obtains intention vector sum website vector;
The similarity determining unit is used for according to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determines the similarity of described advertising creative content and described advertiser's web site contents;
Decision unit is used for determining that when described similarity during less than predetermined threshold value described advertiser's web site contents changes.
The embodiment of the invention provides the method and apparatus of a kind of Detection of content change, the intention text collection of the content by obtaining respectively arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative; Respectively described intention text collection and described website text collection are carried out text vector, obtain intention vector sum website vector; According to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determine the similarity of described advertising creative content and described advertiser's web site contents; When described similarity during less than predetermined threshold value, determine that described advertiser's web site contents changes.
With adopting when determining advertiser's content alteration by vector space model in the prior art, because some advertising creative content of pages can be far fewer than advertiser's web site contents in advertiser's content, the word frequency of each lexical item is widely different in both, when causing calculating the similarity of advertising creative and advertiser's web site contents, the similarity value that obtains can be very little, and the lexical item number that the lexical item number that both have occurs in the advertiser website, if calculate both similarities so utilize the cosine similarity, the meeting of similarity value and the actual conditions that draw differ greatly, cause occurring false alarm, whether can not accurately identify advertiser's content has change to compare, the scheme that the embodiment of the invention provides adopts improved vector space model and new similarity value calculating method to detect advertiser's content alteration, can improve the accuracy rate of identification advertiser content alteration, reduce false alarm.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, the below will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The process flow diagram of the method that a kind of Detection of content that Fig. 1 provides for the embodiment of the invention 1 changes;
The block diagram of the device that a kind of Detection of content that Fig. 2 provides for the embodiment of the invention 1 changes;
The process flow diagram of the method that a kind of Detection of content that Fig. 3 provides for the embodiment of the invention 2 changes;
Three grades of page schematic diagram that Fig. 4 provides for the embodiment of the invention 2;
Fig. 5 carries out the process flow diagram of the method for vectorization for what the embodiment of the invention 2 provided to the intention text collection;
The block diagram of the device that a kind of Detection of content that Fig. 6 provides for the embodiment of the invention 2 changes.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
Embodiment 1
The embodiment of the invention provides a kind of method of Detection of content change, and as shown in Figure 1, the method may further comprise the steps:
Step 101 is obtained respectively the intention text collection of the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative;
In this step, according to the content of arbitrary advertising creative, obtain and resolve the page that advertising creative target pages and described advertising creative target pages internal chaining are pointed to, obtain the intention text collection;
Advertiser web site contents and the predetermined period corresponding according to described advertising creative, obtain and resolve the secondary page and three grades of pages of advertiser's website homepage, described advertiser's website homepage, obtain the website text collection, described predetermined period is for upgrading the cycle of described website text collection.
Step 102 is carried out text vector to described intention text collection and described website text collection respectively, obtains intention vector sum website vector;
In this step, every piece of text in described intention text collection and the described website text collection is cut word;
Word frequency in the default dictionary of statistics in every piece of text of each word after cutting word;
According to the described word frequency of described each word of adding up, calculate respectively the total word frequency of described each word in described intention text collection and described website text collection;
Described intention text collection and described website text collection are mapped on the vector space model, obtain intention vector sum website vector.
Further, described intention text collection and described website text collection are mapped on the vector space model, obtaining intention vector sum website vector comprises: described intention text collection and described website text collection are mapped on the vector space model, obtain the first intention vector sum the first website vector;
Obtain a first threshold parameter the highest element of word frequency in described the first intention vector, and the value of a described first threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described first threshold parameter element that word frequency is the highest in described the first intention vector is set to 0, obtains the intention vector;
Obtain a Second Threshold parameter the highest element of word frequency in described the first website vector, and the value of a described Second Threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described Second Threshold parameter element that word frequency is the highest in described the first website vector is set to 0, obtains the website vector;
Wherein, described first threshold parameter is less than described Second Threshold parameter.
Step 103 according to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, is determined the similarity of described advertising creative content and described advertiser's web site contents;
In this step, according to
Figure BDA00002038242400041
Determine the similarity of described advertising creative content and described advertiser's web site contents; Wherein, V 1Be described intention vector, V 2Be described website vector, N 3Be the number of 1 common element for described intention vector and described website vector intermediate value, N 1Be described first threshold parameter.
Step 104 when described similarity during less than predetermined threshold value, determines that described advertiser's web site contents changes.
Further, after definite described advertiser's web site contents changes, propose to report to the police.
When described similarity is greater than or equal to predetermined threshold value, determine that described advertiser's content changes.
The embodiment of the invention provides a kind of method of Detection of content change, detects advertiser's content alteration by adopting improved vector space model and new similarity value calculating method, can improve the accuracy rate of identification advertiser content alteration, reduces false alarm.
The embodiment of the invention provides a kind of device of Detection of content change, and as shown in Figure 2, this device comprises: text collection acquiring unit 201, vectorial acquiring unit 202, similarity determining unit 203, decision unit 204;
Text collection acquiring unit 201 is used for obtaining respectively the intention text collection of the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative;
Described text collection acquiring unit 201 is used for: according to the content of arbitrary advertising creative, obtain and resolve the page that advertising creative target pages and described advertising creative target pages internal chaining are pointed to, obtain the intention text collection; Advertiser web site contents and the predetermined period corresponding according to described advertising creative, obtain and resolve the secondary page and three grades of pages of advertiser's website homepage, described advertiser's website homepage, obtain the website text collection, described predetermined period is for upgrading the cycle of described website text collection.
Vector acquiring unit 202 is used for respectively described intention text collection and described website text collection being carried out text vector, obtains intention vector sum website vector;
Wherein, cut the word module in the described vectorial acquiring unit 202, be used for every piece of text of described intention text collection and described website text collection is cut word;
Word frequency statistics module in the described vectorial acquiring unit 202 is used for the word frequency in default every piece of text of each word of dictionary after cutting word of statistics;
Total word frequency acquisition module in the described vectorial acquiring unit 202 is used for the described word frequency according to described each word of statistics, calculates respectively the total word frequency of described each word in described intention text collection and described website text collection;
Vectorial acquisition module in the described vectorial acquiring unit 202 is used for described intention text collection and described website text collection are mapped to vector space model, obtains intention vector sum website vector.
Further, the mapping submodule in the described vectorial acquisition module is used for described intention text collection and described website text collection are mapped to vector space model, obtains the first intention vector sum the first website vector;
Vector element value setting unit in the described vectorial acquisition module, be used for obtaining described first an intention vector first threshold parameter the highest element of word frequency, and the value of a described first threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described first threshold parameter element that word frequency is the highest in described the first intention vector is set to 0, obtains the intention vector;
Described vector element value setting unit also is used for, obtain a Second Threshold parameter the highest element of word frequency in described the first website vector, and the value of a described Second Threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described Second Threshold parameter element that word frequency is the highest in described the first website vector is set to 0, obtains the website vector;
Wherein, described first threshold parameter is less than described Second Threshold parameter.
Similarity determining unit 203 is used for according to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determines the similarity of described advertising creative content and described advertiser's web site contents;
Concrete, according to
Figure BDA00002038242400061
Determine the similarity of described advertising creative content and described advertiser's web site contents; Wherein, V 1Be described intention vector, V 2Be described website vector, N 3Be the number of 1 common element for described intention vector and described website vector intermediate value, N 1Be described first threshold parameter.
Decision unit 204 is used for determining that when described similarity during less than predetermined threshold value described advertiser's web site contents changes.
Described decision unit 204 also is used for, and when described similarity is greater than or equal to predetermined threshold value, determines that described advertiser's content changes.
The embodiment of the invention provides the device of a kind of Detection of content change, by the text collection acquiring unit, is used for obtaining respectively the intention text collection of the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative; The vector acquiring unit is used for respectively described intention text collection and described website text collection being carried out text vector, obtains intention vector sum website vector; The similarity determining unit is used for according to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determines the similarity of described advertising creative content and described advertiser's web site contents; Decision unit is used for determining that when described similarity during less than predetermined threshold value described advertiser's web site contents changes.The embodiment of the invention can improve the accuracy rate of identification advertiser content alteration by adopting improved vector space model and new similarity value calculating method to detect advertiser's content alteration, reduces false alarm.
Embodiment 2
The embodiment of the invention provides a kind of method of Detection of content change, and as shown in Figure 3, the method comprises:
Step 301 is obtained the intention text collection of the content of arbitrary advertising creative;
Advertising creative refers to by the technical skill of uniqueness or clever advertisement creation script, more outstanding embodiment product performance and high quality, and promote production marketing with this.
In this step, according to arbitrary advertising creative, can adopt the reptile program to obtain the page of advertising creative target pages and described advertising creative target pages internal chaining sensing, the page that points to by resolving advertising creative target pages and described advertising creative target pages internal chaining obtains intention text collection D 1={ d 1, d 2..., d n, wherein, d 1, d 2..., d nRepresent respectively the intention text.
The reptile program is a program of automatically extracting webpage, and its downloading web pages that is search engine on the WWW is the important component part of search engine.
Step 302 is obtained the website text collection of advertiser's web site contents corresponding to described advertising creative;
The advertiser is the publisher of advertising campaign, is the businessman of selling or publicizing own products ﹠ services on the net, is the supplier of Alliance marketing advertisement.Any popularization, the businessman of selling its product or service can be as the advertiser.
The advertiser web site contents corresponding according to described advertising creative, can adopt the reptile program to obtain the secondary page and three grades of pages of advertiser's website homepage, described advertiser's website homepage, by resolving the secondary page and three grades of pages of advertiser's website homepage, described advertiser's website homepage, obtain website text collection D 2={ d 1, d 2..., d m, wherein, d 1, d 2..., d nRepresent respectively the website text.As shown in Figure 4, P0 is advertiser's website homepage, P1, P2, P 3 and P4 are the secondary page of advertiser's website homepage, P11, P12, P13, P21, P22, P23, P31, P32, P33, P41, P42 and P43 are three grades of pages of advertiser's website homepage, wherein, P11, P12, P13 are the next stage page of P1, and P21, P22, P23 are the next stage page of P2, P31, P32, P33 are the next stage page of P 3, and P41, P42 and P43 are the next stage page of P4.
Further, in order to detect advertiser's content alteration situation, need to obtain continuously advertiser website text collection, therefore a predetermined period can be set, obtain continuously advertiser website text collection according to predetermined period.Wherein said predetermined period can rule of thumb arrange for upgrading the cycle of described website text collection.
Step 303 is carried out text vector to described intention text collection, obtains the intention vector;
In this step, described intention text collection is carried out text vector, obtains the intention vector, as shown in Figure 5, specifically may further comprise the steps:
Step 3031 is to intention text collection D 1={ d 1, d 2..., d nIn every piece of text cut word; Cut word: refer to a Chinese character sequence is cut into independent one by one word.For example, " described intention text collection is carried out text vector " the words is cut word, can obtain " to, described, intention, text, gather, carry out, text, vectorization " these eight independent words.
Step 3032, according to default dictionary, for example existing size is the dictionary of K word, the word frequency (Term Frequency, TF) in every piece of text of each word after cutting word in the statistics dictionary; Word frequency is the number of times that word occurs in text.
Step 3033 according to the described word frequency of described each word of adding up, is calculated the total word frequency of described each word in described intention text collection;
Namely with the word frequency addition of each word in every piece of text, obtain this word at intention text collection D 1In total word frequency.
Step 3034 is mapped to described intention text collection on the vector space model, obtains intention vector V 1
In this step, concrete, described intention text collection is mapped on the vector space model, obtain the first intention vector V 1';
The vector dimension is the big or small K of dictionary, and the value of vector element is the total word frequency TF of lexical item corresponding to element in the intention text collection, i.e. V 1'=(t 1, TF 1t 2, TF 2T K, TF K).Wherein, this vector can be abbreviated as V 1'=(TF 1, TF 2... TF K).
According to V 1' in the size of element value, obtain a first threshold parameter the highest element of word frequency in described the first intention vector, and the value of a described first threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described first threshold parameter element that word frequency is the highest in described the first intention vector is set to 0, obtains intention vector V 1Intention vector V 1Middle element value only comprises 1 and 0.Wherein, the first threshold parameter can rule of thumb arrange, the number threshold value of first threshold Parametric Representation advertising creative content medium-high frequency word.
Step 304 is carried out text vector to described website text collection, obtains the website vector;
The operation of this concrete step is identical with the operation of step 303, and different is in the step 303 for the intention text collection is carried out text vector, and this step is for to carry out text vector to the website text collection.
To website text collection D 2={ d 1, d 2..., d mIn every piece of text cut word, then statistics is preset the word frequency in every piece of text of each word after cutting word in the dictionary, described word frequency according to described each word of adding up, calculate the total word frequency of described each word in the text collection of described website, described website text collection is mapped on the vector space model, obtains the first website vector V 2'; Obtain a Second Threshold parameter the highest element of word frequency in described the first website vector, and the value of a described Second Threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described Second Threshold parameter element that word frequency is the highest in described the first website vector is set to 0, obtains website vector V 2Intention vector V 2Middle element value only comprises 1 and 0.Wherein, the Second Threshold parameter can rule of thumb arrange, the number threshold value of Second Threshold Parametric Representation advertiser web site contents medium-high frequency word.Described first threshold parameter is less than described Second Threshold parameter.
Step 305 according to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, is determined the similarity of described advertising creative content and described advertiser's web site contents;
Can basis in this step
Figure BDA00002038242400091
Determine the similarity of described advertising creative content and described advertiser's web site contents; Wherein, V 1Be described intention vector, V 2Be described website vector, N 3Be the number of 1 common element for described intention vector and described website vector intermediate value, i.e. common appearance and be the number of high frequency words in advertising creative content and the advertiser's web site contents.
Prior art adopts the cosine value of the angle between the vector to obtain similarity, namely
Figure BDA00002038242400092
Obtain similarity, calculation complex, counting yield is lower, and the basis that the present invention adopts
Figure BDA00002038242400093
Determine the similarity of described advertising creative content and described advertiser's web site contents, calculate simpler so that counting yield improves.
Step 306 judges that whether described similarity is less than predetermined threshold value;
Wherein predetermined threshold value is the value that rule of thumb arranges.
Step 307 when described similarity during less than predetermined threshold value, determines that described advertiser's web site contents changes;
When described similarity during less than predetermined threshold value, it is inconsistent that expression advertising creative content and advertiser examined state at that time, and advertiser's web site contents changes, then proposition warning.
Step 308 when described similarity is greater than or equal to predetermined threshold value, determines that described advertiser's content changes.
Whether need to prove, the present embodiment can constantly obtain the website text collection of advertiser website, thereby constantly updates the similarity formula of advertising creative content and advertiser's web site contents, thereby can change by the real-time detecting advertisement main contents.
The embodiment of the invention provides a kind of method of Detection of content change, by adopting improved vector space model, and new similarity calculating method detects advertiser's content alteration, can improve the accuracy rate of identification advertiser content alteration, reduce false alarm, so that advertising creative and advertiser's web site contents correlativity are stronger, promote the user and experience increase click probability.
The embodiment of the invention provides a kind of device of Detection of content change, as shown in Figure 6, this device comprises: text collection acquiring unit 601, vectorial acquiring unit 602, cut word module 6021, word frequency statistics module 6022, total word frequency acquisition module 6023, vectorial acquisition module 6024, mapping submodule 60241, vector element value setting unit 60242, similarity determining unit 603, decision unit 604;
Text collection acquiring unit 601 is used for obtaining respectively the intention text collection of the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative;
Further, described text collection acquiring unit 601 is used for: according to the content of arbitrary advertising creative, adopt the reptile program to obtain and resolve the page of advertising creative target pages and described advertising creative target pages internal chaining sensing, obtain the intention text collection; And advertiser web site contents and the predetermined period corresponding according to described advertising creative, adopt the reptile program to obtain and resolve the secondary page and three grades of pages of advertiser's website homepage, described advertiser's website homepage, obtain the website text collection, described predetermined period is for upgrading the cycle of described website text collection.
Vector acquiring unit 602 is used for respectively described intention text collection and described website text collection being carried out text vector, obtains intention vector sum website vector;
Further, cut word module 6021 in the described vectorial acquiring unit 602, be used for every piece of text of described intention text collection and described website text collection is cut word;
Word frequency statistics module 6022 in the described vectorial acquiring unit 602 is used for the word frequency in default every piece of text of each word of dictionary after cutting word of statistics;
Total word frequency acquisition module 6023 in the described vectorial acquiring unit 602 is used for the described word frequency according to described each word of statistics, calculates respectively the total word frequency of described each word in described intention text collection and described website text collection;
Vectorial acquisition module 6024 in the described vectorial acquiring unit 602 is used for described intention text collection and described website text collection are mapped to vector space model, obtains intention vector sum website vector.
Further, the mapping submodule 60241 in the described vectorial acquisition module 6024 is used for described intention text collection and described website text collection are mapped to vector space model, obtains the first intention vector sum the first website vector;
Vector element value setting unit 60242 in the described vectorial acquisition module 6024, be used for obtaining described first an intention vector first threshold parameter the highest element of word frequency, and the value of a described first threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described first threshold parameter element that word frequency is the highest in described the first intention vector is set to 0, obtains the intention vector;
Described vector value setting unit 60242 also is used for, obtain a Second Threshold parameter the highest element of word frequency in described the first website vector, and the value of a described Second Threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described Second Threshold parameter element that word frequency is the highest in described the first website vector is set to 0, obtains the website vector;
Wherein, the first threshold parameter can rule of thumb arrange, the number threshold value of first threshold Parametric Representation advertising creative content medium-high frequency word; The Second Threshold parameter can rule of thumb arrange, the number threshold value of Second Threshold Parametric Representation advertiser web site contents medium-high frequency word, and described first threshold parameter is less than described Second Threshold parameter.
Similarity determining unit 603 is used for according to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determines the similarity of described advertising creative content and described advertiser's web site contents;
Further, described similarity determining unit 603 is used for:
According to
Figure BDA00002038242400111
Determine the similarity of described advertising creative content and described advertiser's web site contents; Wherein, V 1Be described intention vector, V 2Be described website vector, N 3Be the number of 1 common element for described intention vector and described website vector intermediate value, N 1Be described first threshold parameter.
Decision unit 604 is used for determining that when described similarity during less than predetermined threshold value described advertiser's web site contents changes; When described similarity during less than predetermined threshold value, it is inconsistent that expression advertising creative content and advertiser examined state at that time, and advertiser's web site contents changes, then proposition warning.
Further, described decision unit 604 also is used for, and when described similarity is greater than or equal to predetermined threshold value, determines that described advertiser's content changes.
Wherein predetermined threshold value is the value that rule of thumb arranges.
The embodiment of the invention provides the device of a kind of Detection of content change, obtains respectively the intention text collection of the content of arbitrary advertising creative by the text collection acquiring unit, and the website text collection of advertiser's web site contents corresponding to described advertising creative; The vector acquiring unit carries out text vector to described intention text collection and described website text collection respectively, obtains intention vector sum website vector; According to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, the similarity determining unit is determined the similarity of described advertising creative content and described advertiser's web site contents; When described similarity during less than predetermined threshold value, decision unit determines that described advertiser's web site contents changes.The embodiment of the invention can improve the accuracy rate of identification advertiser content alteration by adopting improved vector space model and new similarity value calculating method to detect advertiser's content alteration, reduces false alarm.
The above; be the specific embodiment of the present invention only, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion by described protection domain with claim.

Claims (12)

1. the method for a Detection of content change is characterized in that, comprising:
Obtain respectively the intention text collection of the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative;
Respectively described intention text collection and described website text collection are carried out text vector, obtain intention vector sum website vector;
According to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determine the similarity of described advertising creative content and described advertiser's web site contents;
When described similarity during less than predetermined threshold value, determine that described advertiser's web site contents changes.
2. method according to claim 1 is characterized in that, the described intention text collection that obtains respectively the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative comprises:
According to the content of arbitrary advertising creative, obtain and resolve the page that advertising creative target pages and described advertising creative target pages internal chaining are pointed to, obtain the intention text collection;
Advertiser web site contents and the predetermined period corresponding according to described advertising creative, obtain and resolve the secondary page and three grades of pages of advertiser's website homepage, described advertiser's website homepage, obtain the website text collection, described predetermined period is for upgrading the cycle of described website text collection.
3. method according to claim 2 is characterized in that, describedly respectively described intention text collection and described website text collection is carried out text vector, obtains intention vector sum website vector and comprises:
Every piece of text in described intention text collection and the described website text collection is cut word;
Word frequency in the default dictionary of statistics in every piece of text of each word after cutting word;
According to the described word frequency of described each word of adding up, calculate respectively the total word frequency of described each word in described intention text collection and described website text collection;
Described intention text collection and described website text collection are mapped on the vector space model, obtain intention vector sum website vector.
4. method according to claim 3 is characterized in that, described described intention text collection and described website text collection is mapped on the vector space model, obtains intention vector sum website vector and comprises:
Described intention text collection and described website text collection are mapped on the vector space model, obtain the first intention vector sum the first website vector;
Obtain a first threshold parameter the highest element of word frequency in described the first intention vector, and the value of a described first threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described first threshold parameter element that word frequency is the highest in described the first intention vector is set to 0, obtains the intention vector;
Obtain a Second Threshold parameter the highest element of word frequency in described the first website vector, and the value of a described Second Threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described Second Threshold parameter element that word frequency is the highest in described the first website vector is set to 0, obtains the website vector;
Wherein, described first threshold parameter is less than described Second Threshold parameter.
5. method according to claim 4, it is characterized in that, described according to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determine that the similarity of described advertising creative content and described advertiser's web site contents comprises:
According to
Figure FDA00002038242300021
Determine the similarity of described advertising creative content and described advertiser's web site contents; Wherein, V 1Be described intention vector, V 2Be described website vector, N 3Be the number of 1 common element for described intention vector and described website vector intermediate value, N 1Be described first threshold parameter.
6. described method is characterized in that according to claim 1-5, and described method also comprises:
When described similarity is greater than or equal to predetermined threshold value, determine that described advertiser's content changes.
7. the device of a Detection of content change is characterized in that, comprising:
The text collection acquiring unit is used for obtaining respectively the intention text collection of the content of arbitrary advertising creative, and the website text collection of advertiser's web site contents corresponding to described advertising creative;
The vector acquiring unit is used for respectively described intention text collection and described website text collection being carried out text vector, obtains intention vector sum website vector;
The similarity determining unit is used for according to the described website of described intention vector sum vector, and the number of common element in the vector of the described website of described intention vector sum, determines the similarity of described advertising creative content and described advertiser's web site contents;
Decision unit is used for determining that when described similarity during less than predetermined threshold value described advertiser's web site contents changes.
8. device according to claim 7 is characterized in that, described text collection acquiring unit is used for:
According to the content of arbitrary advertising creative, obtain and resolve the page that advertising creative target pages and described advertising creative target pages internal chaining are pointed to, obtain the intention text collection;
Advertiser web site contents and the predetermined period corresponding according to described advertising creative, obtain and resolve the secondary page and three grades of pages of advertiser's website homepage, described advertiser's website homepage, obtain the website text collection, described predetermined period is for upgrading the cycle of described website text collection.
9. device according to claim 8 is characterized in that, described vectorial acquiring unit comprises:
Cut the word module, be used for every piece of text of described intention text collection and described website text collection is cut word;
The word frequency statistics module is used for the word frequency in default every piece of text of each word of dictionary after cutting word of statistics;
Total word frequency acquisition module is used for the described word frequency according to described each word of statistics, calculates respectively the total word frequency of described each word in described intention text collection and described website text collection;
The vector acquisition module is used for described intention text collection and described website text collection are mapped to vector space model, obtains intention vector sum website vector.
10. device according to claim 9 is characterized in that, described vectorial acquisition module comprises:
Mapping submodule is used for described intention text collection and described website text collection are mapped to vector space model, obtains the first intention vector sum the first website vector;
Vector element value setting unit, be used for obtaining described first an intention vector first threshold parameter the highest element of word frequency, and the value of a described first threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described first threshold parameter element that word frequency is the highest in described the first intention vector is set to 0, obtains the intention vector;
Described vector element value setting unit also is used for, obtain a Second Threshold parameter the highest element of word frequency in described the first website vector, and the value of a described Second Threshold parameter element that word frequency is the highest is set to 1, the value of the element except a described Second Threshold parameter element that word frequency is the highest in described the first website vector is set to 0, obtains the website vector;
Wherein, described first threshold parameter is less than described Second Threshold parameter.
11. device according to claim 10 is characterized in that, described similarity determining unit is used for:
According to
Figure FDA00002038242300031
Determine the similarity of described advertising creative content and described advertiser's web site contents; Wherein, V 1Be described intention vector, V 2Be described website vector, N 3Be the number of 1 common element for described intention vector and described website vector intermediate value, N 1Be described first threshold parameter.
12. described device is characterized in that according to claim 7-11, described decision unit also is used for:
When described similarity is greater than or equal to predetermined threshold value, determine that described advertiser's content changes.
CN2012102998136A 2012-08-21 2012-08-21 Method and device for detecting content change Pending CN102902714A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012102998136A CN102902714A (en) 2012-08-21 2012-08-21 Method and device for detecting content change

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012102998136A CN102902714A (en) 2012-08-21 2012-08-21 Method and device for detecting content change

Publications (1)

Publication Number Publication Date
CN102902714A true CN102902714A (en) 2013-01-30

Family

ID=47574947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012102998136A Pending CN102902714A (en) 2012-08-21 2012-08-21 Method and device for detecting content change

Country Status (1)

Country Link
CN (1) CN102902714A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446118A (en) * 2016-09-19 2017-02-22 中国南方电网有限责任公司信息中心 Method for automatically generating page change template
CN109740094A (en) * 2018-12-27 2019-05-10 上海掌门科技有限公司 Page monitoring method, equipment and computer storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741012A (en) * 2004-08-23 2006-03-01 富士施乐株式会社 Test search apparatus and method
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Method for establishing and searching feature matrix of Web document based on semantics

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741012A (en) * 2004-08-23 2006-03-01 富士施乐株式会社 Test search apparatus and method
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Method for establishing and searching feature matrix of Web document based on semantics

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446118A (en) * 2016-09-19 2017-02-22 中国南方电网有限责任公司信息中心 Method for automatically generating page change template
CN109740094A (en) * 2018-12-27 2019-05-10 上海掌门科技有限公司 Page monitoring method, equipment and computer storage medium

Similar Documents

Publication Publication Date Title
CN107437038B (en) Webpage tampering detection method and device
CN104572977B (en) A kind of agricultural product quality and safety event online test method
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
CN110727766B (en) Sensitive word detection method
CN107943954A (en) Detection method, device and the electronic equipment of webpage sensitive information
CN102446255B (en) Method and device for detecting page tamper
CN104536980A (en) To-be-commented item quality information determination method and device
CN102436563A (en) Method and device for detecting page tampering
CN103605691B (en) Device and method used for processing issued contents in social network
CN105824898A (en) Label extracting method and device for network comments
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
CN102591965A (en) Method and device for detecting black chain
CN110602045A (en) Malicious webpage identification method based on feature fusion and machine learning
CN104933191A (en) Spam comment recognition method and system based on Bayesian algorithm and terminal
CN103839172A (en) Goods recommendation method and system
CN104133870A (en) Web page similarity calculation method and web page similarity calculation device
CN103617213A (en) Method and system for identifying newspage attributive characters
CN112532624B (en) Black chain detection method and device, electronic equipment and readable storage medium
CN103425680A (en) Selection method and system for page advertisement demonstration
CN103778122A (en) Searching method and system
CN105095381A (en) Method and device for new word identification
CN110688455A (en) Method, medium and computer equipment for filtering invalid comments based on artificial intelligence
CN103399872A (en) Method and device for optimizing webpage capture
CN103136213A (en) Method and device for providing related words
CN106383862A (en) Violation short message detection method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20161019

C20 Patent right or utility model deemed to be abandoned or is abandoned