CN107729489A - Advertisement text recognition methods and device - Google Patents

Advertisement text recognition methods and device Download PDF

Info

Publication number
CN107729489A
CN107729489A CN201710966609.8A CN201710966609A CN107729489A CN 107729489 A CN107729489 A CN 107729489A CN 201710966609 A CN201710966609 A CN 201710966609A CN 107729489 A CN107729489 A CN 107729489A
Authority
CN
China
Prior art keywords
text
categories
advertisement
similarity
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710966609.8A
Other languages
Chinese (zh)
Inventor
李树海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710966609.8A priority Critical patent/CN107729489A/en
Publication of CN107729489A publication Critical patent/CN107729489A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of advertisement text recognition methods and device, it is related to field of computer technology.One embodiment of this method includes:Obtain text to be identified;According to the incidence relation between text, the text to be identified is clustered to form at least one text categories;According at least one text categories, the advertisement text in the text to be identified is identified.The embodiment uses unsupervised method automatic identification advertisement text, it is not necessary to and it is artificial to participate in, cost is reduced, improves recognition efficiency, so as to rapidly identify advertisement text in mass text.

Description

Advertisement text recognition methods and device
Technical field
The present invention relates to field of computer technology, more particularly to a kind of advertisement text recognition methods and device.
Background technology
Social media is the Information Sharing based on relation, propagation and obtains platform.User can pass through Web or mobile terminal Using issue text and multimedia messages, and realize and share immediately.Because social media development is swift and violent, text data has been formed , however as developing rapidly for various Internet communication carriers, also there is substantial amounts of rubbish in extensive accumulation in cyberspace Text, especially advertisement.Analysis work of these advertisement texts to each class text, which causes, to be greatly interfered with, and result in analysis effect Rate is low, analysis difficulty increase, and analysis result can not reflect the truth of data;In addition, for domestic consumer, user Need to filter out the text being of practical significance from substantial amounts of advertisement text, reduce the experience of user.Therefore, to cyberspace Interior advertisement text carries out filtering tool and is of great significance.
At present, the method for existing filtering advertisements text mainly has following two:
1st, using be manually specified sensitive word, advertising words method, and then filter out the text comprising these sensitive words, advertising words This.
2nd, using the sorting technique of supervised learning, in advance artificial mark mass advertising text and non-advertisement text, then lead to Cross the sorting technique of machine learning, such as SVM (Support Vector Machine, SVMs) algorithm, neutral net Algorithm etc., trains disaggregated model, predicts whether text is advertisement using the disaggregated model.
During the present invention is realized, inventor has found that at least there are the following problems in the prior art:
1st, the method by manually participating in, efficiency is low, and cost is high, and in the big data epoch, all generation hundreds of is thousands of even daily Up to ten thousand texts, artificial treatment mode can not meet demands.
2nd, sensitive word is manually specified, the method for advertising words needs staff to have extremely strong advertisement field background knowledge, And comprehensive covering of sensitive word and advertising words can not be ensured, cause the advertisement text recall rate by this method detection very low.
3rd, the sorting technique of supervised learning needs manually to obtain or mark substantial amounts of training set, is obtained from other data sources Open training set generally be currently needed for classification text feature differ greatly, advertisement filter effect can not be ensured.
The content of the invention
In view of this, the embodiment of the present invention provides a kind of advertisement text recognition methods, is known automatically using unsupervised method Other advertisement text, it is not necessary to it is artificial to participate in, cost is reduced, improves recognition efficiency, so as to rapidly in mass text Middle identification advertisement text;This method is clustered according to the similarity between text to text to form at least one text class Not, advertisement text is identified according at least one text categories, improves the accuracy of recognition result.
To achieve the above object, a kind of one side according to embodiments of the present invention, there is provided advertisement text recognition methods.
The advertisement text recognition methods of the embodiment of the present invention, including:Obtain text to be identified;According to the pass between text Connection relation, the text to be identified is clustered to form at least one text categories;According at least one text Classification, identify the advertisement text in the text to be identified.
Alternatively, according at least one text categories, identify that the advertisement text in the text to be identified includes: It is determined that the amount of text in each text categories;If the amount of text in current text classification is more than amount threshold, it is determined that institute It is advertisement text to state the text in current text classification.
Alternatively, the incidence relation between the text includes the similarity between text;
According to the incidence relation between text, the text to be identified is clustered to form at least one text class Do not include:A text is chosen from the text to be identified at random and forms the first text categories;Determine current text and institute State the longest common subsequence of each text in the first text categories;According to the longest common subsequence, determine described current The similarity of text and each text in first text categories;When similarity maximum in the similarity is more than or equal to During similarity threshold, the current text is sorted out to first text categories;When similarity maximum in the similarity During less than similarity threshold, create the second text categories and sort out the current text to second text categories.
Alternatively, according to the incidence relation between text, the text to be identified is clustered to form at least one Individual text categories include:The text publisher of each text is obtained, the text with identical text publisher is clustered To form at least one text set, each corresponding text publisher of the text set.
Alternatively, according at least one text categories, identify that the advertisement text in the text to be identified includes: For text set corresponding to each text publisher:According to the similarity between the text set Chinese version, the text is determined The text multiplicity of text publisher corresponding to collection;If the text multiplicity is more than multiplicity threshold value, it is determined that the text The text of concentration is advertisement text;And/or obtain the concern number and bean vermicelli number of text publisher corresponding to the text set, base In the concern number accounting paid close attention to number and bean vermicelli number, determine text publisher corresponding to the text set;If the concern number Accounting is more than accounting threshold value, it is determined that the text in the text set is advertisement text.
To achieve the above object, a kind of one side according to embodiments of the present invention, there is provided advertisement text identification device.
The advertisement text identification device of the embodiment of the present invention, including:Text acquisition module, for obtaining text to be identified This;Text cluster module, for according to the incidence relation between text, being clustered to the text to be identified to be formed extremely Few text categories;Advertisement identification module, for according at least one text categories, identifying the text to be identified In advertisement text.
Alternatively, the advertisement identification module is additionally operable to:It is determined that the amount of text in each text categories;If current text Amount of text in classification is more than amount threshold, it is determined that the text in the current text classification is advertisement text.
Alternatively, the incidence relation between the text includes the similarity between text;
The text cluster module is additionally operable to:A text is chosen from the text to be identified at random and forms the first text This classification;Determine current text and the longest common subsequence of each text in first text categories;According to described most long Common subsequence, determine the current text and the similarity of each text in first text categories;When the similarity When middle maximum similarity is more than or equal to similarity threshold, the current text is sorted out to first text categories;When When maximum similarity is less than similarity threshold in the similarity, creates the second text categories and sort out the current text To second text categories.
Alternatively, the text cluster module is additionally operable to:The text publisher of each text is obtained, there will be identical text The text of this publisher is clustered to form at least one text set, each corresponding text publisher of the text set.
Alternatively, the advertisement identification module is additionally operable to:For text set corresponding to each text publisher:According to described Similarity between text set Chinese version, determine the text multiplicity of text publisher corresponding to the text set;If the text This multiplicity is more than multiplicity threshold value, it is determined that the text in the text set is advertisement text;And/or obtain the text The concern number and bean vermicelli number of text publisher corresponding to collection, based on the concern number and bean vermicelli number, determine that the text set is corresponding Text publisher concern number accounting;If the concern number accounting is more than accounting threshold value, it is determined that the text in the text set This is advertisement text.
To achieve the above object, one side according to embodiments of the present invention, there is provided one kind is used to realize above-mentioned implementation The electronic equipment of the advertisement text recognition methods of example.
The electronic equipment of the embodiment of the present invention, including:One or more processors;Storage device, for store one or Multiple programs, when one or more of programs are by one or more of computing devices so that one or more of places Reason device realizes the advertisement text recognition methods of the embodiment of the present invention.
To achieve the above object, a kind of one side according to embodiments of the present invention, there is provided computer-readable medium.
The computer-readable medium of the embodiment of the present invention, is stored thereon with computer program, it is characterised in that described program The advertisement text recognition methods of the embodiment of the present invention is realized when being executed by processor.
One embodiment in foregoing invention has the following advantages that or beneficial effect:It is because automatic using unsupervised method Advertisement text is identified, so it is low to overcome efficiency caused by the mode filtering advertisements text for needing manually to participate in the prior art Under, the technical problem that cost is higher, filter effect is poor, and then automatic quick identification advertisement text, improve filter efficiency, Reduce the technique effect of cost.
Further effect adds hereinafter in conjunction with embodiment possessed by above-mentioned non-usual optional mode With explanation.
Brief description of the drawings
Accompanying drawing is used to more fully understand the present invention, does not form inappropriate limitation of the present invention.Wherein:
Fig. 1 is the schematic diagram of the main flow of the advertisement text recognition methods of one embodiment of the invention;
Fig. 2 be the embodiment of the present invention advertisement text recognition methods according to the similarity between text to text to be identified The schematic diagram of the main flow clustered;
Fig. 3 is the schematic diagram of the main flow of the advertisement text recognition methods of another embodiment of the present invention;
Fig. 4 is the schematic diagram of the main flow of the advertisement text recognition methods of root yet another embodiment of the invention;
Fig. 5 is the schematic diagram of the main modular of the advertisement text identification device of the embodiment of the present invention;
Fig. 6 is that the embodiment of the present invention can apply to exemplary system architecture figure therein;
Fig. 7 is adapted for the structural representation for realizing the terminal device of the embodiment of the present invention or the computer system of server Figure.
Embodiment
The one exemplary embodiment of the present invention is explained below in conjunction with accompanying drawing, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize Arrive, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, the description to known function and structure is eliminated in following description.
Fig. 1 is the schematic diagram of the main flow of the advertisement text recognition methods of the embodiment of the present invention.As shown in figure 1, the party Method includes:
Step S101:Obtain text to be identified;
Step S102:According to the incidence relation between text, the text to be identified is clustered to be formed at least One text categories;
Step S103:According at least one text categories, the advertisement text in the text to be identified is identified.
, in an alternate embodiment of the invention, can be by calling API (Application Programming for step S101 Interface, application programming interface) or using crawler technology acquisition social media such as microblogging, wechat, Twitter On microblogging or comment as text to be identified.As a kind of specific example, one can be specified according to field to be analyzed Individual or multiple target keywords, obtain and the text of the target keyword is included in a period of time as text to be identified.For example, The microblogging about mobile phone is analyzed, may specify keyword is " mobile phone ", then by calling microblogging API or being obtained using crawler technology The microblogging comprising " mobile phone " is as text to be identified in a period of time.
In other alternative-embodiments, the social characteristics information of microblogging publisher can also be obtained, such as microblogging publisher ID, concern number, bean vermicelli number etc..
In an alternate embodiment of the invention, original microblogging can only be obtained.This be due to original microblogging (post) include forwarding number, Comment number, the information such as number are thumbed up, so under most scenes, it is more micro- than analysis forwarding (repost) to analyze original microblogging (post) It is rich more valuable.In other alternative-embodiments, it can not only obtain original microblogging but also obtain the microblogging of forwarding, the present invention is herein It is not limited.
In an alternate embodiment of the invention, after text to be identified is obtained, this method also includes:To text to be identified Each text is segmented.
For example, Chinese word segmentation instrument LTP (Harbin Institute of Technology's social computing and the language of Research into information retrieval center research and development can be utilized Say technology platform), NLPIR (Chinese word segmentation system, also known as ICTCLAS2013), THULAC (THU Lexical Analyzer For Chinese, a set of participle instrument released by Tsing-Hua University's natural language processing and society & culture's computing laboratory development Bag), jieba etc. segments to each text of text to be identified.
In an alternate embodiment of the invention, the incidence relation between text includes the similarity between text.
For step S102, inventor has found during the present invention is realized, most of wide in practical application scene Accusing text can occur in the form of similarity is high.Such as following two microbloggings:
(1) " 1 yuan is taken by force precious 2 weeks festivals, is supplemented with money now and is sent 500 red packets, double to block in 24 hours effectively!Running quickly Audi will Announce the winners in a lottery, taken away for 1 person-time with red packet!http://t.cn/R5Ugot7”
(2) " 1 yuan is taken by force precious 2 weeks festivals, is supplemented with money now and is sent 500 red packets, double to block in 24 hours effectively!Running quickly Audi will Announce the winners in a lottery, taken away for 1 person-time with red packet!http://t.cn/R5UgNRo”.
The similarity of above-mentioned two microbloggings is high, then in embodiments of the present invention, it may be determined that above-mentioned two microbloggings are wide Accuse microblogging.
Therefore, in embodiments of the present invention, text to be identified is clustered according to the similarity between text, by phase As text be divided into a text categories.
In optional implement, after Chinese word segmentation is carried out to text to be identified, the basic element of each text It is word, then can judges the similarity between text by judging word (or character string) similarity between text.For example, can To calculate the similarity between text based on longest common subsequence (LCS, Longest Common Subsequence), specifically Ground, the similarity between the length computation text of longest common subsequence can be utilized.If a sequence is two or more The subsequence of known array, and be subsequence most long in two or more all subsequences of known array, then the sequence is The longest common subsequence of two or more known arrays.Such as sequence " aabcd " and " 12abcabcd ", it is most long public Subsequence is " abcd ".
Fig. 2 be the embodiment of the present invention advertisement text recognition methods according to the incidence relation between text to text to be identified The schematic diagram of this main flow clustered.In the present embodiment, the incidence relation between text is similar between text Degree.As shown in Fig. 2 including:
Step S201:A text is chosen from the text to be identified at random and forms the first text categories;
Step S202:Determine current text and the longest common subsequence of each text in first text categories;
Step S203:According to the longest common subsequence, determine in the current text and first text categories The similarity of each text;
Step S204:, will be described current when similarity maximum in the similarity is more than or equal to similarity threshold Text is sorted out to first text categories;
Step S205:When similarity maximum in the similarity is less than similarity threshold, the second text categories are created And the current text is sorted out to second text categories.
For step S202, in an alternate embodiment of the invention, the method for Dynamic Programming can be used to calculate most long public sub- sequence Row.For example, for the text X=[X that length is m1, X2…Xm] and length be n text Y=[Y1, Y2…Yn], calculate both it Between longest common subsequence length process it is as follows:
1. creating 1 two-dimensional array C [i, j], initialization two-dimensional array C [i, j] is 0, wherein, C [i, j] represents text X With text Y longest common subsequence;
2.i and j is proceeded by respectively from 0 plus 1 circulation (that is, i++, j++, i≤m, j≤n);
3. if X [i]=Y [j], C [i, j]=C [i-1, j-1]+1;
4. if X [i]!=Y [j], then C [i, j]=max { C [i, j-1], C [i-1, j };
5. the maximum number in two-dimensional array C [i, j] is the length of text X and text Y longest common subsequence.
6. text X and text Y longest common subsequence is determined from two-dimensional array C [i, j].
Sum it up, the length of longest common subsequence (LCS) can calculate according to equation below (1), according to the formula (1) when calculating, time complexity is only O (i*j)+O (i+j).
For step S203, after calculating the longest common subsequence length between text, the most long public son is utilized Similarity between the length computation text of sequence.Similarity between text X and text Y (2) can calculate as follows:
Similarity (X, Y)=length (LCS) × 2/ (length (X)+length (Y)) (2)
Wherein, similarity (X, Y) represents the similarity between text X and text Y, and length (LCS) represents text X With the length of text Y most long public Ziwen sheet, length (X) represents text X length, and length (Y) represents text Y length Degree.
For step S204 and step S205, because advertisement text has the spy repeatedly issued in the form of similarity is high Sign, therefore, the microblogging after participle can be clustered as input according to the streaming clustering algorithm of setting similarity threshold, Wherein, similarity threshold can be determined according to the field of text to be analyzed, and the present invention is not limited herein.
Illustrate to gather text according to the similarity between text in the embodiment of the present invention with specific embodiment below The process of class.Wherein, text to be identified includes text A, text B, text C and text D, the similarity threshold set as 0.85。
Choose text A and form the first text categories T1
Text B and the first text categories T are determined according to formula (1)1In text A between longest common subsequence.
It is 0.87 that the similarity between text B and text A is determined according to formula (2).
The similarity is more than similarity threshold, then sorts out text B to the first text categories T1In.Now, the first text Classification T1Include two texts:Text A, text B.
Text C and the first text categories T is determined according to formula (1) respectively1In text A, text B between most long public son Sequence.
It is 0.84 that the similarity between text C and text A is determined according to formula (2), the similarity between text C and text B For 0.81.
Maximum similarity 0.84 is less than similarity threshold in above-mentioned two similarity, then creates the second text categories T2And Text C is sorted out to the second text categories T2In.
Text D and the first text categories T is determined according to formula (1) respectively1In text A, text B similarity, and with Two text categories T2In text C similarity.
It is 0.84 that the similarity between text D and text A is determined according to formula (2), the similarity between text D and text B For 0.87, the similarity between text D and text C is 0.86.
Similarity between text D and text B is maximum and is more than similarity threshold, then sorts out text D to the first text Classification T1In.
Inventor has found during the present invention is realized:Most of advertisement text not only can be in the form of similarity be high Occur, there is also repeatedly.
Therefore, for step S103, in optional implement, it may be determined that the amount of text in each text categories;If Amount of text in current text classification is more than amount threshold, it is determined that the text in the current text classification is advertisement text This, wherein, the amount threshold can be according to the actual requirements or empirically determined by those skilled in the art.As a kind of specific Example, the amount threshold can be 10.
The advertisement text recognition methods of the embodiment of the present invention uses unsupervised method automatic identification advertisement text, overcomes Need in the prior art efficiency caused by the mode filtering advertisements text that manually participates in is low, cost is higher, filter effect compared with The technical problem of difference, and then automatic quick identification advertisement text, the technique effect for improving filter efficiency, reducing cost.
Fig. 3 is the schematic diagram of the main flow of the advertisement text recognition methods of another embodiment of the present invention.As shown in figure 3, This method includes:
Step S301:Obtain text to be identified;
Step S302:The text publisher of each text is obtained, the text with identical text publisher is gathered Class is to form at least one text set, each corresponding text publisher of the text set;
Step S303:For text set corresponding to each text publisher:According to the phase between the text set Chinese version Like degree, the text multiplicity of text publisher corresponding to the text set is determined;If the text multiplicity is more than multiplicity threshold Value, it is determined that the text in the text set is advertisement text.
Step S301 in the embodiment of the present invention is identical with the step S101 in the embodiment shown in Fig. 1, and the present invention is herein Repeat no more.
For step S302, can according to the ID of microblogging publisher, by the microblogging of same microblogging publisher clustered with Form a text set.
, in an alternate embodiment of the invention, can be according to the method shown in Fig. 2 to each microblogging publisher's for step S303 Microblogging in text set is clustered, to form at least one microblogging classification;It is determined that the microblogging classification number of each microblogging publisher, The total and corresponding microblogging classification number of microblogging issued according to each microblogging publisher, determine the text weight of each microblogging publisher Multiplicity, i.e. text multiplicity=(microblogging sum)/(microblogging classification number);The microblogging that text multiplicity is more than to multiplicity threshold value is sent out For cloth person as advertisement publishers, the microblogging in text set corresponding to microblogging publisher is advertisement text.Wherein, text multiplicity Threshold value can be according to the actual requirements or empirically determined, and the present invention is not limited herein.As a kind of specific example, text repeats It can be 2 to spend threshold value.
In the above-described embodiment, using the microblogging of acquisition as text to be identified, then text categories above can be with Referred to as microblogging classification, text multiplicity are properly termed as microblogging multiplicity.
The advertisement text recognition methods of the embodiment of the present invention, the social characteristics (text having using advertisement text publisher Multiplicity) identification advertisement text, simple and convenient, accuracy is high, and need not manually participate in, can be quickly and accurately from magnanimity Advertisement text is identified in text.In a network environment, the user of social media can be divided into domestic consumer and advertisement publishers. Compared to domestic consumer, advertisement publishers have some special social characteristics, such as concern number is more, bean vermicelli number is few, microblogging weight Multiplicity height etc..
In an alternate embodiment of the invention, can also utilize other social characteristics that advertisement text publisher have (such as using Concern number is more, bean vermicelli number is few) identification advertisement text.
Fig. 4 is the schematic diagram of the main flow of the advertisement text recognition methods of further embodiment of this invention.As shown in figure 4, This method includes:
Step S401:Obtain text to be identified;
Step S402:The text publisher of each text is obtained, the text with identical text publisher is gathered Class is to form at least one text set, each corresponding text publisher of the text set;
Step S403:For text set corresponding to each text publisher:Text corresponding to the text set is obtained to issue The concern number and bean vermicelli number of person, based on the pass paid close attention to number and bean vermicelli number, determine text publisher corresponding to the text set Note number accounting;If the concern number accounting is more than accounting threshold value, it is determined that the text in the text set is advertisement text.
Step S401-402 is identical with the step S301-302 in Fig. 3, and the embodiment of the present invention will not be repeated here.
For step S403, in optional implement, concern number accounting=(the concern number) of microblogging publisher/(concern number+ Bean vermicelli number).Wherein, paying close attention to number accounting threshold value can be according to the actual requirements or empirically determined, and the present invention is not limited herein.As A kind of specific example, concern number accounting threshold value can be 0.9.
The advertisement text recognition methods of the embodiment of the present invention, the social characteristics having using advertisement text publisher (concern Number is more, and bean vermicelli number is few) identification advertisement text, it is simple and convenient, it is not necessary to it is artificial to participate in, can be quickly and accurately from mass text Identify advertisement text.
In an alternate embodiment of the invention, can know with reference to the method that the embodiment shown in Fig. 3 and embodiment illustrated in fig. 4 are provided Other advertisement text, such as text multiplicity is more than multiplicity threshold value and pays close attention to number accounting and is issued more than the text of accounting threshold value For person as advertisement publishers, the text in text set corresponding to text publisher is advertisement text.
In an alternate embodiment of the invention, Fig. 1 and Fig. 3, Fig. 1 and Fig. 4, Fig. 1 and Fig. 3 and embodiment illustrated in fig. 4 can be combined Method identification advertisement text.
The method of the identification advertisement text of the embodiment of the present invention is simple and convenient, and accuracy is high, and need not manually participate in, Advertisement text can be quickly and accurately identified from mass text, and then people can be helped in analysis social media text data Advertisement text therein is effectively filtered before, is greatly improved the quality of the social media text data for analysis mining, is carried Rise efficiency and effect that subsequent analysis is excavated.
Fig. 5 is the schematic diagram of the main modular of advertisement text identification device 500 according to embodiments of the present invention.Such as Fig. 5 institutes Show, including:
Text acquisition module 501, for obtaining text to be identified;
Text cluster module 502, for according to the incidence relation between text, being clustered to the text to be identified To form at least one text categories;
Advertisement identification module 503, for according at least one text categories, identifying wide in the text to be identified Accuse text.
In an alternate embodiment of the invention, advertisement identification module 503 is additionally operable to:It is determined that the amount of text in each text categories; If the amount of text in current text classification is more than amount threshold, it is determined that the text in the current text classification is advertisement text This.
In an alternate embodiment of the invention, the incidence relation between the text includes the similarity between text;
Text cluster module 502 is additionally operable to:A text is chosen from the text to be identified at random and forms the first text This classification;Determine current text and the longest common subsequence of each text in first text categories;According to described most long Common subsequence, determine the current text and the similarity of each text in first text categories;When the similarity When middle maximum similarity is more than or equal to similarity threshold, the current text is sorted out to first text categories;When When maximum similarity is less than similarity threshold in the similarity, creates the second text categories and sort out the current text To second text categories.
In an alternate embodiment of the invention, text cluster module 502 is additionally operable to:The text publisher of each text is obtained, will be had The text for having identical text publisher is clustered to form at least one text set, each corresponding text of the text set This publisher.
In an alternate embodiment of the invention, advertisement identification module 503 is additionally operable to:For text corresponding to each text publisher Collection:According to the similarity between the text set Chinese version, determine that the text of text publisher corresponding to the text set repeats Degree;If the text multiplicity is more than multiplicity threshold value, it is determined that the text in the text set is advertisement text;And/or obtain The concern number and bean vermicelli number of text publisher corresponding to the text set is taken, based on the concern number and bean vermicelli number, it is determined that described The concern number accounting of text publisher corresponding to text set;If the concern number accounting is more than accounting threshold value, it is determined that the text The text of this concentration is advertisement text.
The advertisement text identification device of the embodiment of the present invention, because using unsupervised method automatic identification advertisement text, So overcome efficiency caused by the mode filtering advertisements text for needing manually to participate in the prior art is low, cost is higher, The poor technical problem of filter effect, and then automatic quick identification advertisement text, the skill for improving filter efficiency, reducing cost Art effect.
Above-mentioned advertisement text identification device can perform the method that the embodiment of the present invention is provided, and it is corresponding to possess execution method Functional module and beneficial effect.Not ins and outs of detailed description in the present embodiment, reference can be made to the embodiment of the present invention is provided Method.
Fig. 6, which is shown, can apply the advertisement text recognition methods of the embodiment of the present invention or showing for advertisement text identification device Example sexual system framework 600.
As shown in fig. 6, system architecture 600 can include terminal device 601,602,603, network 604 and server 605. Network 604 between terminal device 601,602,603 and server 605 provide communication link medium.Network 604 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be interacted with using terminal equipment 601,602,603 by network 604 with server 605, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 601,602,603 The application of page browsing device, searching class application, JICQ, mailbox client, social platform software etc..
Terminal device 601,602,603 can have a display screen and a various electronic equipments that supported web page browses, bag Include but be not limited to smart mobile phone, tablet personal computer, pocket computer on knee and desktop computer etc..
Server 605 can be to provide the server of various services, such as utilize terminal device 601,602,603 to user The shopping class website browsed provides the back-stage management server supported.Back-stage management server can be believed the product received The data such as breath inquiry request are carried out the processing such as analyzing, and result (such as target push information, product information) is fed back to Terminal device.
It should be noted that the advertisement text recognition methods that the embodiment of the present invention is provided typically is performed by server 605, Correspondingly, advertisement text identification device is generally positioned in server 605.
It should be understood that the number of the terminal device, network and server in Fig. 6 is only schematical.According to realizing need Will, can have any number of terminal device, network and server.
Below with reference to Fig. 7, it illustrates suitable for for realizing the computer system 700 of the terminal device of the embodiment of the present invention Structural representation.Terminal device shown in Fig. 7 is only an example, to the function of the embodiment of the present invention and should not use model Shroud carrys out any restrictions.
As shown in fig. 7, computer system 700 includes CPU (CPU) 701, it can be read-only according to being stored in Program in memory (ROM) 702 or be loaded into program in random access storage device (RAM) 703 from storage part 708 and Perform various appropriate actions and processing.In RAM 703, also it is stored with system 700 and operates required various programs and data. CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 is also connected to always Line 704.
I/O interfaces 705 are connected to lower component:Importation 706 including keyboard, mouse etc.;Penetrated including such as negative electrode The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage part 708 including hard disk etc.; And the communications portion 709 of the NIC including LAN card, modem etc..Communications portion 709 via such as because The network of spy's net performs communication process.Driver 710 is also according to needing to be connected to I/O interfaces 705.Detachable media 711, such as Disk, CD, magneto-optic disk, semiconductor memory etc., it is arranged on as needed on driver 710, in order to read from it Computer program be mounted into as needed storage part 708.
Especially, according to embodiment disclosed by the invention, may be implemented as counting above with reference to the process of flow chart description Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product, it includes being carried on computer Computer program on computer-readable recording medium, the computer program include the program code for being used for the method shown in execution flow chart. In such embodiment, the computer program can be downloaded and installed by communications portion 709 from network, and/or from can Medium 711 is dismantled to be mounted.When the computer program is performed by CPU (CPU) 701, system of the invention is performed The above-mentioned function of middle restriction.
It should be noted that the computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer-readable recording medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination.Meter The more specifically example of calculation machine readable storage medium storing program for executing can include but is not limited to:Electrical connection with one or more wires, just Take formula computer disk, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type and may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer-readable recording medium can any include or store journey The tangible medium of sequence, the program can be commanded the either device use or in connection of execution system, device.And at this In invention, computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for By instruction execution system, device either device use or program in connection.Included on computer-readable medium Program code can be transmitted with any appropriate medium, be included but is not limited to:Wirelessly, electric wire, optical cable, RF etc., or it is above-mentioned Any appropriate combination.
Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of various embodiments of the invention, method and computer journey Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation The part of one module of table, program segment or code, a part for above-mentioned module, program segment or code include one or more For realizing the executable instruction of defined logic function.It should also be noted that some as replace realization in, institute in square frame The function of mark can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actual On can perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is depending on involved function.Also It is noted that the combination of each square frame and block diagram in block diagram or flow chart or the square frame in flow chart, can use and perform rule Fixed function or the special hardware based system of operation are realized, or can use the group of specialized hardware and computer instruction Close to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module can also be set within a processor, for example, can be described as:A kind of processor bag Include sending module, acquisition module, determining module and first processing module.Wherein, the title of these modules is under certain conditions simultaneously The restriction in itself to the module is not formed, for example, sending module is also described as " sending picture to the service end connected Obtain the module of request ".
As on the other hand, present invention also offers a kind of computer-readable medium, the computer-readable medium can be Included in equipment described in above-described embodiment;Can also be individualism, and without be incorporated the equipment in.Above-mentioned calculating Machine computer-readable recording medium carries one or more program, when said one or multiple programs are performed by the equipment, makes Obtaining the equipment includes:
Obtain text to be identified;
According to the incidence relation between text, the text to be identified is clustered to form at least one text class Not;
According at least one text categories, the advertisement text in the text to be identified is identified.
The technical scheme of the embodiment of the present invention, using unsupervised method automatic identification advertisement text, overcome existing skill Need that efficiency caused by the mode filtering advertisements text that manually participates in is low, the skill that cost is higher, filter effect is poor in art Art problem, and then automatic quick identification advertisement text, the technique effect for improving filter efficiency, reducing cost.
Above-mentioned embodiment, does not form limiting the scope of the invention.Those skilled in the art should be bright It is white, depending on design requirement and other factors, various modifications, combination, sub-portfolio and replacement can occur.It is any Modifications, equivalent substitutions and improvements made within the spirit and principles in the present invention etc., should be included in the scope of the present invention Within.

Claims (12)

  1. A kind of 1. advertisement text recognition methods, it is characterised in that including:
    Obtain text to be identified;
    According to the incidence relation between text, the text to be identified is clustered to form at least one text categories;
    According at least one text categories, the advertisement text in the text to be identified is identified.
  2. 2. according to the method for claim 1, it is characterised in that according at least one text categories, treated described in identification Advertisement text in the text of identification includes:
    It is determined that the amount of text in each text categories;
    If the amount of text in current text classification is more than amount threshold, it is determined that the text in the current text classification is wide Accuse text.
  3. 3. method according to claim 1 or 2, it is characterised in that
    Incidence relation between the text includes the similarity between text;
    According to the incidence relation between text, the text to be identified is clustered to form at least one text categories bag Include:
    A text is chosen from the text to be identified at random and forms the first text categories;
    Determine current text and the longest common subsequence of each text in first text categories;
    According to the longest common subsequence, determine the current text in first text categories each text it is similar Degree;
    When similarity maximum in the similarity is more than or equal to similarity threshold, the current text is sorted out to described First text categories;
    When similarity maximum in the similarity is less than similarity threshold, the second text categories of establishment simultaneously ought be above by described in This classification is to second text categories.
  4. 4. according to the method for claim 1, it is characterised in that according to the incidence relation between text, to described to be identified Text clustered and included with forming at least one text categories:
    The text publisher of each text is obtained, the text with identical text publisher is clustered to form at least one Individual text set, each corresponding text publisher of the text set.
  5. 5. according to the method for claim 4, it is characterised in that
    According at least one text categories, identify that the advertisement text in the text to be identified includes:For each text Text set corresponding to this publisher:
    According to the similarity between the text set Chinese version, determine that the text of text publisher corresponding to the text set repeats Degree;If the text multiplicity is more than multiplicity threshold value, it is determined that the text in the text set is advertisement text;And/or
    The concern number and bean vermicelli number of text publisher corresponding to the text set is obtained, pays close attention to number and bean vermicelli number based on described, really The concern number accounting of text publisher corresponding to the fixed text set;If the concern number accounting is more than accounting threshold value, it is determined that Text in the text set is advertisement text.
  6. A kind of 6. advertisement text identification device, it is characterised in that including:
    Text acquisition module, for obtaining text to be identified;
    Text cluster module, for according to the incidence relation between text, being clustered to the text to be identified to be formed At least one text categories;
    Advertisement identification module, for according at least one text categories, identifying the advertisement text in the text to be identified This.
  7. 7. device according to claim 6, it is characterised in that the advertisement identification module is additionally operable to:
    It is determined that the amount of text in each text categories;
    If the amount of text in current text classification is more than amount threshold, it is determined that the text in the current text classification is wide Accuse text.
  8. 8. the device according to claim 6 or 7, it is characterised in that incidence relation between the text include text it Between similarity;
    The text cluster module is additionally operable to:
    A text is chosen from the text to be identified at random and forms the first text categories;
    Determine current text and the longest common subsequence of each text in first text categories;
    According to the longest common subsequence, determine the current text in first text categories each text it is similar Degree;
    When similarity maximum in the similarity is more than or equal to similarity threshold, the current text is sorted out to described First text categories;
    When similarity maximum in the similarity is less than similarity threshold, the second text categories of establishment simultaneously ought be above by described in This classification is to second text categories.
  9. 9. device according to claim 6, it is characterised in that the text cluster module is additionally operable to:
    The text publisher of each text is obtained, the text with identical text publisher is clustered to form at least one Individual text set, each corresponding text publisher of the text set.
  10. 10. device according to claim 9, it is characterised in that the advertisement identification module is additionally operable to:
    For text set corresponding to each text publisher:
    According to the similarity between the text set Chinese version, determine that the text of text publisher corresponding to the text set repeats Degree;If the text multiplicity is more than multiplicity threshold value, it is determined that the text in the text set is advertisement text;And/or
    The concern number and bean vermicelli number of text publisher corresponding to the text set is obtained, pays close attention to number and bean vermicelli number based on described, really The concern number accounting of text publisher corresponding to the fixed text set;If the concern number accounting is more than accounting threshold value, it is determined that Text in the text set is advertisement text.
  11. 11. a kind of electronic equipment, it is characterised in that including:
    One or more processors;
    Storage device, for storing one or more programs,
    When one or more of programs are by one or more of computing devices so that one or more of processors are real The now method as described in any in claim 1-5.
  12. 12. a kind of computer-readable medium, is stored thereon with computer program, it is characterised in that described program is held by processor The method as described in any in claim 1-5 is realized during row.
CN201710966609.8A 2017-10-17 2017-10-17 Advertisement text recognition methods and device Pending CN107729489A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710966609.8A CN107729489A (en) 2017-10-17 2017-10-17 Advertisement text recognition methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710966609.8A CN107729489A (en) 2017-10-17 2017-10-17 Advertisement text recognition methods and device

Publications (1)

Publication Number Publication Date
CN107729489A true CN107729489A (en) 2018-02-23

Family

ID=61211610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710966609.8A Pending CN107729489A (en) 2017-10-17 2017-10-17 Advertisement text recognition methods and device

Country Status (1)

Country Link
CN (1) CN107729489A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348539A (en) * 2019-07-19 2019-10-18 知者信息技术服务成都有限公司 Short text correlation method of discrimination
CN110362680A (en) * 2019-06-14 2019-10-22 西安交通大学 A kind of soft wide detection and advertisement abstracting method based on figure Crosslinking Structural
CN110827044A (en) * 2018-08-07 2020-02-21 北京京东尚科信息技术有限公司 Method and device for extracting user interest mode

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030002060A1 (en) * 2000-12-28 2003-01-02 Kazuyuki Yokoyama Apparatus for generating two color printing data, a method for generating two color printing data and recording media
US20070203903A1 (en) * 2006-02-28 2007-08-30 Ilial, Inc. Methods and apparatus for visualizing, managing, monetizing, and personalizing knowledge search results on a user interface
CN101067858A (en) * 2006-09-28 2007-11-07 腾讯科技(深圳)有限公司 Network advertisment realizing method and device
CN101071443A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Content-related advertising identifying method and content-related advertising server
CN101071433A (en) * 2007-05-10 2007-11-14 腾讯科技(深圳)有限公司 Picture download system and method
CN101097570A (en) * 2006-06-29 2008-01-02 上海唯客网广告传播有限公司 Advertisement classification method capable of automatic recognizing classified advertisement type
CN101102316A (en) * 2007-06-22 2008-01-09 腾讯科技(深圳)有限公司 A method and system for removing duplicate webpages
CN101159834A (en) * 2007-10-25 2008-04-09 中国科学院计算技术研究所 Method and system for detecting repeatable video and audio program fragment
CN101226533A (en) * 2007-12-28 2008-07-23 腾讯科技(北京)有限公司 Method and system for arranging web page again
CN101650740A (en) * 2009-08-27 2010-02-17 中国科学技术大学 Method and device for detecting television advertisements
CN101814083A (en) * 2010-01-08 2010-08-25 上海复歌信息科技有限公司 Automatic webpage classification method and system
US20120074680A1 (en) * 2010-09-23 2012-03-29 Theodosios Kountotsis Removable or peelable articles, advertisements, and illustrations from newspapers, magazines and publications
CN102419777A (en) * 2012-01-10 2012-04-18 凤凰在线(北京)信息技术有限公司 System and method for filtering internet image advertisements
CN102663065A (en) * 2012-03-30 2012-09-12 浙江盘石信息技术有限公司 Method for identifying and screening abnormal data of advertising positions
CN102945244A (en) * 2012-09-24 2013-02-27 南京大学 Chinese web page repeated document detection and filtration method based on full stop characteristic word string
CN104462301A (en) * 2014-11-28 2015-03-25 北京奇虎科技有限公司 Network data processing method and device
CN104636487A (en) * 2015-02-26 2015-05-20 湖北光谷天下传媒股份有限公司 Advertising information management method
CN105787133A (en) * 2016-03-31 2016-07-20 北京小米移动软件有限公司 Method and device for filtering advertisement information
CN106844430A (en) * 2016-12-12 2017-06-13 天格科技(杭州)有限公司 A kind of improved real-time social platform advertisement and sensitive information quickly know method for distinguishing
CN107067045A (en) * 2017-05-31 2017-08-18 北京京东尚科信息技术有限公司 Data clustering method, device, computer-readable medium and electronic equipment

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030002060A1 (en) * 2000-12-28 2003-01-02 Kazuyuki Yokoyama Apparatus for generating two color printing data, a method for generating two color printing data and recording media
US20070203903A1 (en) * 2006-02-28 2007-08-30 Ilial, Inc. Methods and apparatus for visualizing, managing, monetizing, and personalizing knowledge search results on a user interface
CN101097570A (en) * 2006-06-29 2008-01-02 上海唯客网广告传播有限公司 Advertisement classification method capable of automatic recognizing classified advertisement type
CN101067858A (en) * 2006-09-28 2007-11-07 腾讯科技(深圳)有限公司 Network advertisment realizing method and device
CN101071433A (en) * 2007-05-10 2007-11-14 腾讯科技(深圳)有限公司 Picture download system and method
CN101102316A (en) * 2007-06-22 2008-01-09 腾讯科技(深圳)有限公司 A method and system for removing duplicate webpages
CN101071443A (en) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 Content-related advertising identifying method and content-related advertising server
CN101159834A (en) * 2007-10-25 2008-04-09 中国科学院计算技术研究所 Method and system for detecting repeatable video and audio program fragment
CN101226533A (en) * 2007-12-28 2008-07-23 腾讯科技(北京)有限公司 Method and system for arranging web page again
CN101650740A (en) * 2009-08-27 2010-02-17 中国科学技术大学 Method and device for detecting television advertisements
CN101814083A (en) * 2010-01-08 2010-08-25 上海复歌信息科技有限公司 Automatic webpage classification method and system
US20120074680A1 (en) * 2010-09-23 2012-03-29 Theodosios Kountotsis Removable or peelable articles, advertisements, and illustrations from newspapers, magazines and publications
CN102419777A (en) * 2012-01-10 2012-04-18 凤凰在线(北京)信息技术有限公司 System and method for filtering internet image advertisements
CN102663065A (en) * 2012-03-30 2012-09-12 浙江盘石信息技术有限公司 Method for identifying and screening abnormal data of advertising positions
CN102945244A (en) * 2012-09-24 2013-02-27 南京大学 Chinese web page repeated document detection and filtration method based on full stop characteristic word string
CN104462301A (en) * 2014-11-28 2015-03-25 北京奇虎科技有限公司 Network data processing method and device
CN104636487A (en) * 2015-02-26 2015-05-20 湖北光谷天下传媒股份有限公司 Advertising information management method
CN105787133A (en) * 2016-03-31 2016-07-20 北京小米移动软件有限公司 Method and device for filtering advertisement information
CN106844430A (en) * 2016-12-12 2017-06-13 天格科技(杭州)有限公司 A kind of improved real-time social platform advertisement and sensitive information quickly know method for distinguishing
CN107067045A (en) * 2017-05-31 2017-08-18 北京京东尚科信息技术有限公司 Data clustering method, device, computer-readable medium and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827044A (en) * 2018-08-07 2020-02-21 北京京东尚科信息技术有限公司 Method and device for extracting user interest mode
CN110362680A (en) * 2019-06-14 2019-10-22 西安交通大学 A kind of soft wide detection and advertisement abstracting method based on figure Crosslinking Structural
CN110362680B (en) * 2019-06-14 2021-07-13 西安交通大学 Soft-wide detection and advertisement extraction method based on graph network structure analysis
CN110348539A (en) * 2019-07-19 2019-10-18 知者信息技术服务成都有限公司 Short text correlation method of discrimination

Similar Documents

Publication Publication Date Title
WO2018192491A1 (en) Information pushing method and device
WO2021174944A1 (en) Message push method based on target activity, and related device
US8898072B2 (en) Optimizing electronic display of advertising content
CN107247786A (en) Method, device and server for determining similar users
CN107609890A (en) A kind of method and apparatus of order tracking
CN107679119A (en) The method and apparatus for generating brand derivative words
JP2019519019A (en) Method, apparatus and device for identifying text type
CN109783741A (en) Method and apparatus for pushed information
CN107908666A (en) A kind of method and apparatus of identification equipment mark
CN107731229A (en) Method and apparatus for identifying voice
CN108924381B (en) Image processing method, image processing apparatus, and computer readable medium
CN107506256A (en) A kind of method and apparatus of crash data monitoring
CN107885783B (en) Method and device for obtaining high-correlation classification of search terms
CN111177319A (en) Risk event determination method and device, electronic equipment and storage medium
CN107729489A (en) Advertisement text recognition methods and device
CN113204691B (en) Information display method, device, equipment and medium
CN107783962A (en) Method and device for query statement
CN115114439A (en) Method and device for multi-task model reasoning and multi-task information processing
CN109284367A (en) Method and apparatus for handling text
Fuad et al. Analysis and classification of mobile apps using topic modeling: A case study on Google Play Arabic apps
US20210349920A1 (en) Method and apparatus for outputting information
CN106815224A (en) Service acquisition method and apparatus
CN107729931A (en) Picture methods of marking and device
CN107766498A (en) Method and apparatus for generating information
US20230281696A1 (en) Method and apparatus for detecting false transaction order

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180223