CN102184256A - Clustering method and system aiming at massive similar short texts - Google Patents

Clustering method and system aiming at massive similar short texts Download PDF

Info

Publication number
CN102184256A
CN102184256A CN2011101473403A CN201110147340A CN102184256A CN 102184256 A CN102184256 A CN 102184256A CN 2011101473403 A CN2011101473403 A CN 2011101473403A CN 201110147340 A CN201110147340 A CN 201110147340A CN 102184256 A CN102184256 A CN 102184256A
Authority
CN
China
Prior art keywords
text
short
short text
trunk
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011101473403A
Other languages
Chinese (zh)
Inventor
白俊良
陈�光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN2011101473403A priority Critical patent/CN102184256A/en
Publication of CN102184256A publication Critical patent/CN102184256A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a clustering method and system aiming at massive similar short texts, belonging to a research on repeated short text detection in the scientific field of information technology. Due to self features of the short texts, the calculated result obtained by applying the traditional repeated text analysis method to short texts are not satisfactory. By adopting a repeated analysis method based on main short text content and combining related word groups, the invention not only can detect completely repeated texts, but also can detect texts with extremely high similarity. The method and system disclosed by the invention have high processing speed and high efficiencyand can better process massive data. By the adoption of the method, redundant short texts can be removed, the system processing scale can be greatly decreased, and hot short texts can be found to a certain extent. therefore, the method and system disclosed by the invention are helpful to find out social hotspots.

Description

A kind of clustering method and system at the similar short text of magnanimity
One, technical field
Infotech
Two, background technology
Become in informationization under the background of world development trends, the internet have application very extensively, development scale is maximum, press close to very much numerous characteristics such as people's life.On the one hand, huge economic benefit and social benefit have been created in the internet, make people can receive instant, up-to-date message; But simultaneously along with universal, the online quantity of information of network is increasing, not only computing machine is proposed stern challenge to the obtaining of these magnanimity informations, storage and real-time analysis processing power, brought certain degree of difficulty also for people accuracy and reliability when search information; On the other hand, the internet has also brought some negative effects, propagates in a large number on network as flames such as pornographic, reactions.Spreading unchecked of devious conduct such as spam utilized the behavior of property infringements such as Internet communication film, music, software, even by network mode swindle user, and problem such as internet-relevant violence occurs.Therefore, in the process of construction information society, improving information content safety guarantee level and to the detectability of various flames in the internet, be the important ring in the network information technology, also is the solid foundation of smooth construction information society.
Be accompanied by the process of the integration of three networks, it is diversified that Next Generation Internet Chinese version form becomes, and the generic web page proportion is more and more littler.Contents such as microblogging, WAP, comment, note improve gradually than regular meeting.Similar with generic web page, also there is a large amount of identical or very similar contents in this class text.For example:
[1] the diploma I.D. QQ731787311 that carves a seal is done in Beijing certificates handling
[2] do in Beijing, and card is done graduation, the card identity, and card is carved, chapter QQ7317@87@311
[3] I send out to bless note, and the bachelor to one's heart's content laughingly.Be regardless of red-letter day big and little, happy graceful very lively.All things all is wiped off with the wind, complies with one's wishes and just can not have worry!
[4]<blessing note I come to send out<bachelor to one's heart's content laughingly.<red-letter day〉be regardless of big and little,<happy graceful very lively.<all things〉all wipe off with the wind,<comply with one's wishes just can not have worry!
[5] timely snow wafts, and cold plum is pretty, Taurus moo moo herald spring early.Gong and drum strike, and firecracker makes a noise, and Divine Land has laughter everywhere.The friendship jail is caught up with ingeniously, and the crust crust send good fortune to arrive today.Healthy, the mammon looks for, and ox fortune is great not to be forgotten to hand over!---Zhang San's truly yours
[6] timely snow wafts, and cold plum is pretty, Taurus moo moo herald spring early.Gong and drum strike, and firecracker makes a noise, and Divine Land has laughter everywhere.The friendship jail is caught up with ingeniously, and the crust crust send good fortune to arrive today.Healthy, the mammon looks for, and ox fortune is great not to be forgotten to hand over!---Li Si's truly yours
Example 1 and example 2 are relatively found, have inserted punctuation mark and special symbol improperly in the note, and this is to send the illegal retailer of advertisement SMS in order to hide the advertisement filter of operator.Example 3 and example 4 are relatively found, send note person and in repeating process the keyword that will emphasize have been drawn together.Example 5 and example 6 find that relatively the body matter of note is identical, and different forwarding persons in the end affixes one's name to the name that goes up oneself respectively.Though the content of this type note is changed to some extent, its main part still is the same.
Also having a class is the note of cellphone subscriber with regard to same topic or the creation of similar topic.The note that exchanges as festival blessing short message or with regard to some public events etc.This class note all is original note, though expression way is different, because content is same topic, so very big similarity is arranged.
Three, summary of the invention
1, technical matters to be solved by this invention (goal of the invention)
The redundant phenomenon especially severe of short text language material: redundant main a large amount of mass-sendings in SMS, a large amount of mass-sendings and the forwarding of the note of making laughs and blessing note, and the emerge in multitude of civil day common-use words from refuse messages; In BBS language material or news analysis language material, redundant a large amount of commentaries on classics cards and a large amount of answer that mainly comes from the focus model; Humorous message, blessing message, works and expressions for everyday use etc. are very frequent in the instant message, cause a large amount of message redundancies.Microsoft had once added up the internet language material that is made of 1.5 hundred million webpages, found that 6% webpage repeats fully.This shows that ratio that short text repeats fully is higher than the repetition ratio of internet language material far away.In addition, in the short text language material except the identical redundant note of content, it is approximately uniform also having the more huge short text content of quantity, these short texts obviously are to talk about same incident, and obviously be to talk about in almost completely identical mode, just punctuation mark has nuance, and perhaps note begins or end up to have added several characters.And the approximate redundant ratio that Microsoft comes out from the internet corpus statistics is 29.2%, so the approximate redundant ratio of short text language material is much higher than the approximate redundant ratio of internet language material.The existence of fully redundance short text and approximate redundant short text can cause the waste of hard drive space. and detect and remove redundant short text and can reduce the system handles scale greatly.Detect and remove redundant short text and can also find the focus short text to a certain extent, the auxiliary social hotspots of finding.
Whether traditional repeated text detection algorithm is used for solving two texts of detection mostly repeats fully, can not solve the duplicate detection problem of the similar short text in 1.1.
Traditional repeated text analytical approach is not suitable for the replicate analysis of short text, and traditional text relevant analytical approach mainly adopts vector space model or probability model.In vector space model, word in the usefulness text or speech are measured the correlativity of text as the character representation text with the similarity between the proper vector.But the length of note, this class text of microblogging is too short, and this can cause proper vector too sparse, and the result who calculates similarity can't satisfy the requirement of similarity analysis, and its result can't make the people accept at semantic level especially.In probability model, can there be similar problem equally.If use this too short text of note, most of feature all can be the level and smooth result of probability, can not reflect the information of True Data.Therefore result of calculation can't be satisfactory, can not solve the duplicate detection problem of similar short text.This paper adopts the replicate analysis method based on the content of text trunk, and in conjunction with relevant clump, preferably resolves this problem.
2, complete skill scheme provided by the invention (invention scheme)
2.1 replicate analysis method based on short text content trunk
This algorithm removes highly similar text according to the consistance of content of text trunk.No matter be probability model or vector space model, the method for its correlation analysis all is based on the word frequency in the text.Simultaneously, if two short texts (for example note, microblogging) a large amount of identical or semantic approximate speech must occur so if similar in the text.Therefore we adopt the method for extracting the content of text trunk to carry out the correlation analysis of note sample.This scheme comprises following a few step:
1) pre-service
This step is used to improve text quality.Comprise the steps:
A) text filtering (remove length too short and do not have a text of quantity of information)
B) text is pruned (sew and special symbol the front and back of removing in the text of playing interference effect)
C) text code conversion
D) content of text normalization (unification that either traditional and simplified characters is unified, upper and lower case letter is unified, the full-shape DBC case is unified, various forms is numbered etc.)
2) participle
This step says that complete content of text is cut into word or the word that has part of speech.
3) extract the text trunk
This step is only extracted verb, noun, number, and the word of other part of speech abandons need not.Then that semanteme is identical synonym, near synonym replace with same speech (semantic normalization).
4) similarity is calculated
After extracting trunk, we suppose the text of same words number many more (order of words is constant), and its similarity is strong more.
Therefore this step is put into the HASH table with the text trunk, is divided into text relevant and uncorrelated two kinds according to mapping relations.
5) similar text cluster
This step is classified as a class with relevant documentation, thereby forms the classification of a plurality of " related texts ".And select the highest keyword of word frequency (keyword repetition rate) and represent this classification.
Four, description of drawings
Fig. 1: strong correlation repeated text detection algorithm process flow diagram
Fig. 2: distributed treatment scheme Organization Chart
Fig. 3: short text data time sequence synchronization scheme
Fig. 4: server end deployment diagram
Fig. 5: the text-processing process flow diagram of each processing node
Five, embodiment
In order to handle the mass network data, must dispose such scheme in distributed mode.Each distributed treatment node obtains data from the short text data source, after extracting the short text trunk, communicate by letter with the HASH database server, in the HASH database, search this short text trunk, thereby determine whether this short text repeated, if repeat, then in local TokyoCabinet HASH table, upgrade the quantity of such short text, result is transferred to subsequent processes and does further processing.For improving processing speed, on each processing node, adopt BUFFER_DEQUE and two buffer structures of DB_DEQUE that the repeated text classification information in the HASH server is done L2 cache simultaneously.
1, this framework need illustrate part
1) processing node is provided with the reason of buffer memory
For guaranteeing the higher reading performance of Hash server, it is very important the data volume in the hash database to be limited within the specific limits (hundred million ranks are following), so at each processing node buffer memory is set.
On the other hand, all can pin database file during record of every deletion, other requests must be waited for.Therefore can not adopt " concentrating deletion strategy " or " deletion strategy in batches ".Each processing node is responsible for the record that deletion was handled oneself from the Hash server database, can spread out deletion action like this, can not cause long wait (database manipulation answer delay).In addition, short text is arranged according to time sequencing in buffer memory, when the short text record of deletion " out-of-date ", can find short text class to be deleted with O (1) time complexity like this.
2) reason of two-level cache is set
Therefore the final incident of using the time-sensitive of being concerned about often, (abbreviates " minor cycle " as) and does not find that the short text class that repeats is considered as unconcerned short text in a short time.This class short text often accounts for the overwhelming majority, for example the note that sends in our daily life.
Even find the short text class of repetition in the short time, after having taken place, also can become a period of time (abbreviating " large period " as) the short text class of " out-of-date ", also be considered as unconcerned short text.It is just nonsensical for example to talk financial crisis now again.
In order to reduce the record quantity of storing in the hash database as far as possible, we distinguish according to above-mentioned reason and treat short text class record.Storage all short text records in " minor cycle " among the buffer structure Buffer_Deque, comprise repetition with unduplicated.Buffer structure DB_Deque is used for the repetition short text in the storage " large period ".
In the process of handling short text stream, we will exceed the short text record of minor cycle and discovery repetition and in time delete from Hash server and buffer structure Buffer_Deque.To exceed the minor cycle but have been found that the short text record unloading of repetition goes into buffer structure DB_Deque; The short text class record that surpasses large period should in time be deleted from buffer structure DB_Deque and HASH database.
3) relation of the data sync between two buffer memorys, the hash database
All short text records between buffer structure Buffer_Deque and the hash database in " minor cycle " synchronously.Repetition short text record between buffer structure DB_Deque and the hash database in synchronous " large period ".
4) use the TokyoCabinet HASH table of processing node to do the reason that short text is counted
The quantity of short text (be called for short and concentrate counting) seems more simpler in unified certain short text class of record in hash database, and its problem of also can occurrence count not makeing mistakes.But so also there is following problem:
A, count results need periodically write Analytical Results Database (oracle database etc.), at this moment the pinning database table and the HASH database that need the long period, each processing node can not be visited the HASH database to return the result of similarity duplicate detection during this period, and the instantaneous pressure of oracle database is also bigger simultaneously.
B, each short essay given figure is surpassed 3 short text class, concentrate counting can increase a database write operation.Can increase the pressure of Hash server like this.
Adopt distributed short text counting can avoid the problems referred to above.This is because reduced HASH access of database amount, adopts the mode of disperseing to reduce the instantaneous pressure of database to the database write data simultaneously.
2, the data storage that relates in this framework
1) the HASH database is installed on the HASH database server, is responsible for the trunk that storage repeats short text.
2) in each processing node TokyoCabinet is installed, is responsible for storage short text class count information.
3) buffer structure Buffer_Deque is used to store all short texts in the minor cycle.Buffer structure Buffer_Deque comprises buffer_queue and two Hash structures of buffer_inde.
Whether adopting the reason of two Hash to be, is key with the short text trunk among the buffer_queue, can inquire about certain short text class rapidly and exist.Buffer_index is a key with the short text transmitting time, can learn rapidly which short text class has exceeded " minor cycle ".So the short text class among buffer_queue and the buffer_index needs synchronously.
4) DB_Deque is used to store all short text classes of finding repetition in the large period.
Short text class in the DB_Deque formation is arranged according to the time sequencing ascending order.Each so only need reading from team's head according to the time threshold deleted data gets final product.
3, HASH server end structure
At the HASH server end, request is according to the time order and function processed in sequence that receives.Main three parts of HASH server end, main thread, Global Queue and worker thread group.Main thread is intercepted request by network interface and is connected, and the request that will obtain is put in the formation of an overall situation then, and worker thread takes out request from queue heads then, inquires about in the HASH database, and Query Result is returned to the user.

Claims (6)

1. to the duplicate detection method of the content-based trunk of the similar short text of magnanimity, comprise text is carried out pre-service, complete content of text is cut into word or the word that has part of speech, text is extracted trunk, only extract the verb in the text, noun, number, the word of other part of speech abandons need not, then that semanteme is identical synonym, near synonym replace with same speech (semantic normalization), text is carried out similarity to be calculated, after extracting trunk, we suppose the text of same words number many more (order of words is constant), and its similarity is strong more, relevant documentation is classified as a class, thereby forms the classification of a plurality of " related texts ".And select several the highest keywords of word frequency (keyword repetition rate) and represent this classification.
2. the duplicate detection method of the content-based trunk to the similar short text of magnanimity as claimed in claim 1, to the filtering and pruning of text, it is too short and do not have the text of quantity of information and the front and back of playing interference effect in the text to sew and special symbol promptly to remove length when it is characterized in that text carried out pre-service.
3. the duplicate detection method of the content-based trunk to the similar short text of magnanimity as claimed in claim 1, when it is characterized in that text carried out pre-service text is carried out code conversion, and to content of text normalization, promptly either traditional and simplified characters is unified, upper and lower case letter is unified, the full-shape DBC case is unified, the unification of various forms numbering etc.
4. the duplicate detection method of the content-based trunk to the similar short text of magnanimity as claimed in claim 1, when it is characterized in that text carried out pre-service text is carried out in the similarity computation process text trunk being put into the HASH table, be divided into text relevant and uncorrelated two kinds according to mapping relations.
5. comprising duplicate detection and repeating the distributed structure/architecture of degree statistical function the similar short text of magnanimity, comprise that each distributed treatment node obtains data from the short text data source, extract the short text trunk, communicate by letter with the HASH database server, in the HASH database, search this short text trunk, thereby determine whether this short text repeated, if repeat, then upgrade the quantity of such short text in local TokyoCabinet, result is transferred to subsequent processes and does further processing.
6. the distributed structure/architecture that comprises duplicate detection and the degree of repetition statistical function to the similar short text of magnanimity as claimed in claim 5, it is characterized in that each distributed treatment node is obtained data from the short text data source, on each processing node, adopt BUFFER_DEQUE and DB_DEQUE that the repeated text classification information in the hash server is done L2 cache when extracting the short text trunk.
CN2011101473403A 2011-06-02 2011-06-02 Clustering method and system aiming at massive similar short texts Pending CN102184256A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011101473403A CN102184256A (en) 2011-06-02 2011-06-02 Clustering method and system aiming at massive similar short texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011101473403A CN102184256A (en) 2011-06-02 2011-06-02 Clustering method and system aiming at massive similar short texts

Publications (1)

Publication Number Publication Date
CN102184256A true CN102184256A (en) 2011-09-14

Family

ID=44570433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011101473403A Pending CN102184256A (en) 2011-06-02 2011-06-02 Clustering method and system aiming at massive similar short texts

Country Status (1)

Country Link
CN (1) CN102184256A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360372A (en) * 2011-10-09 2012-02-22 北京航空航天大学 Cross-language document similarity detection method
CN103049524A (en) * 2012-12-20 2013-04-17 中国科学技术信息研究所 Method for automatically clustering synonym search results according to lexical meanings
CN103177125A (en) * 2013-04-17 2013-06-26 镇江诺尼基智能技术有限公司 Method for realizing fast-speed short text bi-cluster
CN103324604A (en) * 2012-03-07 2013-09-25 国际商业机器公司 Domain specific natural language normalization method and system
CN103729422A (en) * 2013-12-23 2014-04-16 武汉传神信息技术有限公司 Information fragment associative output method and system
CN103744883A (en) * 2013-12-23 2014-04-23 武汉传神信息技术有限公司 Method and system for rapidly selecting information fragments
CN103744884A (en) * 2013-12-23 2014-04-23 武汉传神信息技术有限公司 Method and system for collating information fragments
CN104317883A (en) * 2014-10-21 2015-01-28 北京国双科技有限公司 Web text processing method and web text processing device
CN105843818A (en) * 2015-01-15 2016-08-10 富士通株式会社 Training device, training method, determining device, and recommendation device
CN106202057A (en) * 2016-08-30 2016-12-07 东软集团股份有限公司 The recognition methods of similar news information and device
CN106383814A (en) * 2016-09-13 2017-02-08 电子科技大学 Word segmentation method of English social media short text
CN106407020A (en) * 2016-11-23 2017-02-15 青岛海信移动通信技术股份有限公司 Database processing method of mobile terminal and mobile terminal thereof
CN106407019A (en) * 2016-11-23 2017-02-15 青岛海信移动通信技术股份有限公司 Database processing method of mobile terminal and mobile terminal thereof
CN106919549A (en) * 2015-12-24 2017-07-04 阿里巴巴集团控股有限公司 Method and device for business processing
CN106933901A (en) * 2015-12-31 2017-07-07 北京大学 data integrating method and system
CN107330127A (en) * 2017-07-21 2017-11-07 湘潭大学 A kind of Similar Text detection method retrieved based on textual image
CN109472008A (en) * 2018-11-20 2019-03-15 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus and electronic equipment
CN106682082B (en) * 2016-11-23 2021-03-26 青岛海信移动通信技术股份有限公司 Writing method and device for database
CN112597284A (en) * 2021-03-08 2021-04-02 中邮消费金融有限公司 Company name matching method and device, computer equipment and storage medium

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360372A (en) * 2011-10-09 2012-02-22 北京航空航天大学 Cross-language document similarity detection method
US9424253B2 (en) 2012-03-07 2016-08-23 International Business Machines Corporation Domain specific natural language normalization
CN103324604A (en) * 2012-03-07 2013-09-25 国际商业机器公司 Domain specific natural language normalization method and system
US9122673B2 (en) 2012-03-07 2015-09-01 International Business Machines Corporation Domain specific natural language normalization
CN103324604B (en) * 2012-03-07 2016-04-27 国际商业机器公司 For the standardized method and system of the specific natural language in territory
CN103049524A (en) * 2012-12-20 2013-04-17 中国科学技术信息研究所 Method for automatically clustering synonym search results according to lexical meanings
CN103049524B (en) * 2012-12-20 2016-01-06 中国科学技术信息研究所 Synonym result for retrieval presses meaning of a word automatic clustering method
CN103177125A (en) * 2013-04-17 2013-06-26 镇江诺尼基智能技术有限公司 Method for realizing fast-speed short text bi-cluster
CN103177125B (en) * 2013-04-17 2016-04-27 镇江诺尼基智能技术有限公司 One short text double focusing fast class methods
CN103729422A (en) * 2013-12-23 2014-04-16 武汉传神信息技术有限公司 Information fragment associative output method and system
CN103744883A (en) * 2013-12-23 2014-04-23 武汉传神信息技术有限公司 Method and system for rapidly selecting information fragments
CN103744884A (en) * 2013-12-23 2014-04-23 武汉传神信息技术有限公司 Method and system for collating information fragments
CN104317883B (en) * 2014-10-21 2017-11-21 北京国双科技有限公司 Network text processing method and processing device
CN104317883A (en) * 2014-10-21 2015-01-28 北京国双科技有限公司 Web text processing method and web text processing device
CN105843818A (en) * 2015-01-15 2016-08-10 富士通株式会社 Training device, training method, determining device, and recommendation device
CN106919549A (en) * 2015-12-24 2017-07-04 阿里巴巴集团控股有限公司 Method and device for business processing
CN106933901B (en) * 2015-12-31 2020-07-17 北京大学 Data integration method and system
CN106933901A (en) * 2015-12-31 2017-07-07 北京大学 data integrating method and system
CN106202057B (en) * 2016-08-30 2019-07-12 东软集团股份有限公司 The recognition methods of similar news information and device
CN106202057A (en) * 2016-08-30 2016-12-07 东软集团股份有限公司 The recognition methods of similar news information and device
CN106383814A (en) * 2016-09-13 2017-02-08 电子科技大学 Word segmentation method of English social media short text
CN106407019A (en) * 2016-11-23 2017-02-15 青岛海信移动通信技术股份有限公司 Database processing method of mobile terminal and mobile terminal thereof
CN106407020A (en) * 2016-11-23 2017-02-15 青岛海信移动通信技术股份有限公司 Database processing method of mobile terminal and mobile terminal thereof
CN106682082B (en) * 2016-11-23 2021-03-26 青岛海信移动通信技术股份有限公司 Writing method and device for database
CN107330127A (en) * 2017-07-21 2017-11-07 湘潭大学 A kind of Similar Text detection method retrieved based on textual image
CN107330127B (en) * 2017-07-21 2020-06-05 湘潭大学 Similar text detection method based on text picture retrieval
CN109472008A (en) * 2018-11-20 2019-03-15 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus and electronic equipment
CN112597284A (en) * 2021-03-08 2021-04-02 中邮消费金融有限公司 Company name matching method and device, computer equipment and storage medium
CN112597284B (en) * 2021-03-08 2021-06-15 中邮消费金融有限公司 Company name matching method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102184256A (en) Clustering method and system aiming at massive similar short texts
CN106980692B (en) Influence calculation method based on microblog specific events
JP6007088B2 (en) Question answering program, server and method using a large amount of comment text
CN100478961C (en) New word of short-text discovering method and system
CN109241274A (en) text clustering method and device
CN101820398A (en) Instant messenger for dynamically managing messaging group and method thereof
WO2008014702A1 (en) Method and system of extracting new words
WO2007143914A1 (en) Method, device and inputting system for creating word frequency database based on web information
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
Man Feature extension for short text categorization using frequent term sets
CN103313248A (en) Method and device for identifying junk information
CN105404677B (en) A kind of search method based on tree structure
CN105608232A (en) Bug knowledge modeling method based on graphic database
CN105279159B (en) The reminding method and device of contact person
CN112905800A (en) Public character public opinion knowledge graph and XGboost multi-feature fusion emotion early warning method
Devika et al. A semantic graph-based keyword extraction model using ranking method on big social data
CN106502990A (en) A kind of microblogging Attribute selection method and improvement TF IDF method for normalizing
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN111782970B (en) Data analysis method and device
US9547701B2 (en) Method of discovering and exploring feature knowledge
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN105426490B (en) A kind of indexing means based on tree structure
CN111400617A (en) Social robot detection data set extension method and system based on active learning
Lim et al. ClaimFinder: A Framework for Identifying Claims in Microblogs.
JP6173958B2 (en) Program, apparatus and method for searching using a plurality of hash tables

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
DD01 Delivery of document by public notice

Addressee: Chen Guang

Document name: Notification of Publication of the Application for Invention

C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20110914