CN108334628A - A kind of method, apparatus, equipment and the storage medium of media event cluster - Google Patents

A kind of method, apparatus, equipment and the storage medium of media event cluster Download PDF

Info

Publication number
CN108334628A
CN108334628A CN201810155131.5A CN201810155131A CN108334628A CN 108334628 A CN108334628 A CN 108334628A CN 201810155131 A CN201810155131 A CN 201810155131A CN 108334628 A CN108334628 A CN 108334628A
Authority
CN
China
Prior art keywords
similarity
newsletter
text
participle
archives
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810155131.5A
Other languages
Chinese (zh)
Inventor
王云
刘丹
肖天鹤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Green Oriental Data Technology Co Ltd
Beijing Dong Run Huan Neng Science And Technology Co Ltd
Original Assignee
Beijing Green Oriental Data Technology Co Ltd
Beijing Dong Run Huan Neng Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Green Oriental Data Technology Co Ltd, Beijing Dong Run Huan Neng Science And Technology Co Ltd filed Critical Beijing Green Oriental Data Technology Co Ltd
Priority to CN201810155131.5A priority Critical patent/CN108334628A/en
Publication of CN108334628A publication Critical patent/CN108334628A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The embodiment of the invention discloses method, apparatus, equipment and the storage mediums of a kind of media event cluster.The method, including:The newsletter archive in website is preset in crawl;Newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle;Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns corresponding participle similarity weight;The content of text similarity for comparing two newsletter archives assigns corresponding content of text similarity weight;According to the participle similarity of two newsletter archives, participle similarity weight, content of text similarity and content of text similarity weight, the similarity of two newsletter archives is determined;When two newsletter archives similarity be more than similarity threshold, determine two newsletter archives be similar media event.The technical solution of the embodiment of the present invention is realized and differentiates identical media event, and the time that user browses news is saved.

Description

A kind of method, apparatus, equipment and the storage medium of media event cluster
Technical field
The present embodiments relate to method, apparatus, equipment that text-processing technology more particularly to a kind of media event cluster And storage medium.
Background technology
Environmental protection industry (epi) is fast-developing, and INDUSTRY OVERVIEW largely broadcasts report in internet, and the media event of Similar content can be Different web sites forward, according to keywords removal search news, it is found that the search result content of front is much like, it is sometimes desirable to turn over very much Page just can be found that other media events, other media events in this way are capped, may not be paid close attention to by user.
Currently, artificial constantly removal search keyword may be used, then check search result, empirically analyzes these news Whether event is identical, makes a mark later to identical media event, and it is that single records to be arranged to these identical media events, in this way Identical media event only shows one, does not repeat to show.Although the artificial discriminating for solving identical media event, is realized Method cost is very high, and recording for can covering is limited, and when related personnel's rest, timeliness just fails to meet production requirements.
Invention content
The embodiment of the present invention provides a kind of method, apparatus, equipment and the storage medium of media event cluster, is differentiated with realizing Identical media event saves the time that user browses news.
In a first aspect, an embodiment of the present invention provides a kind of methods of media event cluster, including:
The newsletter archive in website is preset in crawl;
The newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle;
Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns Corresponding participle similarity weight;Wherein, the preset kind text participle includes that time noun, geographic name and name are real Body;
According to the text participle that part of speech in the corresponding text participle of two newsletter archives is noun and verb, than Compared with the content of text similarity of two newsletter archives, corresponding content of text similarity weight is assigned;
According to the participle similarity of two newsletter archives, the participle similarity weight, the content of text Similarity and the content of text similarity weight, determine the similarity of two newsletter archives;
When the similarity of two newsletter archives is more than similarity threshold, determine that two newsletter archives are similar new News event.
Second aspect, the embodiment of the present invention additionally provide a kind of device of media event cluster, including:
Newsletter archive handling module, for capturing the newsletter archive in default website;
Text word-dividing mode obtains pair for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition The text participle answered;
Similarity-rough set module is segmented, for comparing preset kind text in the corresponding text participle of two newsletter archives The participle similarity of this participle assigns corresponding participle similarity weight;Wherein, the preset kind text participle includes the time Noun, geographic name and name entity;
Content similarity comparison module, for being noun according to part of speech in the corresponding text participle of two newsletter archives It is segmented with the text of verb, compares the content of text similarity of two newsletter archives, assign corresponding content of text Similarity weight;
Newsletter archive similarity determining module, for according to the participle similarity of two newsletter archives, described Similarity weight, the content of text similarity and the content of text similarity weight are segmented, determines two news texts This similarity;
Similar media event determining module is more than similarity threshold, really for the similarity when two newsletter archives Fixed two newsletter archives are similar media event.
The third aspect, the embodiment of the present invention additionally provide a kind of equipment, and the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processing The method that device realizes the media event cluster provided such as first aspect.
Fourth aspect, the embodiment of the present invention additionally provides a kind of storage medium including computer executable instructions, described Computer executable instructions by computer processor when being executed for executing such as the media event cluster that first aspect provides Method.
The embodiment of the present invention assigns different weights by the similarity to media event difference element, determines and does not have to news Whether text is similar media event, solves the problem of poor in timeliness of high cost, realizes and differentiates identical media event, section About user browses the effect of the time of news.
Description of the drawings
Fig. 1 is a kind of flow chart of the method for media event cluster in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of the method for media event cluster in the embodiment of the present invention two;
Fig. 3 is a kind of structural schematic diagram of the device of media event cluster in the embodiment of the present invention three;
Fig. 4 is a kind of structural schematic diagram of equipment in the embodiment of the present invention four.
Specific implementation mode
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limitation of the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
Embodiment one
Fig. 1 is a kind of flow chart of the method for media event cluster that the embodiment of the present invention one provides, and the present embodiment can fit For to presetting the case where media event on industry related web site clusters, the dress that this method can be clustered by media event It sets to execute, which can be realized by software and/or hardware, and the method for media event cluster specifically comprises the following steps:
Step 110, crawl preset the newsletter archive in website.
Wherein, the website that website is similar industry, such as the website of environmental protection industry (epi) are preset.It would generally be shown on these websites The related news of the sector, may the one or more related news of displaying on a website.By dividing different data sources Class by way of simply configuring in a short time carries out efficient data grabber to largely presetting website, avoids complicated Exploitation amount, reduce development cost.
Optionally, the newsletter archive in the default website of the crawl includes:The default website is captured by web crawlers In newsletter archive.
Step 120, the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text Participle.
Wherein, using Chinese natural language treatment technology, these newsletter archives grabbed are segmented, part-of-speech tagging and Name Entity recognition.By above-mentioned processing, the sentence in newsletter archive is divided into several texts participle, and this is determined The part of speech of a little text participles, identifies the name entity in text participle.
Optionally, the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text After participle, further include:Reject the text participle without competency in the text participle.For example, auxiliary word, conjunction and language Gas word.
Step 130 compares the participle phase that preset kind text segments in the corresponding text participle of two newsletter archives Like degree, corresponding participle similarity weight is assigned.
Wherein, optionally, the preset kind text participle includes time noun, geographic name and name entity.Because The element of media event includes the when and where that event occurs and the body matter of related person and event.So comparing Time noun, geographic name and the similarity for naming entity, can effectively determine the phase between the different media events grabbed Like degree.
Step 140, according to the text that part of speech in the corresponding text participle of two newsletter archives is noun and verb This participle compares the content of text similarity of two newsletter archives, assigns corresponding content of text similarity weight.
Wherein, noun and verb are the most key to content semantic understanding, and it is the text of verb and noun point to compare by part of speech The similarity of the media event content of word composition, assigns weight.Existing industry method to the content body of media event, implement Main body, when and where make identical consideration, do not distinguish independently;And in fact, content body, subject of implementation, the time and Place is different to semantic understanding contribution.Thus, it is time noun, geographic name and the similarity for naming entity, foundation Different contributions assigns respective weight, can improve the cluster accuracy of similar events.
Step 150, according to the participle similarity of two newsletter archives, the participle similarity weight, described Content of text similarity and the content of text similarity weight, determine the similarity of two newsletter archives.
Wherein, participle similarity, such as time noun, geographic name and the similarity for naming entity, this three are corresponding It is 1 to segment the sum of similarity weight and content of text similarity weight.To be determined in above-mentioned steps each participle similarity and Its corresponding participle similarity multiplied by weight, by content of text similarity and content of text similarity multiplied by weight, then take and, Similarity as two newsletter archives.It is exemplary, the time noun similarities of two newsletter archives be 1 (weight 0.3), Reason title similarity is 1 (weight 0.2) and name entity similarity is 0.5 (weight 0.2), and content of text similarity is 0.8 (weight 0.3), then, the similarity of two newsletter archives can be obtained as 0.84 (1*0.3+1*0.2+ by calculating 0.5*0.2+0.8*0.3=0.84).
Step 160, when two newsletter archives similarity be more than similarity threshold, determine two newsletter archives For similar media event.
Wherein, similarity threshold can be preset, and the later stage can adjust, but after setting, be grabbed in determination new During whether news text is similar, the similarity threshold is constant.It is exemplary, similarity threshold 0.6.
Optionally, it is more than similarity threshold in the similarity when two newsletter archives, determines two news texts Originally it is to further include after similar media event:The newsletter archive for belonging to similar media event is summarized, same news is included into Event.The quantity of the newsletter archive grabbed would generally be bigger, when carrying out the similarity-rough set of newsletter archive, can select One newsletter archive is fixed comparison object, other newsletter archives are in contrast, and principle is transmitted (such as according to similarity:A and B It is similar, and A is similar with C, then A, B and C are similar), similar newsletter archive is summarized, merges and generates identical media event.
The technical solution of the present embodiment assigns different weights by the similarity to media event difference element, determines Whether it is similar media event without newsletter archive, solves the problem of poor in timeliness of high cost, realizes and differentiate identical news thing Part saves the effect that user browses the time of news.
Embodiment two
Fig. 2 is a kind of flow chart of the method for media event cluster provided by Embodiment 2 of the present invention, the technology of this implementation Scheme further refines based on the above technical solution, specifically includes:
Step 210 establishes the key word library for including default leader name and default industry slang.
Wherein it is possible to arrange key word library by business expert, or the network hot word for presetting industry is obtained as crucial Word, and key word library is regularly updated, it is added, deletes and covers update to data therein.
Step 220, crawl preset the newsletter archive in website.
Step 230, the newsletter archive for filtering out non-default INDUSTRY OVERVIEW in the newsletter archive grabbed.
Wherein, the newsletter archive grabbed, it may be possible to the content unrelated with default industry in website, according to key word library, The keyword that newsletter archive includes is scanned, if not including keyword in newsletter archive, filters out news text This.
Step 240, according to the key word library, two newsletter archives are segmented, part-of-speech tagging and name it is real Body identifies, obtains the corresponding text participle.
Step 250 compares the participle phase that preset kind text segments in the corresponding text participle of two newsletter archives Like degree, corresponding participle similarity weight is assigned.
Step 260, according to the text that part of speech in the corresponding text participle of two newsletter archives is noun and verb This participle compares the content of text similarity of two newsletter archives, assigns corresponding content of text similarity weight.
Step 270, according to the participle similarity of two newsletter archives, the participle similarity weight, described Content of text similarity and the content of text similarity weight, determine the similarity of two newsletter archives.
Step 280, when two newsletter archives similarity be more than similarity threshold, determine two newsletter archives For similar media event.
The technical solution of the present embodiment is reference with key word library, determines the similarity between newsletter archive, reduce out Cost is sent out, the time that user browses news is saved.
Embodiment three
Fig. 3 is a kind of structural schematic diagram of the device for media event cluster that the embodiment of the present invention three provides, which can To configure in computer equipment.Media event cluster device include:
Newsletter archive handling module 310, for capturing the newsletter archive in default website;
Text word-dividing mode 320, for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition, obtain It is segmented to corresponding text;
Similarity-rough set module 330 is segmented, class is preset for comparing in the corresponding text participle of two newsletter archives The participle similarity of type text participle, assigns corresponding participle similarity weight;Wherein, the preset kind text participle includes Time noun, geographic name and name entity;
Content similarity comparison module 340, for being according to part of speech in the corresponding text participle of two newsletter archives The text of noun and verb segments, and compares the content of text similarity of two newsletter archives, assigns corresponding text Content similarity weight;
Newsletter archive similarity determining module 350, for the participle similarity according to two newsletter archives, institute Participle similarity weight, the content of text similarity and the content of text similarity weight are stated, determines two news The similarity of text;
Similar media event determining module 360 is more than similarity threshold for the similarity when two newsletter archives, Determine that two newsletter archives are similar media event.
Optionally, media event cluster device, further include:
Without expressing the meaning, participle rejects module, for it is described the newsletter archive is segmented, part-of-speech tagging and name it is real Body identifies, after obtaining corresponding text participle, rejects the text participle without competency in the text participle;
Optionally, media event cluster device, further include:
Newsletter archive filtering module, for it is described the newsletter archive is segmented, part-of-speech tagging and name entity Identification before obtaining corresponding text participle, filters out the described of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed Newsletter archive;
Optionally, media event cluster device, further include:
Key word library establishes module, for before the newsletter archive in website is preset in the crawl, it to include default to establish Lead the key word library of name and default industry slang;Correspondingly, the text word-dividing mode includes:Text participle unit is used According to the key word library, two newsletter archives are segmented, part-of-speech tagging and name Entity recognition, corresponded to The text participle;
Optionally, media event cluster device, further include:
Similar media event summarizing module, for being more than similarity threshold in the similarity for working as two newsletter archives Value determines that two newsletter archives converge for that after similar media event, will belong to the newsletter archive of similar media event Always, it is included into same media event;
Optionally, the newsletter archive handling module includes:
Crawler capturing unit, for capturing the newsletter archive in the default website by web crawlers.
The technical solution of the present embodiment assigns different weights by the similarity to media event difference element, determines Whether it is similar media event without newsletter archive, solves the problem of poor in timeliness of high cost, realizes and differentiate identical news Event saves the effect that user browses the time of news.
The said goods can perform the method that any embodiment of the present invention is provided, and have the corresponding function module of execution method And advantageous effect.
Example IV
Fig. 4 is a kind of structural schematic diagram for equipment that the embodiment of the present invention four provides, as shown in figure 4, the equipment includes place Manage device 40, memory 41, input unit 42 and output device 43;The quantity of processor 40 can be one or more in equipment, In Fig. 4 by taking a processor 40 as an example;Processor 40, memory 41, input unit 42 and output device 43 in equipment can be with It is connected by bus or other modes, in Fig. 4 for being connected by bus.
Memory 41 is used as a kind of computer readable storage medium, can be used for storing software program, computer can perform journey Sequence and module, if the corresponding program instruction/module of method of the media event cluster in the embodiment of the present invention is (for example, news Newsletter archive handling module 310, text word-dividing mode 320, participle similarity-rough set module in the device of affair clustering 330, content similarity comparison module 340, newsletter archive similarity determining module 350 and similar media event determining module 360).Processor 40 is stored in software program, instruction and module in memory 41 by operation, to execute each of equipment Kind application of function and data processing, that is, the method for realizing above-mentioned media event cluster.
Memory 41 can include mainly storing program area and storage data field, wherein storing program area can store operation system Application program needed for system, at least one function;Storage data field can be stored uses created data etc. according to terminal.This Outside, memory 41 may include high-speed random access memory, can also include nonvolatile memory, for example, at least a magnetic Disk storage device, flush memory device or other non-volatile solid state memory parts.In some instances, memory 41 can be further Include the memory remotely located relative to processor 40, these remote memories can pass through network connection to equipment.It is above-mentioned The example of network includes but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.
Input unit 42 can be used for receiving the number or character information of input, and generate with the user setting of equipment and The related key signals input of function control.Output device 43 may include that display screen etc. shows equipment.
Embodiment five
The embodiment of the present invention five also provides a kind of storage medium including computer executable instructions, and the computer can be held When being executed by computer processor for executing a kind of method of media event cluster, this method includes for row instruction:
The newsletter archive in website is preset in crawl;
The newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle;
Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns Corresponding participle similarity weight;Wherein, the preset kind text participle includes that time noun, geographic name and name are real Body;
According to the text participle that part of speech in the corresponding text participle of two newsletter archives is noun and verb, than Compared with the content of text similarity of two newsletter archives, corresponding content of text similarity weight is assigned;
According to the participle similarity of two newsletter archives, the participle similarity weight, the content of text Similarity and the content of text similarity weight, determine the similarity of two newsletter archives;
When the similarity of two newsletter archives is more than similarity threshold, determine that two newsletter archives are similar new News event.
Certainly, a kind of storage medium including computer executable instructions that the embodiment of the present invention is provided, computer The method operation that executable instruction is not limited to the described above, can also be performed the media event that any embodiment of the present invention is provided Relevant operation in the method for cluster
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but the former is more in many cases Good embodiment.Based on this understanding, technical scheme of the present invention substantially in other words contributes to the prior art Part can be expressed in the form of software products, which can be stored in computer readable storage medium In, such as the floppy disk of computer, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.
It is worth noting that, in the embodiment of above-mentioned searcher, included each unit and module are only according to work( Energy logic is divided, but is not limited to above-mentioned division, as long as corresponding function can be realized;In addition, each work( The specific name of energy unit is also only to facilitate mutually distinguish, the protection domain being not intended to restrict the invention.
Note that above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The present invention is not limited to specific embodiments described here, can carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out to the present invention by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also May include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims (10)

1. a kind of method of media event cluster, which is characterized in that including:
The newsletter archive in website is preset in crawl;
The newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle;
Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns and corresponding to Participle similarity weight;Wherein, the preset kind text participle includes time noun, geographic name and name entity;
According to the text participle that part of speech in the corresponding text participle of two newsletter archives is noun and verb, compare two The content of text similarity of a newsletter archive assigns corresponding content of text similarity weight;
It is similar according to the participle similarity of two newsletter archives, the participle similarity weight, the content of text Degree and the content of text similarity weight, determine the similarity of two newsletter archives;
When two newsletter archives similarity be more than similarity threshold, determine two newsletter archives be similar news thing Part.
2. according to the method described in claim 1, it is characterized in that, it is described the newsletter archive is segmented, part of speech mark Note and name Entity recognition further include after obtaining corresponding text participle:
Reject the text participle without competency in the text participle.
3. according to the method described in claim 1, it is characterized in that, it is described the newsletter archive is segmented, part of speech mark Note and name Entity recognition further include before obtaining corresponding text participle:
Filter out the newsletter archive of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed.
4. according to the method described in claim 1, it is characterized in that, the newsletter archive that the crawl is preset in website includes:
The newsletter archive in the default website is captured by web crawlers.
5. according to the method described in claim 1, it is characterized in that, it is described crawl preset website in newsletter archive before, Further include:
It includes the default key word library for leading name and default industry slang to establish;
Correspondingly, it is described the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text point Word includes:
According to the key word library, two newsletter archives is segmented, part-of-speech tagging and name Entity recognition, obtained pair The text participle answered.
6. according to the method described in claim 1, it is characterized in that, described when the similarity of two newsletter archives is more than Similarity threshold determines two newsletter archives after similar media event, to further include:
The newsletter archive for belonging to similar media event is summarized, same media event is included into.
7. a kind of device of media event cluster, which is characterized in that including:
Newsletter archive handling module, for capturing the newsletter archive in default website;
Text word-dividing mode, for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition, obtain corresponding Text segments;
Similarity-rough set module is segmented, for comparing preset kind text point in the corresponding text participle of two newsletter archives The participle similarity of word assigns corresponding participle similarity weight;Wherein, the preset kind text participle includes time name Word, geographic name and name entity;
Content similarity comparison module, for being noun according to part of speech in the corresponding text participle of two newsletter archives and dynamic The text of word segments, and compares the content of text similarity of two newsletter archives, it is similar to assign corresponding content of text Spend weight;
Newsletter archive similarity determining module, for the participle similarity according to two newsletter archives, the participle Similarity weight, the content of text similarity and the content of text similarity weight determine two newsletter archives Similarity;
Similar media event determining module is more than similarity threshold for the similarity when two newsletter archives, determines two A newsletter archive is similar media event.
8. device according to claim 7, which is characterized in that further include:
Without express the meaning participle reject module, for it is described the newsletter archive is segmented, part-of-speech tagging and name entity know Not, after obtaining corresponding text participle, the text participle without competency in the text participle is rejected;
Newsletter archive filtering module, for it is described the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, Before obtaining corresponding text participle, the news text of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed is filtered out This;
Key word library establishes module, for before the newsletter archive in website is preset in the crawl, it to include default leader to establish The key word library of name and default industry slang;Correspondingly, the text word-dividing mode includes:Text participle unit is used for root According to the key word library, two newsletter archives are segmented, part-of-speech tagging and name Entity recognition, obtain corresponding institute State text participle;
Similar media event summarizing module, for being more than similarity threshold in the similarity for working as two newsletter archives, Determine two newsletter archives be similar media event after, the newsletter archive for belonging to similar media event is summarized, It is included into same media event;
The newsletter archive handling module includes:
Crawler capturing unit, for capturing the newsletter archive in the default website by web crawlers.
9. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processors are real The now method of the media event cluster as described in any in claim 1-6.
10. a kind of storage medium including computer executable instructions, which is characterized in that the computer executable instructions by Method when computer processor executes for executing the media event cluster as described in any in claim 1-6.
CN201810155131.5A 2018-02-23 2018-02-23 A kind of method, apparatus, equipment and the storage medium of media event cluster Pending CN108334628A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810155131.5A CN108334628A (en) 2018-02-23 2018-02-23 A kind of method, apparatus, equipment and the storage medium of media event cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810155131.5A CN108334628A (en) 2018-02-23 2018-02-23 A kind of method, apparatus, equipment and the storage medium of media event cluster

Publications (1)

Publication Number Publication Date
CN108334628A true CN108334628A (en) 2018-07-27

Family

ID=62929742

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810155131.5A Pending CN108334628A (en) 2018-02-23 2018-02-23 A kind of method, apparatus, equipment and the storage medium of media event cluster

Country Status (1)

Country Link
CN (1) CN108334628A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241438A (en) * 2018-09-27 2019-01-18 国家计算机网络与信息安全管理中心 Across channel focus incident discovery method, apparatus and storage medium based on element
CN109857866A (en) * 2019-01-14 2019-06-07 中国科学院信息工程研究所 A kind of keyword abstraction method and event query suggestion generation method and searching system towards event query suggestion
CN110704617A (en) * 2019-09-17 2020-01-17 平安科技(深圳)有限公司 News text classification method and device, electronic equipment and storage medium
CN111046271A (en) * 2018-10-15 2020-04-21 阿里巴巴集团控股有限公司 Mining method and device for search, storage medium and electronic equipment
JP2020174342A (en) * 2019-04-08 2020-10-22 バイドゥ ユーエスエイ エルエルシーBaidu USA LLC Method, device, server, computer-readable storage medium, and computer program for generating video
JP2020174338A (en) * 2019-04-08 2020-10-22 バイドゥ ドットコム タイムス テクノロジー (ベイジン) カンパニー リミテッド Method, device, server, computer-readable storage media, and computer program for generating information
CN112231470A (en) * 2019-06-28 2021-01-15 上海智臻智能网络科技股份有限公司 Topic mining method and device, storage medium and terminal
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product
CN113420112A (en) * 2021-06-21 2021-09-21 中国科学院声学研究所 News entity analysis method and device based on unsupervised learning
CN115146065A (en) * 2022-09-02 2022-10-04 安徽商信政通信息技术股份有限公司 Intelligent information reporting similar content merging method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users
CN103377239A (en) * 2012-04-26 2013-10-30 腾讯科技(深圳)有限公司 Method and device for calculating inter-textual similarity
CN104715014A (en) * 2015-01-26 2015-06-17 中山大学 Online news topic detection method
US9477714B1 (en) * 2002-09-20 2016-10-25 Google Inc. Methods and apparatus for ranking documents
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN106934005A (en) * 2017-03-07 2017-07-07 重庆邮电大学 A kind of Text Clustering Method based on density
CN107145568A (en) * 2017-05-04 2017-09-08 成都华栖云科技有限公司 A kind of quick media event clustering system and method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9477714B1 (en) * 2002-09-20 2016-10-25 Google Inc. Methods and apparatus for ranking documents
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Method and device for matching texts
CN102495872A (en) * 2011-11-30 2012-06-13 中国科学技术大学 Method and device for conducting personalized news recommendation to mobile device users
CN103377239A (en) * 2012-04-26 2013-10-30 腾讯科技(深圳)有限公司 Method and device for calculating inter-textual similarity
CN104715014A (en) * 2015-01-26 2015-06-17 中山大学 Online news topic detection method
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN106934005A (en) * 2017-03-07 2017-07-07 重庆邮电大学 A kind of Text Clustering Method based on density
CN107145568A (en) * 2017-05-04 2017-09-08 成都华栖云科技有限公司 A kind of quick media event clustering system and method

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241438B (en) * 2018-09-27 2022-06-24 国家计算机网络与信息安全管理中心 Element-based cross-channel hot event discovery method and device and storage medium
CN109241438A (en) * 2018-09-27 2019-01-18 国家计算机网络与信息安全管理中心 Across channel focus incident discovery method, apparatus and storage medium based on element
CN111046271B (en) * 2018-10-15 2023-04-25 阿里巴巴集团控股有限公司 Mining method and device for searching, storage medium and electronic equipment
CN111046271A (en) * 2018-10-15 2020-04-21 阿里巴巴集团控股有限公司 Mining method and device for search, storage medium and electronic equipment
CN109857866A (en) * 2019-01-14 2019-06-07 中国科学院信息工程研究所 A kind of keyword abstraction method and event query suggestion generation method and searching system towards event query suggestion
CN109857866B (en) * 2019-01-14 2021-05-25 中国科学院信息工程研究所 Event query suggestion-oriented keyword extraction method, event query suggestion generation method and retrieval system
JP2020174342A (en) * 2019-04-08 2020-10-22 バイドゥ ユーエスエイ エルエルシーBaidu USA LLC Method, device, server, computer-readable storage medium, and computer program for generating video
JP2020174338A (en) * 2019-04-08 2020-10-22 バイドゥ ドットコム タイムス テクノロジー (ベイジン) カンパニー リミテッド Method, device, server, computer-readable storage media, and computer program for generating information
JP7108259B2 (en) 2019-04-08 2022-07-28 バイドゥドットコム タイムズ テクノロジー (ベイジン) カンパニー リミテッド Methods, apparatus, servers, computer readable storage media and computer programs for generating information
CN112231470A (en) * 2019-06-28 2021-01-15 上海智臻智能网络科技股份有限公司 Topic mining method and device, storage medium and terminal
CN110704617A (en) * 2019-09-17 2020-01-17 平安科技(深圳)有限公司 News text classification method and device, electronic equipment and storage medium
CN110704617B (en) * 2019-09-17 2023-10-03 平安科技(深圳)有限公司 News text classification method, device, electronic equipment and storage medium
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product
CN113420112A (en) * 2021-06-21 2021-09-21 中国科学院声学研究所 News entity analysis method and device based on unsupervised learning
CN115146065A (en) * 2022-09-02 2022-10-04 安徽商信政通信息技术股份有限公司 Intelligent information reporting similar content merging method and system

Similar Documents

Publication Publication Date Title
CN108334628A (en) A kind of method, apparatus, equipment and the storage medium of media event cluster
US7937338B2 (en) System and method for identifying document structure and associated metainformation
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
CN101271459A (en) Word library generation method, input method and input method system
CN103778548A (en) Goods information and keyword matching method, and goods information releasing method and device
AU2018250372B2 (en) Method to construct content based on a content repository
US20160299891A1 (en) Matching of an input document to documents in a document collection
CN104978332A (en) UGC label data generating method, UGC label data generating device, relevant method and relevant device
CN110909120A (en) Resume searching/delivering method, device and system and electronic equipment
CN110232126A (en) Hot spot method for digging and server and computer readable storage medium
US20200012722A1 (en) System for real-time expression of semantic mind map, and operation method therefor
CN105512300A (en) Information filtering method and system
CN103226601A (en) Method and device for image search
US9946765B2 (en) Building a domain knowledge and term identity using crowd sourcing
US10606899B2 (en) Categorically filtering search results
CN109670047B (en) Abstract note generation method, computer device and readable storage medium
CN102982029B (en) A kind of search need recognition methods and device
CN113407678B (en) Knowledge graph construction method, device and equipment
CN111401047A (en) Method and device for generating dispute focus of legal document and computer equipment
US11232088B2 (en) Method and system for interactive search indexing
US20120047128A1 (en) Open class noun classification
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
CN109511000A (en) Barrage classification determines method, apparatus, equipment and storage medium
US20230061773A1 (en) Automated systems and methods for generating technical questions from technical documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180727

RJ01 Rejection of invention patent application after publication