CN108334628A

CN108334628A - A kind of method, apparatus, equipment and the storage medium of media event cluster

Info

Publication number: CN108334628A
Application number: CN201810155131.5A
Authority: CN
Inventors: 王云; 刘丹; 肖天鹤
Original assignee: Beijing Green Oriental Data Technology Co Ltd; Beijing Dong Run Huan Neng Science And Technology Co Ltd
Current assignee: Beijing Green Oriental Data Technology Co Ltd; Beijing Dong Run Huan Neng Science And Technology Co Ltd
Priority date: 2018-02-23
Filing date: 2018-02-23
Publication date: 2018-07-27

Abstract

The embodiment of the invention discloses method, apparatus, equipment and the storage mediums of a kind of media event cluster.The method, including：The newsletter archive in website is preset in crawl；Newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle；Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns corresponding participle similarity weight；The content of text similarity for comparing two newsletter archives assigns corresponding content of text similarity weight；According to the participle similarity of two newsletter archives, participle similarity weight, content of text similarity and content of text similarity weight, the similarity of two newsletter archives is determined；When two newsletter archives similarity be more than similarity threshold, determine two newsletter archives be similar media event.The technical solution of the embodiment of the present invention is realized and differentiates identical media event, and the time that user browses news is saved.

Description

A kind of method, apparatus, equipment and the storage medium of media event cluster

Technical field

The present embodiments relate to method, apparatus, equipment that text-processing technology more particularly to a kind of media event cluster And storage medium.

Background technology

Environmental protection industry (epi) is fast-developing, and INDUSTRY OVERVIEW largely broadcasts report in internet, and the media event of Similar content can be Different web sites forward, according to keywords removal search news, it is found that the search result content of front is much like, it is sometimes desirable to turn over very much Page just can be found that other media events, other media events in this way are capped, may not be paid close attention to by user.

Currently, artificial constantly removal search keyword may be used, then check search result, empirically analyzes these news Whether event is identical, makes a mark later to identical media event, and it is that single records to be arranged to these identical media events, in this way Identical media event only shows one, does not repeat to show.Although the artificial discriminating for solving identical media event, is realized Method cost is very high, and recording for can covering is limited, and when related personnel's rest, timeliness just fails to meet production requirements.

Invention content

The embodiment of the present invention provides a kind of method, apparatus, equipment and the storage medium of media event cluster, is differentiated with realizing Identical media event saves the time that user browses news.

In a first aspect, an embodiment of the present invention provides a kind of methods of media event cluster, including：

The newsletter archive in website is preset in crawl；

The newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle；

Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns Corresponding participle similarity weight；Wherein, the preset kind text participle includes that time noun, geographic name and name are real Body；

According to the text participle that part of speech in the corresponding text participle of two newsletter archives is noun and verb, than Compared with the content of text similarity of two newsletter archives, corresponding content of text similarity weight is assigned；

According to the participle similarity of two newsletter archives, the participle similarity weight, the content of text Similarity and the content of text similarity weight, determine the similarity of two newsletter archives；

When the similarity of two newsletter archives is more than similarity threshold, determine that two newsletter archives are similar new News event.

Second aspect, the embodiment of the present invention additionally provide a kind of device of media event cluster, including：

Newsletter archive handling module, for capturing the newsletter archive in default website；

Text word-dividing mode obtains pair for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition The text participle answered；

Similarity-rough set module is segmented, for comparing preset kind text in the corresponding text participle of two newsletter archives The participle similarity of this participle assigns corresponding participle similarity weight；Wherein, the preset kind text participle includes the time Noun, geographic name and name entity；

Content similarity comparison module, for being noun according to part of speech in the corresponding text participle of two newsletter archives It is segmented with the text of verb, compares the content of text similarity of two newsletter archives, assign corresponding content of text Similarity weight；

Newsletter archive similarity determining module, for according to the participle similarity of two newsletter archives, described Similarity weight, the content of text similarity and the content of text similarity weight are segmented, determines two news texts This similarity；

Similar media event determining module is more than similarity threshold, really for the similarity when two newsletter archives Fixed two newsletter archives are similar media event.

The third aspect, the embodiment of the present invention additionally provide a kind of equipment, and the equipment includes：

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors so that one or more of processing The method that device realizes the media event cluster provided such as first aspect.

Fourth aspect, the embodiment of the present invention additionally provides a kind of storage medium including computer executable instructions, described Computer executable instructions by computer processor when being executed for executing such as the media event cluster that first aspect provides Method.

The embodiment of the present invention assigns different weights by the similarity to media event difference element, determines and does not have to news Whether text is similar media event, solves the problem of poor in timeliness of high cost, realizes and differentiates identical media event, section About user browses the effect of the time of news.

Description of the drawings

Fig. 1 is a kind of flow chart of the method for media event cluster in the embodiment of the present invention one；

Fig. 2 is a kind of flow chart of the method for media event cluster in the embodiment of the present invention two；

Fig. 3 is a kind of structural schematic diagram of the device of media event cluster in the embodiment of the present invention three；

Fig. 4 is a kind of structural schematic diagram of equipment in the embodiment of the present invention four.

Specific implementation mode

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limitation of the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.

Embodiment one

Fig. 1 is a kind of flow chart of the method for media event cluster that the embodiment of the present invention one provides, and the present embodiment can fit For to presetting the case where media event on industry related web site clusters, the dress that this method can be clustered by media event It sets to execute, which can be realized by software and/or hardware, and the method for media event cluster specifically comprises the following steps：

Step 110, crawl preset the newsletter archive in website.

Wherein, the website that website is similar industry, such as the website of environmental protection industry (epi) are preset.It would generally be shown on these websites The related news of the sector, may the one or more related news of displaying on a website.By dividing different data sources Class by way of simply configuring in a short time carries out efficient data grabber to largely presetting website, avoids complicated Exploitation amount, reduce development cost.

Optionally, the newsletter archive in the default website of the crawl includes：The default website is captured by web crawlers In newsletter archive.

Step 120, the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text Participle.

Wherein, using Chinese natural language treatment technology, these newsletter archives grabbed are segmented, part-of-speech tagging and Name Entity recognition.By above-mentioned processing, the sentence in newsletter archive is divided into several texts participle, and this is determined The part of speech of a little text participles, identifies the name entity in text participle.

Optionally, the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text After participle, further include：Reject the text participle without competency in the text participle.For example, auxiliary word, conjunction and language Gas word.

Step 130 compares the participle phase that preset kind text segments in the corresponding text participle of two newsletter archives Like degree, corresponding participle similarity weight is assigned.

Wherein, optionally, the preset kind text participle includes time noun, geographic name and name entity.Because The element of media event includes the when and where that event occurs and the body matter of related person and event.So comparing Time noun, geographic name and the similarity for naming entity, can effectively determine the phase between the different media events grabbed Like degree.

Step 140, according to the text that part of speech in the corresponding text participle of two newsletter archives is noun and verb This participle compares the content of text similarity of two newsletter archives, assigns corresponding content of text similarity weight.

Wherein, noun and verb are the most key to content semantic understanding, and it is the text of verb and noun point to compare by part of speech The similarity of the media event content of word composition, assigns weight.Existing industry method to the content body of media event, implement Main body, when and where make identical consideration, do not distinguish independently；And in fact, content body, subject of implementation, the time and Place is different to semantic understanding contribution.Thus, it is time noun, geographic name and the similarity for naming entity, foundation Different contributions assigns respective weight, can improve the cluster accuracy of similar events.

Step 150, according to the participle similarity of two newsletter archives, the participle similarity weight, described Content of text similarity and the content of text similarity weight, determine the similarity of two newsletter archives.

Wherein, participle similarity, such as time noun, geographic name and the similarity for naming entity, this three are corresponding It is 1 to segment the sum of similarity weight and content of text similarity weight.To be determined in above-mentioned steps each participle similarity and Its corresponding participle similarity multiplied by weight, by content of text similarity and content of text similarity multiplied by weight, then take and, Similarity as two newsletter archives.It is exemplary, the time noun similarities of two newsletter archives be 1 (weight 0.3), Reason title similarity is 1 (weight 0.2) and name entity similarity is 0.5 (weight 0.2), and content of text similarity is 0.8 (weight 0.3), then, the similarity of two newsletter archives can be obtained as 0.84 (1*0.3+1*0.2+ by calculating 0.5*0.2+0.8*0.3=0.84).

Step 160, when two newsletter archives similarity be more than similarity threshold, determine two newsletter archives For similar media event.

Wherein, similarity threshold can be preset, and the later stage can adjust, but after setting, be grabbed in determination new During whether news text is similar, the similarity threshold is constant.It is exemplary, similarity threshold 0.6.

Optionally, it is more than similarity threshold in the similarity when two newsletter archives, determines two news texts Originally it is to further include after similar media event：The newsletter archive for belonging to similar media event is summarized, same news is included into Event.The quantity of the newsletter archive grabbed would generally be bigger, when carrying out the similarity-rough set of newsletter archive, can select One newsletter archive is fixed comparison object, other newsletter archives are in contrast, and principle is transmitted (such as according to similarity：A and B It is similar, and A is similar with C, then A, B and C are similar), similar newsletter archive is summarized, merges and generates identical media event.

The technical solution of the present embodiment assigns different weights by the similarity to media event difference element, determines Whether it is similar media event without newsletter archive, solves the problem of poor in timeliness of high cost, realizes and differentiate identical news thing Part saves the effect that user browses the time of news.

Embodiment two

Fig. 2 is a kind of flow chart of the method for media event cluster provided by Embodiment 2 of the present invention, the technology of this implementation Scheme further refines based on the above technical solution, specifically includes：

Step 210 establishes the key word library for including default leader name and default industry slang.

Wherein it is possible to arrange key word library by business expert, or the network hot word for presetting industry is obtained as crucial Word, and key word library is regularly updated, it is added, deletes and covers update to data therein.

Step 220, crawl preset the newsletter archive in website.

Step 230, the newsletter archive for filtering out non-default INDUSTRY OVERVIEW in the newsletter archive grabbed.

Wherein, the newsletter archive grabbed, it may be possible to the content unrelated with default industry in website, according to key word library, The keyword that newsletter archive includes is scanned, if not including keyword in newsletter archive, filters out news text This.

Step 240, according to the key word library, two newsletter archives are segmented, part-of-speech tagging and name it is real Body identifies, obtains the corresponding text participle.

Step 250 compares the participle phase that preset kind text segments in the corresponding text participle of two newsletter archives Like degree, corresponding participle similarity weight is assigned.

Step 260, according to the text that part of speech in the corresponding text participle of two newsletter archives is noun and verb This participle compares the content of text similarity of two newsletter archives, assigns corresponding content of text similarity weight.

Step 270, according to the participle similarity of two newsletter archives, the participle similarity weight, described Content of text similarity and the content of text similarity weight, determine the similarity of two newsletter archives.

Step 280, when two newsletter archives similarity be more than similarity threshold, determine two newsletter archives For similar media event.

The technical solution of the present embodiment is reference with key word library, determines the similarity between newsletter archive, reduce out Cost is sent out, the time that user browses news is saved.

Embodiment three

Fig. 3 is a kind of structural schematic diagram of the device for media event cluster that the embodiment of the present invention three provides, which can To configure in computer equipment.Media event cluster device include：

Newsletter archive handling module 310, for capturing the newsletter archive in default website；

Text word-dividing mode 320, for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition, obtain It is segmented to corresponding text；

Similarity-rough set module 330 is segmented, class is preset for comparing in the corresponding text participle of two newsletter archives The participle similarity of type text participle, assigns corresponding participle similarity weight；Wherein, the preset kind text participle includes Time noun, geographic name and name entity；

Content similarity comparison module 340, for being according to part of speech in the corresponding text participle of two newsletter archives The text of noun and verb segments, and compares the content of text similarity of two newsletter archives, assigns corresponding text Content similarity weight；

Newsletter archive similarity determining module 350, for the participle similarity according to two newsletter archives, institute Participle similarity weight, the content of text similarity and the content of text similarity weight are stated, determines two news The similarity of text；

Similar media event determining module 360 is more than similarity threshold for the similarity when two newsletter archives, Determine that two newsletter archives are similar media event.

Optionally, media event cluster device, further include：

Without expressing the meaning, participle rejects module, for it is described the newsletter archive is segmented, part-of-speech tagging and name it is real Body identifies, after obtaining corresponding text participle, rejects the text participle without competency in the text participle；

Optionally, media event cluster device, further include：

Newsletter archive filtering module, for it is described the newsletter archive is segmented, part-of-speech tagging and name entity Identification before obtaining corresponding text participle, filters out the described of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed Newsletter archive；

Optionally, media event cluster device, further include：

Key word library establishes module, for before the newsletter archive in website is preset in the crawl, it to include default to establish Lead the key word library of name and default industry slang；Correspondingly, the text word-dividing mode includes：Text participle unit is used According to the key word library, two newsletter archives are segmented, part-of-speech tagging and name Entity recognition, corresponded to The text participle；

Optionally, media event cluster device, further include：

Similar media event summarizing module, for being more than similarity threshold in the similarity for working as two newsletter archives Value determines that two newsletter archives converge for that after similar media event, will belong to the newsletter archive of similar media event Always, it is included into same media event；

Optionally, the newsletter archive handling module includes：

Crawler capturing unit, for capturing the newsletter archive in the default website by web crawlers.

The technical solution of the present embodiment assigns different weights by the similarity to media event difference element, determines Whether it is similar media event without newsletter archive, solves the problem of poor in timeliness of high cost, realizes and differentiate identical news Event saves the effect that user browses the time of news.

The said goods can perform the method that any embodiment of the present invention is provided, and have the corresponding function module of execution method And advantageous effect.

Example IV

Fig. 4 is a kind of structural schematic diagram for equipment that the embodiment of the present invention four provides, as shown in figure 4, the equipment includes place Manage device 40, memory 41, input unit 42 and output device 43；The quantity of processor 40 can be one or more in equipment, In Fig. 4 by taking a processor 40 as an example；Processor 40, memory 41, input unit 42 and output device 43 in equipment can be with It is connected by bus or other modes, in Fig. 4 for being connected by bus.

Memory 41 is used as a kind of computer readable storage medium, can be used for storing software program, computer can perform journey Sequence and module, if the corresponding program instruction/module of method of the media event cluster in the embodiment of the present invention is (for example, news Newsletter archive handling module 310, text word-dividing mode 320, participle similarity-rough set module in the device of affair clustering 330, content similarity comparison module 340, newsletter archive similarity determining module 350 and similar media event determining module 360).Processor 40 is stored in software program, instruction and module in memory 41 by operation, to execute each of equipment Kind application of function and data processing, that is, the method for realizing above-mentioned media event cluster.

Memory 41 can include mainly storing program area and storage data field, wherein storing program area can store operation system Application program needed for system, at least one function；Storage data field can be stored uses created data etc. according to terminal.This Outside, memory 41 may include high-speed random access memory, can also include nonvolatile memory, for example, at least a magnetic Disk storage device, flush memory device or other non-volatile solid state memory parts.In some instances, memory 41 can be further Include the memory remotely located relative to processor 40, these remote memories can pass through network connection to equipment.It is above-mentioned The example of network includes but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.

Input unit 42 can be used for receiving the number or character information of input, and generate with the user setting of equipment and The related key signals input of function control.Output device 43 may include that display screen etc. shows equipment.

Embodiment five

The embodiment of the present invention five also provides a kind of storage medium including computer executable instructions, and the computer can be held When being executed by computer processor for executing a kind of method of media event cluster, this method includes for row instruction：

The newsletter archive in website is preset in crawl；

Certainly, a kind of storage medium including computer executable instructions that the embodiment of the present invention is provided, computer The method operation that executable instruction is not limited to the described above, can also be performed the media event that any embodiment of the present invention is provided Relevant operation in the method for cluster

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but the former is more in many cases Good embodiment.Based on this understanding, technical scheme of the present invention substantially in other words contributes to the prior art Part can be expressed in the form of software products, which can be stored in computer readable storage medium In, such as the floppy disk of computer, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.

It is worth noting that, in the embodiment of above-mentioned searcher, included each unit and module are only according to work( Energy logic is divided, but is not limited to above-mentioned division, as long as corresponding function can be realized；In addition, each work( The specific name of energy unit is also only to facilitate mutually distinguish, the protection domain being not intended to restrict the invention.

Note that above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The present invention is not limited to specific embodiments described here, can carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out to the present invention by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also May include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims

1. a kind of method of media event cluster, which is characterized in that including：

The newsletter archive in website is preset in crawl；

Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns and corresponding to Participle similarity weight；Wherein, the preset kind text participle includes time noun, geographic name and name entity；

According to the text participle that part of speech in the corresponding text participle of two newsletter archives is noun and verb, compare two The content of text similarity of a newsletter archive assigns corresponding content of text similarity weight；

It is similar according to the participle similarity of two newsletter archives, the participle similarity weight, the content of text Degree and the content of text similarity weight, determine the similarity of two newsletter archives；

When two newsletter archives similarity be more than similarity threshold, determine two newsletter archives be similar news thing Part.

2. according to the method described in claim 1, it is characterized in that, it is described the newsletter archive is segmented, part of speech mark Note and name Entity recognition further include after obtaining corresponding text participle：

Reject the text participle without competency in the text participle.

3. according to the method described in claim 1, it is characterized in that, it is described the newsletter archive is segmented, part of speech mark Note and name Entity recognition further include before obtaining corresponding text participle：

Filter out the newsletter archive of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed.

4. according to the method described in claim 1, it is characterized in that, the newsletter archive that the crawl is preset in website includes：

The newsletter archive in the default website is captured by web crawlers.

5. according to the method described in claim 1, it is characterized in that, it is described crawl preset website in newsletter archive before, Further include：

It includes the default key word library for leading name and default industry slang to establish；

Correspondingly, it is described the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text point Word includes：

According to the key word library, two newsletter archives is segmented, part-of-speech tagging and name Entity recognition, obtained pair The text participle answered.

6. according to the method described in claim 1, it is characterized in that, described when the similarity of two newsletter archives is more than Similarity threshold determines two newsletter archives after similar media event, to further include：

The newsletter archive for belonging to similar media event is summarized, same media event is included into.

7. a kind of device of media event cluster, which is characterized in that including：

Text word-dividing mode, for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition, obtain corresponding Text segments；

Similarity-rough set module is segmented, for comparing preset kind text point in the corresponding text participle of two newsletter archives The participle similarity of word assigns corresponding participle similarity weight；Wherein, the preset kind text participle includes time name Word, geographic name and name entity；

Content similarity comparison module, for being noun according to part of speech in the corresponding text participle of two newsletter archives and dynamic The text of word segments, and compares the content of text similarity of two newsletter archives, it is similar to assign corresponding content of text Spend weight；

Newsletter archive similarity determining module, for the participle similarity according to two newsletter archives, the participle Similarity weight, the content of text similarity and the content of text similarity weight determine two newsletter archives Similarity；

Similar media event determining module is more than similarity threshold for the similarity when two newsletter archives, determines two A newsletter archive is similar media event.

8. device according to claim 7, which is characterized in that further include：

Without express the meaning participle reject module, for it is described the newsletter archive is segmented, part-of-speech tagging and name entity know Not, after obtaining corresponding text participle, the text participle without competency in the text participle is rejected；

Newsletter archive filtering module, for it is described the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, Before obtaining corresponding text participle, the news text of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed is filtered out This；

Key word library establishes module, for before the newsletter archive in website is preset in the crawl, it to include default leader to establish The key word library of name and default industry slang；Correspondingly, the text word-dividing mode includes：Text participle unit is used for root According to the key word library, two newsletter archives are segmented, part-of-speech tagging and name Entity recognition, obtain corresponding institute State text participle；

Similar media event summarizing module, for being more than similarity threshold in the similarity for working as two newsletter archives, Determine two newsletter archives be similar media event after, the newsletter archive for belonging to similar media event is summarized, It is included into same media event；

The newsletter archive handling module includes：

9. a kind of equipment, which is characterized in that the equipment includes：

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors so that one or more of processors are real The now method of the media event cluster as described in any in claim 1-6.

10. a kind of storage medium including computer executable instructions, which is characterized in that the computer executable instructions by Method when computer processor executes for executing the media event cluster as described in any in claim 1-6.