CN108334628A - A kind of method, apparatus, equipment and the storage medium of media event cluster - Google Patents
A kind of method, apparatus, equipment and the storage medium of media event cluster Download PDFInfo
- Publication number
- CN108334628A CN108334628A CN201810155131.5A CN201810155131A CN108334628A CN 108334628 A CN108334628 A CN 108334628A CN 201810155131 A CN201810155131 A CN 201810155131A CN 108334628 A CN108334628 A CN 108334628A
- Authority
- CN
- China
- Prior art keywords
- similarity
- newsletter
- text
- participle
- archives
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Abstract
The embodiment of the invention discloses method, apparatus, equipment and the storage mediums of a kind of media event cluster.The method, including:The newsletter archive in website is preset in crawl;Newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle;Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns corresponding participle similarity weight;The content of text similarity for comparing two newsletter archives assigns corresponding content of text similarity weight;According to the participle similarity of two newsletter archives, participle similarity weight, content of text similarity and content of text similarity weight, the similarity of two newsletter archives is determined;When two newsletter archives similarity be more than similarity threshold, determine two newsletter archives be similar media event.The technical solution of the embodiment of the present invention is realized and differentiates identical media event, and the time that user browses news is saved.
Description
Technical field
The present embodiments relate to method, apparatus, equipment that text-processing technology more particularly to a kind of media event cluster
And storage medium.
Background technology
Environmental protection industry (epi) is fast-developing, and INDUSTRY OVERVIEW largely broadcasts report in internet, and the media event of Similar content can be
Different web sites forward, according to keywords removal search news, it is found that the search result content of front is much like, it is sometimes desirable to turn over very much
Page just can be found that other media events, other media events in this way are capped, may not be paid close attention to by user.
Currently, artificial constantly removal search keyword may be used, then check search result, empirically analyzes these news
Whether event is identical, makes a mark later to identical media event, and it is that single records to be arranged to these identical media events, in this way
Identical media event only shows one, does not repeat to show.Although the artificial discriminating for solving identical media event, is realized
Method cost is very high, and recording for can covering is limited, and when related personnel's rest, timeliness just fails to meet production requirements.
Invention content
The embodiment of the present invention provides a kind of method, apparatus, equipment and the storage medium of media event cluster, is differentiated with realizing
Identical media event saves the time that user browses news.
In a first aspect, an embodiment of the present invention provides a kind of methods of media event cluster, including:
The newsletter archive in website is preset in crawl;
The newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle;
Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns
Corresponding participle similarity weight;Wherein, the preset kind text participle includes that time noun, geographic name and name are real
Body;
According to the text participle that part of speech in the corresponding text participle of two newsletter archives is noun and verb, than
Compared with the content of text similarity of two newsletter archives, corresponding content of text similarity weight is assigned;
According to the participle similarity of two newsletter archives, the participle similarity weight, the content of text
Similarity and the content of text similarity weight, determine the similarity of two newsletter archives;
When the similarity of two newsletter archives is more than similarity threshold, determine that two newsletter archives are similar new
News event.
Second aspect, the embodiment of the present invention additionally provide a kind of device of media event cluster, including:
Newsletter archive handling module, for capturing the newsletter archive in default website;
Text word-dividing mode obtains pair for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition
The text participle answered;
Similarity-rough set module is segmented, for comparing preset kind text in the corresponding text participle of two newsletter archives
The participle similarity of this participle assigns corresponding participle similarity weight;Wherein, the preset kind text participle includes the time
Noun, geographic name and name entity;
Content similarity comparison module, for being noun according to part of speech in the corresponding text participle of two newsletter archives
It is segmented with the text of verb, compares the content of text similarity of two newsletter archives, assign corresponding content of text
Similarity weight;
Newsletter archive similarity determining module, for according to the participle similarity of two newsletter archives, described
Similarity weight, the content of text similarity and the content of text similarity weight are segmented, determines two news texts
This similarity;
Similar media event determining module is more than similarity threshold, really for the similarity when two newsletter archives
Fixed two newsletter archives are similar media event.
The third aspect, the embodiment of the present invention additionally provide a kind of equipment, and the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processing
The method that device realizes the media event cluster provided such as first aspect.
Fourth aspect, the embodiment of the present invention additionally provides a kind of storage medium including computer executable instructions, described
Computer executable instructions by computer processor when being executed for executing such as the media event cluster that first aspect provides
Method.
The embodiment of the present invention assigns different weights by the similarity to media event difference element, determines and does not have to news
Whether text is similar media event, solves the problem of poor in timeliness of high cost, realizes and differentiates identical media event, section
About user browses the effect of the time of news.
Description of the drawings
Fig. 1 is a kind of flow chart of the method for media event cluster in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of the method for media event cluster in the embodiment of the present invention two;
Fig. 3 is a kind of structural schematic diagram of the device of media event cluster in the embodiment of the present invention three;
Fig. 4 is a kind of structural schematic diagram of equipment in the embodiment of the present invention four.
Specific implementation mode
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention rather than limitation of the invention.It also should be noted that in order to just
Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
Embodiment one
Fig. 1 is a kind of flow chart of the method for media event cluster that the embodiment of the present invention one provides, and the present embodiment can fit
For to presetting the case where media event on industry related web site clusters, the dress that this method can be clustered by media event
It sets to execute, which can be realized by software and/or hardware, and the method for media event cluster specifically comprises the following steps:
Step 110, crawl preset the newsletter archive in website.
Wherein, the website that website is similar industry, such as the website of environmental protection industry (epi) are preset.It would generally be shown on these websites
The related news of the sector, may the one or more related news of displaying on a website.By dividing different data sources
Class by way of simply configuring in a short time carries out efficient data grabber to largely presetting website, avoids complicated
Exploitation amount, reduce development cost.
Optionally, the newsletter archive in the default website of the crawl includes:The default website is captured by web crawlers
In newsletter archive.
Step 120, the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text
Participle.
Wherein, using Chinese natural language treatment technology, these newsletter archives grabbed are segmented, part-of-speech tagging and
Name Entity recognition.By above-mentioned processing, the sentence in newsletter archive is divided into several texts participle, and this is determined
The part of speech of a little text participles, identifies the name entity in text participle.
Optionally, the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text
After participle, further include:Reject the text participle without competency in the text participle.For example, auxiliary word, conjunction and language
Gas word.
Step 130 compares the participle phase that preset kind text segments in the corresponding text participle of two newsletter archives
Like degree, corresponding participle similarity weight is assigned.
Wherein, optionally, the preset kind text participle includes time noun, geographic name and name entity.Because
The element of media event includes the when and where that event occurs and the body matter of related person and event.So comparing
Time noun, geographic name and the similarity for naming entity, can effectively determine the phase between the different media events grabbed
Like degree.
Step 140, according to the text that part of speech in the corresponding text participle of two newsletter archives is noun and verb
This participle compares the content of text similarity of two newsletter archives, assigns corresponding content of text similarity weight.
Wherein, noun and verb are the most key to content semantic understanding, and it is the text of verb and noun point to compare by part of speech
The similarity of the media event content of word composition, assigns weight.Existing industry method to the content body of media event, implement
Main body, when and where make identical consideration, do not distinguish independently;And in fact, content body, subject of implementation, the time and
Place is different to semantic understanding contribution.Thus, it is time noun, geographic name and the similarity for naming entity, foundation
Different contributions assigns respective weight, can improve the cluster accuracy of similar events.
Step 150, according to the participle similarity of two newsletter archives, the participle similarity weight, described
Content of text similarity and the content of text similarity weight, determine the similarity of two newsletter archives.
Wherein, participle similarity, such as time noun, geographic name and the similarity for naming entity, this three are corresponding
It is 1 to segment the sum of similarity weight and content of text similarity weight.To be determined in above-mentioned steps each participle similarity and
Its corresponding participle similarity multiplied by weight, by content of text similarity and content of text similarity multiplied by weight, then take and,
Similarity as two newsletter archives.It is exemplary, the time noun similarities of two newsletter archives be 1 (weight 0.3),
Reason title similarity is 1 (weight 0.2) and name entity similarity is 0.5 (weight 0.2), and content of text similarity is
0.8 (weight 0.3), then, the similarity of two newsletter archives can be obtained as 0.84 (1*0.3+1*0.2+ by calculating
0.5*0.2+0.8*0.3=0.84).
Step 160, when two newsletter archives similarity be more than similarity threshold, determine two newsletter archives
For similar media event.
Wherein, similarity threshold can be preset, and the later stage can adjust, but after setting, be grabbed in determination new
During whether news text is similar, the similarity threshold is constant.It is exemplary, similarity threshold 0.6.
Optionally, it is more than similarity threshold in the similarity when two newsletter archives, determines two news texts
Originally it is to further include after similar media event:The newsletter archive for belonging to similar media event is summarized, same news is included into
Event.The quantity of the newsletter archive grabbed would generally be bigger, when carrying out the similarity-rough set of newsletter archive, can select
One newsletter archive is fixed comparison object, other newsletter archives are in contrast, and principle is transmitted (such as according to similarity:A and B
It is similar, and A is similar with C, then A, B and C are similar), similar newsletter archive is summarized, merges and generates identical media event.
The technical solution of the present embodiment assigns different weights by the similarity to media event difference element, determines
Whether it is similar media event without newsletter archive, solves the problem of poor in timeliness of high cost, realizes and differentiate identical news thing
Part saves the effect that user browses the time of news.
Embodiment two
Fig. 2 is a kind of flow chart of the method for media event cluster provided by Embodiment 2 of the present invention, the technology of this implementation
Scheme further refines based on the above technical solution, specifically includes:
Step 210 establishes the key word library for including default leader name and default industry slang.
Wherein it is possible to arrange key word library by business expert, or the network hot word for presetting industry is obtained as crucial
Word, and key word library is regularly updated, it is added, deletes and covers update to data therein.
Step 220, crawl preset the newsletter archive in website.
Step 230, the newsletter archive for filtering out non-default INDUSTRY OVERVIEW in the newsletter archive grabbed.
Wherein, the newsletter archive grabbed, it may be possible to the content unrelated with default industry in website, according to key word library,
The keyword that newsletter archive includes is scanned, if not including keyword in newsletter archive, filters out news text
This.
Step 240, according to the key word library, two newsletter archives are segmented, part-of-speech tagging and name it is real
Body identifies, obtains the corresponding text participle.
Step 250 compares the participle phase that preset kind text segments in the corresponding text participle of two newsletter archives
Like degree, corresponding participle similarity weight is assigned.
Step 260, according to the text that part of speech in the corresponding text participle of two newsletter archives is noun and verb
This participle compares the content of text similarity of two newsletter archives, assigns corresponding content of text similarity weight.
Step 270, according to the participle similarity of two newsletter archives, the participle similarity weight, described
Content of text similarity and the content of text similarity weight, determine the similarity of two newsletter archives.
Step 280, when two newsletter archives similarity be more than similarity threshold, determine two newsletter archives
For similar media event.
The technical solution of the present embodiment is reference with key word library, determines the similarity between newsletter archive, reduce out
Cost is sent out, the time that user browses news is saved.
Embodiment three
Fig. 3 is a kind of structural schematic diagram of the device for media event cluster that the embodiment of the present invention three provides, which can
To configure in computer equipment.Media event cluster device include:
Newsletter archive handling module 310, for capturing the newsletter archive in default website;
Text word-dividing mode 320, for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition, obtain
It is segmented to corresponding text;
Similarity-rough set module 330 is segmented, class is preset for comparing in the corresponding text participle of two newsletter archives
The participle similarity of type text participle, assigns corresponding participle similarity weight;Wherein, the preset kind text participle includes
Time noun, geographic name and name entity;
Content similarity comparison module 340, for being according to part of speech in the corresponding text participle of two newsletter archives
The text of noun and verb segments, and compares the content of text similarity of two newsletter archives, assigns corresponding text
Content similarity weight;
Newsletter archive similarity determining module 350, for the participle similarity according to two newsletter archives, institute
Participle similarity weight, the content of text similarity and the content of text similarity weight are stated, determines two news
The similarity of text;
Similar media event determining module 360 is more than similarity threshold for the similarity when two newsletter archives,
Determine that two newsletter archives are similar media event.
Optionally, media event cluster device, further include:
Without expressing the meaning, participle rejects module, for it is described the newsletter archive is segmented, part-of-speech tagging and name it is real
Body identifies, after obtaining corresponding text participle, rejects the text participle without competency in the text participle;
Optionally, media event cluster device, further include:
Newsletter archive filtering module, for it is described the newsletter archive is segmented, part-of-speech tagging and name entity
Identification before obtaining corresponding text participle, filters out the described of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed
Newsletter archive;
Optionally, media event cluster device, further include:
Key word library establishes module, for before the newsletter archive in website is preset in the crawl, it to include default to establish
Lead the key word library of name and default industry slang;Correspondingly, the text word-dividing mode includes:Text participle unit is used
According to the key word library, two newsletter archives are segmented, part-of-speech tagging and name Entity recognition, corresponded to
The text participle;
Optionally, media event cluster device, further include:
Similar media event summarizing module, for being more than similarity threshold in the similarity for working as two newsletter archives
Value determines that two newsletter archives converge for that after similar media event, will belong to the newsletter archive of similar media event
Always, it is included into same media event;
Optionally, the newsletter archive handling module includes:
Crawler capturing unit, for capturing the newsletter archive in the default website by web crawlers.
The technical solution of the present embodiment assigns different weights by the similarity to media event difference element, determines
Whether it is similar media event without newsletter archive, solves the problem of poor in timeliness of high cost, realizes and differentiate identical news
Event saves the effect that user browses the time of news.
The said goods can perform the method that any embodiment of the present invention is provided, and have the corresponding function module of execution method
And advantageous effect.
Example IV
Fig. 4 is a kind of structural schematic diagram for equipment that the embodiment of the present invention four provides, as shown in figure 4, the equipment includes place
Manage device 40, memory 41, input unit 42 and output device 43;The quantity of processor 40 can be one or more in equipment,
In Fig. 4 by taking a processor 40 as an example;Processor 40, memory 41, input unit 42 and output device 43 in equipment can be with
It is connected by bus or other modes, in Fig. 4 for being connected by bus.
Memory 41 is used as a kind of computer readable storage medium, can be used for storing software program, computer can perform journey
Sequence and module, if the corresponding program instruction/module of method of the media event cluster in the embodiment of the present invention is (for example, news
Newsletter archive handling module 310, text word-dividing mode 320, participle similarity-rough set module in the device of affair clustering
330, content similarity comparison module 340, newsletter archive similarity determining module 350 and similar media event determining module
360).Processor 40 is stored in software program, instruction and module in memory 41 by operation, to execute each of equipment
Kind application of function and data processing, that is, the method for realizing above-mentioned media event cluster.
Memory 41 can include mainly storing program area and storage data field, wherein storing program area can store operation system
Application program needed for system, at least one function;Storage data field can be stored uses created data etc. according to terminal.This
Outside, memory 41 may include high-speed random access memory, can also include nonvolatile memory, for example, at least a magnetic
Disk storage device, flush memory device or other non-volatile solid state memory parts.In some instances, memory 41 can be further
Include the memory remotely located relative to processor 40, these remote memories can pass through network connection to equipment.It is above-mentioned
The example of network includes but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.
Input unit 42 can be used for receiving the number or character information of input, and generate with the user setting of equipment and
The related key signals input of function control.Output device 43 may include that display screen etc. shows equipment.
Embodiment five
The embodiment of the present invention five also provides a kind of storage medium including computer executable instructions, and the computer can be held
When being executed by computer processor for executing a kind of method of media event cluster, this method includes for row instruction:
The newsletter archive in website is preset in crawl;
The newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle;
Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns
Corresponding participle similarity weight;Wherein, the preset kind text participle includes that time noun, geographic name and name are real
Body;
According to the text participle that part of speech in the corresponding text participle of two newsletter archives is noun and verb, than
Compared with the content of text similarity of two newsletter archives, corresponding content of text similarity weight is assigned;
According to the participle similarity of two newsletter archives, the participle similarity weight, the content of text
Similarity and the content of text similarity weight, determine the similarity of two newsletter archives;
When the similarity of two newsletter archives is more than similarity threshold, determine that two newsletter archives are similar new
News event.
Certainly, a kind of storage medium including computer executable instructions that the embodiment of the present invention is provided, computer
The method operation that executable instruction is not limited to the described above, can also be performed the media event that any embodiment of the present invention is provided
Relevant operation in the method for cluster
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention
It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but the former is more in many cases
Good embodiment.Based on this understanding, technical scheme of the present invention substantially in other words contributes to the prior art
Part can be expressed in the form of software products, which can be stored in computer readable storage medium
In, such as the floppy disk of computer, read-only memory (Read-Only Memory, ROM), random access memory (Random
Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set
Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.
It is worth noting that, in the embodiment of above-mentioned searcher, included each unit and module are only according to work(
Energy logic is divided, but is not limited to above-mentioned division, as long as corresponding function can be realized;In addition, each work(
The specific name of energy unit is also only to facilitate mutually distinguish, the protection domain being not intended to restrict the invention.
Note that above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The present invention is not limited to specific embodiments described here, can carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out to the present invention by above example
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
May include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.
Claims (10)
1. a kind of method of media event cluster, which is characterized in that including:
The newsletter archive in website is preset in crawl;
The newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtains corresponding text participle;
Compare the participle similarity that preset kind text segments in the corresponding text participle of two newsletter archives, assigns and corresponding to
Participle similarity weight;Wherein, the preset kind text participle includes time noun, geographic name and name entity;
According to the text participle that part of speech in the corresponding text participle of two newsletter archives is noun and verb, compare two
The content of text similarity of a newsletter archive assigns corresponding content of text similarity weight;
It is similar according to the participle similarity of two newsletter archives, the participle similarity weight, the content of text
Degree and the content of text similarity weight, determine the similarity of two newsletter archives;
When two newsletter archives similarity be more than similarity threshold, determine two newsletter archives be similar news thing
Part.
2. according to the method described in claim 1, it is characterized in that, it is described the newsletter archive is segmented, part of speech mark
Note and name Entity recognition further include after obtaining corresponding text participle:
Reject the text participle without competency in the text participle.
3. according to the method described in claim 1, it is characterized in that, it is described the newsletter archive is segmented, part of speech mark
Note and name Entity recognition further include before obtaining corresponding text participle:
Filter out the newsletter archive of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed.
4. according to the method described in claim 1, it is characterized in that, the newsletter archive that the crawl is preset in website includes:
The newsletter archive in the default website is captured by web crawlers.
5. according to the method described in claim 1, it is characterized in that, it is described crawl preset website in newsletter archive before,
Further include:
It includes the default key word library for leading name and default industry slang to establish;
Correspondingly, it is described the newsletter archive is segmented, part-of-speech tagging and name Entity recognition, obtain corresponding text point
Word includes:
According to the key word library, two newsletter archives is segmented, part-of-speech tagging and name Entity recognition, obtained pair
The text participle answered.
6. according to the method described in claim 1, it is characterized in that, described when the similarity of two newsletter archives is more than
Similarity threshold determines two newsletter archives after similar media event, to further include:
The newsletter archive for belonging to similar media event is summarized, same media event is included into.
7. a kind of device of media event cluster, which is characterized in that including:
Newsletter archive handling module, for capturing the newsletter archive in default website;
Text word-dividing mode, for being segmented to the newsletter archive, part-of-speech tagging and name Entity recognition, obtain corresponding
Text segments;
Similarity-rough set module is segmented, for comparing preset kind text point in the corresponding text participle of two newsletter archives
The participle similarity of word assigns corresponding participle similarity weight;Wherein, the preset kind text participle includes time name
Word, geographic name and name entity;
Content similarity comparison module, for being noun according to part of speech in the corresponding text participle of two newsletter archives and dynamic
The text of word segments, and compares the content of text similarity of two newsletter archives, it is similar to assign corresponding content of text
Spend weight;
Newsletter archive similarity determining module, for the participle similarity according to two newsletter archives, the participle
Similarity weight, the content of text similarity and the content of text similarity weight determine two newsletter archives
Similarity;
Similar media event determining module is more than similarity threshold for the similarity when two newsletter archives, determines two
A newsletter archive is similar media event.
8. device according to claim 7, which is characterized in that further include:
Without express the meaning participle reject module, for it is described the newsletter archive is segmented, part-of-speech tagging and name entity know
Not, after obtaining corresponding text participle, the text participle without competency in the text participle is rejected;
Newsletter archive filtering module, for it is described the newsletter archive is segmented, part-of-speech tagging and name Entity recognition,
Before obtaining corresponding text participle, the news text of non-default INDUSTRY OVERVIEW in the newsletter archive grabbed is filtered out
This;
Key word library establishes module, for before the newsletter archive in website is preset in the crawl, it to include default leader to establish
The key word library of name and default industry slang;Correspondingly, the text word-dividing mode includes:Text participle unit is used for root
According to the key word library, two newsletter archives are segmented, part-of-speech tagging and name Entity recognition, obtain corresponding institute
State text participle;
Similar media event summarizing module, for being more than similarity threshold in the similarity for working as two newsletter archives,
Determine two newsletter archives be similar media event after, the newsletter archive for belonging to similar media event is summarized,
It is included into same media event;
The newsletter archive handling module includes:
Crawler capturing unit, for capturing the newsletter archive in the default website by web crawlers.
9. a kind of equipment, which is characterized in that the equipment includes:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processors are real
The now method of the media event cluster as described in any in claim 1-6.
10. a kind of storage medium including computer executable instructions, which is characterized in that the computer executable instructions by
Method when computer processor executes for executing the media event cluster as described in any in claim 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810155131.5A CN108334628A (en) | 2018-02-23 | 2018-02-23 | A kind of method, apparatus, equipment and the storage medium of media event cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810155131.5A CN108334628A (en) | 2018-02-23 | 2018-02-23 | A kind of method, apparatus, equipment and the storage medium of media event cluster |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108334628A true CN108334628A (en) | 2018-07-27 |
Family
ID=62929742
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810155131.5A Pending CN108334628A (en) | 2018-02-23 | 2018-02-23 | A kind of method, apparatus, equipment and the storage medium of media event cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108334628A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241438A (en) * | 2018-09-27 | 2019-01-18 | 国家计算机网络与信息安全管理中心 | Across channel focus incident discovery method, apparatus and storage medium based on element |
CN109857866A (en) * | 2019-01-14 | 2019-06-07 | 中国科学院信息工程研究所 | A kind of keyword abstraction method and event query suggestion generation method and searching system towards event query suggestion |
CN110704617A (en) * | 2019-09-17 | 2020-01-17 | 平安科技(深圳)有限公司 | News text classification method and device, electronic equipment and storage medium |
CN111046271A (en) * | 2018-10-15 | 2020-04-21 | 阿里巴巴集团控股有限公司 | Mining method and device for search, storage medium and electronic equipment |
JP2020174342A (en) * | 2019-04-08 | 2020-10-22 | バイドゥ ユーエスエイ エルエルシーBaidu USA LLC | Method, device, server, computer-readable storage medium, and computer program for generating video |
JP2020174338A (en) * | 2019-04-08 | 2020-10-22 | バイドゥ ドットコム タイムス テクノロジー (ベイジン) カンパニー リミテッド | Method, device, server, computer-readable storage media, and computer program for generating information |
CN112231470A (en) * | 2019-06-28 | 2021-01-15 | 上海智臻智能网络科技股份有限公司 | Topic mining method and device, storage medium and terminal |
CN112926298A (en) * | 2021-03-02 | 2021-06-08 | 北京百度网讯科技有限公司 | News content identification method, related device and computer program product |
CN113420112A (en) * | 2021-06-21 | 2021-09-21 | 中国科学院声学研究所 | News entity analysis method and device based on unsupervised learning |
CN115146065A (en) * | 2022-09-02 | 2022-10-04 | 安徽商信政通信息技术股份有限公司 | Intelligent information reporting similar content merging method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411583A (en) * | 2010-09-20 | 2012-04-11 | 阿里巴巴集团控股有限公司 | Method and device for matching texts |
CN102495872A (en) * | 2011-11-30 | 2012-06-13 | 中国科学技术大学 | Method and device for conducting personalized news recommendation to mobile device users |
CN103377239A (en) * | 2012-04-26 | 2013-10-30 | 腾讯科技(深圳)有限公司 | Method and device for calculating inter-textual similarity |
CN104715014A (en) * | 2015-01-26 | 2015-06-17 | 中山大学 | Online news topic detection method |
US9477714B1 (en) * | 2002-09-20 | 2016-10-25 | Google Inc. | Methods and apparatus for ranking documents |
CN106383877A (en) * | 2016-09-12 | 2017-02-08 | 电子科技大学 | On-line short text clustering and topic detection method of social media |
CN106934005A (en) * | 2017-03-07 | 2017-07-07 | 重庆邮电大学 | A kind of Text Clustering Method based on density |
CN107145568A (en) * | 2017-05-04 | 2017-09-08 | 成都华栖云科技有限公司 | A kind of quick media event clustering system and method |
-
2018
- 2018-02-23 CN CN201810155131.5A patent/CN108334628A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9477714B1 (en) * | 2002-09-20 | 2016-10-25 | Google Inc. | Methods and apparatus for ranking documents |
CN102411583A (en) * | 2010-09-20 | 2012-04-11 | 阿里巴巴集团控股有限公司 | Method and device for matching texts |
CN102495872A (en) * | 2011-11-30 | 2012-06-13 | 中国科学技术大学 | Method and device for conducting personalized news recommendation to mobile device users |
CN103377239A (en) * | 2012-04-26 | 2013-10-30 | 腾讯科技(深圳)有限公司 | Method and device for calculating inter-textual similarity |
CN104715014A (en) * | 2015-01-26 | 2015-06-17 | 中山大学 | Online news topic detection method |
CN106383877A (en) * | 2016-09-12 | 2017-02-08 | 电子科技大学 | On-line short text clustering and topic detection method of social media |
CN106934005A (en) * | 2017-03-07 | 2017-07-07 | 重庆邮电大学 | A kind of Text Clustering Method based on density |
CN107145568A (en) * | 2017-05-04 | 2017-09-08 | 成都华栖云科技有限公司 | A kind of quick media event clustering system and method |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241438B (en) * | 2018-09-27 | 2022-06-24 | 国家计算机网络与信息安全管理中心 | Element-based cross-channel hot event discovery method and device and storage medium |
CN109241438A (en) * | 2018-09-27 | 2019-01-18 | 国家计算机网络与信息安全管理中心 | Across channel focus incident discovery method, apparatus and storage medium based on element |
CN111046271B (en) * | 2018-10-15 | 2023-04-25 | 阿里巴巴集团控股有限公司 | Mining method and device for searching, storage medium and electronic equipment |
CN111046271A (en) * | 2018-10-15 | 2020-04-21 | 阿里巴巴集团控股有限公司 | Mining method and device for search, storage medium and electronic equipment |
CN109857866A (en) * | 2019-01-14 | 2019-06-07 | 中国科学院信息工程研究所 | A kind of keyword abstraction method and event query suggestion generation method and searching system towards event query suggestion |
CN109857866B (en) * | 2019-01-14 | 2021-05-25 | 中国科学院信息工程研究所 | Event query suggestion-oriented keyword extraction method, event query suggestion generation method and retrieval system |
JP2020174342A (en) * | 2019-04-08 | 2020-10-22 | バイドゥ ユーエスエイ エルエルシーBaidu USA LLC | Method, device, server, computer-readable storage medium, and computer program for generating video |
JP2020174338A (en) * | 2019-04-08 | 2020-10-22 | バイドゥ ドットコム タイムス テクノロジー (ベイジン) カンパニー リミテッド | Method, device, server, computer-readable storage media, and computer program for generating information |
JP7108259B2 (en) | 2019-04-08 | 2022-07-28 | バイドゥドットコム タイムズ テクノロジー (ベイジン) カンパニー リミテッド | Methods, apparatus, servers, computer readable storage media and computer programs for generating information |
CN112231470A (en) * | 2019-06-28 | 2021-01-15 | 上海智臻智能网络科技股份有限公司 | Topic mining method and device, storage medium and terminal |
CN110704617A (en) * | 2019-09-17 | 2020-01-17 | 平安科技(深圳)有限公司 | News text classification method and device, electronic equipment and storage medium |
CN110704617B (en) * | 2019-09-17 | 2023-10-03 | 平安科技(深圳)有限公司 | News text classification method, device, electronic equipment and storage medium |
CN112926298A (en) * | 2021-03-02 | 2021-06-08 | 北京百度网讯科技有限公司 | News content identification method, related device and computer program product |
CN113420112A (en) * | 2021-06-21 | 2021-09-21 | 中国科学院声学研究所 | News entity analysis method and device based on unsupervised learning |
CN115146065A (en) * | 2022-09-02 | 2022-10-04 | 安徽商信政通信息技术股份有限公司 | Intelligent information reporting similar content merging method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108334628A (en) | A kind of method, apparatus, equipment and the storage medium of media event cluster | |
US7937338B2 (en) | System and method for identifying document structure and associated metainformation | |
CN111831802B (en) | Urban domain knowledge detection system and method based on LDA topic model | |
CN101271459A (en) | Word library generation method, input method and input method system | |
CN103778548A (en) | Goods information and keyword matching method, and goods information releasing method and device | |
AU2018250372B2 (en) | Method to construct content based on a content repository | |
US20160299891A1 (en) | Matching of an input document to documents in a document collection | |
CN104978332A (en) | UGC label data generating method, UGC label data generating device, relevant method and relevant device | |
CN110909120A (en) | Resume searching/delivering method, device and system and electronic equipment | |
CN110232126A (en) | Hot spot method for digging and server and computer readable storage medium | |
US20200012722A1 (en) | System for real-time expression of semantic mind map, and operation method therefor | |
CN105512300A (en) | Information filtering method and system | |
CN103226601A (en) | Method and device for image search | |
US9946765B2 (en) | Building a domain knowledge and term identity using crowd sourcing | |
US10606899B2 (en) | Categorically filtering search results | |
CN109670047B (en) | Abstract note generation method, computer device and readable storage medium | |
CN102982029B (en) | A kind of search need recognition methods and device | |
CN113407678B (en) | Knowledge graph construction method, device and equipment | |
CN111401047A (en) | Method and device for generating dispute focus of legal document and computer equipment | |
US11232088B2 (en) | Method and system for interactive search indexing | |
US20120047128A1 (en) | Open class noun classification | |
CN113468339A (en) | Label extraction method, system, electronic device and medium based on knowledge graph | |
CN113761104A (en) | Method and device for detecting entity relationship in knowledge graph and electronic equipment | |
CN109511000A (en) | Barrage classification determines method, apparatus, equipment and storage medium | |
US20230061773A1 (en) | Automated systems and methods for generating technical questions from technical documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180727 |
|
RJ01 | Rejection of invention patent application after publication |