CN107203509A

CN107203509A - Title generation method and device

Info

Publication number: CN107203509A
Application number: CN201710262158.XA
Authority: CN
Inventors: 王洪俊; 肖诗斌
Original assignee: BEIJING TRS INFORMATION TECHNOLOGY Co Ltd
Current assignee: Tols Information Technology Co ltd
Priority date: 2017-04-20
Filing date: 2017-04-20
Publication date: 2017-09-26
Anticipated expiration: 2037-04-20
Also published as: CN107203509B

Abstract

The embodiment provides a kind of title generation method and device.The title generation method includes：Obtain in the first news agregator the original header of each news documents and be spliced into title text string, wherein, first news agregator includes at least one news documents on same media event；High frequency word string is extracted from the title text string, and the high frequency word string of extraction is filtered；Frequency of occurrence highest word string in the high frequency word string by filtering is defined as to the title of first news agregator.Using the technical scheme of the embodiment of the present invention, a high-quality slug can be generated for news documents automatically, it is ensured that the semantic effect of title and Politeness, and reduce the difficulty in computation of slug generation, and with higher adaptability.

Description

Title generation method and device

Technical field

The present invention relates to field of computer technology, more particularly to a kind of title generation method and device.

Background technology

Generally, the title of news documents is longer, typically there is 20~30 words, causes the news that can be shown on news web page Limited amount.In order to show more news on news web page, the title of news documents can be compressed or be rewritten, On the basis of not influenceing title semantic, shorten length for heading.

At present, the header compressing method of news documents is mainly based upon setting rule or grammatical pattern is long to shorten title Degree.For example, based on setting rule, replacing corresponding word string in title using the shorter synonym of length or abbreviation, or obtain The kernel sentence or critical sentence of news documents is taken to make to replace title.For another example based on grammatical pattern, from database learning title The grammatical pattern of generation, to generate the title that length is shorter.

But, because the limited coverage area of setting rule, and grammatical pattern are limited to the scope of database, easily make Into the semantic effect of headline based on setting rule or grammatical pattern generation and it is Politeness cannot be guaranteed, and can not be effective Compress title.

The content of the invention

Embodiments of the invention provide a kind of title generation method and device, are generated with automatically for news documents high-quality Slug.

One side according to embodiments of the present invention there is provided a kind of title generation method, including：Obtain the first news collection The original header of each news documents and title text string is spliced into conjunction, wherein, first news agregator is included on same At least one news documents of media event；High frequency word string is extracted from the title text string, and to the high frequency of extraction Word string is filtered；Frequency of occurrence highest word string in the high frequency word string by filtering is defined as the first news collection The title of conjunction.

Alternatively.Methods described also includes：The first news collection is obtained by being clustered to the second news agregator Close, wherein, second news agregator at least includes first news agregator.

It is alternatively described to obtain first news agregator by being clustered to the second news agregator, including：Calculate Content similarity in second news agregator between each news documents；Determine that at least one candidate is new according to the content similarity Set is heard, and first news agregator is determined from least one described candidate's news agregator.

Alternatively.It is described obtain the first news agregator in each news original header and be spliced into title text string, including： Punctuation mark is set between each adjacent original header in the title text string；And/or, using synonym or abbreviation Corresponding word string in the original header is replaced.

Alternatively, the high frequency word string of described pair of extraction is filtered, including：The mistake from the high frequency word string of extraction Filter the word string not occurred in the beginning of the sentence or sentence tail of the original header；And/or, filtered from the high frequency word string of extraction Exchange and include the word string of punctuation mark；And/or, word string length is filtered out from the high frequency word string of extraction and is less than setting length threshold The word string of value.

Another aspect according to embodiments of the present invention, also provides a kind of title generating means, including：Acquisition module, is used for Obtain in the first news agregator the original header of each news documents and be spliced into title text string, wherein, the first news collection Closing includes at least one news documents on same media event；Filtering module is extracted, for from the title text string High frequency word string is extracted, and the high frequency word string of extraction is filtered；Generation module, for the high frequency by filtering is passed through Frequency of occurrence highest word string is defined as the title of first news agregator in word string.

Alternatively, described device also includes：Cluster module, for by being clustered the second news agregator to obtain The first news agregator is stated, wherein, second news agregator at least includes first news agregator.

Alternatively, the cluster module includes：Computing unit, for calculating in the second news agregator between each news documents Content similarity；Determining unit, for determining at least one candidate's news agregator according to the content similarity, and from described First news agregator is determined at least one candidate's news agregator.

Alternatively, the acquisition module includes：Setting unit, for each adjacent original in the title text string Punctuation mark is set between beginning title；And/or, replacement unit, using synonym or referred to as to corresponding in the original header Word string is replaced.

Alternatively, the extraction filtering module includes filter element, the filter element：For the high frequency from extraction The word string not occurred in the beginning of the sentence or sentence tail of the original header is filtered out in word string；And/or, from the high frequency words of extraction The word string including punctuation mark is filtered out in string；And/or, filtered out from the high frequency word string of extraction word string length be less than set The word string of measured length threshold value.

The title generation method and device of the embodiment of the present invention, by obtaining multiple news text on same media event The respective original header of shelves, to be spliced into title text string, then the extraction high frequency word string from title text string, and to extraction High frequency word string is filtered to be met the high frequency words of title feature and conspires to create to screen, and then the most high frequency word string by filtering is determined It is that each news documents generate a high-quality slug for new title, it is ensured that the semantic effect of title and Politeness；And And, the difficulty in computation of slug generation is reduced, and with higher adaptability.

Brief description of the drawings

Fig. 1 is a kind of step flow chart of according to embodiments of the present invention one title generation method；

Fig. 2 is a kind of step flow chart of according to embodiments of the present invention two title generation method；

Fig. 3 is a kind of structured flowchart of according to embodiments of the present invention three title generating means；

Fig. 4 is a kind of structured flowchart of according to embodiments of the present invention four title generating means.

Embodiment

(identical label represents identical element in some accompanying drawings) and embodiment, implement to the present invention below in conjunction with the accompanying drawings The embodiment of example is described in further detail.Following examples are used to illustrate the present invention, but are not limited to the present invention Scope.

It will be understood by those skilled in the art that the term such as " first ", " second " in the embodiment of the present invention is only used for difference Different step, equipment or module etc., neither represent any particular technology implication, also do not indicate that the inevitable logic between them is suitable Sequence.

Embodiment one

Reference picture 1, shows a kind of step flow chart of according to embodiments of the present invention one title generation method.

The title generation method of the present embodiment comprises the following steps：

Step S102：Obtain in the first news agregator the original header of each news documents and be spliced into title text string.

Wherein, the first news agregator includes at least one news documents on same media event.

In the present embodiment, one or more of first news agregator news documents are on same media event, the news Event can be any media event.One or more of first news agregator news documents each have original header.

For example, table 1 shows a kind of example of the first news agregator.

After the first news agregator is obtained, the original header of each news documents in the first news agregator is extracted, and will be obtained The original mark taken is spliced into a long text strings, forms title text string.

Step S104：High frequency word string is extracted from title text string, and the high frequency word string of extraction is filtered.

Wherein, high frequency word string exceedes preset length (for example, two English words or two for length in title text string The length of Chinese character), and occurrence number exceedes the word string of preset times (for example twice).

For example, the title text string being spliced into for the original header of the first news agregator shown in table 1, the high frequency of extraction Word string can include the big wedding of Liu Shishi, Liu Shishi, wedding gauze kerchief, Wu Qilong, big wedding, the grand Liu Shi poems of Wu Qi etc..Extracting high frequency words After string, the high frequency word string to extraction carries out filter operation, to filter out the word string that feature does not meet title feature.The present embodiment In to the extracting mode of high frequency word string, and the filtering rule of high frequency word string is not limited.

Step S106：Frequency of occurrence highest word string in high frequency word string by filtering is defined as the first news agregator Title.

High frequency word string by filtering substantially conforms to title feature, is chosen in the high frequency words trail by filtering and frequency occurs Secondary highest word string, is used as the new title of each news documents in the first news agregator.That is, with the most high frequency word string by filtering As title, the semantic effect of new title is on the one hand ensure that, each news documents in the first news agregator can be stated and referred to Media event, meet the essential characteristic of title；On the other hand, using word string as title, equivalent to in the first news agregator Each news documents have regenerated a slug, ensure that the Politeness of title.

Title generation method according to embodiments of the present invention, by obtaining multiple news documents on same media event Respective original header, to be spliced into title text string, then the extraction high frequency word string from title text string, and to the height of extraction Frequency word string is filtered to be met the high frequency words of title feature and conspires to create to screen, and then will be defined as by the most high frequency word string of filtering New title, is that each news documents generate a high-quality slug, it is ensured that the semantic effect of title and Politeness.

Relative to the method for shortening title based on setting rule and grammatical pattern in the prior art, the mark of the embodiment of the present invention Generation method is inscribed, the slug create-rule complicated without setting reduces the difficulty in computation of slug generation；Moreover, need not Consider setting rule and the coverage of database, the title that can obtain each news documents spliced, and screen with Compression, automatically generates high-quality slug, with higher adaptability.

The title generation method of the present embodiment can be performed and realized by the arbitrarily equipment with corresponding data disposal ability, The including but not limited to corresponding server end of news web page.

Embodiment two

Reference picture 2, shows a kind of step flow chart of according to embodiments of the present invention two title generation method.

Step S202：The first news agregator is obtained by being clustered to the second news agregator.

Wherein, the second news agregator at least includes the first news agregator.

In the present embodiment, the second news agregator includes at least one news documents at least one media event, That is, in the second news agregator except including in the first news agregator on same media event at least one news documents it Outside, other news documents on other media events can also be included.

By being clustered to the second news agregator, to obtain the class news text therein on same media event Shelves, are used as the first news agregator.In a kind of optional embodiment, calculate interior between each news documents in the second news agregator Hold similarity, at least one candidate's news agregator is determined according to the content similarity, and from least one described candidate's news First news agregator is determined in set.

Specifically, it can calculate each by carrying out participle and vectorization processing to each news documents in the second news agregator Content similarity between news documents, for example, calculating the included angle cosine similarity between news documents vector.If two new The content similarity heard between document is more than similarity threshold set in advance (for example, 0.5), then can determine the two news Document is on same media event.That is, content similarity can be more than to multiple news documents of similarity threshold, it is defined as On multiple news documents of same media event, this multiple news documents is further defined as candidate's news agregator.From One or more candidate's news agregators are may determine in two news agregators, one candidate's news agregator of people can be determined that first is new Hear set.

Step S204：Obtain in the first news agregator the original header of each news documents and be spliced into title text string.

It is determined that after the first news agregator, the original header of each news documents in the first news agregator is extracted, to splice Into a title text string.

Alternatively, during each original header is spliced into title text string, can in title text string each phase Punctuation mark is set between adjacent original header, each original header is split, it is to avoid shape between being finished up in adjacent original header Into word string.And, it is preferable that identical punctuation mark is set between each adjacent original header, to reduce amount of calculation.For example, Fullstop is set at the ending of each original header.Further, it is also possible to the space character that will be replaced using fullstop in each original header Number.

It is (long using synonym after the original header of each news documents in extracting the first news agregator in the present embodiment Synonym of the degree less than word string to be replaced) or referred to as corresponding word string in each original header is replaced, it is long to shorten word string Degree, so that in the case of being replaced word string as title, can further shorten length for heading.

Step S206：High frequency word string is extracted from title text string.

In a kind of optional embodiment, using the statistical method of n-gram word string, word string length is extracted from title text string More than the word string that preset length and occurrence number exceed preset times, high frequency word string is used as.Wherein, if in the high frequency word string extracted Including same frequency substring, then same frequency substring is filtered out.If for example, word string " China " and " Chinese people " go out in title text string It is existing 4 times, and " Chinese people " include " China ", then " China " is the same frequency substring of " Chinese people ", when extracting high frequency word string, Only extract word string " Chinese people ".

Step S208：The word not occurred in the beginning of the sentence or sentence tail of original header is filtered out from the high frequency word string of extraction Word string, the word string length of string including punctuation mark are less than the word string of setting length threshold.

In the present embodiment, the word not occurred in the beginning of the sentence or sentence tail of original header is filtered out from the high frequency word string of extraction String；And/or, the word string including punctuation mark is filtered out from the high frequency word string of extraction；And/or, from the high frequency word string of extraction Filter out the word string that word string length is less than setting length threshold.Wherein, the word not occurred in the beginning of the sentence or sentence tail of original header Conspire to create it is smaller for the possibility of title, including punctuation mark word string generally can not turn into title, and word string length be less than set The word string of measured length threshold value is not enough to, by media event sake of clarity, therefore, these word strings be filtered out, and can cause what is extracted High frequency word string more conforms to title feature.

Illustrate herein, in other embodiments, can filter out above-mentioned three kinds from the high frequency word string of extraction and do not meet One or more in the word string of title feature, can also be after filtering other word strings for not meeting title feature.

Step S210：Frequency of occurrence highest word string in high frequency word string by filtering is defined as the first news agregator Title.

The title generation method of the present embodiment, can be considered the optional tool of one kind of the title generation method of above-described embodiment one Body embodiment, identical step can be found in the executive mode of correlation step in above-described embodiment one.

The title generation method of the embodiment of the present invention, by clustering method by the news Aggreagation of same time to one Rise, then extract the original header of these news to be spliced into title text string, then the extraction high frequency word string from title text string, And the feature such as position, the length based on word string meets the high frequency words of title feature and conspired to create to screen, and then filter out and meet mark The most high frequency word string of feature is inscribed as new title, is that each news documents generate a high-quality slug, and ensure that mark The semantic effect of topic and Politeness；Moreover, automatically generating high-quality slug, the difficulty in computation of slug generation is reduced, And with higher adaptability.

Embodiment three

Reference picture 3, shows a kind of structured flowchart of according to embodiments of the present invention three title generating means.

The title generating means of the present embodiment include acquisition module 302, extract filtering module 304 and generation module 306.Its In, acquisition module 302 be used for obtain the first news agregator in each news documents original header and be spliced into title text string, its In, first news agregator includes at least one news documents on same media event.Extracting filtering module 304 is used for High frequency word string is extracted from the title text string, and the high frequency word string of extraction is filtered.Generation module 306 is used for Frequency of occurrence highest word string in the high frequency word string by filtering is defined as to the title of first news agregator.

The title generating means provided according to embodiments of the present invention, by obtaining multiple news on same media event The respective original header of document, to be spliced into title text string, then the extraction high frequency word string from title text string, and to extracting High frequency word string filtered and meet the high frequency words of title feature to screen and conspire to create, it is and then the most high frequency word string by filtering is true It is set to new title, is that each news documents generate a high-quality slug, it is ensured that the semantic effect of title and Politeness； And the difficulty in computation of slug generation is reduced, and with higher adaptability.

Example IV

Reference picture 4, shows a kind of structured flowchart of according to embodiments of the present invention four title generating means.

The title generating means of the present embodiment include acquisition module 402, extract filtering module 404 and generation module 406.Its In, acquisition module 402 be used for obtain the first news agregator in each news documents original header and be spliced into title text string, its In, first news agregator includes at least one news documents on same media event.Extracting filtering module 404 is used for High frequency word string is extracted from the title text string, and the high frequency word string of extraction is filtered.Generation module 406 is used for Frequency of occurrence highest word string in the high frequency word string by filtering is defined as to the title of first news agregator.

Alternatively, the title generating means of the present embodiment also include cluster module 408, for by the second news agregator Clustered to obtain first news agregator, wherein, second news agregator at least includes first news agregator.

Alternatively, cluster module 408 includes computing unit 4082 and determining unit 4084, and computing unit 4082 is used to calculate Content similarity in second news agregator between each news documents；Determining unit 4084 is used for true according to the content similarity At least one fixed candidate's news agregator, and determine first news agregator from least one described candidate's news agregator.

Alternatively, acquisition module 402 includes setting unit 4022 and/or replacement unit 4024, and setting unit 4022 is used for Punctuation mark is set between each adjacent original header in the title text string；Replacement unit 4024, using synonymous Word is referred to as replaced to corresponding word string in the original header.

Alternatively, extracting filtering module 404 includes extraction unit 4042 and filter element 4044, and extraction unit 4042 is used for High frequency word string is extracted from the title text string.Filter element 4044 is used to filter out not from the high frequency word string of extraction The word string occurred in the beginning of the sentence or sentence tail of the original header；And/or, filtered out from the high frequency word string of extraction including The word string of punctuation mark；And/or, the word that word string length is less than setting length threshold is filtered out from the high frequency word string of extraction String.

The title generation method of the present embodiment is used for the title generation method for realizing previous embodiment one or embodiment two, and Beneficial effect with embodiment of the method, is not being repeated herein.

It may be noted that the need for according to implementation, all parts/step described in the embodiment of the present invention can be split as more The part operation of two or more components/steps or components/steps, can also be combined into new part/step by multi-part/step Suddenly, to realize the purpose of the embodiment of the present invention.

Above-mentioned method according to embodiments of the present invention can be realized in hardware, firmware, or be implemented as being storable in note Software or computer code in recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optic disk), or it is implemented through net The original storage that network is downloaded is in long-range recording medium or nonvolatile machine readable media and will be stored in local recording medium In computer code so that method described here can be stored in using all-purpose computer, application specific processor or can compile Such software processing in journey or the recording medium of specialized hardware (such as ASIC or FPGA).It is appreciated that computer, processing Device, microprocessor controller or programmable hardware include can storing or receive software or computer code storage assembly (for example, RAM, ROM, flash memory etc.), when the software or computer code are by computer, processor or hardware access and when performing, realize Processing method described here.In addition, when all-purpose computer accesses the code for realizing the processing being shown in which, code Perform special-purpose computer all-purpose computer is converted to for performing the processing being shown in which.

Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein Member and method and step, can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually Performed with hardware or software mode, depending on the application-specific and design constraint of technical scheme.Professional and technical personnel Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The scope of the embodiment of the present invention.

Embodiment of above is merely to illustrate the embodiment of the present invention, and the not limitation to the embodiment of the present invention, relevant skill The those of ordinary skill in art field, in the case where not departing from the spirit and scope of the embodiment of the present invention, can also make various Change and modification, therefore all equivalent technical schemes fall within the category of the embodiment of the present invention, the patent of the embodiment of the present invention Protection domain should be defined by the claims.

Claims

1. a kind of title generation method, it is characterised in that including：

Obtain in the first news agregator the original header of each news documents and be spliced into title text string, wherein, described first is new Hearing set includes at least one news documents on same media event；

High frequency word string is extracted from the title text string, and the high frequency word string of extraction is filtered；

Frequency of occurrence highest word string in the high frequency word string by filtering is defined as to the title of first news agregator.

2. according to the method described in claim 1, it is characterised in that also include：

First news agregator is obtained by being clustered to the second news agregator, wherein, second news agregator is extremely Include first news agregator less.

3. method according to claim 2, it is characterised in that described to be obtained by being clustered to the second news agregator First news agregator, including：

Calculate the content similarity between each news documents in the second news agregator；

At least one candidate's news agregator is determined according to the content similarity, and from least one described candidate's news agregator Determine first news agregator.

4. according to the method described in claim 1, it is characterised in that the original for obtaining each news documents in the first news agregator Beginning title is simultaneously spliced into title text string, including：

Punctuation mark is set between each adjacent original header in the title text string；And/or,

Corresponding word string in the original header is replaced using synonym or abbreviation.

5. method according to any one of claim 1 to 4, it is characterised in that the high frequency word string of described pair of extraction Filtered, including：

The word string not occurred in the beginning of the sentence or sentence tail of the original header is filtered out from the high frequency word string of extraction；With/ Or,

The word string including punctuation mark is filtered out from the high frequency word string of extraction；And/or,

The word string that word string length is less than setting length threshold is filtered out from the high frequency word string of extraction.

6. a kind of title generating means, it is characterised in that including：

Acquisition module, for obtain the first news agregator in each news documents original header and be spliced into title text string, its In, first news agregator includes at least one news documents on same media event；

Filtering module is extracted, for extracting high frequency word string from the title text string, and the high frequency word string of extraction is entered Row filtering；

Generation module is new for frequency of occurrence highest word string in the high frequency word string by filtering to be defined as into described first Hear the title of set.

7. device according to claim 6, it is characterised in that also include：

Cluster module, for obtaining first news agregator by being clustered to the second news agregator, wherein, described Two news agregators at least include first news agregator.

8. device according to claim 7, it is characterised in that the cluster module includes：

Computing unit, for calculating the content similarity in the second news agregator between each news documents；

Determining unit, for determining at least one candidate's news agregator according to the content similarity, and from it is described at least one First news agregator is determined in candidate's news agregator.

9. device according to claim 6, it is characterised in that the acquisition module includes：

Setting unit, for setting punctuation mark between each adjacent original header in the title text string；With/ Or,

Replacement unit, is replaced using synonym or abbreviation to corresponding word string in the original header.

10. the device according to any one of claim 6 to 9, it is characterised in that the extraction filtering module includes filtering Unit, the filter element is used for：