CN107203509B

CN107203509B - Title generation method and device

Info

Publication number: CN107203509B
Application number: CN201710262158.XA
Authority: CN
Inventors: 王洪俊; 肖诗斌
Original assignee: BEIJING TRS INFORMATION TECHNOLOGY CO LTD
Current assignee: Tols Information Technology Co ltd
Priority date: 2017-04-20
Filing date: 2017-04-20
Publication date: 2023-06-20
Anticipated expiration: 2037-04-20
Also published as: CN107203509A

Abstract

The embodiment of the invention provides a title generation method and device. The title generation method comprises the following steps: acquiring original titles of all news documents in a first news set and splicing the original titles into title text strings, wherein the first news set comprises at least one news document related to the same news event; extracting a high-frequency word string from the title text string, and filtering the extracted high-frequency word string; and determining the word string with highest occurrence frequency in the filtered high-frequency word strings as the title of the first news set. By adopting the technical scheme of the embodiment of the invention, a high-quality short title can be automatically generated for the news document, the semantic effect and the conciseness of the title are ensured, the calculation difficulty of short title generation is reduced, and the method has higher adaptability.

Description

Title generation method and device

Technical Field

The invention relates to the technical field of computers, and more particularly, to a title generation method and apparatus.

Background

Typically, news documents have a long title, typically 20-30 words, resulting in a limited amount of news that can be displayed on a news web page. In order to display more news on the news webpage, the title of the news document can be compressed or rewritten, and the length of the title is shortened on the basis that the semantics of the title are not affected.

Currently, the header compression method of news documents mainly shortens the header length based on a set rule or grammar pattern. For example, based on a set rule, a synonym or abbreviation with a shorter length is used to replace a corresponding word string in the title, or a core sentence or a key sentence of the news document is acquired to replace the title. For another example, a shorter-length title is generated by learning a grammar pattern generated from a database based on the grammar pattern.

However, since the coverage of the setting rule is limited and the grammar pattern is limited to the range of the database, the semantic effect and the conclusivity of the news headline generated based on the setting rule or the grammar pattern are easily not ensured and the headline cannot be effectively compressed.

Disclosure of Invention

The embodiment of the invention provides a title generation method and device, which are used for automatically generating high-quality short titles for news documents.

According to an aspect of an embodiment of the present invention, there is provided a title generation method including: acquiring original titles of all news documents in a first news set and splicing the original titles into title text strings, wherein the first news set comprises at least one news document related to the same news event; extracting a high-frequency word string from the title text string, and filtering the extracted high-frequency word string; and determining the word string with highest occurrence frequency in the filtered high-frequency word strings as the title of the first news set.

Optionally, the method comprises the step of. The method further comprises the steps of: the first news collection is obtained by clustering a second news collection, wherein the second news collection includes at least the first new Wen Jige.

Optionally the obtaining the first news collection by clustering the second news collection includes: calculating content similarity among all news documents in the second news collection; at least one candidate news set is determined based on the content similarity, and the first new Wen Jige is determined from the at least one candidate news set.

Optionally, the method comprises the step of. The obtaining the original headlines of each news in the first news set and splicing the headline text strings comprises the following steps: punctuation marks are arranged between adjacent original titles in the title text strings; and/or, replacing the corresponding word strings in the original title by synonyms or short words.

Optionally, the filtering the extracted high-frequency word string includes: filtering word strings which do not appear at the head or tail of the original title from the extracted high-frequency word strings; and/or filtering word strings comprising punctuation marks from the extracted high-frequency word strings; and/or filtering out word strings with word string lengths smaller than a set length threshold value from the extracted high-frequency word strings.

According to another aspect of the embodiment of the present invention, there is also provided a title generating apparatus, including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original titles of all news documents in a first news set and splicing the original titles into title text strings, and the first news set comprises at least one news document related to the same news event; the extraction and filtration module is used for extracting a high-frequency word string from the title text string and filtering the extracted high-frequency word string; and the generating module is used for determining the word string with highest occurrence frequency in the filtered high-frequency word strings as the title of the first news set.

Optionally, the apparatus further comprises: and the clustering module is used for acquiring the first news set by clustering a second news set, wherein the second news set at least comprises the first news Wen Jige.

Optionally, the clustering module includes: a calculating unit, configured to calculate content similarity between news documents in the second news set; and the determining unit is used for determining at least one candidate news set according to the content similarity and determining the first new Wen Jige from the at least one candidate news set.

Optionally, the acquiring module includes: a setting unit, configured to set punctuation marks between each adjacent original titles in the title text string; and/or, the replacing unit adopts synonyms or short words to replace the corresponding word strings in the original title.

Optionally, the extraction filtration module comprises a filtration unit, the filtration unit: filtering word strings which do not appear at the head or tail of the original title from the extracted high-frequency word strings; and/or filtering word strings comprising punctuation marks from the extracted high-frequency word strings; and/or filtering out word strings with word string lengths smaller than a set length threshold value from the extracted high-frequency word strings.

According to the title generation method and device, original titles of a plurality of news documents related to the same news event are acquired to be spliced into title text strings, then high-frequency word strings are extracted from the title text strings, the extracted high-frequency word strings are filtered to screen the high-frequency word strings conforming to the title characteristics, the filtered highest-frequency word strings are determined to be new titles, a high-quality short title is generated for each news document, and the semantic effect and the conciseness of the title are guaranteed; moreover, the calculation difficulty of short header generation is reduced, and the method has higher adaptability.

Drawings

Fig. 1 is a flowchart showing steps of a title generation method according to a first embodiment of the present invention;

fig. 2 is a flowchart showing steps of a title generation method according to a second embodiment of the present invention;

fig. 3 is a block diagram showing the construction of a title generation apparatus according to a third embodiment of the present invention;

fig. 4 is a block diagram showing the construction of a title generation apparatus according to a fourth embodiment of the present invention.

Detailed Description

The following description of embodiments of the present invention will be made in further detail with reference to the drawings (like numerals designate like elements throughout the several views) and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present invention are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

Example 1

Referring to fig. 1, a flowchart illustrating steps of a title generation method according to a first embodiment of the present invention is shown.

The title generation method of the present embodiment includes the steps of:

step S102: the original titles of all news documents in the first news set are acquired and spliced into title text strings.

Wherein the first news collection includes at least one news document pertaining to the same news event.

In this embodiment, one or more news documents in the first news collection pertain to the same news event, which may be any news event. One or more news documents in the first news collection each have an original title.

After the first news set is acquired, extracting the original titles of all news documents in the first news set, and splicing the acquired original identifications into a long text string to form a title text string.

Step S104: and extracting the high-frequency word strings from the title text strings, and filtering the extracted high-frequency word strings.

The high-frequency word string is a word string with a length exceeding a preset length (for example, the length of two English words or two Chinese characters) in the title text string and the occurrence number exceeding a preset number (for example, twice).

For example, for a title text string spliced from the original titles of the first news set shown in table 1, the extracted high-frequency word string. After the high-frequency word strings are extracted, filtering operation is carried out on the extracted high-frequency word strings so as to filter out word strings with characteristics not conforming to the characteristics of the title. In this embodiment, the extraction method of the high-frequency word string and the filtering rule of the high-frequency word string are not limited.

Step S106: and determining the word string with the highest occurrence frequency in the filtered high-frequency word strings as the title of the first news set.

The filtered high-frequency word strings basically accord with the title characteristics, and the word string with the highest occurrence frequency is selected from the filtered high-frequency word strings to be used as a new title of each news document in the first news set. That is, the filtered highest frequency word string is used as the title, so that on one hand, the semantic effect of the new title is ensured, news events pointed by all news documents in the first news set can be expressed, and the basic characteristics of the title are met; on the other hand, using the word string as the title corresponds to regenerating a short title for each news document in the first news collection, and can ensure the conclusivity of the title.

According to the title generation method provided by the embodiment of the invention, the original titles of a plurality of news documents related to the same news event are acquired to be spliced into the title text strings, then the high-frequency word strings are extracted from the title text strings, the extracted high-frequency word strings are filtered to screen the high-frequency word strings conforming to the title characteristics, and then the filtered highest-frequency word strings are determined to be new titles, so that a high-quality short title is generated for each news document, and the semantic effect and the conciseness of the title are ensured.

Compared with the method for shortening the title based on the setting rule and the grammar mode in the prior art, the title generation method provided by the embodiment of the invention does not need to set a complex short title generation rule, and reduces the calculation difficulty of short title generation; moreover, the titles of all news documents can be acquired for splicing, screened and compressed without considering the setting rules and the coverage range of the database, and high-quality short titles can be automatically generated, so that the method has higher adaptability.

The title generation method of the present embodiment may be executed and implemented by any device having a corresponding data processing capability, including, but not limited to, a server side corresponding to a news web page.

Example two

Referring to fig. 2, a flowchart illustrating steps of a title generation method according to a second embodiment of the present invention is shown.

The title generation method of the present embodiment includes the steps of:

step S202: the first new Wen Jige is obtained by clustering the second news set.

Wherein the second news collection includes at least the first new Wen Jige.

In this embodiment, the second news set includes at least one news document related to at least one news event, that is, the second news set may include other news documents related to other news events in addition to the at least one news document related to the same news event in the first news set.

A class of news documents about the same news event therein is obtained as the first new Wen Jige by clustering the second news collection. In an alternative embodiment, content similarity between news documents in a second news collection is calculated, at least one candidate news collection is determined according to the content similarity, and the first new Wen Jige is determined from the at least one candidate news collection.

Specifically, content similarity between the news documents, for example, cosine similarity of included angles between the news document vectors, may be calculated by performing word segmentation and vectorization processing on each news document in the second news set. If the content similarity between two news documents is greater than a pre-set similarity threshold (e.g., 0.5), then it may be determined that the two news documents are related to the same news event. That is, a plurality of news documents having a content similarity greater than a similarity threshold may be determined as a plurality of news documents concerning the same news event, and further determined as candidate news sets. One or more candidate news sets may be determined from the second news set, and one candidate news set may be determined to be the first new Wen Jige.

Step S204: the original titles of all news documents in the first news set are acquired and spliced into title text strings.

After the first news collection is determined, the original headlines of each news document in the first news collection are extracted to be spliced into a headline text string.

Optionally, in the process of splicing the original titles into the title text string, punctuation marks can be set between adjacent original titles in the title text string, and the original titles are segmented, so that word strings are prevented from being formed between ends of the adjacent original titles. Also, it is preferable to set the same punctuation marks between each adjacent original titles to reduce the amount of calculation. For example, a period is set at the end of each original title. In addition, periods may be used instead of space symbols in each original title.

In this embodiment, after extracting the original titles of each news document in the first news set, the corresponding word strings in each original title are replaced by synonyms (the length of which is smaller than that of the synonym of the word string to be replaced) or simply, so as to shorten the word string length, and thus, in the case that the replaced word string is taken as the title, the title length can be further shortened.

Step S206: a high frequency word string is extracted from the title text string.

In an alternative implementation manner, a statistical method of n-element word strings is adopted, and word strings with word string lengths larger than a preset length and occurrence times exceeding a preset number are extracted from the title text strings to serve as high-frequency word strings. If the extracted high-frequency word string comprises the same-frequency sub-string, the same-frequency sub-string is filtered. For example, if the word string "chinese" and "chinese people" both appear 4 times in the title text string and "chinese people" includes "chinese", then "chinese" is the same-frequency substring of "chinese people", and only the word string "chinese people" is extracted when extracting the high-frequency word string.

Step S208: filtering word strings which do not appear at the head or tail of the original title from the extracted high-frequency word strings word strings including punctuation marks, word strings having a word string length less than a set length threshold.

In this embodiment, word strings which do not appear at the beginning or end of the original title are filtered from the extracted high-frequency word strings; and/or filtering word strings comprising punctuation marks from the extracted high-frequency word strings; and/or filtering out word strings with word string lengths smaller than a set length threshold value from the extracted high-frequency word strings. The word strings which do not appear at the beginning or end of the original title are less likely to become the title, the word strings comprising punctuation marks cannot usually become the title, and word strings with the word string length smaller than the set length threshold are insufficient for clearly expressing news events, so that the word strings are filtered, and the extracted high-frequency word strings can be more in line with the title characteristics.

In other embodiments, one or more of the three word strings that do not fit the title feature may be filtered out from the extracted high frequency word strings, and other word strings that do not fit the title feature may be filtered out.

Step S210: and determining the word string with the highest occurrence frequency in the filtered high-frequency word strings as the title of the first news set.

The title generating method of this embodiment may be considered as an alternative specific implementation of the title generating method of the first embodiment, and the same steps may be referred to as the implementation manner of the related steps in the first embodiment.

According to the title generation method, news related to the same time are aggregated together through a clustering method, then original titles of the news are extracted to be spliced into title text strings, then high-frequency word strings are extracted from the title text strings, the high-frequency word strings conforming to the title features are screened based on the features such as the positions and the lengths of the word strings, the highest-frequency word strings conforming to the title features are screened out to serve as new titles, a high-quality short title is generated for each news document, and the semantic effect and the conciseness of the title are guaranteed; moreover, the high-quality short titles are automatically generated, the calculation difficulty of short title generation is reduced, and the method has high adaptability.

Example III

Referring to fig. 3, there is shown a block diagram of a title generation apparatus according to a third embodiment of the present invention.

Title generation of the present embodiment the apparatus includes an acquisition module 302 an extraction filtering module 304 and a generating module 306. The obtaining module 302 is configured to obtain an original title of each news document in a first news set and splice the original titles into a title text string, where the first news set includes at least one news document related to the same news event. The extraction and filtering module 304 is configured to extract a high-frequency word string from the title text string, and filter the extracted high-frequency word string. The generating module 306 is configured to generate the filtered high-frequency word strings the most frequent word strings are determined as the titles of the first news set.

According to the title generation device provided by the embodiment of the invention, the original titles of a plurality of news documents related to the same news event are acquired to be spliced into the title text strings, then the high-frequency word strings are extracted from the title text strings, the extracted high-frequency word strings are filtered to screen the high-frequency word strings conforming to the title characteristics, and then the filtered highest-frequency word strings are determined to be new titles, so that a high-quality short title is generated for each news document, and the semantic effect and the conciseness of the title are ensured; and the calculation difficulty of short header generation is reduced, and the method has higher adaptability.

Example IV

Referring to fig. 4, there is shown a block diagram of a title generating apparatus according to a fourth embodiment of the present invention.

The title generating apparatus of this embodiment includes an acquisition module 402, an extraction filtering module 404, and a generating module 406. Wherein the obtaining module 402 is configured to obtain an original headline of each news document in a first news set, and splice the headline text strings into headline text strings, wherein the first news set includes at least one news document related to the same news event. The extraction and filtering module 404 is configured to extract a high-frequency word string from the title text string, and filter the extracted high-frequency word string. The generating module 406 is configured to determine, as the title of the first news set, a word string with the highest occurrence frequency of the filtered high-frequency word strings.

Optionally, the title generating apparatus of this embodiment further includes a clustering module 408, configured to obtain the first news set by clustering a second news set, where the second news set includes at least the first new Wen Jige.

Optionally, the clustering module 408 includes a calculating unit 4082 and a determining unit 4084, where the calculating unit 4082 is configured to calculate a content similarity between the news documents in the second news set; the determining unit 4084 is configured to determine at least one candidate news set according to the content similarity, and determine the first new Wen Jige from the at least one candidate news set.

Optionally, the obtaining module 402 includes a setting unit 4022 and/or a replacing unit 4024, where the setting unit 4022 is configured to set punctuation marks between each adjacent original title in the title text string; the replacing unit 4024 replaces the corresponding word string in the original title with a synonym or simply.

Optionally, the extraction filtering module 404 includes an extraction unit 4042 and a filtering unit 4044, where the extraction unit 4042 is configured to extract a high-frequency word string from the title text string. The filtering unit 4044 is configured to filter out word strings that do not appear at the beginning or end of the original title from the extracted high-frequency word strings; and-or alternatively, the first and second heat exchangers may be, filtering word strings comprising punctuation marks from the extracted high-frequency word strings; and/or filtering out word strings with word string lengths smaller than a set length threshold value from the extracted high-frequency word strings.

The title generation method of the present embodiment is used to implement the title generation method of the first embodiment or the second embodiment, and has the beneficial effects of the method embodiment, which is not described herein.

It should be noted that, according to implementation requirements, each component/step described in the embodiments of the present invention may be split into more components/steps, or two or more components/steps or part of operations of the components/steps may be combined into new components/steps, so as to achieve the objects of the embodiments of the present invention.

The methods according to embodiments of the present invention described above may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CDROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the methods described herein may be stored on such software processes on a recording medium using a general purpose computer, special purpose processor, or programmable or dedicated hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes a memory component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor, or hardware, implements the processing methods described herein. Further, when the general-purpose computer accesses code for implementing the processes shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the processes shown herein.

Those of ordinary skill in the art will appreciate that the elements and method steps of the examples described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present invention.

The above embodiments are only for illustrating the embodiments of the present invention, but not for limiting the embodiments of the present invention, and various changes and modifications may be made by one skilled in the relevant art without departing from the spirit and scope of the embodiments of the present invention, so that all equivalent technical solutions also fall within the scope of the embodiments of the present invention, and the scope of the embodiments of the present invention should be defined by the claims.

Claims

1. A title generation method, comprising: acquiring original titles of all news documents in a first news set and splicing the original titles into title text strings, wherein the first news set comprises at least one news document related to the same news event;

extracting a high-frequency word string from the title text string, and filtering the extracted high-frequency word string; determining the word string with highest occurrence frequency in the filtered high-frequency word strings as the title of the first news set;

further comprises: obtaining a first news set by clustering a second news set, wherein the second news set at least comprises the first news set;

the obtaining the first news collection by clustering the second news collection includes:

calculating content similarity among all news documents in the second news collection;

determining at least one candidate news set according to the content similarity, and determining the first news set from the at least one candidate news set;

the obtaining the original titles of all news documents in the first news set and splicing the original titles into title text strings comprises the following steps:

punctuation marks are arranged between adjacent original titles in the title text strings; and/or the number of the groups of groups,

replacing the corresponding word strings in the original title by synonyms or short words;

the filtering the extracted high-frequency word strings comprises the following steps:

filtering word strings which do not appear at the head or tail of the original title from the extracted high-frequency word strings; and/or the number of the groups of groups,

filtering word strings comprising punctuation marks from the extracted high-frequency word strings; and/or the number of the groups of groups,

and filtering out word strings with word string lengths smaller than a set length threshold value from the extracted high-frequency word strings.

2. A title generation apparatus, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original titles of all news documents in a first news set and splicing the original titles into title text strings, and the first news set comprises at least one news document related to the same news event;

the extraction and filtration module is used for extracting a high-frequency word string from the title text string and filtering the extracted high-frequency word string;

the generation module is used for determining the word string with highest occurrence frequency in the filtered high-frequency word strings as the title of the first news set;

further comprises:

the clustering module is used for acquiring the first news set by clustering a second news set, wherein the second news set at least comprises the first news set;

the clustering module comprises:

a calculating unit, configured to calculate content similarity between news documents in the second news set;

a determining unit, configured to determine at least one candidate news set according to the content similarity, and determine the first news set from the at least one candidate news set;

the acquisition module comprises:

a setting unit, configured to set punctuation marks between each adjacent original titles in the title text string; and/or the number of the groups of groups,

the replacing unit is used for replacing the corresponding word strings in the original title by synonyms or short words;

the extraction filtration module comprises a filtration unit for: