CN112597776A

CN112597776A - Keyword extraction method and system

Info

Publication number: CN112597776A
Application number: CN202110251354.3A
Authority: CN
Inventors: 郑志军; 程国艮
Original assignee: Global Tone Communication Technology Co ltd
Current assignee: Global Tone Communication Technology Co ltd
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2021-04-02

Abstract

The invention discloses a keyword extraction method and a system, wherein the method comprises the following steps: segmenting a news text into words so as to segment the news text into a sequence with the words as minimum semantic units; performing entity identification on the news text; combining at least two words adjacent to each other in the sequence to obtain a combined vocabulary, judging whether the combined vocabulary is a certain entity vocabulary, and if the combined vocabulary is the certain entity vocabulary, replacing each word before the certain entity vocabulary is combined with the certain entity vocabulary in the sequence; extracting a first candidate keyword vocabulary set from the sequence by a keyword extraction algorithm based on a word graph model, and extracting a second candidate keyword vocabulary set from the sequence by a keyword extraction algorithm based on statistical characteristics; and (6) solving intersection. The keyword extraction method and the system can quickly and accurately extract the keywords.

Description

Keyword extraction method and system

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a keyword extraction method and system.

Background

With the popularization of networks, more and more people acquire information through internet. Reading news becomes a part of people's daily life, but the network is full of a large amount of text data, so that the idea of how to help people to quickly browse news and make people quickly know the news is always a research hotspot.

Keyword extraction is a common task in the field of NLP (natural language processing), and can extract a plurality of words most relevant to the meaning of an article, so that a user can quickly know the idea of the text by reading the keywords of the article, and the development of the technology reduces the time for people to browse information to a certain extent. At present, common keyword extraction methods can be divided into two categories, namely an unsupervised keyword extraction method and a supervised keyword extraction method.

The unsupervised keyword extraction method comprises the steps of extracting candidate words, scoring each candidate word, and outputting a plurality of candidate words with higher scores as keywords. According to different scoring strategies, the method can be divided into keyword extraction based on a word graph model, keyword extraction based on statistical characteristics and keyword extraction based on a topic model; specifically, extracting keywords based on a word graph model to construct a word network graph of a document, analyzing words in the network graph, and searching words or phrases with important functions on the graph, wherein the words (or phrases) are keywords of the document; the idea of the keyword extraction algorithm based on the statistical characteristics is to extract keywords of the document by utilizing the statistical information of words in the document; the keyword extraction algorithm based on the theme mainly utilizes the property about theme distribution in the theme model to extract keywords.

The method for extracting the keywords based on supervision is to regard the keyword extraction task as a classification task or a sequence labeling task. In the classification task, candidate words are extracted, then each candidate word is subjected to secondary classification, and whether the candidate word is a keyword or not is judged. In the sequence labeling task, an algorithm labels the minimum semantic unit (characters, words and the like) of the text, and extracts key words in the text through the combination of the labels.

The inventor finds that the keyword extraction method based on supervised learning needs high labor cost to label the linguistic data in the process of realizing the method, so that the method is difficult to be applied in a large scale. The method based on unsupervised learning does not need the process of manually labeling a training set, so that the method is faster, but due to the fact that word segmentation errors exist, various information cannot be effectively and comprehensively utilized to screen keywords, the problem that the ordering of the keywords is not logical and the like is solved, and the unsupervised keyword extraction method is poor in effect.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention aims to provide a keyword extraction method and a keyword extraction system, which can quickly and accurately extract keywords.

In order to achieve the above object, the present invention provides a keyword extraction method, which includes: segmenting a news text into words so as to segment the news text into a sequence with the words as minimum semantic units; carrying out entity recognition on the news text and extracting each entity vocabulary; combining at least two words adjacent to each other in the sequence to obtain a combined vocabulary, judging whether the combined vocabulary is a certain entity vocabulary, and if the combined vocabulary is the certain entity vocabulary, replacing each word before the certain entity vocabulary is combined with the certain entity vocabulary in the sequence; extracting candidate keyword vocabularies from the sequence by a keyword extraction algorithm based on a word graph model so as to obtain a first candidate keyword vocabulary set, and extracting candidate keyword vocabularies from the sequence by a keyword extraction algorithm based on statistical characteristics so as to obtain a second candidate keyword vocabulary set; and solving the intersection of the first candidate keyword vocabulary set and the second candidate keyword vocabulary set.

In an embodiment of the present invention, the keyword extraction method further includes: and sequencing the candidate keyword vocabularies in the intersection according to the sequence of the candidate keyword vocabularies appearing in the news text, so as to obtain a third candidate keyword vocabulary set.

In an embodiment of the present invention, the keyword extraction method further includes: and sequencing the candidate keyword vocabularies in the intersection according to a linguistic rule so as to obtain a third candidate keyword vocabulary set.

In an embodiment of the present invention, the keyword extraction method further includes: and calculating mutual information of two adjacent words in the third candidate keyword vocabulary set, combining the two adjacent words with mutual information values larger than a preset threshold value into one word, and thus obtaining a final keyword vocabulary set.

Based on the same inventive concept, the invention also provides a keyword extraction system, which is characterized by comprising the following steps: the system comprises a word segmentation module, an entity identification module, a first combination module, a first keyword extraction algorithm module, a second keyword extraction algorithm module and an intersection solving module. The word segmentation module is used for segmenting the news text into words which are used as sequences of the minimum semantic unit; the entity recognition module is coupled with the word segmentation module and used for carrying out entity recognition on the news text and extracting each entity word; the first combination module is coupled with the word segmentation module and the entity recognition module and is used for combining at least two words adjacent to each other in the sequence to obtain a combined vocabulary, judging whether the combined vocabulary is a certain entity vocabulary or not, and if the combined vocabulary is the certain entity vocabulary, replacing each word before the certain entity vocabulary is combined with the certain entity vocabulary in the sequence; the first keyword extraction algorithm module is coupled with the first combination module and used for extracting candidate keyword vocabularies from the sequence based on a keyword extraction algorithm of a word graph model so as to obtain a first candidate keyword vocabulary set; the second keyword extraction algorithm module is coupled with the first combination module and used for extracting candidate keyword vocabularies from the sequence based on a keyword extraction algorithm of statistical characteristics so as to obtain a second candidate keyword vocabulary set; and the intersection solving module is coupled with the first keyword extraction algorithm module and the second keyword extraction algorithm module and is used for solving the intersection of the first candidate keyword vocabulary set and the second candidate keyword vocabulary set.

In an embodiment of the present invention, the keyword extraction system further includes: and the sequencing module is coupled with the intersection solving module and is used for sequencing the candidate keyword vocabularies in the intersection according to the sequence of the candidate keyword vocabularies appearing in the news text, so that a third candidate keyword vocabulary set is obtained.

In an embodiment of the present invention, the keyword extraction system further includes: and the sequencing module is coupled with the intersection solving module and is used for sequencing the candidate keyword vocabularies in the intersection according to the linguistic rule so as to obtain a third candidate keyword vocabulary set.

In an embodiment of the present invention, the keyword extraction system further includes: and the second combination module is coupled with the sorting module and used for calculating mutual information of two adjacent words in the third candidate keyword vocabulary set, combining the two adjacent words with mutual information values larger than a preset threshold value into one word, and thus obtaining a final keyword vocabulary set.

Based on the same inventive concept, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the keyword extraction method according to any of the above embodiments.

Based on the same inventive concept, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the keyword extraction method according to any one of the above embodiments.

Compared with the prior art, according to the keyword extraction method and the system, linguistic data do not need to be labeled, the wrongly-divided words are repaired by using entity recognition, the keyword sets are respectively screened out by using the word graph model and the statistical characteristics, the intersection is obtained for the two sets, and the keywords can be quickly and accurately extracted. Preferably, the keywords in the intersection are also sequenced, and the words are combined by means of mutual information, so that the extraction accuracy of the keywords is further improved.

Drawings

FIG. 1 is a block diagram of the steps of a keyword extraction method according to an embodiment of the present invention;

FIG. 2 is a block diagram of the steps of a keyword extraction method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a keyword extraction system according to an embodiment of the present invention;

fig. 4 is a block diagram of a keyword extraction system according to an embodiment of the present invention.

Detailed Description

The following detailed description of the present invention is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.

Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.

In order to overcome the problems, the invention provides a keyword extraction method which can quickly and accurately extract keywords.

Fig. 1 is a keyword extraction method according to an embodiment of the present invention. The method comprises the following steps: step S1 to step S5.

In step S1, the news text is segmented into sequences with words as the smallest semantic units. The inventor also finds that the traditional word segmentation method is easy to segment the entity words, so that the accurate extraction of the keywords is not facilitated. Therefore, in the embodiment, the text is segmented, and then the segmented words are combined to determine whether the words are entities, and if the words are entities, the original words are directly replaced by the entities.

Thus, in step S2, the news text is subjected to entity recognition and respective entity words are extracted. In step S3, at least two words adjacent to each other in the sequence are combined to obtain a combined word, and it is determined whether the combined word is an entity word, and if the combined word is the entity word, the word before the combination of the entity word is replaced with the entity word in the sequence.

In step S4, the keyword extraction algorithm based on the word graph model extracts candidate individual keyword words from the sequence, thereby obtaining a first candidate keyword word set, and the keyword extraction algorithm based on the statistical characteristics extracts candidate keyword words from the sequence, thereby obtaining a second candidate keyword word set.

In step S5, an intersection of the first candidate keyword vocabulary set and the second candidate keyword vocabulary set is obtained.

Therefore, the keyword extraction method of the implementation method does not need to label linguistic data, restores wrongly-divided words by using entity recognition, respectively screens out candidate keyword word sets by using a word graph model and a statistical characteristic algorithm, obtains the intersection of the two sets, obtains a keyword extraction result, can effectively improve the accuracy of keyword extraction, and can quickly extract keywords.

Preferably, the keyword extraction method further includes: step S6 and step S7.

In step S6, the candidate keyword words in the intersection are ranked, so as to obtain a third candidate keyword word set. Specifically, the keyword sequences in the intersection generally have no logic, and optionally, in an embodiment, in step S6, the candidate keyword words are sorted according to the sequence of occurrence of the candidate keyword words in the news text, so as to obtain a third candidate keyword word set. This allows a relatively fast calculation speed. In another embodiment, in step S6, the candidate keyword vocabularies in the intersection may be sorted according to a linguistic rule (e.g., n-gram determination method) to obtain a third candidate keyword vocabulary set; thus, a more logical sort result can be obtained, and the accuracy of the combination result in the subsequent step S7 can be improved.

In step S7, mutual information of two adjacent words in the third candidate keyword vocabulary set is calculated, and the two adjacent words with mutual information values greater than a preset threshold are combined into one word, so as to obtain a final keyword vocabulary set. Thus, the words and phrases are combined by combining the mutual information, and the words and phrases which are common in news texts are combined. The accuracy of keyword extraction can be further improved. For example, the current political news often contains many professional words, which are not entities but are separated by word segmentation, and the two words can be combined into one word by the embodiment.

Based on the same inventive concept, an embodiment further provides a keyword extraction system, which includes: the system comprises a word segmentation module 10, an entity recognition module 11, a first combination module 12, a first keyword extraction algorithm module 13, a second keyword extraction algorithm module 14 and an intersection solving module 15.

The segmentation module 10 is configured to segment news text into sequences with words as minimum semantic units.

The entity recognition module 11 is coupled to the word segmentation module 10, and is configured to perform entity recognition on the news text and extract each entity word.

The first combining module 12 is coupled to the word segmentation module 10 and the entity recognition module 11, and configured to combine at least two words adjacent to each other in the sequence to obtain a combined vocabulary, determine whether the combined vocabulary is a certain entity vocabulary, and if the combined vocabulary is the certain entity vocabulary, replace each word before the certain entity vocabulary is combined with the certain entity vocabulary in the sequence.

A first keyword extraction algorithm module 13 is coupled to the first combination module 12, and configured to extract candidate individual keyword words from the sequence based on a keyword extraction algorithm of a word graph model, so as to obtain a first candidate keyword word set.

A second keyword extraction algorithm module 14 is coupled to the first assembly module 12, and configured to extract candidate keyword words from the sequence based on a statistical feature-based keyword extraction algorithm, so as to obtain a second candidate keyword word set.

The intersection finding module 15 is coupled to both the first keyword extraction algorithm module 13 and the second keyword extraction algorithm module 14, and is configured to find an intersection of the first candidate keyword vocabulary set and the second candidate keyword vocabulary set.

Preferably, the keyword extraction system of an embodiment further includes: a sorting module 16 and a second combining module 17.

The sorting module 16 is coupled to the intersection solving module 15, and configured to sort each candidate keyword vocabulary in the intersection according to a sequence of the candidate keyword vocabulary appearing in the news text, so as to obtain a third candidate keyword vocabulary set. In other embodiments, the sorting module 16 is configured to sort each candidate keyword vocabulary in the intersection according to a linguistic rule, so as to obtain a third candidate keyword vocabulary set.

The second combining module 17 is coupled to the sorting module 16, and configured to calculate mutual information of two adjacent words in the third candidate keyword vocabulary set, combine the two adjacent words with mutual information values larger than a preset threshold value, and combine the two adjacent words into one word, so as to obtain a final keyword vocabulary set.

Based on the same inventive concept, an embodiment further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the keyword extraction method according to any of the above embodiments when executing the program.

Based on the same inventive concept, an embodiment further provides a non-transitory computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the keyword extraction method according to any of the above embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A keyword extraction method is characterized by comprising the following steps:

segmenting a news text into words so as to segment the news text into a sequence with the words as minimum semantic units;

carrying out entity recognition on the news text and extracting each entity vocabulary;

combining at least two words adjacent to each other in the sequence to obtain a combined vocabulary, judging whether the combined vocabulary is a certain entity vocabulary, and if the combined vocabulary is the certain entity vocabulary, replacing each word before the certain entity vocabulary is combined with the certain entity vocabulary in the sequence;

extracting candidate keyword vocabularies from the sequence by a keyword extraction algorithm based on a word graph model so as to obtain a first candidate keyword vocabulary set, and extracting candidate keyword vocabularies from the sequence by a keyword extraction algorithm based on statistical characteristics so as to obtain a second candidate keyword vocabulary set; and

and solving the intersection of the first candidate keyword vocabulary set and the second candidate keyword vocabulary set.

2. The keyword extraction method according to claim 1, further comprising:

and sequencing the candidate keyword vocabularies in the intersection according to the sequence of the candidate keyword vocabularies appearing in the news text, so as to obtain a third candidate keyword vocabulary set.

3. The keyword extraction method according to claim 1, further comprising:

and sequencing the candidate keyword vocabularies in the intersection according to a linguistic rule so as to obtain a third candidate keyword vocabulary set.

4. The keyword extraction method according to claim 2 or 3, characterized by further comprising:

and calculating mutual information of two adjacent words in the third candidate keyword vocabulary set, combining the two adjacent words with mutual information values larger than a preset threshold value into one word, and thus obtaining a final keyword vocabulary set.

5. A keyword extraction system, comprising:

the word segmentation module is used for segmenting the news text into words which are used as sequences of the minimum semantic unit;

the entity recognition module is coupled with the word segmentation module and used for carrying out entity recognition on the news text and extracting each entity word;

the first combination module is coupled with the word segmentation module and the entity recognition module and used for combining at least two words adjacent to each other in the sequence to obtain a combined vocabulary, judging whether the combined vocabulary is a certain entity vocabulary or not, and if the combined vocabulary is the certain entity vocabulary, replacing each word before the certain entity vocabulary is combined with the certain entity vocabulary in the sequence;

the first keyword extraction algorithm module is coupled with the first combination module and used for extracting candidate keyword vocabularies from the sequence based on a keyword extraction algorithm of a word graph model so as to obtain a first candidate keyword vocabulary set;

a second keyword extraction algorithm module, coupled to the first combination module, configured to extract candidate keyword vocabularies from the sequence based on a statistical feature keyword extraction algorithm, so as to obtain a second candidate keyword vocabulary set; and

and the intersection solving module is coupled with the first keyword extraction algorithm module and the second keyword extraction algorithm module and is used for solving the intersection of the first candidate keyword vocabulary set and the second candidate keyword vocabulary set.

6. The keyword extraction system of claim 5, wherein the keyword extraction system further comprises:

and the sequencing module is coupled with the intersection solving module and is used for sequencing the candidate keyword vocabularies in the intersection according to the sequence of the candidate keyword vocabularies appearing in the news text, so that a third candidate keyword vocabulary set is obtained.

7. The keyword extraction system of claim 5, wherein the keyword extraction system further comprises:

and the sequencing module is coupled with the intersection solving module and is used for sequencing the candidate keyword vocabularies in the intersection according to the linguistic rule so as to obtain a third candidate keyword vocabulary set.

8. The keyword extraction system according to claim 6 or 7, wherein the keyword extraction system further comprises:

and the second combination module is coupled with the sorting module and used for calculating mutual information of two adjacent words in the third candidate keyword vocabulary set, combining the two adjacent words with mutual information values larger than a preset threshold value into one word, and thus obtaining a final keyword vocabulary set.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 4 are implemented when the processor executes the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.