CN110427626B

CN110427626B - Keyword extraction method and device

Info

Publication number: CN110427626B
Application number: CN201910703459.0A
Authority: CN
Inventors: 崔峭
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2022-12-09
Anticipated expiration: 2039-07-31
Also published as: CN110427626A

Abstract

The invention provides a keyword extraction method and device. Specifically, the method comprises the following steps: performing text structure analysis on an input text to determine a first weight value corresponding to each content type in the text; performing semantic analysis on each word in the text to determine a second weight value corresponding to each word; adjusting a word frequency TF value of each word according to the first weight value and the second weight value, and calculating a third weight value of each word according to the adjusted TF value; and extracting the appointed terms in the terms as the keywords for searching according to the third weighted value. By the method and the device, the problems that TF-IDF calculation depends on a plurality of documents related to contents, word weight of a single text cannot be calculated, and the TF-IDF method is poor in discrete data with low relevance are solved, and the effect of improving the accuracy of extracting the keyword information is achieved.

Description

Keyword extraction method and device

Technical Field

The invention relates to the field of communication, in particular to a keyword extraction method and device.

Background

At present, the most common retrieval systems are realized based on keywords, and the extraction of the keywords almost adopts a word frequency (TF) and Inverse Document Frequency (IDF) calculation method. However, the TF-IDF calculation relies on multiple content-related documents, cannot calculate word weights for a single text, and the TF-IDF method performs poorly for discrete data with low relevance.

Disclosure of Invention

The embodiment of the invention provides a method and a device for extracting keywords, which are used for at least solving the problems that TF-IDF calculation in the related technology depends on documents related to a plurality of contents, the word weight of a single text cannot be calculated, and the TF-IDF method is poor in expression of discrete data with low relevance.

According to an embodiment of the present invention, there is provided a keyword extraction method, including: performing text structure analysis on an input text to determine a first weight value corresponding to each content type in the text; performing semantic analysis on each word in the text to determine a second weight value corresponding to each word; adjusting a word frequency TF value of each word according to the first weight value and the second weight value, and calculating a third weight value of each word according to the adjusted TF value; and extracting the appointed terms in the terms as the keywords for searching according to the third weighted value.

Optionally, before performing semantic analysis on words in the text, the method further comprises: and performing word segmentation processing on the text according to a preset rule, and determining the part of speech of each word according to the relevance of each word after word segmentation.

Optionally, performing semantic analysis on each word in the text to determine a second weighted value corresponding to the specified word, including: sequencing all the words according to a preset part-of-speech priority rule; and giving the corresponding second weight value to each word according to the sequence of the part of speech priority.

Optionally, adjusting a word frequency TF value of each word according to the first weight value and the second weight value, further comprising: obtaining the TF values of each word; multiplying the TF value by the first and second weight values to determine the adjusted TF value.

Optionally, calculating a third weighting value of each word according to the adjusted TF value, including: obtaining the IDF value of each word; and determining the third weight value according to the adjusted TF value of each word and the IDF value of each word.

Optionally, according to the third weighted value, a specified word in the words is extracted as a keyword for retrieval, and the method further includes: removing words of which the third weight values are smaller than a preset weight threshold value from the words as the designated words.

Optionally, the content type includes at least one of: the text content type is divided according to a preset text format in the text, the position type of a paragraph in the text, and the position type of a sentence in the text.

According to another embodiment of the present invention, there is provided an extraction apparatus of a keyword, including: the first determining module is used for performing text structure analysis on an input text to determine a first weight value corresponding to each content type in the text; the second determining module is used for performing semantic analysis on each word in the text to determine a second weight value corresponding to each word; the adjusting module is used for adjusting a word frequency TF value of each word according to the first weight value and the second weight value and calculating a third weight value of each word according to the adjusted TF value; and the extraction module is used for extracting the appointed terms in the terms as the keywords for retrieval according to the third weighted value.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, the TF of the words is adjusted by using the structural analysis result of the text and the semantic analysis result of the words, so that the problems that the TF-IDF calculation in the related technology depends on a plurality of documents related to the content, the word weight of a single text cannot be calculated, and the TF-IDF method has poor expression on discrete data with low relevance are solved, and the effect of improving the accuracy of extracting the keyword information is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of keyword extraction according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a test text according to an embodiment of the present invention;

FIG. 3 is a result graph of an extraction result according to an embodiment of the present invention;

fig. 4 is a block diagram of a keyword extraction method and apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

In this embodiment, an extraction method for keywords is provided, and fig. 1 is a flowchart of extracting keywords according to an embodiment of the present invention, as shown in fig. 1, the flowchart includes the following steps:

step S102, performing text structure analysis on an input text to determine a first weight value corresponding to each content type in the text;

step S104, performing semantic analysis on each word in the text to determine a second weight value corresponding to each word;

step S106, adjusting a word frequency TF value of each word according to the first weight value and the second weight value, and calculating a third weight value of each word according to the adjusted TF value;

and step S108, extracting the appointed terms in the terms as the keywords for searching according to the third weighted value.

Specifically, the text content types divided according to the preset text format in the text refer to each content part in the text. For example, for the specification of a patent application, the abstract, the drawings of the abstract, the claims, the specification and the drawings of the specification can be used as a text content type division mode of a content part. For the specification, the dividing manner of the text content type of the specification may be divided according to the detailed embodiments of the technical field, the background art, the summary of the invention, and the description of the drawings.

Specifically, the position type of a paragraph in the text refers to the position of the paragraph in the text content. For example, whether in the technical field, in summary, or in the description of the figures. Whether the paragraph is the first paragraph, the last paragraph, or a paragraph at some intermediate position.

Specifically, the position type of the sentence in the text is similar to the position type of the paragraph in the text, and refers to the position of the sentence in the paragraph or the text content in the text. For example, whether in the technical field, in summary, or in the description of the figures. Whether it is in the first sentence, the last sentence or the middle part of a paragraph.

The above description is only an exemplary example, and any representation of content types based on the above concepts is within the scope of the present embodiment.

Specifically, when determining the first weight value corresponding to each content type in the text, taking the application of the invention as an example, the content in the embodiment in the description is obviously more important than the content in the other four parts, so that the paragraphs and sentences in the embodiment and the embodiment are given higher weight than the other content types. In the detailed description, the contents of the first paragraph or the first few paragraphs are often the most important. Therefore, the first paragraph or the first few paragraphs are given a higher weight than the other paragraphs. While for each paragraph, the first or last paragraph usually gives conclusive statements, so the weights of the first and last paragraph are given higher weights than statements in other paragraphs.

Optionally, performing semantic analysis on each word in the text to determine a second weight value corresponding to the specified word, including: sequencing the words according to a preset part-of-speech priority rule; and giving the corresponding second weight value to each word according to the sequence of the part of speech priority.

Specifically, semantic analysis can screen out the main entities discussed in the articles according to semantics, and can remove useless components in the sentences. For example, "Xiaoming is a Chinese," the subject "Xiaoming" and the phrase "Chinese" may be proposed. Then, key entities such as the subject, object and the like of the core phrase and sentence are given a higher weight, and the weights of other words are directly set to be 1. As another example, adjectives are also critical in view of the sometimes quantitative terms. Thus, quantifiers, adjectives, etc. may be given lower weights than key entities such as the subject, object, etc. of the core phrase and sentence, but higher weights than other words may be given higher values.

Optionally, according to the third weighted value, a specified word in the words is extracted as a keyword for retrieval, and the method further includes: removing words of which the third weight values are smaller than a preset weight threshold value from the words as the specified words.

It should be noted that the purpose of the preset weight threshold is to avoid outputting too many results and affecting the results of the subsequent retrieval, because too many keywords are searched in the knowledge graph, it is possible that too many information is returned because of too many primary keys or information cannot be returned because of too many limitations.

In order to better understand the technical solution described in the present embodiment, the following scenario is provided in the present embodiment to better understand the solution described in the above embodiment.

Fig. 2 is a schematic diagram of a test text according to an embodiment of the present invention. As shown in figure 2 of the drawings, in which,

step 1: and performing text structure analysis on the input test text. Thus, it is analyzed that the highest weight value is assigned to the 1 st line and the 31 st line, and the weight value of the 7 th line and the 9 th line is 0 (which is equivalent to removing the contents of the 7 th line and the 9 th line), while the weight values of the other lines are lower than the weight values of the 1 st line and the 31 st line.

Step 2: and performing semantic analysis on each word in the text, performing word segmentation and part of speech, hearing words, syntactic analysis and other processing. For example, for the "many people really remember the artificial intelligence" described in line 2, or because the stevens pierce berg directed that movie "artificial intelligence" in 2001 could be extracted for the subjects "people", "stevens pierce berg", the predicates "remember", "guide", and the object "artificial intelligence". However, the subject, object, "person", "stevens pierce", "artificial intelligence", etc. for the core are given the highest weight values greater than 1. The predicate "remember" or "instruct" is given to data having a weight value greater than 1 but smaller than that of the subject or object of the core. The other words are directly set to 1.

And step 3: and adjusting the word frequency TF value of each word according to the first weight value and the second weight value. The TF values of the words in each sentence are calculated. And performing multiplication operation according to the weight values obtained in the step 1 and the step 2 to obtain the TF value of each word.

And 4, step 4: the TF-IDF value for each word is calculated (for the case where the IDF value is zero, the TF-IDF value directly uses the word weight). After obtaining the TF-IDF value of each word, the TF-IDF value can be directly used, or the TF-IDF value can be subjected to linearization by using a sigmoid function, and the result is normalized. Thereby obtaining the weight value corresponding to the keyword of each word.

And 5: and comparing the weight value corresponding to each word and each keyword with a preset threshold value, thereby screening out the result shown in fig. 3. Fig. 3 is a result diagram of an extraction result according to an embodiment of the present invention. As shown in fig. 3, the final extraction results are "artificial intelligence", "human", "robot", "machine", "law". The law has the highest weight, and the robot has the lowest weight.

Step 6: according to the output result in fig. 3, the search is performed in a targeted manner.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, a device is further provided, and the device is used to implement the above embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a device for extracting keywords according to an embodiment of the present invention, and as shown in fig. 4, the device includes:

the first determining module 42 is configured to perform text structure analysis on an input text to determine a first weight value corresponding to each content type in the text;

a second determining module 44, configured to perform semantic analysis on each word in the text to determine a second weight value corresponding to each word;

an adjusting module 46, configured to adjust a word frequency TF value of each word according to the first weight value and the second weight value, and calculate a third weight value of each word according to the adjusted TF value;

and the extracting module 48 is configured to extract the specified word from the words as the keyword for retrieval according to the third weighted value.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, performing text structure analysis on an input text to determine a first weight value corresponding to each content type in the text;

s2, performing semantic analysis on each word in the text to determine a second weight value corresponding to each word;

and S3, adjusting the TF value of the word frequency of each word according to the first weight value and the second weight value, and calculating a third weight value of each word according to the adjusted TF value.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A keyword extraction method is characterized by comprising the following steps:

performing text structure analysis on an input text to determine a first weight value corresponding to each content type in the text;

performing semantic analysis on each word in the text to determine a second weight value corresponding to each word;

adjusting a word frequency TF value of each word according to the first weight value and the second weight value, and calculating a third weight value of each word according to the adjusted TF value;

extracting appointed terms in the terms as keywords for retrieval according to the third weighted value;

performing semantic analysis on each word in the text to determine a second weighted value corresponding to the specified word, including:

sequencing the words according to a preset part-of-speech priority rule;

giving the corresponding second weight value to each word according to the sequence of the part of speech priority;

wherein calculating a third weighted value of each of the words by the adjusted TF values includes:

obtaining the IDF value of each word;

determining the third weight value according to the adjusted TF value of each word and the IDF value of each word;

wherein, according to the third weighted value, a specified word in the words is extracted as a keyword for retrieval, and the method further comprises:

removing words of which the third weight values are smaller than a preset weight threshold value from the words as the specified words.

2. The method of claim 1, wherein prior to semantically analyzing words in the text, the method further comprises:

performing word segmentation processing on the text according to a preset rule, and,

and determining the part of speech of each word according to the relevance between the words after word segmentation.

3. The method of claim 1, wherein adjusting a word frequency (TF) value of each of the words according to the first and second weight values further comprises:

obtaining the TF values of each word;

multiplying the TF value by the first and second weight values to determine the adjusted TF value.

4. The method of claim 1, wherein the content type comprises at least one of:

the text content type is divided according to a preset text format in the text, the position type of a paragraph in the text, and the position type of a sentence in the text.

5. An extraction device of a keyword, characterized by comprising:

the first determining module is used for performing text structure analysis on an input text to determine a first weight value corresponding to each content type in the text;

the second determining module is used for performing semantic analysis on each word in the text to determine a second weight value corresponding to each word;

the adjusting module is used for adjusting a word frequency TF value of each word according to the first weight value and the second weight value and calculating a third weight value of each word according to the adjusted TF value;

the extraction module is used for extracting the appointed terms in the terms as the keywords for retrieval according to the third weighted value;

the second determining module is further configured to rank each word according to a preset part-of-speech priority rule, and assign a corresponding second weight value to each word according to the sequence of the part-of-speech priority;

the adjusting module is further configured to obtain an IDF value of each word, and determine the third weight value according to the adjusted TF value of each word and the adjusted IDF value of each word;

the extracting module is further configured to remove, as the designated word, a word in the words whose third weight value is smaller than a preset weight threshold.

6. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 4 when executed.

7. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 4.