CN113673223A

CN113673223A - Keyword extraction method and system based on semantic similarity

Info

Publication number: CN113673223A
Application number: CN202110981460.7A
Authority: CN
Inventors: 史晓凌; 刘弦弦; 唐先明; 柳晶晶; 李立琴; 高艳
Original assignee: Beijing Zhitong Yunlian Technology Co Ltd
Current assignee: Beijing Zhitong Yunlian Technology Co Ltd
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-11-19

Abstract

The invention discloses a keyword extraction method and a keyword extraction system based on semantic similarity, wherein the method comprises the following steps: the text is divided into sentences, and each sentence is divided into words according to the field word division dictionary; vectorizing the words after the word segmentation and sentences in which the words are positioned; calculating the similarity of the vectorized words and sentences in which the words are positioned, and extracting candidate keywords; clustering the candidate keywords by using a clustering algorithm to obtain a candidate keyword topic model; and sequencing the candidate keywords in each topic model to obtain a final keyword result. By adopting the method, the word segmentation can be more accurate, and the extracted article keywords can better reflect the theme of the article.

Description

Keyword extraction method and system based on semantic similarity

Technical Field

The invention relates to the technical field of artificial intelligence natural language processing, in particular to a keyword extraction method and system based on semantic similarity.

Background

In the field of natural language processing, the key point for processing massive text files is to extract the most concerned problem of a user, and the topic idea that the whole text can be snooped through some keywords regardless of long text or short text. Meanwhile, regardless of text-based recommendation or text-based search, the dependency on text keywords is also great, and the accuracy of keyword extraction directly relates to the final effect of a recommendation system or a search system. Therefore, keyword extraction is an important part in the field of text mining.

The current keyword extraction method comprises a supervised extraction algorithm and an unsupervised extraction algorithm:

the supervised keyword extraction algorithm needs to provide labeled training corpora, train a keyword extraction model by using the training corpora, and extract keywords from documents of which the keywords need to be extracted according to the model, but the manual labeling cost is high;

unsupervised keyword extraction algorithms can be divided into three major categories, keyword extraction based on statistical characteristics, keyword extraction based on word graph models and keyword extraction based on topic models: the idea of the keyword extraction algorithm based on the statistical characteristics is to extract keywords of a document by utilizing the statistical information of words in the document, wherein the statistical information mainly comprises word weight, the document position of the words and the associated information of the words, the method mainly depends on the selection of the characteristics, and if the characteristic selection is not good, the effect is greatly influenced; extracting keywords based on a word graph model, firstly constructing a language network graph of a document, then analyzing the language network graph, and searching words or phrases with important functions on the graph, wherein the phrases are the keywords of the document, the method is used in the petroleum industry and the military industry, the effect is not satisfactory in the vertical field, and the extraction method is more biased to general words; the topic-based keyword extraction algorithm mainly utilizes the distribution property of the topics in the topic model to extract keywords, but the keywords extracted by the topic model are wide and cannot well reflect the document topics.

Disclosure of Invention

The invention aims to provide a keyword extraction method and system based on semantic similarity, and aims to solve the problems.

The invention provides a keyword extraction method based on semantic similarity, which comprises the following steps:

s1, segmenting a text, and segmenting each sentence according to a field segmentation dictionary;

s2, vectorizing the words subjected to word segmentation and sentences in which the words are positioned;

s3, calculating the similarity of the vectorized words and sentences in which the words are located, and extracting candidate keywords;

s4, clustering the candidate keywords by using a clustering algorithm to obtain a candidate keyword topic model;

and S5, sequencing the candidate keywords in each topic model to obtain a final keyword result.

The invention provides a keyword extraction system based on semantic similarity, which comprises the following steps:

a word segmentation module: the system is used for segmenting the text and segmenting each sentence according to the field segmentation dictionary;

a vectorization module: the system is used for vectorizing the words and sentences of the words after word segmentation;

a similarity calculation module: the method is used for calculating the similarity of the vectorized words and sentences in which the words are positioned and extracting candidate keywords;

a keyword clustering module: the candidate keywords are clustered by using a clustering algorithm to obtain a candidate keyword topic model;

a keyword ordering module: and the method is used for sequencing the candidate keywords in each topic model to obtain a final keyword result.

The embodiment of the present invention further provides a keyword extraction device based on semantic similarity, including: the keyword extraction method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the steps of the keyword extraction method when being executed by the processor.

The embodiment of the invention also provides a computer readable storage medium, wherein an implementation program for information transmission is stored on the computer readable storage medium, and the implementation program is executed by a processor to implement the steps of the keyword extraction method.

By adopting the embodiment of the invention, the word segmentation can be more accurate, and the extracted article keywords can better reflect the theme of the article.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a keyword extraction method based on semantic similarity according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating specific steps of a keyword extraction method based on semantic similarity according to an embodiment of the present invention;

FIG. 3 is a diagram of a domain segmentation dictionary in accordance with an embodiment of the present invention;

FIG. 4 is a schematic input and output diagram of an embodiment of the present invention using bert for vectorization;

FIG. 5 is a diagram of a keyword extraction system based on semantic similarity according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a keyword extraction device based on semantic similarity according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise. Furthermore, the terms "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Method embodiment

According to an embodiment of the present invention, a keyword extraction method based on semantic similarity is provided, fig. 1 is a flowchart of the keyword extraction method based on semantic similarity according to the embodiment of the present invention, and as shown in fig. 1, the keyword extraction method based on semantic similarity according to the embodiment of the present invention specifically includes:

s1, the text is divided into sentences, and each sentence is divided into words according to the field word division dictionary.

firstly, dividing characters of a whole sentence, then coding the sentence by using a bert model, and outputting a vector corresponding to each character; adding the vectors of each character in the words, and then averaging to obtain the vectors of the words; the sentence vector employs a CLS vector.

calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.

specifically, in the embodiment, the candidate keywords are clustered by using a K-means algorithm.

Calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, reserving the first n of each topic model as text keywords, and setting the value of n in a user-defined manner according to the actual situation.

Fig. 2 is a schematic diagram illustrating specific process steps of a keyword extraction method based on semantic similarity according to an embodiment of the present invention, and as shown in fig. 2, the specific process steps of the keyword extraction method based on semantic similarity according to the specific embodiment are as follows:

after text segmentation, segmenting each sentence according to a user-defined dictionary: according to the special terms of each field, the field segmentation dictionary is customized, the situation that the field words are split due to the use of a general segmentation dictionary is avoided, and the field segmentation dictionary is shown in FIG. 3.

Such as the following: the research result of the subject can play a supporting role in future exploration and development of petrochemical shale oil and gas and reserve management work, and can directly provide technical preparation for shale oil and gas tertiary reserve calculation to be declared by petrochemicals in China. ", the word segmentation results according to the custom dictionary are shown in table 1:

TABLE 1

After the articles are obtained and the words are segmented, the sentences and the words are vectorized:

in this embodiment, a general pre-training language representation model bert is selected for vectorization, as shown in fig. 4, the specific process is as follows:

firstly, dividing the whole sentence into words, and then coding the words by using a bert model, wherein the output is a vector corresponding to each word as shown in a table 2;

TABLE 2

Character (Chinese character)	Word vector
		Stone (stone)	V1
Transforming	V2
		Probe	V3
Zone(s)	V4
		Page	V5

After each word in the sentence is encoded by the bert model, the vectors of the words are obtained by averaging the vectors of each word in the words, as shown in table 3:

TABLE 3

The sentence vector adopts the CLS vector in output, and is marked as SV.

Performing word and sentence similarity calculation, and keeping words larger than a threshold value:

after each sentence and the vectorized representation of the words in the sentence are obtained, the similarity between each word and the sentence where the word is located is calculated by utilizing the cosine similarity, as shown in table 4:

TABLE 4

Setting a threshold value, and keeping the words with similarity greater than the threshold value as candidate keywords.

Clustering candidate keywords:

and clustering the reserved candidate keywords exceeding the threshold value by using a K-means algorithm.

The terms under each category are ordered:

calculating the average value of word vectors under each category, then calculating the similarity between each candidate keyword and the average word vector of the category, keeping the top n keywords in each category, and finally taking the keywords as the keywords of the whole text to obtain the keywords of the article.

By adopting the embodiment of the invention, the word segmentation of the article is carried out according to the field word segmentation dictionary, the word segmentation result can be more accurate, candidate keywords are extracted by calculating the similarity of the words after vectorization representation and the sentences, and the keywords are selectively reserved after sequencing, so that the extracted article keywords can more accurately reflect the subjects of the article.

System embodiment

According to an embodiment of the present invention, a keyword extraction system based on semantic similarity is provided, fig. 5 is a schematic diagram of the keyword extraction system based on semantic similarity according to the embodiment of the present invention, and as shown in fig. 5, the keyword extraction system based on semantic similarity according to the embodiment of the present invention specifically includes:

the word segmentation module 50: the system is used for segmenting the text and segmenting each sentence according to the domain segmentation dictionary.

The vectorization module 52: the system is used for vectorizing the words and sentences of the words after word segmentation;

the vectorization module 52 is specifically configured to: dividing the whole sentence into words, then coding the words by using a bert model, and outputting a vector corresponding to each word; and adding the vectors of each word in the word, and then averaging to obtain the vector of the word.

The similarity calculation module 54: the method is used for calculating the similarity of the vectorized words and sentences in which the words are positioned and extracting candidate keywords;

the similarity calculation module 54 is specifically configured to: calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.

Keyword clustering module 56: and the candidate keywords are clustered by using a clustering algorithm to obtain the candidate keyword topic model.

Keyword ranking module 58: and the method is used for sequencing the candidate keywords in each topic model to obtain a final keyword result.

The keyword ranking module 58 is specifically configured to: calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, and reserving the first n of each topic model as text keywords according to the set value of n.

The embodiment of the present invention is a system embodiment corresponding to the above method embodiment, and specific operations of each module may be understood with reference to the description of the method embodiment, which is not described herein again.

Apparatus embodiment one

The embodiment of the present invention provides a keyword extraction device based on semantic similarity, as shown in fig. 5, including: a memory 60, a processor 62 and a computer program stored on the memory 60 and executable on the processor 62, which computer program, when executed by the processor 62, carries out the following method steps:

Device embodiment II

The embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, and when being executed by a processor 62, the implementation program implements the following method steps:

The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A keyword extraction method based on semantic similarity is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step S2 is implemented by vectorizing the words and the sentences in which the words are segmented in the following specific method:

3. The method according to claim 1, wherein the step S3 of calculating the similarity between the vectorized words and the sentences in which the words are located and extracting the candidate keywords specifically comprises: calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.

4. The method according to claim 1, wherein the step S5 of ranking the candidate keywords in each topic model to obtain the final keyword result comprises: calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, reserving the first n of each topic model as text keywords, and setting the value of n in a user-defined manner according to the actual situation.

5. A keyword extraction system based on semantic similarity is characterized by comprising:

6. The system of claim 5, wherein the vectorization module is specifically configured to: dividing the whole sentence into words, then coding the words by using a bert model, and outputting a vector corresponding to each word; and adding the vectors of each word in the word, and then averaging to obtain the vector of the word.

7. The system of claim 5, wherein the similarity calculation module is specifically configured to: calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.

8. The system of claim 5, wherein the keyword ranking module is specifically configured to: calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, and reserving the first n of each topic model as text keywords according to the set value of n.

9. A keyword extraction device based on semantic similarity is characterized by comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the keyword extraction method as claimed in any one of claims 1 to 4.

10. A computer-readable storage medium, on which an information transfer implementation program is stored, which, when executed by a processor, implements the steps of the keyword extraction method according to any one of claims 1 to 4.