CN117669561A

CN117669561A - Unsupervised keyword extraction method, system, equipment and medium

Info

Publication number: CN117669561A
Application number: CN202311628915.2A
Authority: CN
Inventors: 李嘉豪; 戴宪华
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-03-08

Abstract

The present invention relates to the field of keyword extraction technology, and in particular, to an unsupervised keyword extraction method, system, device, and medium. Preprocessing an original document to obtain a plurality of candidate keywords; masking operation is carried out on the original document according to the candidate keywords respectively, so that a masking document corresponding to each candidate keyword is obtained; inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword; respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents; weighting the first cosine similarity and the second cosine similarity to obtain total similarity; the accuracy and the diversity of keyword extraction can be improved.

Description

Unsupervised keyword extraction method, system, equipment and medium

Technical Field

The present invention relates to the field of keyword extraction technology, and in particular, to an unsupervised keyword extraction method, system, device, and medium.

Background

At present, keyword extraction is divided into two types, namely supervised keyword extraction and unsupervised keyword extraction, and the document data is easy to obtain in actual engineering, the labeling data is difficult to obtain, and the unsupervised keyword extraction is more widely used. The traditional unsupervised keyword extraction only focuses on low-level features such as word frequency, position, part of speech and the like, but does not use semantics, but the semantics are decisive factors for keyword extraction, so that the traditional method is low in accuracy. The accuracy of word embedding and extracting technology based on the pre-training model is greatly improved compared with that of the traditional method, but repeated extraction can occur for words with similar semantics, and the word embedding and extracting technology lacks diversity; most embedding methods use words and documents to embed and calculate similarity, but words are often much shorter than documents, and it is difficult to use one word to represent a whole document, so that much information can be lost in the calculation method; when the pre-training language model is used for obtaining the embedding, only the last layer of output is used, the information of the middle layer is not utilized, and the information is lost.

Disclosure of Invention

The invention aims to solve the problems of low keyword extraction accuracy and lack of diversity in the prior art.

In order to achieve the above object, the present invention provides an unsupervised keyword extraction method, which is characterized in that the method includes:

preprocessing an original document to obtain a plurality of candidate keywords;

masking operation is carried out on the original document according to the candidate keywords respectively, so that a masking document corresponding to each candidate keyword is obtained;

inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;

respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents;

weighting the first cosine similarity and the second cosine similarity to obtain total similarity;

and screening target candidate keywords according to the total similarity.

Further, the preprocessing the original document to obtain a plurality of candidate keywords includes:

and performing word segmentation, part-of-speech tagging and stop word removal on the original document through a jieba tool.

Further, the method comprises the steps of,

the pre-trained language characterization model is an AlBert model.

Further, the original document cls vector is represented by the following formula:

wherein h is _i As a trainable parameter, representing the weight output by the ith layer; cls _i Cls representing the i-th layer]Embedding the representation;

the cls vector of the mask document corresponding to each candidate keyword is expressed by the following formula:

the first cosine similarity is calculated using the following formula:

wherein sim is ⁱ Representing cosine similarity of the ith mask document vector and the original document vector;

the second cosine similarity is calculated using the following formula:

wherein,representing the cosine similarity of the ith and kth mask-document vectors.

Further, the weighting the first cosine similarity and the second cosine similarity to obtain a total similarity includes:

summing cosine similarities of all mask document vectors;

setting a weighting coefficient of the first cosine similarity and the second cosine similarity;

the overall similarity is calculated according to the following formula:

wherein lambda is a weighting coefficient, and the value range is 0-1.

Further, the screening the target candidate keywords according to the total similarity includes:

and sequentially screening a predetermined number of target candidate keywords according to the sequence from small to large of the total similarity.

The invention provides an unsupervised keyword extraction system, which is characterized by comprising:

the preprocessing module is used for preprocessing the original document to obtain a plurality of candidate keywords;

the mask operation module is used for performing mask operation on the original document according to the plurality of candidate keywords to obtain mask documents corresponding to each candidate keyword;

the vector acquisition module is used for inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;

the computing module is used for respectively computing the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents;

the weighting module is used for weighting the first cosine similarity and the second cosine similarity to obtain total similarity;

and the screening module is used for screening target candidate keywords according to the total similarity.

Another embodiment of the present invention also proposes a computer-readable storage medium including a stored computer program; wherein the computer program, when run, controls a device in which the computer-readable storage medium resides to perform the unsupervised keyword extraction method as described above.

Another embodiment of the present invention also proposes a terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the unsupervised keyword extraction method as described above when executing the computer program.

According to the method, the system, the equipment and the medium for extracting the unsupervised keywords, which are disclosed by the embodiment of the invention, the original document is preprocessed to obtain a plurality of candidate keywords; masking operation is carried out on the original document according to the candidate keywords respectively, so that a masking document corresponding to each candidate keyword is obtained; inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword; respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the rest mask documents; weighting the first cosine similarity and the second cosine similarity to obtain total similarity; the accuracy and the diversity of keyword extraction can be improved.

Drawings

FIG. 1 is a flowchart of an unsupervised keyword extraction method provided by an embodiment of the present invention;

FIG. 2 is a block diagram of an unsupervised keyword extraction system according to an embodiment of the present invention;

fig. 3 is a block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

It should be noted that, the step numbers herein are only for convenience of explanation of the specific embodiments, and are not used as limiting the order of execution of the steps. The method provided in this embodiment may be executed by a relevant server, and the following description will take the server as an execution body as an example.

As shown in fig. 1, an unsupervised keyword extraction method according to a preferred embodiment of the present invention includes steps S1 to S6:

step S1, preprocessing an original document to obtain a plurality of candidate keywords;

according to the embodiment of the invention, the jieba tool is utilized to segment the original document, label the part of speech and remove the stop word, so that the candidate word is obtained. The stopping word set uses a Chinese stopping word set built in jieba, words marked as n, nr, ns, nt, nw, nz, vn are extracted, and the words respectively represent common nouns, person names, place names, organization names, work names, other proper nouns and proper nouns to form a candidate word set A. The word segmentation, part-of-speech tagging and stop word removal operations performed on the original document in this embodiment are not limited to the use of jieba tools, which are used to explain the process of obtaining candidate keywords in this embodiment, so that other word segmentation tools may be selected to perform the above preprocessing operations on the original document.

S2, masking operation is carried out on the original document according to the candidate keywords, and masking documents corresponding to the candidate keywords are obtained;

for the candidate word set A obtained in the above step, the embodiment of the invention performs the following steps on each candidate word A in the candidate word set A _n And (n candidate words) performing masking operation, namely mask operation, shielding one candidate keyword at a time, and sequentially obtaining mask documents corresponding to each candidate keyword. Through the step, the embodiment of the invention can keep the lengths of the document after mask and the original document consistent, so that the information quantity difference only exists whether the document has a shielded screen or notThe candidate keywords are masked.

If the similarity between the mask document and the original document is high, the candidate keywords which are removed by the mask are not greatly influenced on the document, namely, the keyword degree is low; if the similarity between the mask document and the original document is low, the candidate keywords which are used for explaining mask drop have a great influence on the document, namely the keyword degree is high. Therefore, the embodiment of the present invention continues to perform the comparison operation of the similarity between the original document and the mask document after the mask document is obtained, i.e., steps S3 and S4.

S3, inputting the original document and each mask document into a pre-trained language characterization model to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;

in the embodiment, an Albert model (a lightweight Bert model) is selected to perform training reasoning on an original document and mask documents corresponding to each candidate keyword, and the original document and the mask documents corresponding to each candidate keyword are input into the Albert model to obtain final document vector representations which are respectively marked as cls and cls';

wherein cls is (cls) ₁ ,cls ₂ ,...,cls _n ) Cls 'is (cls' ₁ ,cls‘ ₂ ,...,cls‘ _n )。

The Albert model adopted by the embodiment trains the document, and can fully use partial information of the middle layer, namely, the output of the middle layer is utilized, and a method of weighting and summing the output of each layer is adopted to obtain the final embedded representation of the document. Therefore, compared with the existing mode of only using the last layer of the pre-training model as output, partial information of the middle layer cannot be lost, the output of the middle layer can be utilized, and training reasoning accuracy is greatly improved.

Step S4, respectively calculating the first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword and the second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vector of the rest mask documents;

the original document cls vector is expressed by the following formula:

wherein h is _i As a trainable parameter, representing the weight output by the ith layer; cls _i Cls representing the i-th layer]Embedding the representation; n is the number of layers of the Albert model, and n=12 is preferred in this embodiment.

the first cosine similarity is calculated using the following formula:

the second cosine similarity is calculated using the following formula:

The calculation is that the selected keywords are as various as possible, and the semantic difference is larger

Step S5, weighting the first cosine similarity and the second cosine similarity to obtain total similarity;

specifically, the present embodiment sums cosine similarities of all mask document vectors; setting a weighting coefficient of the first cosine similarity and the second cosine similarity; and calculates the total similarity according to the following formula:

wherein lambda is a weighting coefficient, and the value range is 0-1.

In this embodiment, the weighting system λ is an adjustable parameter, for example, λ is smaller than 0.5, so that the diversity of the extracted candidate keywords is more concerned in the similarity calculation, and λ is larger than 0.5, so that the association degree between the extracted candidate keywords and the original document is more concerned. The extracted vocabulary may be more varied when λ is less than 0.5, but the relevance of the vocabulary to the original document may be weakened. The association of words with the original document with lambda greater than 0.5 is enhanced, but diversity may be impaired. And (5) adjusting according to actual conditions. If the diversity of the extracted candidate keywords is more focused, λ may be set to a value less than 0.5, and if the association degree of the extracted candidate keywords with the original document is more focused, λ may be set to a value greater than 0.5.

And S6, screening target candidate keywords according to the total similarity.

The smaller the SIM is, the more important the keywords are, and the target candidate keywords with preset numbers are sequentially screened according to the sequence from the smaller total similarity to the larger total similarity.

In summary, the present example provides that a jieba tool is used to process an original document to obtain candidate keywords, then mask operation is performed on the document, a mask document corresponding to each candidate keyword and the original document are input into an Albert model, and a final vector representation is obtained by using the output of an Albert middle layer, so as to improve accuracy; and acquiring final similarity by using the cosine similarity of the original document vector and the mask document vector corresponding to each candidate keyword and the cosine similarity of the mask document vector corresponding to each candidate keyword and the rest mask document vectors, and screening the keywords according to the final similarity, so that the extracted keywords are more diversified and have high importance.

As shown in fig. 2, the embodiment of the present invention further provides an unsupervised keyword extraction system, configured to perform an unsupervised keyword extraction method as described above, where the system includes:

a preprocessing module 21, configured to preprocess an original document to obtain a plurality of candidate keywords;

the mask operation module 22 is configured to perform a mask operation on the original document according to the plurality of candidate keywords, so as to obtain a mask document corresponding to each candidate keyword;

the vector obtaining module 23 is configured to input the original document and each of the mask documents into a pre-trained language representation model, so as to obtain an original document cls vector and a mask document cls vector corresponding to each candidate keyword;

a calculating module 24, configured to calculate a first cosine similarity of the cls vector of the original document and the cls vector of the mask document corresponding to each candidate keyword, and a second cosine similarity of the cls vector of the mask document corresponding to each candidate keyword and the cls vectors of the remaining mask documents, respectively;

a weighting module 25, configured to weight the first cosine similarity and the second cosine similarity to obtain a total similarity;

and a screening module 26, configured to screen the target candidate keywords according to the total similarity.

The technical features and technical effects of the unsupervised keyword extraction system provided by the embodiment of the present invention are the same as those of the unsupervised keyword extraction method provided by the embodiment of the present invention, and are not repeated here. The modules in the above-described unsupervised keyword extraction system may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The embodiment of the invention also provides a computer readable storage medium, which comprises a stored computer program; wherein the computer program, when run, controls a device in which the computer-readable storage medium resides to perform an unsupervised keyword extraction method as described above.

As shown in fig. 3, the embodiment of the present invention further provides a computer device, and fig. 3 is a block diagram of a preferred embodiment of the computer device provided by the present invention, where the computer device includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements an unsupervised keyword extraction method as described above when executing the computer program.

Preferably, the computer program may be divided into one or more modules/units (e.g. computer program 1, computer program 2, … …) stored in the memory and executed by the processor to complete the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the computer device.

The processor may be a central processing unit (Central Processing Unit, CPU), or may be other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., or the general purpose processor may be a microprocessor, or any conventional processor, which is the control center of the terminal device, that connects the various parts of the terminal device using various interfaces and lines.

The memory mainly includes a program storage area, which may store an operating system, an application program required for at least one function, and the like, and a data storage area, which may store related data and the like. In addition, the memory may be a high-speed random access memory, a nonvolatile memory such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), or the like, or may be other volatile solid-state memory devices.

It should be noted that the above-mentioned terminal device may include, but is not limited to, a processor, a memory, and those skilled in the art will understand that the structural block diagram of fig. 3 is merely an example of the terminal device, and does not constitute limitation of the terminal device, and may include more or less components than those illustrated, or may combine some components, or different components.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and substitutions will now occur to those skilled in the art without departing from the spirit of the present invention, and these modifications and substitutions should also be considered to be within the scope of the present invention.

Claims

1. An unsupervised keyword extraction method, comprising:

preprocessing an original document to obtain a plurality of candidate keywords;

and screening target candidate keywords according to the total similarity.

2. The method for extracting unsupervised keywords of claim 1, wherein the preprocessing the original document to obtain a plurality of candidate keywords comprises:

3. The method for extracting an unsupervised keyword according to claim 1,

the pre-trained language characterization model is an AlBert model.

4. The method for extracting unsupervised keywords according to claim 1, wherein the cls vector of the original document is represented by the following formula:

the first cosine similarity is calculated using the following formula:

the second cosine similarity is calculated using the following formula:

5. The method for extracting an unsupervised keyword according to claim 4, wherein weighting the first cosine similarity and the second cosine similarity to obtain a total similarity comprises:

summing cosine similarities of all mask document vectors;

the overall similarity is calculated according to the following formula:

wherein lambda is a weighting coefficient, and the value range is 0-1.

6. The method for extracting an unsupervised keyword according to claim 1, wherein the screening the target candidate keywords according to the total similarity comprises:

7. An unsupervised keyword extraction system, the system comprising:

8. A computer device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the unsupervised keyword extraction method of any one of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, wherein the computer readable storage medium comprises a stored computer program; wherein the computer program, when run, controls a device in which the computer-readable storage medium is located to perform the unsupervised keyword extraction method according to any one of claims 1 to 6.