CN113673223A - Keyword extraction method and system based on semantic similarity - Google Patents

Keyword extraction method and system based on semantic similarity Download PDF

Info

Publication number
CN113673223A
CN113673223A CN202110981460.7A CN202110981460A CN113673223A CN 113673223 A CN113673223 A CN 113673223A CN 202110981460 A CN202110981460 A CN 202110981460A CN 113673223 A CN113673223 A CN 113673223A
Authority
CN
China
Prior art keywords
words
similarity
keyword
sentences
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110981460.7A
Other languages
Chinese (zh)
Inventor
史晓凌
刘弦弦
唐先明
柳晶晶
李立琴
高艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhitong Yunlian Technology Co Ltd
Original Assignee
Beijing Zhitong Yunlian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhitong Yunlian Technology Co Ltd filed Critical Beijing Zhitong Yunlian Technology Co Ltd
Priority to CN202110981460.7A priority Critical patent/CN113673223A/en
Publication of CN113673223A publication Critical patent/CN113673223A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a keyword extraction method and a keyword extraction system based on semantic similarity, wherein the method comprises the following steps: the text is divided into sentences, and each sentence is divided into words according to the field word division dictionary; vectorizing the words after the word segmentation and sentences in which the words are positioned; calculating the similarity of the vectorized words and sentences in which the words are positioned, and extracting candidate keywords; clustering the candidate keywords by using a clustering algorithm to obtain a candidate keyword topic model; and sequencing the candidate keywords in each topic model to obtain a final keyword result. By adopting the method, the word segmentation can be more accurate, and the extracted article keywords can better reflect the theme of the article.

Description

Keyword extraction method and system based on semantic similarity
Technical Field
The invention relates to the technical field of artificial intelligence natural language processing, in particular to a keyword extraction method and system based on semantic similarity.
Background
In the field of natural language processing, the key point for processing massive text files is to extract the most concerned problem of a user, and the topic idea that the whole text can be snooped through some keywords regardless of long text or short text. Meanwhile, regardless of text-based recommendation or text-based search, the dependency on text keywords is also great, and the accuracy of keyword extraction directly relates to the final effect of a recommendation system or a search system. Therefore, keyword extraction is an important part in the field of text mining.
The current keyword extraction method comprises a supervised extraction algorithm and an unsupervised extraction algorithm:
the supervised keyword extraction algorithm needs to provide labeled training corpora, train a keyword extraction model by using the training corpora, and extract keywords from documents of which the keywords need to be extracted according to the model, but the manual labeling cost is high;
unsupervised keyword extraction algorithms can be divided into three major categories, keyword extraction based on statistical characteristics, keyword extraction based on word graph models and keyword extraction based on topic models: the idea of the keyword extraction algorithm based on the statistical characteristics is to extract keywords of a document by utilizing the statistical information of words in the document, wherein the statistical information mainly comprises word weight, the document position of the words and the associated information of the words, the method mainly depends on the selection of the characteristics, and if the characteristic selection is not good, the effect is greatly influenced; extracting keywords based on a word graph model, firstly constructing a language network graph of a document, then analyzing the language network graph, and searching words or phrases with important functions on the graph, wherein the phrases are the keywords of the document, the method is used in the petroleum industry and the military industry, the effect is not satisfactory in the vertical field, and the extraction method is more biased to general words; the topic-based keyword extraction algorithm mainly utilizes the distribution property of the topics in the topic model to extract keywords, but the keywords extracted by the topic model are wide and cannot well reflect the document topics.
Disclosure of Invention
The invention aims to provide a keyword extraction method and system based on semantic similarity, and aims to solve the problems.
The invention provides a keyword extraction method based on semantic similarity, which comprises the following steps:
s1, segmenting a text, and segmenting each sentence according to a field segmentation dictionary;
s2, vectorizing the words subjected to word segmentation and sentences in which the words are positioned;
s3, calculating the similarity of the vectorized words and sentences in which the words are located, and extracting candidate keywords;
s4, clustering the candidate keywords by using a clustering algorithm to obtain a candidate keyword topic model;
and S5, sequencing the candidate keywords in each topic model to obtain a final keyword result.
The invention provides a keyword extraction system based on semantic similarity, which comprises the following steps:
a word segmentation module: the system is used for segmenting the text and segmenting each sentence according to the field segmentation dictionary;
a vectorization module: the system is used for vectorizing the words and sentences of the words after word segmentation;
a similarity calculation module: the method is used for calculating the similarity of the vectorized words and sentences in which the words are positioned and extracting candidate keywords;
a keyword clustering module: the candidate keywords are clustered by using a clustering algorithm to obtain a candidate keyword topic model;
a keyword ordering module: and the method is used for sequencing the candidate keywords in each topic model to obtain a final keyword result.
The embodiment of the present invention further provides a keyword extraction device based on semantic similarity, including: the keyword extraction method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the steps of the keyword extraction method when being executed by the processor.
The embodiment of the invention also provides a computer readable storage medium, wherein an implementation program for information transmission is stored on the computer readable storage medium, and the implementation program is executed by a processor to implement the steps of the keyword extraction method.
By adopting the embodiment of the invention, the word segmentation can be more accurate, and the extracted article keywords can better reflect the theme of the article.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a keyword extraction method based on semantic similarity according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating specific steps of a keyword extraction method based on semantic similarity according to an embodiment of the present invention;
FIG. 3 is a diagram of a domain segmentation dictionary in accordance with an embodiment of the present invention;
FIG. 4 is a schematic input and output diagram of an embodiment of the present invention using bert for vectorization;
FIG. 5 is a diagram of a keyword extraction system based on semantic similarity according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a keyword extraction device based on semantic similarity according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise. Furthermore, the terms "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Method embodiment
According to an embodiment of the present invention, a keyword extraction method based on semantic similarity is provided, fig. 1 is a flowchart of the keyword extraction method based on semantic similarity according to the embodiment of the present invention, and as shown in fig. 1, the keyword extraction method based on semantic similarity according to the embodiment of the present invention specifically includes:
s1, the text is divided into sentences, and each sentence is divided into words according to the field word division dictionary.
S2, vectorizing the words subjected to word segmentation and sentences in which the words are positioned;
firstly, dividing characters of a whole sentence, then coding the sentence by using a bert model, and outputting a vector corresponding to each character; adding the vectors of each character in the words, and then averaging to obtain the vectors of the words; the sentence vector employs a CLS vector.
S3, calculating the similarity of the vectorized words and sentences in which the words are located, and extracting candidate keywords;
calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.
S4, clustering the candidate keywords by using a clustering algorithm to obtain a candidate keyword topic model;
specifically, in the embodiment, the candidate keywords are clustered by using a K-means algorithm.
And S5, sequencing the candidate keywords in each topic model to obtain a final keyword result.
Calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, reserving the first n of each topic model as text keywords, and setting the value of n in a user-defined manner according to the actual situation.
Fig. 2 is a schematic diagram illustrating specific process steps of a keyword extraction method based on semantic similarity according to an embodiment of the present invention, and as shown in fig. 2, the specific process steps of the keyword extraction method based on semantic similarity according to the specific embodiment are as follows:
after text segmentation, segmenting each sentence according to a user-defined dictionary: according to the special terms of each field, the field segmentation dictionary is customized, the situation that the field words are split due to the use of a general segmentation dictionary is avoided, and the field segmentation dictionary is shown in FIG. 3.
Such as the following: the research result of the subject can play a supporting role in future exploration and development of petrochemical shale oil and gas and reserve management work, and can directly provide technical preparation for shale oil and gas tertiary reserve calculation to be declared by petrochemicals in China. ", the word segmentation results according to the custom dictionary are shown in table 1:
TABLE 1
Figure BDA0003229177240000061
After the articles are obtained and the words are segmented, the sentences and the words are vectorized:
in this embodiment, a general pre-training language representation model bert is selected for vectorization, as shown in fig. 4, the specific process is as follows:
firstly, dividing the whole sentence into words, and then coding the words by using a bert model, wherein the output is a vector corresponding to each word as shown in a table 2;
TABLE 2
Character (Chinese character) Word vector
Stone (stone) V1
Transforming V2
Probe V3
Zone(s) V4
Page V5
After each word in the sentence is encoded by the bert model, the vectors of the words are obtained by averaging the vectors of each word in the words, as shown in table 3:
TABLE 3
Figure BDA0003229177240000071
The sentence vector adopts the CLS vector in output, and is marked as SV.
Performing word and sentence similarity calculation, and keeping words larger than a threshold value:
after each sentence and the vectorized representation of the words in the sentence are obtained, the similarity between each word and the sentence where the word is located is calculated by utilizing the cosine similarity, as shown in table 4:
TABLE 4
Figure BDA0003229177240000072
Setting a threshold value, and keeping the words with similarity greater than the threshold value as candidate keywords.
Clustering candidate keywords:
and clustering the reserved candidate keywords exceeding the threshold value by using a K-means algorithm.
The terms under each category are ordered:
calculating the average value of word vectors under each category, then calculating the similarity between each candidate keyword and the average word vector of the category, keeping the top n keywords in each category, and finally taking the keywords as the keywords of the whole text to obtain the keywords of the article.
By adopting the embodiment of the invention, the word segmentation of the article is carried out according to the field word segmentation dictionary, the word segmentation result can be more accurate, candidate keywords are extracted by calculating the similarity of the words after vectorization representation and the sentences, and the keywords are selectively reserved after sequencing, so that the extracted article keywords can more accurately reflect the subjects of the article.
System embodiment
According to an embodiment of the present invention, a keyword extraction system based on semantic similarity is provided, fig. 5 is a schematic diagram of the keyword extraction system based on semantic similarity according to the embodiment of the present invention, and as shown in fig. 5, the keyword extraction system based on semantic similarity according to the embodiment of the present invention specifically includes:
the word segmentation module 50: the system is used for segmenting the text and segmenting each sentence according to the domain segmentation dictionary.
The vectorization module 52: the system is used for vectorizing the words and sentences of the words after word segmentation;
the vectorization module 52 is specifically configured to: dividing the whole sentence into words, then coding the words by using a bert model, and outputting a vector corresponding to each word; and adding the vectors of each word in the word, and then averaging to obtain the vector of the word.
The similarity calculation module 54: the method is used for calculating the similarity of the vectorized words and sentences in which the words are positioned and extracting candidate keywords;
the similarity calculation module 54 is specifically configured to: calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.
Keyword clustering module 56: and the candidate keywords are clustered by using a clustering algorithm to obtain the candidate keyword topic model.
Keyword ranking module 58: and the method is used for sequencing the candidate keywords in each topic model to obtain a final keyword result.
The keyword ranking module 58 is specifically configured to: calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, and reserving the first n of each topic model as text keywords according to the set value of n.
The embodiment of the present invention is a system embodiment corresponding to the above method embodiment, and specific operations of each module may be understood with reference to the description of the method embodiment, which is not described herein again.
Apparatus embodiment one
The embodiment of the present invention provides a keyword extraction device based on semantic similarity, as shown in fig. 5, including: a memory 60, a processor 62 and a computer program stored on the memory 60 and executable on the processor 62, which computer program, when executed by the processor 62, carries out the following method steps:
s1, the text is divided into sentences, and each sentence is divided into words according to the field word division dictionary.
S2, vectorizing the words subjected to word segmentation and sentences in which the words are positioned;
firstly, dividing characters of a whole sentence, then coding the sentence by using a bert model, and outputting a vector corresponding to each character; adding the vectors of each character in the words, and then averaging to obtain the vectors of the words; the sentence vector employs a CLS vector.
S3, calculating the similarity of the vectorized words and sentences in which the words are located, and extracting candidate keywords;
calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.
S4, clustering the candidate keywords by using a clustering algorithm to obtain a candidate keyword topic model;
specifically, in the embodiment, the candidate keywords are clustered by using a K-means algorithm.
And S5, sequencing the candidate keywords in each topic model to obtain a final keyword result.
Calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, reserving the first n of each topic model as text keywords, and setting the value of n in a user-defined manner according to the actual situation.
Device embodiment II
The embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, and when being executed by a processor 62, the implementation program implements the following method steps:
s1, the text is divided into sentences, and each sentence is divided into words according to the field word division dictionary.
S2, vectorizing the words subjected to word segmentation and sentences in which the words are positioned;
firstly, dividing characters of a whole sentence, then coding the sentence by using a bert model, and outputting a vector corresponding to each character; adding the vectors of each character in the words, and then averaging to obtain the vectors of the words; the sentence vector employs a CLS vector.
S3, calculating the similarity of the vectorized words and sentences in which the words are located, and extracting candidate keywords;
calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.
S4, clustering the candidate keywords by using a clustering algorithm to obtain a candidate keyword topic model;
specifically, in the embodiment, the candidate keywords are clustered by using a K-means algorithm.
And S5, sequencing the candidate keywords in each topic model to obtain a final keyword result.
Calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, reserving the first n of each topic model as text keywords, and setting the value of n in a user-defined manner according to the actual situation.
The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A keyword extraction method based on semantic similarity is characterized by comprising the following steps:
s1, segmenting a text, and segmenting each sentence according to a field segmentation dictionary;
s2, vectorizing the words subjected to word segmentation and sentences in which the words are positioned;
s3, calculating the similarity of the vectorized words and sentences in which the words are located, and extracting candidate keywords;
s4, clustering the candidate keywords by using a clustering algorithm to obtain a candidate keyword topic model;
and S5, sequencing the candidate keywords in each topic model to obtain a final keyword result.
2. The method according to claim 1, wherein the step S2 is implemented by vectorizing the words and the sentences in which the words are segmented in the following specific method:
firstly, dividing characters of a whole sentence, then coding the sentence by using a bert model, and outputting a vector corresponding to each character; adding the vectors of each character in the words, and then averaging to obtain the vectors of the words; the sentence vector employs a CLS vector.
3. The method according to claim 1, wherein the step S3 of calculating the similarity between the vectorized words and the sentences in which the words are located and extracting the candidate keywords specifically comprises: calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.
4. The method according to claim 1, wherein the step S5 of ranking the candidate keywords in each topic model to obtain the final keyword result comprises: calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, reserving the first n of each topic model as text keywords, and setting the value of n in a user-defined manner according to the actual situation.
5. A keyword extraction system based on semantic similarity is characterized by comprising:
a word segmentation module: the system is used for segmenting the text and segmenting each sentence according to the field segmentation dictionary;
a vectorization module: the system is used for vectorizing the words and sentences of the words after word segmentation;
a similarity calculation module: the method is used for calculating the similarity of the vectorized words and sentences in which the words are positioned and extracting candidate keywords;
a keyword clustering module: the candidate keywords are clustered by using a clustering algorithm to obtain a candidate keyword topic model;
a keyword ordering module: and the method is used for sequencing the candidate keywords in each topic model to obtain a final keyword result.
6. The system of claim 5, wherein the vectorization module is specifically configured to: dividing the whole sentence into words, then coding the words by using a bert model, and outputting a vector corresponding to each word; and adding the vectors of each word in the word, and then averaging to obtain the vector of the word.
7. The system of claim 5, wherein the similarity calculation module is specifically configured to: calculating the similarity of the words after vectorization representation and the sentences in which the words are positioned by utilizing cosine similarity, setting a similarity threshold value, and keeping the words with the similarity between the words and the sentences in which the words are positioned larger than the similarity threshold value as candidate keywords.
8. The system of claim 5, wherein the keyword ranking module is specifically configured to: calculating the average word vector of the word vectors in each topic model, then calculating the similarity between each candidate keyword and the average word vector of the topic model, sequencing the calculated similarity from big to small, and reserving the first n of each topic model as text keywords according to the set value of n.
9. A keyword extraction device based on semantic similarity is characterized by comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the keyword extraction method as claimed in any one of claims 1 to 4.
10. A computer-readable storage medium, on which an information transfer implementation program is stored, which, when executed by a processor, implements the steps of the keyword extraction method according to any one of claims 1 to 4.
CN202110981460.7A 2021-08-25 2021-08-25 Keyword extraction method and system based on semantic similarity Pending CN113673223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110981460.7A CN113673223A (en) 2021-08-25 2021-08-25 Keyword extraction method and system based on semantic similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110981460.7A CN113673223A (en) 2021-08-25 2021-08-25 Keyword extraction method and system based on semantic similarity

Publications (1)

Publication Number Publication Date
CN113673223A true CN113673223A (en) 2021-11-19

Family

ID=78546085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110981460.7A Pending CN113673223A (en) 2021-08-25 2021-08-25 Keyword extraction method and system based on semantic similarity

Country Status (1)

Country Link
CN (1) CN113673223A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637840A (en) * 2022-04-27 2022-06-17 北京清博智能科技有限公司 Keyword-based abstract generation system and method
CN114970523A (en) * 2022-05-20 2022-08-30 浙江省科技信息研究院 Topic prompting type keyword extraction method based on text semantic enhancement
CN115687576A (en) * 2022-12-29 2023-02-03 安徽大学 Keyword extraction method and device represented by theme constraint
CN116564539A (en) * 2023-07-10 2023-08-08 神州医疗科技股份有限公司 Medical similar case recommending method and system based on information extraction and entity normalization

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562717A (en) * 2017-07-24 2018-01-09 南京邮电大学 A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure
CN109840532A (en) * 2017-11-24 2019-06-04 南京大学 A kind of law court's class case recommended method based on k-means
CN109885831A (en) * 2019-01-30 2019-06-14 广州杰赛科技股份有限公司 Key Term abstracting method, device, equipment and computer readable storage medium
CN111859961A (en) * 2020-07-29 2020-10-30 华中师范大学 Text keyword extraction method based on improved TopicRank algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562717A (en) * 2017-07-24 2018-01-09 南京邮电大学 A kind of text key word abstracting method being combined based on Word2Vec with Term co-occurrence
CN109840532A (en) * 2017-11-24 2019-06-04 南京大学 A kind of law court's class case recommended method based on k-means
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure
CN109885831A (en) * 2019-01-30 2019-06-14 广州杰赛科技股份有限公司 Key Term abstracting method, device, equipment and computer readable storage medium
CN111859961A (en) * 2020-07-29 2020-10-30 华中师范大学 Text keyword extraction method based on improved TopicRank algorithm

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
姜芳等: "基于语义的文档关键词提取方法", 《计算机应用研究》, vol. 32, no. 1, pages 142 - 145 *
李俊等: "融合BERT语义加权与网络图的关键词抽取方法", 《计算机工程》, vol. 46, no. 9, pages 89 - 94 *
王立霞等: "基于语义的中文文本关键词提取算法", 《计算机工程》, vol. 38, no. 1, pages 1 - 4 *
郑磊: "微博用户兴趣的提取和动态建模", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 1, pages 138 - 1905 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637840A (en) * 2022-04-27 2022-06-17 北京清博智能科技有限公司 Keyword-based abstract generation system and method
CN114970523A (en) * 2022-05-20 2022-08-30 浙江省科技信息研究院 Topic prompting type keyword extraction method based on text semantic enhancement
CN114970523B (en) * 2022-05-20 2022-11-29 浙江省科技信息研究院 Topic prompting type keyword extraction method based on text semantic enhancement
CN115687576A (en) * 2022-12-29 2023-02-03 安徽大学 Keyword extraction method and device represented by theme constraint
CN116564539A (en) * 2023-07-10 2023-08-08 神州医疗科技股份有限公司 Medical similar case recommending method and system based on information extraction and entity normalization
CN116564539B (en) * 2023-07-10 2023-10-24 神州医疗科技股份有限公司 Medical similar case recommending method and system based on information extraction and entity normalization

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
KR101999152B1 (en) English text formatting method based on convolution network
CN110019843B (en) Knowledge graph processing method and device
CN113673223A (en) Keyword extraction method and system based on semantic similarity
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN112507711B (en) Text abstract extraction method and system
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN109902290B (en) Text information-based term extraction method, system and equipment
CN114358007A (en) Multi-label identification method and device, electronic equipment and storage medium
CN112559684A (en) Keyword extraction and information retrieval method
CN109815400A (en) Personage's interest extracting method based on long text
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN111191031A (en) Entity relation classification method of unstructured text based on WordNet and IDF
CN112307190A (en) Medical literature sorting method and device, electronic equipment and storage medium
CN111639189B (en) Text graph construction method based on text content features
CN110929022A (en) Text abstract generation method and system
CN117332788A (en) Semantic analysis method based on spoken English text
KR102370171B1 (en) Device and method to retrieve medical documents using contextual relevance
CN117076946A (en) Short text similarity determination method, device and terminal
CN116910599A (en) Data clustering method, system, electronic equipment and storage medium
CN111681731A (en) Method for automatically marking colors of inspection report
CN115269846A (en) Text processing method and device, electronic equipment and storage medium
CN110413782B (en) Automatic table theme classification method and device, computer equipment and storage medium
CN111241848B (en) Article reading comprehension answer retrieval method and device based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination