CN116720516A - Keyword extraction method, keyword extraction device, electronic equipment and computer readable storage medium - Google Patents

Keyword extraction method, keyword extraction device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN116720516A
CN116720516A CN202310723836.3A CN202310723836A CN116720516A CN 116720516 A CN116720516 A CN 116720516A CN 202310723836 A CN202310723836 A CN 202310723836A CN 116720516 A CN116720516 A CN 116720516A
Authority
CN
China
Prior art keywords
word
word segmentation
nodes
target text
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310723836.3A
Other languages
Chinese (zh)
Inventor
刘羲
田巍
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310723836.3A priority Critical patent/CN116720516A/en
Publication of CN116720516A publication Critical patent/CN116720516A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of intelligent medical treatment, and discloses a keyword extraction method, which comprises the following steps: performing word segmentation on a target text, counting the co-occurrence relation among each word segment, constructing a word segment connection graph of the target text by utilizing the co-occurrence relation, sequentially calculating the similarity between every two connected word segment nodes, acquiring the position relation of the corresponding two word segment nodes in the target text, calculating the connection edge weight between the two word segment nodes by utilizing the similarity and the position relation, updating the connection edge weight into a preset TextRank algorithm, obtaining the influence score of each word segment node in the word segment connection graph, sequencing the word segments of the target text according to the influence score of each word segment node, and selecting the word segments with a preset ranking order as keywords of the target text. The invention also provides a keyword extraction device, equipment and medium. The method and the device can improve the accuracy of extracting the medical text keywords.

Description

Keyword extraction method, keyword extraction device, electronic equipment and computer readable storage medium
Technical Field
The present invention relates to the field of intelligent medical technology, and in particular, to a keyword extraction method, a keyword extraction device, an electronic device, and a computer readable storage medium.
Background
In the medical application field, medical texts, such as medical papers, gene detection reports, multi-party joint consultation reports, patient rehabilitation evaluation, heart disease monitoring reports and the like, key core information of corresponding texts cannot be easily and quickly obtained in a short time due to a large number of professional terms and long space, wherein the key words are used as the minimum units for expressing the core meaning of the texts, and the text topics are effectively summarized in a concise manner, so that people can be helped to quickly grasp the topic and content of the documents. The graph keyword extraction algorithm represented by TextRank is the main stream method of keyword extraction.
Research on the TextRank algorithm is focused on aspects of feature integration, transition probability matrix improvement and the like, for example, three features of word average information entropy, part of speech and word position are utilized, weight distribution of the three features is optimized through neural network training to complete feature fusion, the fused features are used for improving word initial weight and probability transition matrix of the TextRank algorithm, and extraction accuracy is improved.
For another example, the word coverage, the word position and the word frequency comprehensive weighting are used as the word position distribution weight, the probability transition matrix is calculated and the candidate keyword scores are obtained through iteration, and the improved method achieves good effects.
And adding sentence and title similarity, sentence position in special paragraph, key sentence, special sentence weight and sentence length factor into TextRank, and providing iTextrank method.
However, the improvement ignores the potential difference of the position and the semantic information among different words, especially for medical texts with strict logic and complex demonstration content, the function and the expressed meaning of the same medical vocabulary in different positions of the same text are closely related to the context of the medical vocabulary, so that the accuracy of extracting the keywords by using a TextRank algorithm for the medical texts is lower.
Disclosure of Invention
The invention provides a keyword extraction method, a keyword extraction device, electronic equipment and a computer readable storage medium, and mainly aims to improve the accuracy of medical text keyword extraction.
In order to achieve the above object, the present invention provides a keyword extraction method, including:
obtaining a target text, and segmenting the target text;
counting the co-occurrence relation among the word segmentation, and constructing a word segmentation connection diagram of the target text by utilizing the co-occurrence relation, wherein the word segmentation connection diagram comprises word segmentation nodes and connection edges between any two connected word segmentation nodes;
sequentially calculating the similarity between every two connected word segmentation nodes, and acquiring the position relation of the corresponding two word segmentation nodes in the target text;
calculating the connecting edge weight between the two corresponding word segmentation nodes by utilizing the similarity and the position relation;
updating the connection edge weight to a preset TextRank algorithm to obtain an influence score of each word segmentation node in the word segmentation connection diagram;
and sorting the segmented words of the target text according to the influence score of each segmented word node, and selecting segmented words with a preset ranking order as keywords of the target text.
Optionally, the sequentially calculating the similarity between every two connected word segmentation nodes includes:
the similarity between every two connected segmentation nodes is calculated by using the following formula:
Wherein v is i Representing the ith word segmentation node in the word segmentation connection diagram, v j Representation and v i Connected j-th segmentation node, sim (v i J) represents a segmentation node v i And word segmentation node v j Similarity between w (v) i ) Is a word segmentation node v i Word vector, w (v) j ) Is a word segmentation node v j Is a word vector of (a).
Optionally, the obtaining the positional relationship of the two corresponding word segmentation nodes in the target text includes:
judging whether the two corresponding word segmentation nodes are in the same natural segment in the target text;
when the two word segmentation nodes are in the same natural segment, recording that the position relationship is the same natural segment;
when the two word segmentation nodes are not in the same natural segment, recording that the position relationship is different natural segments.
Optionally, calculating the connection edge weight between the two corresponding word segmentation nodes by using the similarity and the position relationship includes:
when the position relation of the two word segmentation nodes is different natural segments, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the following first weight formula:
W ij =(1-sim(v i ,v j ))*β
wherein W is ij Representing connected word segmentation nodes v i And v j The connecting edge weight between them, sim (v i ,v j ) Representing word segmentation node v i And word segmentation node v j The similarity between the two is beta is an superparameter, and the value range is 0 to 1;
when the position relation of the two word segmentation nodes is the same natural segment, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the following second weight formula:
W ij =(1-sim(v i ,v j ))
optionally, the counting the co-occurrence relation between each word segment includes:
selecting one word from the target text one by one as a target word;
counting co-occurrence times of the target word and adjacent words of the target word in a preset neighborhood range of the target word;
and constructing a co-occurrence matrix by utilizing the co-occurrence times corresponding to each target word, and taking the co-occurrence matrix as a co-occurrence relation between the corresponding words.
Optionally, the constructing a co-occurrence matrix by using the co-occurrence times corresponding to each target word includes:
constructing a co-occurrence matrix shown as follows by using the co-occurrence times corresponding to each target word:
wherein V is i,j And the co-occurrence times of the word segmentation i and the adjacent word segmentation j of the word segmentation i in the target text are obtained.
In order to solve the above problems, the present invention also provides a keyword extraction apparatus, the apparatus comprising:
The connection diagram construction module is used for acquiring a target text, segmenting the target text, counting the co-occurrence relation among each segmented word, and constructing a segmented word connection diagram of the target text by utilizing the co-occurrence relation, wherein the segmented word connection diagram comprises segmented word nodes and connection edges between any two connected segmented word nodes;
the node edge weight calculation module is used for sequentially calculating the similarity between every two connected word segmentation nodes, acquiring the position relation of the corresponding two word segmentation nodes in the target text, and calculating the connection edge weight between the corresponding two word segmentation nodes by utilizing the similarity and the position relation;
the node influence calculation module is used for updating the connection edge weight to a preset TextRank algorithm to obtain influence scores of each word segmentation node in the word segmentation connection diagram;
and the keyword screening module is used for sorting the segmented words of the target text according to the influence score of each segmented word node, and selecting segmented words with a preset ranking order as keywords of the target text.
Optionally, the connection graph construction module counts co-occurrence relationships between each of the segmentation words by:
Selecting one word from the target text one by one as a target word;
counting co-occurrence times of the target word and adjacent words of the target word in a preset neighborhood range of the target word;
and constructing a co-occurrence matrix by utilizing the co-occurrence times corresponding to each target word, and taking the co-occurrence matrix as a co-occurrence relation between the corresponding words.
In order to solve the above-mentioned problems, the present application also provides an electronic apparatus including:
a memory storing at least one computer program; a kind of electronic device with high-pressure air-conditioning system
And the processor executes the program stored in the memory to realize the keyword extraction method.
In order to solve the above-mentioned problems, the present application also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the keyword extraction method described above.
According to the embodiment of the application, the co-occurrence relation among the words of the target text is utilized to construct the word segmentation connection diagram of the target text, the similarity between two connected word segmentation nodes and the position relation of the corresponding word segmentation nodes in the target text are combined, the connection side weight among the corresponding word segmentation nodes is calculated, the connection side weight obtained through calculation is updated to a preset TextRank algorithm, and the influence score of each word is obtained.
Drawings
Fig. 1 is a flow chart of a keyword extraction method according to an embodiment of the present application;
FIG. 2 is a functional block diagram of a keyword extraction device according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device for implementing the keyword extraction method according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The embodiment of the application provides a keyword extraction method. The execution subject of the keyword extraction method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the keyword extraction method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a flow chart of a keyword extraction method according to an embodiment of the invention is shown. In this embodiment, the keyword extraction method includes:
s1, acquiring a target text, and segmenting the target text;
in the embodiment of the invention, the target text is a medical text of a keyword to be extracted, for example, the target text may be a medical paper about clinical analysis of genes and genetic diseases, and core abstract information of the medical paper is obtained by extracting the keyword of the medical paper. The target text can also be used for a patient rehabilitation evaluation report, and the life health index of a patient needing to be focused is known by extracting keywords of the evaluation report.
In the embodiment of the invention, the target text can be subjected to word segmentation processing by adopting a preset standard medical dictionary to obtain a plurality of segmented words, wherein the standard medical dictionary comprises a plurality of standard medical segmented words.
The target text is searched in the standard dictionary according to different lengths, and if the standard word which is the same as the target text can be searched, the searched standard word can be determined to be the word of the target text.
In another embodiment of the invention, a jieba aliquotation tool can be utilized to perform sentence and word segmentation processing on the target text.
According to the embodiment of the invention, the target text can be divided into the sets taking the word segmentation as the unit from one whole by word segmentation, so that the word segmentation meeting the requirements is screened out from the word segmentation sets to serve as the key word of the target text through subsequent calculation. For example, a medical paper for clinical analysis of genes and genetic diseases, by word segmentation of the medical paper, a series of words characterizing infectious diseases, such as epilepsy, multiple malformations, autism spectrum disorders, etc., can be obtained; words that characterize symptoms of infectious diseases, such as abnormal complexion, congenital malformation, mental retardation, etc.; and words for representing abnormal genes, such as words for gene deletion, gene micro-weight, polymorphism and the like.
S2, counting the co-occurrence relation among each word segmentation, and constructing a word segmentation connection diagram of the target text by utilizing the co-occurrence relation, wherein the word segmentation connection diagram comprises word segmentation nodes and connection edges between any two connected word segmentation nodes;
in the embodiment of the invention, the word segmentation connection diagram of the target text can be constructed based on a TextRank algorithm, wherein the TextRank algorithm is a graph-based ordering algorithm for the text, and the word segmentation connection diagram is constructed by dividing the target text into a plurality of component unit word segments. The word segmentation connection graph refers to a network constructed according to co-occurrence relationships among the words.
In detail, the counting the co-occurrence relation among each word segment includes:
selecting one word from the target text one by one as a target word;
counting co-occurrence times of the target word and adjacent words of the target word in a preset neighborhood range of the target word;
and constructing a co-occurrence matrix by utilizing the co-occurrence times corresponding to each target word, and taking the co-occurrence matrix as a co-occurrence relation between the corresponding words.
Illustratively, the co-occurrence matrix shown below is constructed by using the co-occurrence times corresponding to each target word:
wherein V is i,j And the co-occurrence times of the word segmentation i and the adjacent word segmentation j of the word segmentation i in the target text are obtained.
In the embodiment of the invention, the word segmentation (word segmentation nodes) are distributed and associated according to the co-occurrence relation among the words by utilizing the form of the graph by constructing the word segmentation connection graph of the target text.
S3, sequentially calculating the similarity between every two connected word segmentation nodes, and acquiring the position relation of the corresponding two word segmentation nodes in the target text;
in the embodiment of the invention, word vector conversion can be carried out on the word segmentation nodes, so that the word segmentation nodes can be converted from text description information into digital vector data which can be recognized and calculated by a computer.
Illustratively, the similarity between each two connected participle nodes is calculated using the following formula:
wherein v is i Representing the ith word segmentation node in the word segmentation connection diagram, v i Representation and v i Connected j-th segmentation node, sim (v i ,v j ) Representing word segmentation node v i And word segmentation node v j Similarity between w (v) i ) Is a word segmentation node v i Word vector, w (v) j ) Is a word segmentation node v j Is a word vector of (a).
In the embodiment of the present invention, the positional relationship refers to a distribution positional relationship of two word segmentation nodes in the target text, including, but not limited to, whether the two word segmentation nodes are in the same natural segment, whether the two word segmentation nodes are in the same sentence, or whether the two word segmentation nodes are in two natural segments adjacent from the beginning to the end, and the like.
Illustratively, the obtaining the positional relationship of the two corresponding word segmentation nodes in the target text includes:
judging whether the two corresponding word segmentation nodes are in the same natural segment in the target text;
when the two word segmentation nodes are in the same natural segment, recording that the position relationship is the same natural segment;
when the two word segmentation nodes are not in the same natural segment, recording that the position relationship is different natural segments.
In the embodiment of the invention, the similarity and the position relation between two word segmentation nodes are taken as two important consideration factors for keyword extraction.
S4, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the similarity and the position relation;
it can be understood that the larger the semantic difference between two adjacent word segmentation nodes, namely, the two word segmentation nodes are in the same natural segment, and the similarity of the two word segmentation nodes is lower, the higher the transition probability between the two corresponding word segmentation nodes in the TextRank algorithm, and the larger the connecting edge weight between the two corresponding word segmentation nodes. When two word segmentation nodes are not in the same natural segment, the relation degree of the two word segmentation nodes is reduced, and the transition probability between the two corresponding word segmentation nodes is attenuated.
In detail, the calculating the connection edge weight between the two corresponding word segmentation nodes by using the similarity and the position relationship includes:
when the position relation of the two word segmentation nodes is different natural segments, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the following first weight formula:
W ij =(1-sim(v i ,v j ))*β
wherein W is ij Representing connected word segmentation nodes v i And v j The connecting edge weight between them, sim (v i ,v j ) Representing word segmentation node v i And word segmentation node v j The similarity between the two is beta is an superparameter, and the value range is 0 to 1;
When the position relation of the two word segmentation nodes is the same natural segment, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the following second weight formula:
W ij =1-sim(v i ,j))
in the embodiment of the invention, compared with the traditional TextRank algorithm, the similarity between the segmented words is used as the connecting edge weight, the method only considers the influence of the semantic meaning of the segmented words, but ignores the influence of the position distribution of the segmented words in the target text, and the accuracy of the connecting edge weight can be improved by integrating the similarity between the segmented word nodes and the two factors of the position relation.
S5, updating the connection edge weight to a preset TextRank algorithm to obtain an influence score of each word segmentation node in the word segmentation connection diagram;
in the embodiment of the present invention, the preset TextRank algorithm may adopt the following algorithm:
wherein S (v) i ) Representing a word segmentation node v in the target text i D is a smoothing factor, typically taking a value of 0.85, in # i ) Is the set of other word segmentation nodes which can be connected to the word node vi in the target text, out # j ) Is a word segmentation node v j Pointing to a set of other segmentation nodes in the target text, w jk Representing word segmentation node v j And word segmentation node v k The weight of the connecting edge between the two parts, w ji Representing word segmentation node v j And word segmentation node v i And the connecting edge weight between the two.
It should be noted that, in the conventional TextRank algorithm, the connection edge weight generally only considers the similarity between two corresponding connected word segmentation nodes, and in the embodiment of the present invention, the connection edge weight integrates the similarity and the positional relationship between the word segmentation nodes, so that the accuracy of the final connection edge weight is improved, the actual connection edge weight is updated to the preset TextRank algorithm, and according to the TextRank algorithm, all the S (v i ) Obtaining the final (v) through repeated iterative computation i ) I.e., the impact score of the segmentation node.
S6, sorting the segmented words of the target text according to the influence score of each segmented word node, and selecting segmented words with a preset ranking order as keywords of the target text.
In the embodiment of the invention, all the segmented words can be ordered according to the influence score of each segmented word node from large to small, and generally, the higher the response score of the segmented word is, the higher the probability that the corresponding segmented word is the keyword of the target text is.
In the embodiment of the application, the preset ranking order can rank 5 top or rank 3 top, namely selecting the top 5 participles or the top 3 participles in the ranking as the keywords of the target text.
According to the embodiment of the application, the co-occurrence relation among the words of the target text is utilized to construct the word segmentation connection diagram of the target text, the similarity between two connected word segmentation nodes and the position relation of the corresponding word segmentation nodes in the target text are combined, the connection side weight among the corresponding word segmentation nodes is calculated, the connection side weight obtained through calculation is updated to a preset TextRank algorithm, and the influence score of each word is obtained.
Fig. 3 is a functional block diagram of a keyword extraction device according to an embodiment of the present application.
The keyword extraction apparatus 100 of the present application may be installed in an electronic device. According to the implemented functions, the keyword extraction apparatus 100 includes: the node influence calculation module comprises a connection diagram construction module 101, a node edge weight calculation module 102, a node influence calculation module 103 and a keyword screening module 104. The module of the application, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the connection diagram construction module 101 is configured to obtain a target text, segment the target text, count co-occurrence relationships between each segment, and construct a segment connection diagram of the target text using the co-occurrence relationships, where the segment connection diagram includes segment nodes and connection edges between any two connected segment nodes;
the node edge weight calculation module 102 is configured to sequentially calculate a similarity between each two connected word segmentation nodes, obtain a positional relationship between the corresponding two word segmentation nodes in the target text, and calculate a connection edge weight between the corresponding two word segmentation nodes by using the similarity and the positional relationship;
the node influence calculation module 103 is configured to update the connection edge weight to a preset TextRank algorithm, and obtain an influence score of each word segmentation node in the word segmentation connection graph;
the keyword screening module 104 is configured to sort the tokens of the target text according to the influence score of each token node, and select a token of a preset ranking order as a keyword of the target text.
In detail, the specific embodiments of each module of the keyword extraction apparatus 100 are as follows:
step one, acquiring a target text, and word segmentation is carried out on the target text;
in the embodiment of the invention, the target text is a text of a keyword to be extracted, for example, the target text may be a medical paper about clinical analysis of genes and genetic diseases, and core abstract information of the medical paper is obtained by extracting the keyword of the medical paper. The target text can also be used for a patient rehabilitation evaluation report, and the life health index of a patient needing to be focused is known by extracting keywords of the evaluation report. .
In the embodiment of the invention, the target text can be subjected to word segmentation processing by adopting a preset standard medical dictionary to obtain a plurality of segmented words, wherein the standard medical dictionary comprises a plurality of standard medical segmented words.
The target text is searched in the standard dictionary according to different lengths, and if the standard word which is the same as the target text can be searched, the searched standard word can be determined to be the word of the target text.
In another embodiment of the invention, a jieba aliquotation tool can be utilized to perform sentence and word segmentation processing on the target text.
According to the embodiment of the invention, the target text can be divided into the sets taking the word segmentation as the unit from one whole by word segmentation, so that the word segmentation meeting the requirements is screened out from the word segmentation sets to serve as the key word of the target text through subsequent calculation. For example, a medical paper for clinical analysis of genes and genetic diseases, by word segmentation of the medical paper, a series of words characterizing infectious diseases, such as epilepsy, multiple malformations, autism spectrum disorders, etc., can be obtained; words that characterize symptoms of infectious diseases, such as abnormal complexion, congenital malformation, mental retardation, etc.; and words for representing abnormal genes, such as words for gene deletion, gene micro-weight, polymorphism and the like.
Counting the co-occurrence relation among the segmented words, and constructing a segmented word connection diagram of the target text by utilizing the co-occurrence relation, wherein the segmented word connection diagram comprises segmented word nodes and connection edges between any two connected segmented word nodes;
in the embodiment of the invention, the word segmentation connection diagram of the target text can be constructed based on a TextRank algorithm, wherein the TextRank algorithm is a graph-based ordering algorithm for the text, and the word segmentation connection diagram is constructed by dividing the target text into a plurality of component unit word segments. The word segmentation connection graph refers to a network constructed according to co-occurrence relationships among the words.
In detail, the counting the co-occurrence relation among each word segment includes:
selecting one word from the target text one by one as a target word;
counting co-occurrence times of the target word and adjacent words of the target word in a preset neighborhood range of the target word;
and constructing a co-occurrence matrix by utilizing the co-occurrence times corresponding to each target word, and taking the co-occurrence matrix as a co-occurrence relation between the corresponding words.
Illustratively, the co-occurrence matrix shown below is constructed by using the co-occurrence times corresponding to each target word:
wherein V is i,j And the co-occurrence times of the word segmentation i and the adjacent word segmentation j of the word segmentation i in the target text are obtained.
In the embodiment of the invention, the word segmentation (word segmentation nodes) are distributed and associated according to the co-occurrence relation among the words by utilizing the form of the graph by constructing the word segmentation connection graph of the target text.
Step three, sequentially calculating the similarity between every two connected word segmentation nodes, and acquiring the position relation of the corresponding two word segmentation nodes in the target text;
in the embodiment of the invention, word vector conversion can be carried out on the word segmentation nodes, so that the word segmentation nodes can be converted from text description information into digital vector data which can be recognized and calculated by a computer.
Illustratively, the similarity between each two connected participle nodes is calculated using the following formula:
wherein v is i Representing the ith word segmentation node in the word segmentation connection diagram, v j Representation and v i Connected j-th segmentation node, sim (v i J) represents a segmentation node v i And word segmentation node v j Similarity between w (v) i ) Is a word segmentation node v i Word vector, w (v) j ) Is a word segmentation node v j Is a word vector of (a).
In the embodiment of the present invention, the positional relationship refers to a distribution positional relationship of two word segmentation nodes in the target text, including, but not limited to, whether the two word segmentation nodes are in the same natural segment, whether the two word segmentation nodes are in the same sentence, or whether the two word segmentation nodes are in two natural segments adjacent from the beginning to the end, and the like.
Illustratively, the obtaining the positional relationship of the two corresponding word segmentation nodes in the target text includes:
judging whether the two corresponding word segmentation nodes are in the same natural segment in the target text;
when the two word segmentation nodes are in the same natural segment, recording that the position relationship is the same natural segment;
when the two word segmentation nodes are not in the same natural segment, recording that the position relationship is different natural segments.
In the embodiment of the invention, the similarity and the position relation between two word segmentation nodes are taken as two important consideration factors for keyword extraction.
Step four, calculating the connecting edge weight between the two corresponding word segmentation nodes by utilizing the similarity and the position relation;
it can be understood that the larger the semantic difference between two adjacent word segmentation nodes, namely, the two word segmentation nodes are in the same natural segment, and the similarity of the two word segmentation nodes is lower, the higher the transition probability between the two corresponding word segmentation nodes in the TextRank algorithm, and the larger the connecting edge weight between the two corresponding word segmentation nodes. When two word segmentation nodes are not in the same natural segment, the relation degree of the two word segmentation nodes is reduced, and the transition probability between the two corresponding word segmentation nodes is attenuated.
In detail, the calculating the connection edge weight between the two corresponding word segmentation nodes by using the similarity and the position relationship includes:
when the position relation of the two word segmentation nodes is different natural segments, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the following first weight formula:
W ij =1-sim(v i ,j))*β
wherein W is ij Representing connected word segmentation nodes v i And v j The connecting edge weight between them, sim (v i J) represents a segmentation node v i And word segmentation node v j The similarity between the two is beta is an superparameter, and the value range is 0 to 1;
When the position relation of the two word segmentation nodes is the same natural segment, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the following second weight formula:
W ij =1-sim(v i ,j))
in the embodiment of the invention, compared with the traditional TextRank algorithm, the similarity between the segmented words is used as the connecting edge weight, the method only considers the influence of the semantic meaning of the segmented words, but ignores the influence of the position distribution of the segmented words in the target text, and the accuracy of the connecting edge weight can be improved by integrating the similarity between the segmented word nodes and the two factors of the position relation.
Updating the connection edge weight to a preset TextRank algorithm to obtain an influence score of each word segmentation node in the word segmentation connection diagram;
in the embodiment of the present invention, the preset TextRank algorithm may adopt the following algorithm:
wherein S (v) i ) Representing a word segmentation node v in the target text i D is a smoothing factor, typically taking a value of 0.85, in # i ) Is the set of other word segmentation nodes which can be connected to the word node vi in the target text, out # j ) Is a word segmentation node v j Pointing to a set of other segmentation nodes in the target text, w jk Representing word segmentation node v j And word segmentation node v k The weight of the connecting edge between the two parts, w ji Representing word segmentation node v j And word segmentation node v i And the connecting edge weight between the two.
It should be noted that, in the conventional TextRank algorithm, the connection edge weight generally only considers the similarity between two corresponding connected word segmentation nodes, and in the embodiment of the present invention, the connection edge weight integrates the similarity and the positional relationship between the word segmentation nodes, so that the accuracy of the final connection edge weight is improved, the actual connection edge weight is updated to the preset TextRank algorithm, and according to the TextRank algorithm, all the S (v i ) Obtaining the final (v) through repeated iterative computation i ) I.e., the impact score of the segmentation node.
Step six, sorting the word segmentation of the target text according to the influence score of each word segmentation node, and selecting the word segmentation with a preset ranking order as a keyword of the target text.
In the embodiment of the invention, all the segmented words can be ordered according to the influence score of each segmented word node from large to small, and generally, the higher the response score of the segmented word is, the higher the probability that the corresponding segmented word is the keyword of the target text is.
In the embodiment of the application, the preset ranking order can rank 5 top or rank 3 top, namely selecting the top 5 participles or the top 3 participles in the ranking as the keywords of the target text.
According to the embodiment of the application, the co-occurrence relation among the words of the target text is utilized to construct the word segmentation connection diagram of the target text, the similarity between two connected word segmentation nodes and the position relation of the corresponding word segmentation nodes in the target text are combined, the connection side weight among the corresponding word segmentation nodes is calculated, the connection side weight obtained through calculation is updated to a preset TextRank algorithm, and the influence score of each word is obtained.
Fig. 3 is a schematic structural diagram of an electronic device for implementing a keyword extraction method according to an embodiment of the present application.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a keyword extraction program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of keyword extraction programs, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing Unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., keyword extraction programs, etc.) stored in the memory 11, and calling data stored in the memory 11.
The bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 3 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The keyword extraction program stored in the memory 11 of the electronic device 1 is a combination of a plurality of instructions, which when executed in the processor 10, may implement:
Obtaining a target text, and segmenting the target text;
counting the co-occurrence relation among the word segmentation, and constructing a word segmentation connection diagram of the target text by utilizing the co-occurrence relation, wherein the word segmentation connection diagram comprises word segmentation nodes and connection edges between any two connected word segmentation nodes;
sequentially calculating the similarity between every two connected word segmentation nodes, and acquiring the position relation of the corresponding two word segmentation nodes in the target text;
calculating the connecting edge weight between the two corresponding word segmentation nodes by utilizing the similarity and the position relation;
updating the connection edge weight to a preset TextRank algorithm to obtain an influence score of each word segmentation node in the word segmentation connection diagram;
and sorting the segmented words of the target text according to the influence score of each segmented word node, and selecting segmented words with a preset ranking order as keywords of the target text.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
obtaining a target text, and segmenting the target text;
counting the co-occurrence relation among the word segmentation, and constructing a word segmentation connection diagram of the target text by utilizing the co-occurrence relation, wherein the word segmentation connection diagram comprises word segmentation nodes and connection edges between any two connected word segmentation nodes;
sequentially calculating the similarity between every two connected word segmentation nodes, and acquiring the position relation of the corresponding two word segmentation nodes in the target text;
calculating the connecting edge weight between the two corresponding word segmentation nodes by utilizing the similarity and the position relation;
updating the connection edge weight to a preset TextRank algorithm to obtain an influence score of each word segmentation node in the word segmentation connection diagram;
and sorting the segmented words of the target text according to the influence score of each segmented word node, and selecting segmented words with a preset ranking order as keywords of the target text.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application can acquire and process the related data based on the holographic projection technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present application without departing from the spirit and scope of the technical solution of the present application.

Claims (10)

1. A keyword extraction method, the method comprising:
obtaining a target text, and segmenting the target text;
Counting the co-occurrence relation among the word segmentation, and constructing a word segmentation connection diagram of the target text by utilizing the co-occurrence relation, wherein the word segmentation connection diagram comprises word segmentation nodes and connection edges between any two connected word segmentation nodes;
sequentially calculating the similarity between every two connected word segmentation nodes, and acquiring the position relation of the corresponding two word segmentation nodes in the target text;
calculating the connecting edge weight between the two corresponding word segmentation nodes by utilizing the similarity and the position relation;
updating the connection edge weight to a preset TextRank algorithm to obtain an influence score of each word segmentation node in the word segmentation connection diagram;
and sorting the segmented words of the target text according to the influence score of each segmented word node, and selecting segmented words with a preset ranking order as keywords of the target text.
2. The keyword extraction method as claimed in claim 1, wherein the sequentially calculating the similarity between each two connected segmentation nodes comprises:
the similarity between every two connected segmentation nodes is calculated by using the following formula:
wherein v is i Representing the ith word segmentation node in the word segmentation connection diagram, v j Representation and v i Connected j-th segmentation node, sim (v i ,v j ) Representing word segmentation node v i And word segmentation node v j Similarity between w (v) i ) Is a word segmentation node v i Word vector, w (v) j ) Is a word segmentation node v j Is a word vector of (a).
3. The keyword extraction method as claimed in claim 1, wherein the obtaining the positional relationship of the two corresponding segmentation nodes in the target text includes:
judging whether the two corresponding word segmentation nodes are in the same natural segment in the target text;
when the two word segmentation nodes are in the same natural segment, recording that the position relationship is the same natural segment;
when the two word segmentation nodes are not in the same natural segment, recording that the position relationship is different natural segments.
4. The keyword extraction method as claimed in claim 3, wherein calculating the connection edge weight between the corresponding two word segmentation nodes by using the similarity and the position relationship comprises:
when the position relation of the two word segmentation nodes is different natural segments, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the following first weight formula:
W ij =(1-sim(v i ,v j ))*β
wherein W is ij Representing connected word segmentation nodes v i And v j The connecting edge weight between them, sim (v i ,v j ) Representing word segmentation node v i And word segmentation node v j The similarity between the two is beta is an superparameter, and the value range is 0 to 1;
when the position relation of the two word segmentation nodes is the same natural segment, calculating the connecting edge weight between the two corresponding word segmentation nodes by using the following second weight formula:
W ij =(1-sim(v i ,v j ))。
5. the keyword extraction method of claim 1, wherein said counting co-occurrence relationships between each of said tokens comprises:
selecting one word from the target text one by one as a target word;
counting co-occurrence times of the target word and adjacent words of the target word in a preset neighborhood range of the target word;
and constructing a co-occurrence matrix by utilizing the co-occurrence times corresponding to each target word, and taking the co-occurrence matrix as a co-occurrence relation between the corresponding words.
6. The keyword extraction method of claim 5, wherein the constructing a co-occurrence matrix using the co-occurrence times corresponding to each of the target tokens includes:
constructing a co-occurrence matrix shown as follows by using the co-occurrence times corresponding to each target word:
wherein V is i,j And the co-occurrence times of the word segmentation i and the adjacent word segmentation j of the word segmentation i in the target text are obtained.
7. A keyword extraction apparatus, the apparatus comprising:
the connection diagram construction module is used for acquiring a target text, segmenting the target text, counting the co-occurrence relation among each segmented word, and constructing a segmented word connection diagram of the target text by utilizing the co-occurrence relation, wherein the segmented word connection diagram comprises segmented word nodes and connection edges between any two connected segmented word nodes;
the node edge weight calculation module is used for sequentially calculating the similarity between every two connected word segmentation nodes, acquiring the position relation of the corresponding two word segmentation nodes in the target text, and calculating the connection edge weight between the corresponding two word segmentation nodes by utilizing the similarity and the position relation;
the node influence calculation module is used for updating the connection edge weight to a preset TextRank algorithm to obtain influence scores of each word segmentation node in the word segmentation connection diagram;
and the keyword screening module is used for sorting the segmented words of the target text according to the influence score of each segmented word node, and selecting segmented words with a preset ranking order as keywords of the target text.
8. The keyword extraction apparatus of claim 7, wherein the connection graph construction module counts co-occurrence relationships between each of the tokens by:
selecting one word from the target text one by one as a target word;
counting co-occurrence times of the target word and adjacent words of the target word in a preset neighborhood range of the target word;
and constructing a co-occurrence matrix by utilizing the co-occurrence times corresponding to each target word, and taking the co-occurrence matrix as a co-occurrence relation between the corresponding words.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the keyword extraction method of any one of claims 1 to 6.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the keyword extraction method of any one of claims 1 to 6.
CN202310723836.3A 2023-06-16 2023-06-16 Keyword extraction method, keyword extraction device, electronic equipment and computer readable storage medium Pending CN116720516A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310723836.3A CN116720516A (en) 2023-06-16 2023-06-16 Keyword extraction method, keyword extraction device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310723836.3A CN116720516A (en) 2023-06-16 2023-06-16 Keyword extraction method, keyword extraction device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116720516A true CN116720516A (en) 2023-09-08

Family

ID=87873069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310723836.3A Pending CN116720516A (en) 2023-06-16 2023-06-16 Keyword extraction method, keyword extraction device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116720516A (en)

Similar Documents

Publication Publication Date Title
US11232365B2 (en) Digital assistant platform
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
CN111639153A (en) Query method and device based on legal knowledge graph, electronic equipment and medium
CN113722483B (en) Topic classification method, device, equipment and storage medium
CN113870974A (en) Risk prediction method and device based on artificial intelligence, electronic equipment and medium
CN114420279A (en) Medical resource recommendation method, device, equipment and storage medium
WO2022227171A1 (en) Method and apparatus for extracting key information, electronic device, and medium
CN116882496B (en) Medical knowledge base construction method for multistage logic reasoning
CN113111159A (en) Question and answer record generation method and device, electronic equipment and storage medium
CN116402166A (en) Training method and device of prediction model, electronic equipment and storage medium
CN116844711A (en) Disease auxiliary identification method and device based on deep learning
CN116483976A (en) Registration department recommendation method, device, equipment and storage medium
CN114596958B (en) Pathological data classification method, device, equipment and medium based on cascade classification
CN114676307A (en) Ranking model training method, device, equipment and medium based on user retrieval
CN116720516A (en) Keyword extraction method, keyword extraction device, electronic equipment and computer readable storage medium
CN114664421A (en) Doctor-patient matching method and device, electronic equipment, medium and product
Lavanya et al. Auto capture on drug text detection in social media through NLP from the heterogeneous data
Garg et al. Predicting family physicians based on their practice using machine learning
CN114968412B (en) Configuration file generation method, device, equipment and medium based on artificial intelligence
Junior et al. A study of the influence of textual features in learning medical prior authorization
CN112507230B (en) Webpage recommendation method and device based on browser, electronic equipment and storage medium
CN113157865B (en) Cross-language word vector generation method and device, electronic equipment and storage medium
CN112988963B (en) User intention prediction method, device, equipment and medium based on multi-flow nodes
CN116403704A (en) Online medical information-based severe early warning method, device, equipment and medium
CN116881454A (en) Medical corpus generation method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination