CN109902290B - Text information-based term extraction method, system and equipment - Google Patents
Text information-based term extraction method, system and equipment Download PDFInfo
- Publication number
- CN109902290B CN109902290B CN201910063975.1A CN201910063975A CN109902290B CN 109902290 B CN109902290 B CN 109902290B CN 201910063975 A CN201910063975 A CN 201910063975A CN 109902290 B CN109902290 B CN 109902290B
- Authority
- CN
- China
- Prior art keywords
- word
- text
- node
- words
- term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a text information-based term extraction method, which comprises the following steps: acquiring a text to be processed, and preprocessing the text to be processed; extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed, and recording the words into a seed word set; constructing a seed word network based on nodes of the seed word set and edges of the nodes; defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges; and sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words which are arranged in sequence form the adjacent phrases. The invention also discloses a text information-based term extraction system and a text information-based term extraction device. The embodiment of the invention can fully consider the problem of Chinese grammar level, has the characteristics of automation and dynamic update, and meets the requirement of high-speed extraction of modern massive text terms.
Description
Technical Field
The present invention relates to the field of language identification technologies, and in particular, to a method, a system, and an apparatus for extracting terms based on text information.
Background
The term automated extraction research has become a research hotspot problem in the field of natural language. The term automatic extraction method in the prior art specifically comprises the following steps: firstly, extracting a seed word method of a text by utilizing mutual information and context dependence; then, splicing the words by combining a word frequency method to form a compound word in the key field; finally, quantitatively measuring the association degree between terms by adopting the field consistency, the field correlation degree and the field membership degree. The seed word extraction method based on mutual information, context dependence and information entropy takes text frequent words as reference points, synthesizes text seed words in a forward or backward splicing mode, and has high completeness of extracted terms, but the method does not consider the problem of Chinese grammar level, and can cause a large number of non-domain compound words or terms. In addition, although the term extraction method of the field consistency, the field relevance and the field membership can be used for better extracting the compound words and the terms of the field, the threshold value of each index is difficult to find an optimal value.
Disclosure of Invention
The embodiment of the invention aims to provide a text information-based term extraction method, a text information-based term extraction system and text information-based term extraction equipment, which can fully consider the problem of Chinese grammar level, have the characteristics of automation and dynamic update, and meet the requirement of high-speed extraction of modern massive text terms.
In order to achieve the above object, an embodiment of the present invention provides a text information-based term extraction method, including:
acquiring a text to be processed, and preprocessing the text to be processed;
extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed, and recording the words into a seed word set;
constructing a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges;
sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases; wherein the adjacent phrases meet a preset term rule.
Compared with the prior art, the term extraction method based on text information disclosed by the invention has the advantages that firstly, on the basis of preprocessing, the predicted seed words are mined by adopting the judging indexes and the context-dependent judging indexes and are recorded in a seed word set; then, constructing a seed word network based on the nodes of the seed word set and the edges of the nodes, and iterating the weight of the nodes by adopting an algorithm of a preset model to enable the weight to be converged; and finally, sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases. The method for extracting the text information-based terms solves the problem that the Chinese grammar level is not considered in the prior art to extract a large number of non-domain compound words or terms, can fully consider the problem of the Chinese grammar level, has the characteristics of automation and dynamic updating, and meets the requirement of high-speed extraction of modern massive text terms.
As an improvement of the above solution, after extracting the adjacent phrase as a candidate term, the method further includes:
calculating the support and the confidence of the candidate term in a database; the database comprises a plurality of words in a preset field;
and when the candidate term belongs to the preset domain, extracting the candidate term to form a term dictionary of the preset domain.
As an improvement of the above scheme, the preprocessing the text to be processed specifically includes:
carrying out minimum unit division of words on the text to be processed by utilizing an hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system.
As an improvement of the above scheme, the mutual information judgment index satisfies the following formula:
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i Is the number of occurrences of all words in the database.
As an improvement of the above-described scheme, the context-dependent determination index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w represents t within a particular window i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words.
As an improvement of the above solution, the defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges specifically includes:
defining the weight of the node by adopting semantic relevance; wherein the semantic relevance satisfies the following formula:
wherein w is ij Is the word t i And t j Semantic relevance between nodes represents the importance degree of edge connection between nodes;
iterating the weight of the node through the Textrank model until the weight of the node converges; wherein the iterative process satisfies the following formula:
wherein WS (t) i ) Representing node t i Is of importance of (a); d represents a damping coefficient, typically less than 1; t is t j ∈In(t i ) The representation being the word t i Following word t j Afterwards; t is t k ∈Out(t j ) Representation word t k Following word t j Afterwards; WS (t) j ) Representing node t j Is of importance of (a); w (w) jk Is the word t j And t k Semantic relatedness between them.
As an improvement of the above scheme, the extracting the adjacent phrase as a candidate term specifically includes:
and extracting the adjacent phrases by using the sliding window as candidate terms.
To achieve the above object, an embodiment of the present invention further provides a text information-based term extraction system, including:
the text pretreatment unit is used for obtaining the text to be treated and carrying out pretreatment on the text to be treated;
the seed word set recording unit is used for extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed and recording the words into a seed word set;
a seed word network construction unit, configured to construct a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
the convergence unit is used for defining the weight of the node and iterating the weight of the node through a preset model until the weight of the node converges;
the candidate term extraction unit is used for sequencing the weights of the nodes, and extracting adjacent phrases as candidate terms when the seed words which are sequentially arranged form the adjacent phrases; wherein the adjacent phrases meet a preset term rule.
Compared with the prior art, the term extraction system based on text information disclosed by the invention has the advantages that firstly, on the basis of preprocessing by a text preprocessing unit to be processed, a seed word set recording unit adopts a judging index and a context dependent judging index to record expected seed words into a seed word set; then, a seed word network construction unit constructs a seed word network based on the nodes of the seed word set and the edges of the nodes, and a convergence unit adopts an algorithm of a preset model to iterate the weight of the nodes to enable the weight of the nodes to be converged; and finally, the candidate term extraction unit sorts the weights of the nodes, and when the seed words which are arranged in sequence form adjacent phrases, the adjacent phrases are extracted as candidate terms. The text information-based term extraction system disclosed by the invention can fully consider the problem of Chinese grammar level, has the characteristics of automation and dynamic updating, and meets the requirement of high-speed extraction of modern massive text terms.
As an improvement of the above solution, the system further comprises:
a support and confidence calculating unit for calculating the support and confidence of the candidate term in the database; the database comprises a plurality of words in a preset field;
and the term dictionary generating unit is used for extracting the candidate terms to form a term dictionary of the preset domain when the candidate terms belong to the preset domain.
To achieve the above object, an embodiment of the present invention further provides a text information based term extraction device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the text information based term extraction method according to any one of the embodiments described above when executing the computer program.
Drawings
Fig. 1 is a flowchart of a text information-based term extraction method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a seed word network in a term extraction method based on text information according to an embodiment of the present invention;
FIG. 3 is another flow chart of a text information based term extraction method provided by an embodiment of the present invention;
FIG. 4 is a block diagram of a text-based term extraction system 10 according to an embodiment of the present invention;
fig. 5 is a block diagram of a text information based term extracting device 20 according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, fig. 1 is a flowchart of a text information-based term extraction method according to an embodiment of the present invention; comprising the following steps:
s1, acquiring a text to be processed, and preprocessing the text to be processed;
s2, extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed, and recording the words into a seed word set;
s3, constructing a seed word network based on the nodes of the seed word set and the edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
s4, defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges;
s5, sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases; wherein the adjacent phrases meet a preset term rule.
Specifically, in step S1, the text to be processed is unstructured text, and the unstructured text may be several words, several sentences, or an article.
Preferably, the preprocessing the text to be processed specifically includes: carrying out minimum unit division of words on the text to be processed by utilizing an hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system. The minimum units of division for the same word are different according to different dictionaries. Such as cloud computing, may be partitioned into "cloud/computing" using bargaining, and "cloud computing" if other custom dictionaries are used. The minimum unit is a word which can be divided under the current tool.
Specifically, in step S2, the conventional mutual information calculation method weakens the probability of occurrence in the word combination or word re-prediction, so that the probability influence coefficient of occurrence of the word needs to be considered when calculating the mutual information. The mutual information judgment index satisfies the following formula:
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i Is the number of occurrences of all words in the database.
Context-dependent refers to up and down Wen Ciyu t within a particular window i Conditional entropy in the case that has occurred, the context dependent decision index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w is represented in a particular windowInner t i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words. The benefit of setting the specific window is that misjudgement of some specific word combinations as terms is largely precluded.
For example, if the text in the specific window is "each program", then t is called i That is, "one segment" is the probability that the word "program" appears after the word "one segment" appears, and w is the probability that the "program" appears again when "one segment" appears within a specific window. In the whole corpus, after a section appears, specific words such as a program, a road surface, a telephone, a silk belt and the like may appear, and all the specific words do not include a section, so that all the specific words are a set that another word appears when a certain word appears, namely a set of specific condition states.
Specifically, a threshold value of mutual information and context dependence is set according to the corpus, and if the word or word combination meets the threshold value, the word or word combination is recorded into the seed word set.
Specifically, in step S3, referring to fig. 2, fig. 2 is a schematic diagram of a seed word network in a term extraction method based on text information according to an embodiment of the present invention; and forming a seed word network G= (V, E) by a node V of the seed word set and an edge E between the nodes, wherein the node is any seed word in the seed word set (such as an algorithm in fig. 2), the edge of the node is a seed word adjacent to the current node (such as an edge of the algorithm in fig. 2 comprises an unsupervised, a neural network and an intelligent), namely the edge is 1 or any equal constant.
Specifically, in step S4, the mutual information and the context in the above steps rely on the statistics-too-focused index to measure the features of the words, and the semantic features between the words are not reflected from the semantic level.
Aiming at the problems, the embodiment of the invention firstly adopts semantic relevance to define the weight of the node; the node semantic relevance means the probability of simultaneous occurrence of the seed words, which accords with the assumption of the ebedding method, namely has similar context, and judges whether the seed words belong to the same category by quantitatively measuring semantic hierarchical relations among the seed words; word vectors trained by the text corpus-based emmbedding method have semantic correlation, so that the features of semantic correlation are reflected by adopting similarity among vectors on the basis of word2vec training pretreatment on each corpus; wherein the semantic relevance satisfies the following formula:
wherein w is ij Is the word t i And t j Semantic relevance between nodes represents the importance degree of edge connection between nodes;
then iterating the weight of the node through the Textrank model until the weight of the node converges; wherein the iterative process satisfies the following formula:
wherein WS (t) i ) Representing node t i Is of importance of (a); d represents a damping coefficient, typically less than 1; t is t j ∈In(t i ) The representation being the word t i Following word t j Afterwards; t is t k ∈Out(t j ) Representation word t k Following word t j Afterwards; WS (t) j ) Representing node t j Is of importance of (a); w (w) jk Is the word t j And t k Semantic relativity between; and continuously iterating according to the word ordering rule of the corpus until the stopping condition is met.
Specifically, in step S5, the weights of the nodes are ranked by Top-N to obtain Top-N seed words; if adjacent word groups are formed among the Top-N seed words, the adjacent word groups are extracted as terms. The method reflects semantic features among words constituting the term from a semantic level, and can reduce interference of irrelevant word combinations to a certain extent.
Preferably, the adjacent phrase is extracted as a candidate term by using a sliding window. For example, a paragraph "set mutual information and context dependent threshold according to corpus", if the word or word combination meets the above threshold, it is included in the seed node set ", and at this time, the seed word of Top-N is extracted by: "corpus", "set", "mutual information", "context", "dependency", "threshold", "word combination", "seed node", "collection". Starting sliding with a window of length 6, slowly sliding the window from left "with" word to right "with" word, if the framed word is inside the seed word set of Top-N, then it is used as candidate term set (e.g. seed node set, context dependency), if not, then the seed word is used as candidate term set (corpus, mutual information).
Preferably, the preset term rule is a chinese term rule shown in table 1, wherein the definitive phrase includes: adjectives, distinguishing this, verbs, nouns, and the number + adjectives.
TABLE 1 Chinese term rules
Part-of-speech phrase | Template |
baseNP | Verb+noun |
baseNP | baseNP+baseNP |
baseNP | baseNP+ noun |
baseNP | Definitive sign + baseNP |
baseNP | Limiting the idioms + nouns |
Further, after extracting the candidate term, the method further includes step S6: calculating the support and the confidence of the candidate term in a database; the database comprises a plurality of words in a preset field; and when the candidate term belongs to the preset domain, extracting the candidate term to form a term dictionary of the preset domain.
Wherein the support reveals the term m i And m is equal to j The probability of simultaneous occurrence is expressed as:
Support(m i ->m j )=P(m i ∪m j ) Equation (5);
the confidence level reveals the term m i After appearance, the term m j Whether or not or how likely it is that it will appear, the formula is:
Confidence(m i ->m j )=P(m i |m j ) Equation (6);
and calculating the support degree and the confidence degree of each candidate term in the specific field through the formula (5) and the formula (6), comparing the support degree and the confidence degree with the set minimum support degree and the set minimum confidence degree, excluding the candidate terms with the support degree less than the minimum support degree and the confidence degree, and finally forming the Chinese term dictionary in the specific field.
The acquisition of the association rule is mainly to find out the frequent mode of the minimum support degree Minsup and the minimum confidence degree Minconf meeting certain conditions from a large number of event record databases by a data mining method. After candidate terms are found, the embodiment of the invention calculates the support degree and the confidence degree of each candidate term in the preset field, compares the minimum support degree and the confidence degree of the terms in the preset field, excludes a large number of non-field candidate terms, and finally forms the Chinese dictionary in the preset field. The preset domain may be a specific domain which is set arbitrarily, and terms in different domains have different confidence degrees and support degrees, which is not limited in particular by the present invention.
Further, the process of steps S1 to S6 may refer to fig. 3.
When the method is implemented, firstly, on the basis of preprocessing, the predicted seed words are mined by adopting the judging indexes and the context-dependent judging indexes and are recorded in a seed word set; then, constructing a seed word network based on the nodes of the seed word set and the edges of the nodes, and iterating the weight of the nodes by adopting an algorithm of a preset model to enable the weight to be converged; and finally, sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases.
Compared with the prior art, the text information-based term extraction method disclosed by the invention solves the problem that a large number of non-domain compound words or terms are extracted due to no consideration of Chinese grammar levels in the prior art, can fully consider the problem of Chinese grammar levels, has the characteristics of automation and dynamic update, and meets the requirement of high-speed extraction of modern massive text terms.
Example two
Referring to fig. 4, fig. 4 is a block diagram illustrating a text-based term extraction system 10 according to an embodiment of the present invention; comprising the following steps:
a text pretreatment unit 1 to be treated, which is used for obtaining a text to be treated and pretreating the text to be treated;
a seed word set recording unit 2, configured to extract words satisfying the mutual information judgment index and the context dependent judgment index from the text to be processed, and record the words into a seed word set;
a seed word network construction unit 3, configured to construct a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
a convergence unit 4, configured to define a weight of the node, and iterate the weight of the node through a preset model until the weight of the node converges;
a candidate term extraction unit 5, configured to sort weights of the nodes, and extract adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases; wherein the adjacent phrases meet a preset term rule.
Preferably, the text-based term extracting system 10 further includes:
a support and confidence calculation unit 6 for calculating the support and confidence of the candidate term in the database; the database comprises a plurality of words in a preset field;
a term dictionary generating unit 7 for extracting term dictionaries of which the candidate terms constitute a preset domain when the candidate terms belong to the preset domain.
Preferably, the text preprocessing unit 1 performs minimum unit division of words on the text to be processed by using a hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system.
Preferably, the mutual information judgment index satisfies the following formula:
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i Is the number of occurrences of all words in the database.
Preferably, the context dependent decision index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w represents t within a particular window i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words.
Preferably, the convergence unit 4 defines the weight of the node by adopting semantic relevance; wherein the semantic relevance satisfies the following formula:
wherein w is ij Is the word t i And t j Semantic relevance between nodes represents the importance degree of edge connection between nodes;
the convergence unit 4 iterates the weight of the node through the Textrank model until the weight of the node converges; wherein the iterative process satisfies the following formula:
wherein WS (t) i ) Representing node t i Is of importance of (a); d represents a damping coefficient, typically less than 1; t is t j ∈In(t i ) The representation being the word t i Following word t j Afterwards; t is t k ∈Out(t j ) Representation word t k Following word t j Afterwards; WS (t) j ) Representing node t j Is of importance of (a); w (w) jk Is the word t j And t k Semantic relatedness between them.
Preferably, the candidate term extraction unit 5 extracts the adjacent phrase as a candidate term using a sliding window.
The working process of each unit is referred to the working process of steps S1 to S6 in the above embodiment, and will not be described herein.
When the method is implemented, firstly, on the basis of preprocessing by the text preprocessing unit 1 to be processed, the seed word set recording unit 2 adopts the judging index and the context dependent judging index to mine the expected seed word to be recorded in the seed word set; then, the seed word network construction unit 3 constructs a seed word network based on the nodes of the seed word set and the edges of the nodes, and the convergence unit 4 adopts an algorithm of a preset model to iterate the weights of the nodes to enable the weights to be converged; finally, the candidate term extraction unit 5 sorts the weights of the nodes, and extracts adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases.
Compared with the prior art, the text information-based term extraction system 10 disclosed by the invention solves the problem that a large number of non-domain compound words or terms are extracted due to the fact that the Chinese grammar level is not considered in the prior art, and the text information-based term extraction system 10 disclosed by the invention can fully consider the problem of the Chinese grammar level, has the characteristics of automation and dynamic update, and meets the requirement of high-speed extraction of modern massive text terms.
Example III
Referring to fig. 5, fig. 5 is a schematic structural diagram of a text information based term extracting apparatus 20 according to an embodiment of the present invention. The text information based term extracting device 20 of this embodiment includes: a processor 21, a memory 22 and a computer program stored in said memory 22 and executable on said processor 21. The processor 21, when executing the computer program, implements the steps of the above-described respective text information based term extraction method embodiments, such as steps S1 to S5 shown in fig. 1. Alternatively, the processor 21 may implement the functions of the modules/units in the above-described device embodiments when executing the computer program.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 22 and executed by the processor 21 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program in the text information based term extracting device 20. For example, the computer program may be divided into a text preprocessing unit 1 to be processed, a seed word collection recording unit 2, a seed word network construction unit 3, a convergence unit 4, a candidate term extraction unit 5, a candidate term extraction unit 6, and a term dictionary generating unit 7, and specific functions of each module unit refer to specific functions of each unit in the text information based term extraction system 10 described in the above second embodiment, which are not described herein.
The text-based term extraction device 20 may be a computing device such as a desktop computer, a notebook computer, a palm top computer, and a cloud server. The text information based term extracting device 20 may include, but is not limited to, a processor 21, a memory 22. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the text information based term extraction device 20 and is not meant to be limiting of the text information based term extraction device 20, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the text information based term extraction device 20 may also include input and output devices, network access devices, buses, etc.
The processor 21 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the text-based term extracting device 20, connecting various parts of the entire text-based term extracting device 20 using various interfaces and lines.
The memory 22 may be used to store the computer program and/or module, and the processor 21 may implement various functions of the text information based term extracting device 20 by executing or executing the computer program and/or module stored in the memory 22 and invoking data stored in the memory 22. The memory 22 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the modules/units integrated by the text-based term extracting device 20 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a separate product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the method embodiments described above when executed by the processor 21. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.
Claims (5)
1. A text information-based term extraction method, comprising:
acquiring a text to be processed, and preprocessing the text to be processed;
extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed, and recording the words into a seed word set;
constructing a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges;
sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases; wherein, the adjacent phrase meets the preset term rule;
calculating the support and the confidence of the candidate term in a database; wherein the database comprises a plurality of words in a preset field, and the support reveals the term m i And m is equal to j The probability of simultaneous occurrence is expressed as: support (m) i ->m j )=P(m i ∪m j ) The method comprises the steps of carrying out a first treatment on the surface of the The confidence level reveals the term m i After appearance, the term m j Whether or not or how likely it is that it will appear, the formula is: confidence (m) i ->m j )=P(m i |m j );
Extracting the candidate terms to form a term dictionary of a preset domain when the candidate terms belong to the preset domain;
the preprocessing the text to be processed specifically comprises the following steps:
carrying out minimum unit division of words on the text to be processed by utilizing an hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system;
the mutual information judgment index satisfies the following formula:
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i The number of occurrences of all words in the database;
the context dependent decision index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w represents t within a particular window i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words.
2. The text information based term extraction method of claim 1, wherein the defining the weight of the node and iterating the weight of the node through a preset model until the weight of the node converges specifically includes:
defining the weight of the node by adopting semantic relevance; wherein the semantic relevance satisfies the following formula:
wherein w is ij Is the word t i And t j Semantic relevance between nodes represents the importance degree of edge connection between nodes;
iterating the weight of the node through the Textrank model until the weight of the node converges; wherein the iterative process satisfies the following formula:
wherein WS (t) i ) Representing node t i Is of importance of (a); d represents a damping coefficient, typically less than 1; t is t j ∈In(t i ) The representation being the word t i Following word t j Afterwards; t is t k ∈Out(t j ) Representation word t k Following word t j Afterwards; WS (t) j ) Representing node t j Is of importance of (a); w (w) jk Is the word t j And t k Semantic relatedness between them.
3. The text information based term extraction method of claim 1, wherein the extracting the adjacent phrase as a candidate term specifically includes:
and extracting the adjacent phrases by using the sliding window as candidate terms.
4. A text-based term extraction system, comprising:
the text pretreatment unit is used for obtaining the text to be treated and carrying out pretreatment on the text to be treated;
the seed word set recording unit is used for extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed and recording the words into a seed word set;
a seed word network construction unit, configured to construct a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
the convergence unit is used for defining the weight of the node and iterating the weight of the node through a preset model until the weight of the node converges;
the candidate term extraction unit is used for sequencing the weights of the nodes, and extracting adjacent phrases as candidate terms when the seed words which are sequentially arranged form the adjacent phrases; wherein, the adjacent phrase meets the preset term rule;
a support and confidence calculating unit for calculating the support and confidence of the candidate term in the database; wherein the database comprises a plurality of words in a preset field, and the support reveals the term m i And m is equal to j The probability of simultaneous occurrence is expressed as: support (m) i ->m j )=P(m i ∪m j ) The method comprises the steps of carrying out a first treatment on the surface of the The confidence level reveals the term m i After appearance, the term m j Whether or not or how likely it is that it will appear, the formula is: confidence (m) i ->m j )=P(m i |m j );
A term dictionary generating unit configured to extract a term dictionary of a preset domain constituted by the candidate terms when the candidate terms belong to the preset domain;
the text preprocessing unit to be processed is specifically configured to:
carrying out minimum unit division of words on the text to be processed by utilizing an hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system;
the mutual information judgment index satisfies the following formula:
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i The number of occurrences of all words in the database;
the context dependent decision index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w representsWithin a particular window t i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words.
5. A text information based term extracting device, characterized by comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the text information based term extracting method according to any one of claims 1 to 3 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910063975.1A CN109902290B (en) | 2019-01-23 | 2019-01-23 | Text information-based term extraction method, system and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910063975.1A CN109902290B (en) | 2019-01-23 | 2019-01-23 | Text information-based term extraction method, system and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109902290A CN109902290A (en) | 2019-06-18 |
CN109902290B true CN109902290B (en) | 2023-06-30 |
Family
ID=66944048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910063975.1A Active CN109902290B (en) | 2019-01-23 | 2019-01-23 | Text information-based term extraction method, system and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109902290B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230118640A1 (en) * | 2020-03-25 | 2023-04-20 | Metis Ip (Suzhou) Llc | Methods and systems for extracting self-created terms in professional area |
CN111680128A (en) * | 2020-06-16 | 2020-09-18 | 杭州安恒信息技术股份有限公司 | Method and system for detecting web page sensitive words and related devices |
CN112966508B (en) * | 2021-04-05 | 2023-08-25 | 集智学园(北京)科技有限公司 | Universal automatic term extraction method |
CN115130472B (en) * | 2022-08-31 | 2023-02-21 | 北京澜舟科技有限公司 | Method, system and readable storage medium for segmenting subwords based on BPE |
CN116756298B (en) * | 2023-08-18 | 2023-10-20 | 太仓市律点信息技术有限公司 | Cloud database-oriented AI session information optimization method and big data optimization server |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102360383A (en) * | 2011-10-15 | 2012-02-22 | 西安交通大学 | Method for extracting text-oriented field term and term relationship |
CN107329950A (en) * | 2017-06-13 | 2017-11-07 | 武汉工程大学 | It is a kind of based on the Chinese address segmenting method without dictionary |
CN108287825A (en) * | 2018-01-05 | 2018-07-17 | 中译语通科技股份有限公司 | A kind of term identification abstracting method and system |
CN108549626A (en) * | 2018-03-02 | 2018-09-18 | 广东技术师范学院 | A kind of keyword extracting method for admiring class |
-
2019
- 2019-01-23 CN CN201910063975.1A patent/CN109902290B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102360383A (en) * | 2011-10-15 | 2012-02-22 | 西安交通大学 | Method for extracting text-oriented field term and term relationship |
CN107329950A (en) * | 2017-06-13 | 2017-11-07 | 武汉工程大学 | It is a kind of based on the Chinese address segmenting method without dictionary |
CN108287825A (en) * | 2018-01-05 | 2018-07-17 | 中译语通科技股份有限公司 | A kind of term identification abstracting method and system |
CN108549626A (en) * | 2018-03-02 | 2018-09-18 | 广东技术师范学院 | A kind of keyword extracting method for admiring class |
Non-Patent Citations (2)
Title |
---|
基于上下文关系和TextRank算法的关键词提取方法;杜海舟 等;《上海电力学院学报》;20171230;607-612页 * |
基于关联规则和语义规则的本体概念提取研究;贺海涛 等;《吉林大学学报(信息科学版)》;20141130;657-663页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109902290A (en) | 2019-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109902290B (en) | Text information-based term extraction method, system and equipment | |
US11301637B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
CN109086265B (en) | Semantic training method and multi-semantic word disambiguation method in short text | |
CN110347790B (en) | Text duplicate checking method, device and equipment based on attention mechanism and storage medium | |
CN111177375B (en) | Electronic document classification method and device | |
CN113590811B (en) | Text abstract generation method and device, electronic equipment and storage medium | |
CN112434134B (en) | Search model training method, device, terminal equipment and storage medium | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
CN115062621A (en) | Label extraction method and device, electronic equipment and storage medium | |
CN111046662B (en) | Training method, device and system of word segmentation model and storage medium | |
CN112836491B (en) | NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model | |
CN109858035A (en) | A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing | |
CN112948561A (en) | Method and device for automatically expanding question-answer knowledge base | |
CN116502637A (en) | Text keyword extraction method combining context semantics | |
CN117057346A (en) | Domain keyword extraction method based on weighted textRank and K-means | |
CN113157857B (en) | Hot topic detection method, device and equipment for news | |
KR20050033852A (en) | Apparatus, method, and program for text classification using frozen pattern | |
CN111813934B (en) | Multi-source text topic model clustering method based on DMA model and feature division | |
CN117573956B (en) | Metadata management method, device, equipment and storage medium | |
CN117725555B (en) | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium | |
CN111159393B (en) | Text generation method for abstract extraction based on LDA and D2V | |
CN111125350B (en) | Method and device for generating LDA topic model based on bilingual parallel corpus | |
Liang et al. | Learning mention and relation representation with convolutional neural networks for relation extraction | |
CN117726423A (en) | Identification method and device of client loan information, storage medium and electronic equipment | |
Mimouni et al. | Comparing Performance of Text Pre-processing Methods for Predicting a Binary Position by LASSO |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |