CN109902290A - A kind of term extraction method, system and equipment based on text information - Google Patents
A kind of term extraction method, system and equipment based on text information Download PDFInfo
- Publication number
- CN109902290A CN109902290A CN201910063975.1A CN201910063975A CN109902290A CN 109902290 A CN109902290 A CN 109902290A CN 201910063975 A CN201910063975 A CN 201910063975A CN 109902290 A CN109902290 A CN 109902290A
- Authority
- CN
- China
- Prior art keywords
- node
- words
- text
- word
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
The term extraction method based on text information that the invention discloses a kind of, comprising: obtain text to be processed, the text to be processed is pre-processed;Extraction meets mutual information Judging index from the text to be processed and the word of Context-dependent Judging index is included into seed set of words;The side of node and the node based on the seed set of words constructs seed words network;The weight of the node is defined, and by the weight of node described in preset model iteration until the weight convergence of the node;The weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, extracts the adjacent phrase as candidate terms.Invention additionally discloses a kind of term extraction system and a kind of term extraction equipment based on text information based on text information.Using the embodiment of the present invention, the problem of capable of fully considering Chinese grammatical levels, has the characteristics that automation, dynamic update, meet the demand that modern mass text term high speed extracts.
Description
Technical field
The present invention relates to technical field of language recognition more particularly to a kind of term extraction method based on text information, it is
System and equipment.
Background technique
The research hotspot problem that research has become natural language field is extracted in term automation.It is in the prior art
Term automation extracting method specifically includes: firstly, extracting the seed words method of text using mutual information, Context-dependent;So
Afterwards, bluebeard compound frequency method carries out word to be spliced to form key area compound word;Finally, related using field consistent degree, field
Degree, field degree of membership quantification measure the degree of association between term.Based on mutual information, Context-dependent, comentropy seed words
Extracting method is the point on the basis of the frequent word of text, using connecting method synthesis text seed words forward or backward, extraction
Term completeness with higher, but the problem of the above method does not account for Chinese grammatical levels, will cause a large amount of non-neck
Domain compound word or term.In addition, though using field consistent degree, domain correlation degree, field degree of membership term extraction method
It is capable of the compound word and term in the better extract field, but the threshold value of each index is difficult to find an optimum value.
Summary of the invention
The purpose of the embodiment of the present invention is that providing a kind of term extraction method, system and equipment based on text information, energy
The problem of fully considering Chinese grammatical levels has the characteristics that automation, dynamic update, and meets modern mass text term high speed
The demand of extraction.
To achieve the above object, the term extraction method based on text information that the embodiment of the invention provides a kind of, comprising:
Text to be processed is obtained, the text to be processed is pre-processed;
The word receipts for meeting mutual information Judging index and Context-dependent Judging index are extracted from the text to be processed
It records into seed set of words;
The side of node and the node based on the seed set of words constructs seed words network;Wherein, the node
For any sub- word in the seed set of words, the side of the node is the adjacent seed words of present node;
The weight of the node is defined, and by the weight of node described in preset model iteration until the weight of the node
Convergence;
The weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, described in extraction
Adjacent phrase is as candidate terms;Wherein, the adjacent phrase meets preset term rules.
Compared with prior art, the term extraction method disclosed by the invention based on text information, firstly, carried out it is pre-
On the basis of processing, the seed words expected are excavated using Judging index and Context-dependent Judging index and are included into seed set of words
In;Then, the side of node and the node based on the seed set of words constructs seed words network, and uses preset model
Algorithm carry out the weight of iteration node and make its convergence;Finally, being ranked up to the weight of the node, when the kind being arranged in order
When sub- morphology is at adjacent phrase, the adjacent phrase is extracted as candidate terms.It solves and does not account for Chinese in the prior art
Grammatical levels lead to the problem of extracting a large amount of non-field compound word or term, the art disclosed by the invention based on text information
Language extracting method can fully consider the problem of Chinese grammatical levels, have the characteristics that automation, dynamic update, meet modern magnanimity
The demand that text terms high speed extracts.
It is as an improvement of the above scheme, described after extracting the adjacent phrase as candidate terms, further includes:
Calculate the support and confidence level of the candidate terms in the database;Wherein, the database includes default neck
Several words in domain;
When the candidate terms belong to the default field, the term word that the candidate terms constitute default field is extracted
Allusion quotation.
As an improvement of the above scheme, it is described to the text to be processed carry out pretreatment specifically include:
It is divided using the minimum unit that hanlp Words partition system carries out word to the text to be processed;Wherein, the minimum
Unit indicates the single word that can be divided under current Words partition system to the text to be processed.
As an improvement of the above scheme, the mutual information Judging index meets following formula:
Wherein, word string S=t1t2…ti, tiFor by the word or word group of the hanlp Words partition system cutting
It closes;f(ti) indicate tiThe frequency of appearance;niIt is the number that word string S occurs, NiIt is the number that all words occur in database.
As an improvement of the above scheme, the Context-dependent Judging index meets following formula:
H(W|ti)=- ∑w∈Wp(w|ti)*log2p(w|ti) formula (2);
Wherein, w indicates the t in certain windowiOccurs the probability of some particular words in the case where appearance again;W is expressed as
T in certain windowiOccurs the set of all particular words in the case where appearance again;The certain window be show it is described to
The window that a specific length is arranged in text is handled, contains several words in the window of the specific length.
As an improvement of the above scheme, the weight for defining the node, and pass through node described in preset model iteration
Weight until the node weight convergence, specifically include:
The weight of the node is defined using semantic relevance;Wherein, the semantic relevance meets following formula:
Wherein, wijIt is word tiWith tjBetween semantic relevance, indicate node between side connection significance level;
By the weight of node described in Textrank model iteration until the weight convergence of the node;Wherein, iteration mistake
Journey meets following formula:
Wherein, WS (ti) indicate node tiSignificance level;D expression damped coefficient, usually less than 1;tj∈In(ti) indicate
It is word tiImmediately following word tjLater;tk∈Out(tj) indicate word tkImmediately following word tjLater;WS(tj) indicate node tjWeight
Want degree;wjkIt is word tjWith tkBetween semantic relevance.
As an improvement of the above scheme, described to extract the adjacent phrase as candidate terms, it specifically includes:
The adjacent phrase is extracted as candidate terms using sliding window.
To achieve the above object, the term extraction system based on text information that the embodiment of the invention also provides a kind of, packet
It includes:
Text Pretreatment unit to be processed pre-processes the text to be processed for obtaining text to be processed;
Seed set of words includes unit, for from the text to be processed extract meet mutual information Judging index and up and down
The word that text relies on Judging index is included into seed set of words;
Seed words network struction unit, the side for node and the node based on the seed set of words construct kind
Sub- word network;Wherein, the node is any sub- word in the seed set of words, and the side of the node is present node phase
Adjacent seed words;
Restrain unit, for defining the weight of the node, and by the weight of node described in preset model iteration until
The weight convergence of the node;
Candidate terms extraction unit is ranked up for the weight to the node, when the seed morphology being arranged in order
When at adjacent phrase, the adjacent phrase is extracted as candidate terms;Wherein, the adjacent phrase meets preset term rule
Then.
Compared with prior art, the term extraction system disclosed by the invention based on text information, firstly, in text to be processed
On the basis of the progress of this pretreatment unit is pretreated, seed set of words is included unit and is determined using Judging index and Context-dependent
Index is excavated the seed words expected and is included into seed set of words;Then, seed words network struction unit is based on the seed words
The side of the node of set and the node constructs seed words network, restrains unit using the algorithm of preset model come iteration node
Weight make its convergence;Finally, candidate terms extraction unit is ranked up the weight of the node, when the kind being arranged in order
When sub- morphology is at adjacent phrase, the adjacent phrase is extracted as candidate terms.It solves and does not account for Chinese in the prior art
Grammatical levels lead to the problem of extracting a large amount of non-field compound word or term, the art disclosed by the invention based on text information
Language extraction system can fully consider the problem of Chinese grammatical levels, have the characteristics that automation, dynamic update, meet modern magnanimity
The demand that text terms high speed extracts.
As an improvement of the above scheme, the system also includes:
Support and confidence computation unit, for calculating the support and confidence of the candidate terms in the database
Degree;Wherein, the database includes several words in default field;
Term dictionary creation unit, for when the candidate terms belong to the default field, extracting the candidate art
Language constitutes the term dictionary in default field.
To achieve the above object, the embodiment of the present invention also provides a kind of term extraction equipment based on text information, including
Processor, memory and storage in the memory and are configured as the computer program executed by the processor, institute
State the term extraction based on text information realized as described in above-mentioned any embodiment when processor executes the computer program
Method.
Detailed description of the invention
Fig. 1 is a kind of flow chart of term extraction method based on text information provided in an embodiment of the present invention;
Fig. 2 is that seed words network shows in a kind of term extraction method based on text information provided in an embodiment of the present invention
It is intended to;
Fig. 3 is a kind of another flow chart of term extraction method based on text information provided in an embodiment of the present invention;
Fig. 4 is a kind of structural block diagram of term extraction system 10 based on text information provided in an embodiment of the present invention;
Fig. 5 is a kind of structural block diagram of term extraction equipment 20 based on text information provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Embodiment one
It is a kind of process of term extraction method based on text information provided in an embodiment of the present invention referring to Fig. 1, Fig. 1
Figure;Include:
S1, text to be processed is obtained, the text to be processed is pre-processed;
S2, the word for meeting mutual information Judging index and Context-dependent Judging index is extracted from the text to be processed
It includes into seed set of words;
The side of S3, the node based on the seed set of words and the node construct seed words network;Wherein, the section
Point is any sub- word in the seed set of words, and the side of the node is the adjacent seed words of present node;
S4, the weight for defining the node, and by the weight of node described in preset model iteration up to the node
Weight convergence;
S5, the weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, extracts institute
Adjacent phrase is stated as candidate terms;Wherein, the adjacent phrase meets preset term rules.
Specifically, in step sl, the text to be processed is non-structured text, the non-structured text be can be
Several sections of words, several sentences or an article.
Preferably, it is described to the text to be processed carry out pretreatment specifically include: using hanlp Words partition system to described
The minimum unit that text to be processed carries out word divides;Wherein, the minimum unit indicates under current Words partition system to described
The single word that text to be processed can be divided into.According to different dictionaries, the minimum unit divided to the same word is different
Sample.Such as cloud computing, " cloud/calculating " may be divided into using stammerer participle, if using other custom dictionaries can be with
It is divided into " cloud computing.So-called minimum unit exactly has been able to the word being divided under current tool.
Specifically, in step s 2, traditional mutual information calculation is weakened during word combination or word expect again
Therefore the probability of appearance when calculating mutual information, needs the impact probability coefficient that word is occurred to take into account.It is described
Mutual information Judging index meets following formula:
Wherein, word string S=t1t2…ti, tiFor by the word or word group of the hanlp Words partition system cutting
It closes;f(ti) indicate tiThe frequency of appearance;niIt is the number that word string S occurs, NiIt is the number that all words occur in database.
Context-dependent refers in certain window in context words tiConditional entropy in the case where having already appeared, it is described
Context-dependent Judging index meets following formula:
H(W|ti)=- ∑w∈Wp(w|ti)*log2p(w|ti) formula (2);
Wherein, w indicates the t in certain windowiOccurs the probability of some particular words in the case where appearance again;W is expressed as
T in certain windowiOccurs the set of all particular words in the case where appearance again;The certain window be show it is described to
The window that a specific length is arranged in text is handled, contains several words in the window of the specific length.It is arranged described specific
The case where window is advantageous in that, largely eliminates some specific word combination erroneous judgements into term.
For example, if the text in the certain window is " each section of program ", at this point, so-called tiIt is exactly " one section ", this
" one section " word occur after, behind occur " program " this word probability be how many, then indicated with w, i.e. w expression in specific window
Occurs the probability of " program " again in the case where " one section " appearance in mouthful.In entire corpus, after there is " one section ", it may go out
The particular words of existing " program ", " road surface ", " words ", " silk ribbon " etc, the set of all particular words does not include " one section ",
The set of so-called all particular words exactly occurs will appear the set of another word in the case of some word, refers to spy
The set of fixed cond.
Specifically, the threshold value of mutual information and Context-dependent is set according to corpus, if word or word combination are equal
Meet above-mentioned threshold value, is then included into the seed set of words.
Specifically, in step s3, referring to fig. 2, Fig. 2 is a kind of art based on text information provided in an embodiment of the present invention
The schematic diagram of seed words network in language extracting method;Kind is constituted by the side E between the node V of the seed set of words and node
Sub- word network G=(V, E), wherein the node is any sub- word (such as " calculating in Fig. 2 in the seed set of words
Method "), the node while for the adjacent seed words of present node (such as " algorithm " in Fig. 2 while include " non-supervisory ", " mind
Through network ", " intelligence "), i.e. side is for 1 or any one equal constant.
Specifically, in step s 4, the index that mutual information and Context-dependent in above-mentioned steps excessively focus on statistics is come
The feature of word is measured, reflects the semantic feature between word not from semantic level.
In view of the above-mentioned problems, the embodiment of the present invention uses semantic relevance to define the weight of the node first;Node language
The probability that adopted relevance means seed words while occurring is that this meets embedding method it is assumed that possessing similar upper and lower
Text judges whether seed words belong to same category with this by the semantic hierarchies relationship between quantitative measurement seed words;And base
The term vector come is trained with semantic correlativity, therefore, the embodiment of the present invention in the embedding method of corpus
On the basis of pretreated to the progress word2vec training of each corpus, semantic pass is reflected using the similitude between vector
The feature of connection property;Wherein, the semantic relevance meets following formula:
Wherein, wijIt is word tiWith tjBetween semantic relevance, indicate node between side connection significance level;
Then, by the weight of node described in Textrank model iteration until the weight convergence of the node;Wherein, repeatedly
Meet following formula for process:
Wherein, WS (ti) indicate node tiSignificance level;D expression damped coefficient, usually less than 1;tj∈In(ti) indicate
It is word tiImmediately following word tjLater;tk∈Out(tj) indicate word tkImmediately following word tjLater;WS(tj) indicate node tjWeight
Want degree;wjkIt is word tjWith tkBetween semantic relevance;According to the continuous iteration of rule of corpus word sequence, until full
Sufficient stop condition.
Specifically, in step s 5, the sequence of Top-N being carried out to the weight of the node, obtains Top-N seed words;If
Adjacent phrase is formed between Top-N seed words, then is come out as term extraction.This method reflects from semantic level constitutes term
Word between semantic feature, the interference of uncorrelated word combination can be reduced in certain degree.
Preferably, the adjacent phrase is extracted as candidate terms using sliding window.For example one section of word is " according to corpus
It sets mutual information and Context-dependent threshold value and is included in kind if word or word combination are all satisfied above-mentioned threshold value
In child node set ", at this time extract Top-N seed words have: " corpus ", " setting ", " mutual information ", " context ", " according to
Rely ", " threshold value ", " word ", " word combination ", " seed node ", " set ".With length be 6 window start to slide, window from
The left side " with " word to the right " in " word slowly slides, if the word framed all inside the seed set of words of Top-N, just
Using it as candidate terms set (such as: seed node set, Context-dependent), if not provided, just when seed words are as waiting
Select term set (corpus, mutual information).
Preferably, the preset term rules are the rule of Chinese terminology shown in table 1, wherein limited attribute includes:
Adjective, distinguish this, verb, noun and number+quantifier.
Table 1 Chinese terminology rule
Nominal phrase | Template |
baseNP | Verb+noun |
baseNP | baseNP+baseNP |
baseNP | BaseNP+ noun |
baseNP | Limited attribute+baseNP |
baseNP | Limited attribute+noun |
Further, further include step S6 after extracting the candidate terms: calculating the candidate terms in database
In support and confidence level;Wherein, the database includes several words in default field;When the candidate terms belong to institute
When stating default field, the term dictionary that the candidate terms constitute default field is extracted.
Wherein, the support discloses term miWith mjThe probability occurred simultaneously, formula are as follows:
Support(mi->mj)=P (mi∪mj) formula (5);
The confidence level then discloses term miAfter appearance, term mjWhether will appear or much probability will appear, formula
Are as follows:
Confidence(mi->mj)=P (mi|mj) formula (6);
Support and confidence level of each candidate terms in specific area are calculated by formula (5) and formula (6), and
It is compared with the minimum support of setting and min confidence, the candidate terms of minimum support and min confidence will be less than
It forecloses, ultimately forms the Chinese terminology dictionary of specific area.
The acquisition of correlation rule is mainly found out from a large amount of event log data library by the method for data mining full
The frequent mode of the minimum support Minsup and min confidence Minconf of sufficient certain condition.The embodiment of the present invention is being found
After candidate terms, support and confidence level of each candidate terms in default field are then calculated, and in default field
The minimum support and confidence level of term are compared, and the candidate terms in a large amount of non-fields are foreclosed, are ultimately formed default
The Chinese dictionary in field.Wherein, the default field can be a certain specific area of any setting, and the term of different field has
Different confidence levels and support, the present invention are not specifically limited in this embodiment.
Further, the process of above-mentioned steps S1~S6 can refer to Fig. 3.
When it is implemented, firstly, being referred on the basis of having carried out pretreated using Judging index and Context-dependent judgement
Mark excavates the seed words expected and includes into seed set of words;Then, node and the section based on the seed set of words
The side of point constructs seed words network, and makes its convergence using the algorithm of preset model come the weight of iteration node;Finally, to described
The weight of node is ranked up, and when the seed words being arranged in order form adjacent phrase, extracts the adjacent phrase as time
Select term.
Compared with prior art, the term extraction method disclosed by the invention based on text information, solves the prior art
In do not account for Chinese grammatical levels and lead to the problem of extracting a large amount of non-field compound word or term, base disclosed by the invention
In the term extraction method of text information can fully consider Chinese grammatical levels the problem of, the spy that there is automation, dynamic to update
Point meets the demand that modern mass text term high speed extracts.
Embodiment two
Referring to fig. 4, Fig. 4 is a kind of structure of term extraction system 10 based on text information provided in an embodiment of the present invention
Block diagram;Include:
Text Pretreatment unit 1 to be processed pre-processes the text to be processed for obtaining text to be processed;
Seed set of words includes unit 2, for from the text to be processed extract meet mutual information Judging index and
The word for hereafter relying on Judging index is included into seed set of words;
Seed words network struction unit 3, the side for node and the node based on the seed set of words construct
Seed words network;Wherein, the node is any sub- word in the seed set of words, and the side of the node is present node
Adjacent seed words;
Restrain unit 4, for defining the weight of the node, and by the weight of node described in preset model iteration until
The weight convergence of the node;
Candidate terms extraction unit 5 is ranked up for the weight to the node, when the seed morphology being arranged in order
When at adjacent phrase, the adjacent phrase is extracted as candidate terms;Wherein, the adjacent phrase meets preset term rule
Then.
Preferably, the term extraction system 10 based on text information further include:
Support and confidence computation unit 6, for calculating the support and confidence of the candidate terms in the database
Degree;Wherein, the database includes several words in default field;
Term dictionary creation unit 7, for when the candidate terms belong to the default field, extracting the candidate art
Language constitutes the term dictionary in default field.
Preferably, the Text Pretreatment unit 1 to be processed carries out the text to be processed using hanlp Words partition system
The minimum unit of word divides;Wherein, the minimum unit expression can to the text to be processed under current Words partition system
The single word being divided into.
Preferably, the mutual information Judging index meets following formula:
Wherein, word string S=t1t2…ti, tiFor by the word or word group of the hanlp Words partition system cutting
It closes;f(ti) indicate tiThe frequency of appearance;niIt is the number that word string S occurs, NiIt is the number that all words occur in database.
Preferably, the Context-dependent Judging index meets following formula:
H(W|ti)=- ∑w∈Wp(w|ti)*log2p(w|ti) formula (2);
Wherein, w indicates the t in certain windowiOccurs the probability of some particular words in the case where appearance again;W is expressed as
T in certain windowiOccurs the set of all particular words in the case where appearance again;The certain window be show it is described to
The window that a specific length is arranged in text is handled, contains several words in the window of the specific length.
Preferably, the convergence unit 4 defines the weight of the node using semantic relevance;Wherein, the semantic pass
Connection property meets following formula:
Wherein, wijIt is word tiWith tjBetween semantic relevance, indicate node between side connection significance level;
The convergence unit 4 is by the weight of node described in Textrank model iteration until the weight of the node is received
It holds back;Wherein, iterative process meets following formula:
Wherein, WS (ti) indicate node tiSignificance level;D expression damped coefficient, usually less than 1;tj∈In(ti) indicate
It is word tiImmediately following word tjLater;tk∈Out(tj) indicate word tkImmediately following word tjLater;WS(tj) indicate node tjWeight
Want degree;wjkIt is word tjWith tkBetween semantic relevance.
Preferably, the candidate terms extraction unit 5 extracts the adjacent phrase as candidate terms using sliding window.
The course of work of specific above-mentioned each unit please refers to the course of work of step S1~S6 in above-described embodiment,
This is repeated no more.
When it is implemented, firstly, on the basis of Text Pretreatment unit 1 to be processed carries out pretreated, seed set of words
Unit 2 is included to include using the seed words that Judging index and Context-dependent Judging index excavate expectation into seed set of words;
Then, the side of node and the node of the seed words network struction unit 3 based on the seed set of words constructs seed words net
Network, restraining unit 4 using the algorithm of preset model makes its convergence come the weight of iteration node;Finally, candidate terms extraction unit 5
The weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, extracts the adjacent phrase
As candidate terms.
Compared with prior art, the term extraction system 10 disclosed by the invention based on text information, solves existing skill
Chinese grammatical levels are not accounted in art to be caused the problem of extracting a large amount of non-field compound word or term, disclosed by the invention
There is the problem of term extraction system 10 based on text information can fully consider Chinese grammatical levels automation, dynamic to update
The characteristics of, meet the demand that modern mass text term high speed extracts.
Embodiment three
It is the structural representation of the term extraction equipment 20 provided in an embodiment of the present invention based on text information referring to Fig. 5, Fig. 5
Figure.The term extraction equipment 20 based on text information of the embodiment includes: processor 21, memory 22 and is stored in described
In memory 22 and the computer program that can be run on the processor 21.The processor 21 executes the computer program
Step in the above-mentioned each term extraction embodiment of the method based on text information of Shi Shixian, such as step S1 shown in FIG. 1~
S5.Alternatively, the processor 21 realizes the function of each module/unit in above-mentioned each Installation practice when executing the computer program
Energy.
Illustratively, the computer program can be divided into one or more module/units, one or more
A module/unit is stored in the memory 22, and is executed by the processor 21, to complete the present invention.It is one
Or multiple module/units can be the series of computation machine program instruction section that can complete specific function, the instruction segment is for retouching
State implementation procedure of the computer program in the term extraction equipment 20 based on text information.For example, the calculating
Machine program can be divided into Text Pretreatment unit 1 to be processed, seed set of words includes unit 2, seed words network struction list
Member 3, convergence unit 4, candidate terms extraction unit 5, candidate terms extraction unit 6 and term dictionary creation unit 7, each form unit
Concrete function please refer to the tool of each unit in the term extraction system 10 described in above-described embodiment two based on text information
Body function, details are not described herein.
The term extraction equipment 20 based on text information can be desktop PC, notebook, palm PC and
Cloud server etc. calculates equipment.The term extraction equipment 20 based on text information may include, but be not limited only to, processor
21, memory 22.It will be understood by those skilled in the art that the schematic diagram is only based on the term extraction equipment of text information
20 example does not constitute the restriction to the term extraction equipment 20 based on text information, may include more or more than illustrating
Few component perhaps combines certain components or different components, such as the term extraction equipment 20 based on text information
It can also include input-output equipment, network access equipment, bus etc..
The processor 21 can be central processing unit (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng the processor is the control centre of the term extraction equipment 20 based on text information, utilizes various interfaces and route
Connect the various pieces of the entirely term extraction equipment 20 based on text information.
The memory 22 can be used for storing the computer program and/or module, the processor 21 by operation or
The computer program and/or module being stored in the memory 22 are executed, and calls the data being stored in memory 22,
Realize the various functions of the term extraction equipment 20 based on text information.The memory 22 can mainly include storage program
Area and storage data area, wherein storing program area can application program needed for storage program area, at least one function (such as
Sound-playing function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (ratio according to mobile phone
Such as audio data, phone directory) etc..In addition, memory 22 may include high-speed random access memory, it can also include non-easy
The property lost memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital
(Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other
Volatile solid-state part.
Wherein, if the integrated module/unit of the term extraction equipment 20 based on text information is with software function list
Member form realize and when sold or used as an independent product, can store in a computer-readable storage medium
In.Based on this understanding, the present invention realizes all or part of the process in above-described embodiment method, can also pass through computer
Program is completed to instruct relevant hardware, and the computer program can be stored in a computer readable storage medium, should
Computer program by processor 21 when being executed, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer journey
Sequence includes computer program code, and the computer program code can be source code form, object identification code form, executable text
Part or certain intermediate forms etc..The computer-readable medium may include: that can carry appointing for the computer program code
What entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications letter
Number and software distribution medium etc..It should be noted that the content that the computer-readable medium includes can be managed according to the administration of justice
Local legislation and the requirement of patent practice carry out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent
Practice, computer-readable medium does not include electric carrier signal and telecommunication signal.
It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separation unit
The unit of explanation may or may not be physically separated, and component shown as a unit can be or can also be with
It is not physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual
It needs that some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention
In embodiment attached drawing, the connection relationship between module indicate between them have communication connection, specifically can be implemented as one or
A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, it can understand
And implement.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art
For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as
Protection scope of the present invention.
Claims (10)
1. a kind of term extraction method based on text information characterized by comprising
Text to be processed is obtained, the text to be processed is pre-processed;
Extracted from the text to be processed meet mutual information Judging index and Context-dependent Judging index word include into
In seed set of words;
The side of node and the node based on the seed set of words constructs seed words network;Wherein, the node is institute
Any sub- word in seed set of words is stated, the side of the node is the adjacent seed words of present node;
The weight of the node is defined, and by the weight of node described in preset model iteration until the weight of the node is received
It holds back;
The weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, is extracted described adjacent
Phrase is as candidate terms;Wherein, the adjacent phrase meets preset term rules.
2. the term extraction method based on text information as described in claim 1, which is characterized in that described to extract the phase
After adjacent phrase is as candidate terms, further includes:
Calculate the support and confidence level of the candidate terms in the database;Wherein, the database includes default field
Several words;
When the candidate terms belong to the default field, the term dictionary that the candidate terms constitute default field is extracted.
3. the term extraction method based on text information as described in claim 1, which is characterized in that described to described to be processed
Text carries out pretreatment and specifically includes:
It is divided using the minimum unit that hanlp Words partition system carries out word to the text to be processed;Wherein, the minimum unit
Indicate the single word that can be divided under current Words partition system to the text to be processed.
4. the term extraction method based on text information as claimed in claim 3, which is characterized in that the mutual information judgement refers to
Mark meets following formula:
Wherein, word string S=t1t2…ti, tiFor by the word or word combination of the hanlp Words partition system cutting;f
(ti) indicate tiThe frequency of appearance;niIt is the number that word string S occurs, NiIt is the number that all words occur in database.
5. the term extraction method based on text information as claimed in claim 4, which is characterized in that the Context-dependent is sentenced
Determine index and meet following formula:
H(W|ti)=- ∑w∈Wp(w|ti)*log2p(w|ti) formula (2);Wherein, w indicates the t in certain windowiOccur
In the case of occur the probability of some particular words again;W is expressed as t in certain windowiOccur again in the case where appearance all described
The set of particular words;The certain window is the window showed the text to be processed and a specific length is arranged, the spy
Contain several words in the window of measured length.
6. the term extraction method based on text information as described in claim 1, which is characterized in that described to define the node
Weight, and by the weight of node described in preset model iteration until the weight convergence of the node, specifically includes:
The weight of the node is defined using semantic relevance;Wherein, the semantic relevance meets following formula:
Wherein, wijIt is word tiWith tjBetween semantic relevance, indicate node between side connection significance level;
By the weight of node described in Textrank model iteration until the weight convergence of the node;Wherein, iterative process is full
Sufficient following formula:
Wherein, WS (ti) indicate node tiSignificance level;D expression damped coefficient, usually less than 1;tj∈In(ti) indicate to be word
Language tiImmediately following word tjLater;tk∈Out(tj) indicate word tkImmediately following word tjLater;WS(tj) indicate node tjImportant journey
Degree;wjkIt is word tjWith tkBetween semantic relevance.
7. the term extraction method based on text information as described in claim 1, which is characterized in that the extraction is described adjacent
Phrase is specifically included as candidate terms:
The adjacent phrase is extracted as candidate terms using sliding window.
8. a kind of term extraction system based on text information characterized by comprising
Text Pretreatment unit to be processed pre-processes the text to be processed for obtaining text to be processed;
Seed set of words includes unit, for from the text to be processed extract meet mutual information Judging index and context according to
The word of Judging index is relied to include into seed set of words;
Seed words network struction unit, the side for node and the node based on the seed set of words construct seed words
Network;Wherein, the node is any sub- word in the seed set of words, and the side of the node is that present node is adjacent
Seed words;
Unit is restrained, for defining the weight of the node, and by the weight of node described in preset model iteration until described
The weight convergence of node;
Candidate terms extraction unit is ranked up for the weight to the node, when the seed words being arranged in order form phase
When adjacent phrase, the adjacent phrase is extracted as candidate terms;Wherein, the adjacent phrase meets preset term rules.
9. the term extraction system based on text information as claimed in claim 8, which is characterized in that the system also includes:
Support and confidence computation unit, for calculating the support and confidence level of the candidate terms in the database;Its
In, the database includes several words in default field;
Term dictionary creation unit, for extracting the candidate terms structure when the candidate terms belong to the default field
At the term dictionary in default field.
10. a kind of term extraction equipment based on text information, which is characterized in that including processor, memory and be stored in
In the memory and it is configured as the computer program executed by the processor, the processor executes the computer journey
The term extraction method based on text information as claimed in any of claims 1 to 7 in one of claims is realized when sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910063975.1A CN109902290B (en) | 2019-01-23 | 2019-01-23 | Text information-based term extraction method, system and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910063975.1A CN109902290B (en) | 2019-01-23 | 2019-01-23 | Text information-based term extraction method, system and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109902290A true CN109902290A (en) | 2019-06-18 |
CN109902290B CN109902290B (en) | 2023-06-30 |
Family
ID=66944048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910063975.1A Active CN109902290B (en) | 2019-01-23 | 2019-01-23 | Text information-based term extraction method, system and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109902290B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111680128A (en) * | 2020-06-16 | 2020-09-18 | 杭州安恒信息技术股份有限公司 | Method and system for detecting web page sensitive words and related devices |
CN112966508A (en) * | 2021-04-05 | 2021-06-15 | 集智学园(北京)科技有限公司 | General automatic term extraction method |
CN115066679A (en) * | 2020-03-25 | 2022-09-16 | 苏州七星天专利运营管理有限责任公司 | Method and system for extracting self-made terms in professional field |
CN115130472A (en) * | 2022-08-31 | 2022-09-30 | 北京澜舟科技有限公司 | Method, system and readable storage medium for segmenting subwords based on BPE |
CN116756298A (en) * | 2023-08-18 | 2023-09-15 | 太仓市律点信息技术有限公司 | Cloud database-oriented AI session information optimization method and big data optimization server |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102360383A (en) * | 2011-10-15 | 2012-02-22 | 西安交通大学 | Method for extracting text-oriented field term and term relationship |
CN107329950A (en) * | 2017-06-13 | 2017-11-07 | 武汉工程大学 | It is a kind of based on the Chinese address segmenting method without dictionary |
CN108287825A (en) * | 2018-01-05 | 2018-07-17 | 中译语通科技股份有限公司 | A kind of term identification abstracting method and system |
CN108549626A (en) * | 2018-03-02 | 2018-09-18 | 广东技术师范学院 | A kind of keyword extracting method for admiring class |
-
2019
- 2019-01-23 CN CN201910063975.1A patent/CN109902290B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102360383A (en) * | 2011-10-15 | 2012-02-22 | 西安交通大学 | Method for extracting text-oriented field term and term relationship |
CN107329950A (en) * | 2017-06-13 | 2017-11-07 | 武汉工程大学 | It is a kind of based on the Chinese address segmenting method without dictionary |
CN108287825A (en) * | 2018-01-05 | 2018-07-17 | 中译语通科技股份有限公司 | A kind of term identification abstracting method and system |
CN108549626A (en) * | 2018-03-02 | 2018-09-18 | 广东技术师范学院 | A kind of keyword extracting method for admiring class |
Non-Patent Citations (2)
Title |
---|
杜海舟 等: "基于上下文关系和TextRank算法的关键词提取方法", 《上海电力学院学报》 * |
贺海涛 等: "基于关联规则和语义规则的本体概念提取研究", 《吉林大学学报(信息科学版)》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115066679A (en) * | 2020-03-25 | 2022-09-16 | 苏州七星天专利运营管理有限责任公司 | Method and system for extracting self-made terms in professional field |
CN115066679B (en) * | 2020-03-25 | 2024-02-20 | 苏州七星天专利运营管理有限责任公司 | Method and system for extracting self-made terms in professional field |
CN111680128A (en) * | 2020-06-16 | 2020-09-18 | 杭州安恒信息技术股份有限公司 | Method and system for detecting web page sensitive words and related devices |
CN112966508A (en) * | 2021-04-05 | 2021-06-15 | 集智学园(北京)科技有限公司 | General automatic term extraction method |
CN112966508B (en) * | 2021-04-05 | 2023-08-25 | 集智学园(北京)科技有限公司 | Universal automatic term extraction method |
CN115130472A (en) * | 2022-08-31 | 2022-09-30 | 北京澜舟科技有限公司 | Method, system and readable storage medium for segmenting subwords based on BPE |
CN116756298A (en) * | 2023-08-18 | 2023-09-15 | 太仓市律点信息技术有限公司 | Cloud database-oriented AI session information optimization method and big data optimization server |
CN116756298B (en) * | 2023-08-18 | 2023-10-20 | 太仓市律点信息技术有限公司 | Cloud database-oriented AI session information optimization method and big data optimization server |
Also Published As
Publication number | Publication date |
---|---|
CN109902290B (en) | 2023-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11301637B2 (en) | Methods, devices, and systems for constructing intelligent knowledge base | |
CN109902290A (en) | A kind of term extraction method, system and equipment based on text information | |
CN110390006B (en) | Question-answer corpus generation method, device and computer readable storage medium | |
CN108304375A (en) | A kind of information identifying method and its equipment, storage medium, terminal | |
KR101548096B1 (en) | Method and server for automatically summarizing documents | |
US11113470B2 (en) | Preserving and processing ambiguity in natural language | |
CN110347790B (en) | Text duplicate checking method, device and equipment based on attention mechanism and storage medium | |
CN107885717B (en) | Keyword extraction method and device | |
CN112256822A (en) | Text search method and device, computer equipment and storage medium | |
WO2017177809A1 (en) | Word segmentation method and system for language text | |
CN108304377B (en) | Extraction method of long-tail words and related device | |
CN103309852A (en) | Method for discovering compound words in specific field based on statistics and rules | |
CN103971684A (en) | Method and system for adding punctuations and method and device for establishing language model for adding punctuations | |
EP3232336A1 (en) | Method and device for recognizing stop word | |
CN109063184A (en) | Multilingual newsletter archive clustering method, storage medium and terminal device | |
CN109460499A (en) | Target search word generation method and device, electronic equipment, storage medium | |
CN113901214B (en) | Method and device for extracting form information, electronic equipment and storage medium | |
CN111309916A (en) | Abstract extraction method and device, storage medium and electronic device | |
CN112328747A (en) | Event context generation method and device, terminal equipment and storage medium | |
CN111444713B (en) | Method and device for extracting entity relationship in news event | |
CN110874408B (en) | Model training method, text recognition device and computing equipment | |
US20110106849A1 (en) | New case generation device, new case generation method, and new case generation program | |
CN113157857B (en) | Hot topic detection method, device and equipment for news | |
CN111339287B (en) | Abstract generation method and device | |
CN112182283A (en) | Song searching method, device, network equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |