CN109902290A - A kind of term extraction method, system and equipment based on text information - Google Patents

A kind of term extraction method, system and equipment based on text information Download PDF

Info

Publication number
CN109902290A
CN109902290A CN201910063975.1A CN201910063975A CN109902290A CN 109902290 A CN109902290 A CN 109902290A CN 201910063975 A CN201910063975 A CN 201910063975A CN 109902290 A CN109902290 A CN 109902290A
Authority
CN
China
Prior art keywords
node
words
text
word
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910063975.1A
Other languages
Chinese (zh)
Other versions
CN109902290B (en
Inventor
杜翠凤
沈文明
周冠宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jay Communications Planning And Design Institute Co Ltd
GCI Science and Technology Co Ltd
Original Assignee
Guangzhou Jay Communications Planning And Design Institute Co Ltd
GCI Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jay Communications Planning And Design Institute Co Ltd, GCI Science and Technology Co Ltd filed Critical Guangzhou Jay Communications Planning And Design Institute Co Ltd
Priority to CN201910063975.1A priority Critical patent/CN109902290B/en
Publication of CN109902290A publication Critical patent/CN109902290A/en
Application granted granted Critical
Publication of CN109902290B publication Critical patent/CN109902290B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The term extraction method based on text information that the invention discloses a kind of, comprising: obtain text to be processed, the text to be processed is pre-processed;Extraction meets mutual information Judging index from the text to be processed and the word of Context-dependent Judging index is included into seed set of words;The side of node and the node based on the seed set of words constructs seed words network;The weight of the node is defined, and by the weight of node described in preset model iteration until the weight convergence of the node;The weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, extracts the adjacent phrase as candidate terms.Invention additionally discloses a kind of term extraction system and a kind of term extraction equipment based on text information based on text information.Using the embodiment of the present invention, the problem of capable of fully considering Chinese grammatical levels, has the characteristics that automation, dynamic update, meet the demand that modern mass text term high speed extracts.

Description

A kind of term extraction method, system and equipment based on text information
Technical field
The present invention relates to technical field of language recognition more particularly to a kind of term extraction method based on text information, it is System and equipment.
Background technique
The research hotspot problem that research has become natural language field is extracted in term automation.It is in the prior art Term automation extracting method specifically includes: firstly, extracting the seed words method of text using mutual information, Context-dependent;So Afterwards, bluebeard compound frequency method carries out word to be spliced to form key area compound word;Finally, related using field consistent degree, field Degree, field degree of membership quantification measure the degree of association between term.Based on mutual information, Context-dependent, comentropy seed words Extracting method is the point on the basis of the frequent word of text, using connecting method synthesis text seed words forward or backward, extraction Term completeness with higher, but the problem of the above method does not account for Chinese grammatical levels, will cause a large amount of non-neck Domain compound word or term.In addition, though using field consistent degree, domain correlation degree, field degree of membership term extraction method It is capable of the compound word and term in the better extract field, but the threshold value of each index is difficult to find an optimum value.
Summary of the invention
The purpose of the embodiment of the present invention is that providing a kind of term extraction method, system and equipment based on text information, energy The problem of fully considering Chinese grammatical levels has the characteristics that automation, dynamic update, and meets modern mass text term high speed The demand of extraction.
To achieve the above object, the term extraction method based on text information that the embodiment of the invention provides a kind of, comprising:
Text to be processed is obtained, the text to be processed is pre-processed;
The word receipts for meeting mutual information Judging index and Context-dependent Judging index are extracted from the text to be processed It records into seed set of words;
The side of node and the node based on the seed set of words constructs seed words network;Wherein, the node For any sub- word in the seed set of words, the side of the node is the adjacent seed words of present node;
The weight of the node is defined, and by the weight of node described in preset model iteration until the weight of the node Convergence;
The weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, described in extraction Adjacent phrase is as candidate terms;Wherein, the adjacent phrase meets preset term rules.
Compared with prior art, the term extraction method disclosed by the invention based on text information, firstly, carried out it is pre- On the basis of processing, the seed words expected are excavated using Judging index and Context-dependent Judging index and are included into seed set of words In;Then, the side of node and the node based on the seed set of words constructs seed words network, and uses preset model Algorithm carry out the weight of iteration node and make its convergence;Finally, being ranked up to the weight of the node, when the kind being arranged in order When sub- morphology is at adjacent phrase, the adjacent phrase is extracted as candidate terms.It solves and does not account for Chinese in the prior art Grammatical levels lead to the problem of extracting a large amount of non-field compound word or term, the art disclosed by the invention based on text information Language extracting method can fully consider the problem of Chinese grammatical levels, have the characteristics that automation, dynamic update, meet modern magnanimity The demand that text terms high speed extracts.
It is as an improvement of the above scheme, described after extracting the adjacent phrase as candidate terms, further includes:
Calculate the support and confidence level of the candidate terms in the database;Wherein, the database includes default neck Several words in domain;
When the candidate terms belong to the default field, the term word that the candidate terms constitute default field is extracted Allusion quotation.
As an improvement of the above scheme, it is described to the text to be processed carry out pretreatment specifically include:
It is divided using the minimum unit that hanlp Words partition system carries out word to the text to be processed;Wherein, the minimum Unit indicates the single word that can be divided under current Words partition system to the text to be processed.
As an improvement of the above scheme, the mutual information Judging index meets following formula:
Wherein, word string S=t1t2…ti, tiFor by the word or word group of the hanlp Words partition system cutting It closes;f(ti) indicate tiThe frequency of appearance;niIt is the number that word string S occurs, NiIt is the number that all words occur in database.
As an improvement of the above scheme, the Context-dependent Judging index meets following formula:
H(W|ti)=- ∑w∈Wp(w|ti)*log2p(w|ti) formula (2);
Wherein, w indicates the t in certain windowiOccurs the probability of some particular words in the case where appearance again;W is expressed as T in certain windowiOccurs the set of all particular words in the case where appearance again;The certain window be show it is described to The window that a specific length is arranged in text is handled, contains several words in the window of the specific length.
As an improvement of the above scheme, the weight for defining the node, and pass through node described in preset model iteration Weight until the node weight convergence, specifically include:
The weight of the node is defined using semantic relevance;Wherein, the semantic relevance meets following formula:
Wherein, wijIt is word tiWith tjBetween semantic relevance, indicate node between side connection significance level;
By the weight of node described in Textrank model iteration until the weight convergence of the node;Wherein, iteration mistake Journey meets following formula:
Wherein, WS (ti) indicate node tiSignificance level;D expression damped coefficient, usually less than 1;tj∈In(ti) indicate It is word tiImmediately following word tjLater;tk∈Out(tj) indicate word tkImmediately following word tjLater;WS(tj) indicate node tjWeight Want degree;wjkIt is word tjWith tkBetween semantic relevance.
As an improvement of the above scheme, described to extract the adjacent phrase as candidate terms, it specifically includes:
The adjacent phrase is extracted as candidate terms using sliding window.
To achieve the above object, the term extraction system based on text information that the embodiment of the invention also provides a kind of, packet It includes:
Text Pretreatment unit to be processed pre-processes the text to be processed for obtaining text to be processed;
Seed set of words includes unit, for from the text to be processed extract meet mutual information Judging index and up and down The word that text relies on Judging index is included into seed set of words;
Seed words network struction unit, the side for node and the node based on the seed set of words construct kind Sub- word network;Wherein, the node is any sub- word in the seed set of words, and the side of the node is present node phase Adjacent seed words;
Restrain unit, for defining the weight of the node, and by the weight of node described in preset model iteration until The weight convergence of the node;
Candidate terms extraction unit is ranked up for the weight to the node, when the seed morphology being arranged in order When at adjacent phrase, the adjacent phrase is extracted as candidate terms;Wherein, the adjacent phrase meets preset term rule Then.
Compared with prior art, the term extraction system disclosed by the invention based on text information, firstly, in text to be processed On the basis of the progress of this pretreatment unit is pretreated, seed set of words is included unit and is determined using Judging index and Context-dependent Index is excavated the seed words expected and is included into seed set of words;Then, seed words network struction unit is based on the seed words The side of the node of set and the node constructs seed words network, restrains unit using the algorithm of preset model come iteration node Weight make its convergence;Finally, candidate terms extraction unit is ranked up the weight of the node, when the kind being arranged in order When sub- morphology is at adjacent phrase, the adjacent phrase is extracted as candidate terms.It solves and does not account for Chinese in the prior art Grammatical levels lead to the problem of extracting a large amount of non-field compound word or term, the art disclosed by the invention based on text information Language extraction system can fully consider the problem of Chinese grammatical levels, have the characteristics that automation, dynamic update, meet modern magnanimity The demand that text terms high speed extracts.
As an improvement of the above scheme, the system also includes:
Support and confidence computation unit, for calculating the support and confidence of the candidate terms in the database Degree;Wherein, the database includes several words in default field;
Term dictionary creation unit, for when the candidate terms belong to the default field, extracting the candidate art Language constitutes the term dictionary in default field.
To achieve the above object, the embodiment of the present invention also provides a kind of term extraction equipment based on text information, including Processor, memory and storage in the memory and are configured as the computer program executed by the processor, institute State the term extraction based on text information realized as described in above-mentioned any embodiment when processor executes the computer program Method.
Detailed description of the invention
Fig. 1 is a kind of flow chart of term extraction method based on text information provided in an embodiment of the present invention;
Fig. 2 is that seed words network shows in a kind of term extraction method based on text information provided in an embodiment of the present invention It is intended to;
Fig. 3 is a kind of another flow chart of term extraction method based on text information provided in an embodiment of the present invention;
Fig. 4 is a kind of structural block diagram of term extraction system 10 based on text information provided in an embodiment of the present invention;
Fig. 5 is a kind of structural block diagram of term extraction equipment 20 based on text information provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Embodiment one
It is a kind of process of term extraction method based on text information provided in an embodiment of the present invention referring to Fig. 1, Fig. 1 Figure;Include:
S1, text to be processed is obtained, the text to be processed is pre-processed;
S2, the word for meeting mutual information Judging index and Context-dependent Judging index is extracted from the text to be processed It includes into seed set of words;
The side of S3, the node based on the seed set of words and the node construct seed words network;Wherein, the section Point is any sub- word in the seed set of words, and the side of the node is the adjacent seed words of present node;
S4, the weight for defining the node, and by the weight of node described in preset model iteration up to the node Weight convergence;
S5, the weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, extracts institute Adjacent phrase is stated as candidate terms;Wherein, the adjacent phrase meets preset term rules.
Specifically, in step sl, the text to be processed is non-structured text, the non-structured text be can be Several sections of words, several sentences or an article.
Preferably, it is described to the text to be processed carry out pretreatment specifically include: using hanlp Words partition system to described The minimum unit that text to be processed carries out word divides;Wherein, the minimum unit indicates under current Words partition system to described The single word that text to be processed can be divided into.According to different dictionaries, the minimum unit divided to the same word is different Sample.Such as cloud computing, " cloud/calculating " may be divided into using stammerer participle, if using other custom dictionaries can be with It is divided into " cloud computing.So-called minimum unit exactly has been able to the word being divided under current tool.
Specifically, in step s 2, traditional mutual information calculation is weakened during word combination or word expect again Therefore the probability of appearance when calculating mutual information, needs the impact probability coefficient that word is occurred to take into account.It is described Mutual information Judging index meets following formula:
Wherein, word string S=t1t2…ti, tiFor by the word or word group of the hanlp Words partition system cutting It closes;f(ti) indicate tiThe frequency of appearance;niIt is the number that word string S occurs, NiIt is the number that all words occur in database.
Context-dependent refers in certain window in context words tiConditional entropy in the case where having already appeared, it is described Context-dependent Judging index meets following formula:
H(W|ti)=- ∑w∈Wp(w|ti)*log2p(w|ti) formula (2);
Wherein, w indicates the t in certain windowiOccurs the probability of some particular words in the case where appearance again;W is expressed as T in certain windowiOccurs the set of all particular words in the case where appearance again;The certain window be show it is described to The window that a specific length is arranged in text is handled, contains several words in the window of the specific length.It is arranged described specific The case where window is advantageous in that, largely eliminates some specific word combination erroneous judgements into term.
For example, if the text in the certain window is " each section of program ", at this point, so-called tiIt is exactly " one section ", this " one section " word occur after, behind occur " program " this word probability be how many, then indicated with w, i.e. w expression in specific window Occurs the probability of " program " again in the case where " one section " appearance in mouthful.In entire corpus, after there is " one section ", it may go out The particular words of existing " program ", " road surface ", " words ", " silk ribbon " etc, the set of all particular words does not include " one section ", The set of so-called all particular words exactly occurs will appear the set of another word in the case of some word, refers to spy The set of fixed cond.
Specifically, the threshold value of mutual information and Context-dependent is set according to corpus, if word or word combination are equal Meet above-mentioned threshold value, is then included into the seed set of words.
Specifically, in step s3, referring to fig. 2, Fig. 2 is a kind of art based on text information provided in an embodiment of the present invention The schematic diagram of seed words network in language extracting method;Kind is constituted by the side E between the node V of the seed set of words and node Sub- word network G=(V, E), wherein the node is any sub- word (such as " calculating in Fig. 2 in the seed set of words Method "), the node while for the adjacent seed words of present node (such as " algorithm " in Fig. 2 while include " non-supervisory ", " mind Through network ", " intelligence "), i.e. side is for 1 or any one equal constant.
Specifically, in step s 4, the index that mutual information and Context-dependent in above-mentioned steps excessively focus on statistics is come The feature of word is measured, reflects the semantic feature between word not from semantic level.
In view of the above-mentioned problems, the embodiment of the present invention uses semantic relevance to define the weight of the node first;Node language The probability that adopted relevance means seed words while occurring is that this meets embedding method it is assumed that possessing similar upper and lower Text judges whether seed words belong to same category with this by the semantic hierarchies relationship between quantitative measurement seed words;And base The term vector come is trained with semantic correlativity, therefore, the embodiment of the present invention in the embedding method of corpus On the basis of pretreated to the progress word2vec training of each corpus, semantic pass is reflected using the similitude between vector The feature of connection property;Wherein, the semantic relevance meets following formula:
Wherein, wijIt is word tiWith tjBetween semantic relevance, indicate node between side connection significance level;
Then, by the weight of node described in Textrank model iteration until the weight convergence of the node;Wherein, repeatedly Meet following formula for process:
Wherein, WS (ti) indicate node tiSignificance level;D expression damped coefficient, usually less than 1;tj∈In(ti) indicate It is word tiImmediately following word tjLater;tk∈Out(tj) indicate word tkImmediately following word tjLater;WS(tj) indicate node tjWeight Want degree;wjkIt is word tjWith tkBetween semantic relevance;According to the continuous iteration of rule of corpus word sequence, until full Sufficient stop condition.
Specifically, in step s 5, the sequence of Top-N being carried out to the weight of the node, obtains Top-N seed words;If Adjacent phrase is formed between Top-N seed words, then is come out as term extraction.This method reflects from semantic level constitutes term Word between semantic feature, the interference of uncorrelated word combination can be reduced in certain degree.
Preferably, the adjacent phrase is extracted as candidate terms using sliding window.For example one section of word is " according to corpus It sets mutual information and Context-dependent threshold value and is included in kind if word or word combination are all satisfied above-mentioned threshold value In child node set ", at this time extract Top-N seed words have: " corpus ", " setting ", " mutual information ", " context ", " according to Rely ", " threshold value ", " word ", " word combination ", " seed node ", " set ".With length be 6 window start to slide, window from The left side " with " word to the right " in " word slowly slides, if the word framed all inside the seed set of words of Top-N, just Using it as candidate terms set (such as: seed node set, Context-dependent), if not provided, just when seed words are as waiting Select term set (corpus, mutual information).
Preferably, the preset term rules are the rule of Chinese terminology shown in table 1, wherein limited attribute includes: Adjective, distinguish this, verb, noun and number+quantifier.
Table 1 Chinese terminology rule
Nominal phrase Template
baseNP Verb+noun
baseNP baseNP+baseNP
baseNP BaseNP+ noun
baseNP Limited attribute+baseNP
baseNP Limited attribute+noun
Further, further include step S6 after extracting the candidate terms: calculating the candidate terms in database In support and confidence level;Wherein, the database includes several words in default field;When the candidate terms belong to institute When stating default field, the term dictionary that the candidate terms constitute default field is extracted.
Wherein, the support discloses term miWith mjThe probability occurred simultaneously, formula are as follows:
Support(mi->mj)=P (mi∪mj) formula (5);
The confidence level then discloses term miAfter appearance, term mjWhether will appear or much probability will appear, formula Are as follows:
Confidence(mi->mj)=P (mi|mj) formula (6);
Support and confidence level of each candidate terms in specific area are calculated by formula (5) and formula (6), and It is compared with the minimum support of setting and min confidence, the candidate terms of minimum support and min confidence will be less than It forecloses, ultimately forms the Chinese terminology dictionary of specific area.
The acquisition of correlation rule is mainly found out from a large amount of event log data library by the method for data mining full The frequent mode of the minimum support Minsup and min confidence Minconf of sufficient certain condition.The embodiment of the present invention is being found After candidate terms, support and confidence level of each candidate terms in default field are then calculated, and in default field The minimum support and confidence level of term are compared, and the candidate terms in a large amount of non-fields are foreclosed, are ultimately formed default The Chinese dictionary in field.Wherein, the default field can be a certain specific area of any setting, and the term of different field has Different confidence levels and support, the present invention are not specifically limited in this embodiment.
Further, the process of above-mentioned steps S1~S6 can refer to Fig. 3.
When it is implemented, firstly, being referred on the basis of having carried out pretreated using Judging index and Context-dependent judgement Mark excavates the seed words expected and includes into seed set of words;Then, node and the section based on the seed set of words The side of point constructs seed words network, and makes its convergence using the algorithm of preset model come the weight of iteration node;Finally, to described The weight of node is ranked up, and when the seed words being arranged in order form adjacent phrase, extracts the adjacent phrase as time Select term.
Compared with prior art, the term extraction method disclosed by the invention based on text information, solves the prior art In do not account for Chinese grammatical levels and lead to the problem of extracting a large amount of non-field compound word or term, base disclosed by the invention In the term extraction method of text information can fully consider Chinese grammatical levels the problem of, the spy that there is automation, dynamic to update Point meets the demand that modern mass text term high speed extracts.
Embodiment two
Referring to fig. 4, Fig. 4 is a kind of structure of term extraction system 10 based on text information provided in an embodiment of the present invention Block diagram;Include:
Text Pretreatment unit 1 to be processed pre-processes the text to be processed for obtaining text to be processed;
Seed set of words includes unit 2, for from the text to be processed extract meet mutual information Judging index and The word for hereafter relying on Judging index is included into seed set of words;
Seed words network struction unit 3, the side for node and the node based on the seed set of words construct Seed words network;Wherein, the node is any sub- word in the seed set of words, and the side of the node is present node Adjacent seed words;
Restrain unit 4, for defining the weight of the node, and by the weight of node described in preset model iteration until The weight convergence of the node;
Candidate terms extraction unit 5 is ranked up for the weight to the node, when the seed morphology being arranged in order When at adjacent phrase, the adjacent phrase is extracted as candidate terms;Wherein, the adjacent phrase meets preset term rule Then.
Preferably, the term extraction system 10 based on text information further include:
Support and confidence computation unit 6, for calculating the support and confidence of the candidate terms in the database Degree;Wherein, the database includes several words in default field;
Term dictionary creation unit 7, for when the candidate terms belong to the default field, extracting the candidate art Language constitutes the term dictionary in default field.
Preferably, the Text Pretreatment unit 1 to be processed carries out the text to be processed using hanlp Words partition system The minimum unit of word divides;Wherein, the minimum unit expression can to the text to be processed under current Words partition system The single word being divided into.
Preferably, the mutual information Judging index meets following formula:
Wherein, word string S=t1t2…ti, tiFor by the word or word group of the hanlp Words partition system cutting It closes;f(ti) indicate tiThe frequency of appearance;niIt is the number that word string S occurs, NiIt is the number that all words occur in database.
Preferably, the Context-dependent Judging index meets following formula:
H(W|ti)=- ∑w∈Wp(w|ti)*log2p(w|ti) formula (2);
Wherein, w indicates the t in certain windowiOccurs the probability of some particular words in the case where appearance again;W is expressed as T in certain windowiOccurs the set of all particular words in the case where appearance again;The certain window be show it is described to The window that a specific length is arranged in text is handled, contains several words in the window of the specific length.
Preferably, the convergence unit 4 defines the weight of the node using semantic relevance;Wherein, the semantic pass Connection property meets following formula:
Wherein, wijIt is word tiWith tjBetween semantic relevance, indicate node between side connection significance level;
The convergence unit 4 is by the weight of node described in Textrank model iteration until the weight of the node is received It holds back;Wherein, iterative process meets following formula:
Wherein, WS (ti) indicate node tiSignificance level;D expression damped coefficient, usually less than 1;tj∈In(ti) indicate It is word tiImmediately following word tjLater;tk∈Out(tj) indicate word tkImmediately following word tjLater;WS(tj) indicate node tjWeight Want degree;wjkIt is word tjWith tkBetween semantic relevance.
Preferably, the candidate terms extraction unit 5 extracts the adjacent phrase as candidate terms using sliding window.
The course of work of specific above-mentioned each unit please refers to the course of work of step S1~S6 in above-described embodiment, This is repeated no more.
When it is implemented, firstly, on the basis of Text Pretreatment unit 1 to be processed carries out pretreated, seed set of words Unit 2 is included to include using the seed words that Judging index and Context-dependent Judging index excavate expectation into seed set of words; Then, the side of node and the node of the seed words network struction unit 3 based on the seed set of words constructs seed words net Network, restraining unit 4 using the algorithm of preset model makes its convergence come the weight of iteration node;Finally, candidate terms extraction unit 5 The weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, extracts the adjacent phrase As candidate terms.
Compared with prior art, the term extraction system 10 disclosed by the invention based on text information, solves existing skill Chinese grammatical levels are not accounted in art to be caused the problem of extracting a large amount of non-field compound word or term, disclosed by the invention There is the problem of term extraction system 10 based on text information can fully consider Chinese grammatical levels automation, dynamic to update The characteristics of, meet the demand that modern mass text term high speed extracts.
Embodiment three
It is the structural representation of the term extraction equipment 20 provided in an embodiment of the present invention based on text information referring to Fig. 5, Fig. 5 Figure.The term extraction equipment 20 based on text information of the embodiment includes: processor 21, memory 22 and is stored in described In memory 22 and the computer program that can be run on the processor 21.The processor 21 executes the computer program Step in the above-mentioned each term extraction embodiment of the method based on text information of Shi Shixian, such as step S1 shown in FIG. 1~ S5.Alternatively, the processor 21 realizes the function of each module/unit in above-mentioned each Installation practice when executing the computer program Energy.
Illustratively, the computer program can be divided into one or more module/units, one or more A module/unit is stored in the memory 22, and is executed by the processor 21, to complete the present invention.It is one Or multiple module/units can be the series of computation machine program instruction section that can complete specific function, the instruction segment is for retouching State implementation procedure of the computer program in the term extraction equipment 20 based on text information.For example, the calculating Machine program can be divided into Text Pretreatment unit 1 to be processed, seed set of words includes unit 2, seed words network struction list Member 3, convergence unit 4, candidate terms extraction unit 5, candidate terms extraction unit 6 and term dictionary creation unit 7, each form unit Concrete function please refer to the tool of each unit in the term extraction system 10 described in above-described embodiment two based on text information Body function, details are not described herein.
The term extraction equipment 20 based on text information can be desktop PC, notebook, palm PC and Cloud server etc. calculates equipment.The term extraction equipment 20 based on text information may include, but be not limited only to, processor 21, memory 22.It will be understood by those skilled in the art that the schematic diagram is only based on the term extraction equipment of text information 20 example does not constitute the restriction to the term extraction equipment 20 based on text information, may include more or more than illustrating Few component perhaps combines certain components or different components, such as the term extraction equipment 20 based on text information It can also include input-output equipment, network access equipment, bus etc..
The processor 21 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor is the control centre of the term extraction equipment 20 based on text information, utilizes various interfaces and route Connect the various pieces of the entirely term extraction equipment 20 based on text information.
The memory 22 can be used for storing the computer program and/or module, the processor 21 by operation or The computer program and/or module being stored in the memory 22 are executed, and calls the data being stored in memory 22, Realize the various functions of the term extraction equipment 20 based on text information.The memory 22 can mainly include storage program Area and storage data area, wherein storing program area can application program needed for storage program area, at least one function (such as Sound-playing function, image player function etc.) etc.;Storage data area, which can be stored, uses created data (ratio according to mobile phone Such as audio data, phone directory) etc..In addition, memory 22 may include high-speed random access memory, it can also include non-easy The property lost memory, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.
Wherein, if the integrated module/unit of the term extraction equipment 20 based on text information is with software function list Member form realize and when sold or used as an independent product, can store in a computer-readable storage medium In.Based on this understanding, the present invention realizes all or part of the process in above-described embodiment method, can also pass through computer Program is completed to instruct relevant hardware, and the computer program can be stored in a computer readable storage medium, should Computer program by processor 21 when being executed, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer journey Sequence includes computer program code, and the computer program code can be source code form, object identification code form, executable text Part or certain intermediate forms etc..The computer-readable medium may include: that can carry appointing for the computer program code What entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications letter Number and software distribution medium etc..It should be noted that the content that the computer-readable medium includes can be managed according to the administration of justice Local legislation and the requirement of patent practice carry out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent Practice, computer-readable medium does not include electric carrier signal and telecommunication signal.
It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separation unit The unit of explanation may or may not be physically separated, and component shown as a unit can be or can also be with It is not physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual It needs that some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention In embodiment attached drawing, the connection relationship between module indicate between them have communication connection, specifically can be implemented as one or A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, it can understand And implement.
The above is a preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

Claims (10)

1. a kind of term extraction method based on text information characterized by comprising
Text to be processed is obtained, the text to be processed is pre-processed;
Extracted from the text to be processed meet mutual information Judging index and Context-dependent Judging index word include into In seed set of words;
The side of node and the node based on the seed set of words constructs seed words network;Wherein, the node is institute Any sub- word in seed set of words is stated, the side of the node is the adjacent seed words of present node;
The weight of the node is defined, and by the weight of node described in preset model iteration until the weight of the node is received It holds back;
The weight of the node is ranked up, when the seed words being arranged in order form adjacent phrase, is extracted described adjacent Phrase is as candidate terms;Wherein, the adjacent phrase meets preset term rules.
2. the term extraction method based on text information as described in claim 1, which is characterized in that described to extract the phase After adjacent phrase is as candidate terms, further includes:
Calculate the support and confidence level of the candidate terms in the database;Wherein, the database includes default field Several words;
When the candidate terms belong to the default field, the term dictionary that the candidate terms constitute default field is extracted.
3. the term extraction method based on text information as described in claim 1, which is characterized in that described to described to be processed Text carries out pretreatment and specifically includes:
It is divided using the minimum unit that hanlp Words partition system carries out word to the text to be processed;Wherein, the minimum unit Indicate the single word that can be divided under current Words partition system to the text to be processed.
4. the term extraction method based on text information as claimed in claim 3, which is characterized in that the mutual information judgement refers to Mark meets following formula:
Wherein, word string S=t1t2…ti, tiFor by the word or word combination of the hanlp Words partition system cutting;f (ti) indicate tiThe frequency of appearance;niIt is the number that word string S occurs, NiIt is the number that all words occur in database.
5. the term extraction method based on text information as claimed in claim 4, which is characterized in that the Context-dependent is sentenced Determine index and meet following formula:
H(W|ti)=- ∑w∈Wp(w|ti)*log2p(w|ti) formula (2);Wherein, w indicates the t in certain windowiOccur In the case of occur the probability of some particular words again;W is expressed as t in certain windowiOccur again in the case where appearance all described The set of particular words;The certain window is the window showed the text to be processed and a specific length is arranged, the spy Contain several words in the window of measured length.
6. the term extraction method based on text information as described in claim 1, which is characterized in that described to define the node Weight, and by the weight of node described in preset model iteration until the weight convergence of the node, specifically includes:
The weight of the node is defined using semantic relevance;Wherein, the semantic relevance meets following formula:
Wherein, wijIt is word tiWith tjBetween semantic relevance, indicate node between side connection significance level;
By the weight of node described in Textrank model iteration until the weight convergence of the node;Wherein, iterative process is full Sufficient following formula:
Wherein, WS (ti) indicate node tiSignificance level;D expression damped coefficient, usually less than 1;tj∈In(ti) indicate to be word Language tiImmediately following word tjLater;tk∈Out(tj) indicate word tkImmediately following word tjLater;WS(tj) indicate node tjImportant journey Degree;wjkIt is word tjWith tkBetween semantic relevance.
7. the term extraction method based on text information as described in claim 1, which is characterized in that the extraction is described adjacent Phrase is specifically included as candidate terms:
The adjacent phrase is extracted as candidate terms using sliding window.
8. a kind of term extraction system based on text information characterized by comprising
Text Pretreatment unit to be processed pre-processes the text to be processed for obtaining text to be processed;
Seed set of words includes unit, for from the text to be processed extract meet mutual information Judging index and context according to The word of Judging index is relied to include into seed set of words;
Seed words network struction unit, the side for node and the node based on the seed set of words construct seed words Network;Wherein, the node is any sub- word in the seed set of words, and the side of the node is that present node is adjacent Seed words;
Unit is restrained, for defining the weight of the node, and by the weight of node described in preset model iteration until described The weight convergence of node;
Candidate terms extraction unit is ranked up for the weight to the node, when the seed words being arranged in order form phase When adjacent phrase, the adjacent phrase is extracted as candidate terms;Wherein, the adjacent phrase meets preset term rules.
9. the term extraction system based on text information as claimed in claim 8, which is characterized in that the system also includes:
Support and confidence computation unit, for calculating the support and confidence level of the candidate terms in the database;Its In, the database includes several words in default field;
Term dictionary creation unit, for extracting the candidate terms structure when the candidate terms belong to the default field At the term dictionary in default field.
10. a kind of term extraction equipment based on text information, which is characterized in that including processor, memory and be stored in In the memory and it is configured as the computer program executed by the processor, the processor executes the computer journey The term extraction method based on text information as claimed in any of claims 1 to 7 in one of claims is realized when sequence.
CN201910063975.1A 2019-01-23 2019-01-23 Text information-based term extraction method, system and equipment Active CN109902290B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910063975.1A CN109902290B (en) 2019-01-23 2019-01-23 Text information-based term extraction method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910063975.1A CN109902290B (en) 2019-01-23 2019-01-23 Text information-based term extraction method, system and equipment

Publications (2)

Publication Number Publication Date
CN109902290A true CN109902290A (en) 2019-06-18
CN109902290B CN109902290B (en) 2023-06-30

Family

ID=66944048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910063975.1A Active CN109902290B (en) 2019-01-23 2019-01-23 Text information-based term extraction method, system and equipment

Country Status (1)

Country Link
CN (1) CN109902290B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680128A (en) * 2020-06-16 2020-09-18 杭州安恒信息技术股份有限公司 Method and system for detecting web page sensitive words and related devices
CN112966508A (en) * 2021-04-05 2021-06-15 集智学园(北京)科技有限公司 General automatic term extraction method
CN115066679A (en) * 2020-03-25 2022-09-16 苏州七星天专利运营管理有限责任公司 Method and system for extracting self-made terms in professional field
CN115130472A (en) * 2022-08-31 2022-09-30 北京澜舟科技有限公司 Method, system and readable storage medium for segmenting subwords based on BPE
CN116756298A (en) * 2023-08-18 2023-09-15 太仓市律点信息技术有限公司 Cloud database-oriented AI session information optimization method and big data optimization server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN107329950A (en) * 2017-06-13 2017-11-07 武汉工程大学 It is a kind of based on the Chinese address segmenting method without dictionary
CN108287825A (en) * 2018-01-05 2018-07-17 中译语通科技股份有限公司 A kind of term identification abstracting method and system
CN108549626A (en) * 2018-03-02 2018-09-18 广东技术师范学院 A kind of keyword extracting method for admiring class

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN107329950A (en) * 2017-06-13 2017-11-07 武汉工程大学 It is a kind of based on the Chinese address segmenting method without dictionary
CN108287825A (en) * 2018-01-05 2018-07-17 中译语通科技股份有限公司 A kind of term identification abstracting method and system
CN108549626A (en) * 2018-03-02 2018-09-18 广东技术师范学院 A kind of keyword extracting method for admiring class

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杜海舟 等: "基于上下文关系和TextRank算法的关键词提取方法", 《上海电力学院学报》 *
贺海涛 等: "基于关联规则和语义规则的本体概念提取研究", 《吉林大学学报(信息科学版)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115066679A (en) * 2020-03-25 2022-09-16 苏州七星天专利运营管理有限责任公司 Method and system for extracting self-made terms in professional field
CN115066679B (en) * 2020-03-25 2024-02-20 苏州七星天专利运营管理有限责任公司 Method and system for extracting self-made terms in professional field
CN111680128A (en) * 2020-06-16 2020-09-18 杭州安恒信息技术股份有限公司 Method and system for detecting web page sensitive words and related devices
CN112966508A (en) * 2021-04-05 2021-06-15 集智学园(北京)科技有限公司 General automatic term extraction method
CN112966508B (en) * 2021-04-05 2023-08-25 集智学园(北京)科技有限公司 Universal automatic term extraction method
CN115130472A (en) * 2022-08-31 2022-09-30 北京澜舟科技有限公司 Method, system and readable storage medium for segmenting subwords based on BPE
CN116756298A (en) * 2023-08-18 2023-09-15 太仓市律点信息技术有限公司 Cloud database-oriented AI session information optimization method and big data optimization server
CN116756298B (en) * 2023-08-18 2023-10-20 太仓市律点信息技术有限公司 Cloud database-oriented AI session information optimization method and big data optimization server

Also Published As

Publication number Publication date
CN109902290B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN109902290A (en) A kind of term extraction method, system and equipment based on text information
CN110390006B (en) Question-answer corpus generation method, device and computer readable storage medium
CN108304375A (en) A kind of information identifying method and its equipment, storage medium, terminal
KR101548096B1 (en) Method and server for automatically summarizing documents
US11113470B2 (en) Preserving and processing ambiguity in natural language
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN107885717B (en) Keyword extraction method and device
CN112256822A (en) Text search method and device, computer equipment and storage medium
WO2017177809A1 (en) Word segmentation method and system for language text
CN108304377B (en) Extraction method of long-tail words and related device
CN103309852A (en) Method for discovering compound words in specific field based on statistics and rules
CN103971684A (en) Method and system for adding punctuations and method and device for establishing language model for adding punctuations
EP3232336A1 (en) Method and device for recognizing stop word
CN109063184A (en) Multilingual newsletter archive clustering method, storage medium and terminal device
CN109460499A (en) Target search word generation method and device, electronic equipment, storage medium
CN113901214B (en) Method and device for extracting form information, electronic equipment and storage medium
CN111309916A (en) Abstract extraction method and device, storage medium and electronic device
CN112328747A (en) Event context generation method and device, terminal equipment and storage medium
CN111444713B (en) Method and device for extracting entity relationship in news event
CN110874408B (en) Model training method, text recognition device and computing equipment
US20110106849A1 (en) New case generation device, new case generation method, and new case generation program
CN113157857B (en) Hot topic detection method, device and equipment for news
CN111339287B (en) Abstract generation method and device
CN112182283A (en) Song searching method, device, network equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant