CN104102847B

CN104102847B - Chinese thesaurus constructing system

Info

Publication number: CN104102847B
Application number: CN201410359650.5A
Authority: CN
Inventors: 曾文; 乔晓东; 朱礼军; 张均胜
Original assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Current assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date: 2014-07-25
Filing date: 2014-07-25
Publication date: 2017-11-10
Anticipated expiration: 2034-07-25
Also published as: CN104102847A

Abstract

The invention provides a kind of Chinese thesaurus constructing system, and it includes input equipment, system processor, memory, output equipment.System processor includes data processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, descriptor table generator.Memory communication is connected to the data processor of system processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, descriptor table generator.Output equipment is communicatively coupled to system processor.Thus, the shortcomings that overcoming original manual method, uses manpower and material resources sparingly, and improves Chinese thesaurus structure efficiency, the dynamic construction for realizing Chinese thesaurus, renewal and maintenance that can be convenient, fast and inexpensive；The quality of descriptor structure is can guarantee that, structure or the information extraction of the Chinese thesaurus of all spectra can be supported；It is beneficial to information tissue and the utilization of Library and file administration field, and digital library can be served.

Description

Chinese thesaurus constructing system

Technical field

The present invention relates to data processing technique, more particularly to a kind of Chinese thesaurus constructing system.

Background technology

Thesaurus is a kind of standardization dynamic vocabulary for showing semantic relation between descriptor, descriptor word, wherein including Specific license field, in the semantic many vocabulary related on hierarchical relationship, from the perspective of in terms of function, thesaurus is then document mark Thinking bridge between the person of inducing one and retrieval personnel, it is natural language (language used in document) and system language (searching system rule Generalized language) between a kind of term control instrument for being changed, while be also the medium that is exchanged between people and system. Science and technology rapid development, today that Network Information service becomes increasingly popular, the method for traditional artificial constructed thesaurus take It is and costly.The shortcomings that maximum of artificial constructed thesaurus is that can not solve " knowledge acquisition existing for brainstrust itself of tabulating Bottleneck " problem, also it is unfavorable for upgrading in time and safeguarding for thesaurus.Artificial constructed thesaurus is applied to networking, digitlization Environment when, its own existing updating survey not enough causes vocabulary content in ageing, descriptor term scale and quality etc. Missing, make it be difficult to use and promote in all types of user in digital networked environment, i.e., can not meet Library With the professional in file administration field and the needs of user are retrieved, in addition, Library and the numeral in file administration field Change data in literature every year with the data volume of magnanimity scale to be incremented by, increased document number is constantly updated and developed to existing art The generation that new terminology emerges in an endless stream is resulted according to data in literature caused by the appearance of, frontier technology.Therefore, transform and update There is thesaurus, emerging technical field or specialty are then needed to rebuild new industry technology field thesaurus.Structure at present Thesaurus is the common recognition of domestic and international Library and file administration industry, refers to document, Robert M.Losee, thesaurus structure The decision method research built and used, information processing and management, 2007 (4):958-968(Decisions in Thesaurus Construction and Use.Information Processing&Management),2007(4):958-968.).Such as Efficiently, quickly structure Chinese thesaurus is Library and the actual demand urgently to be resolved hurrily of file administration field for what.

From published document and practical application, the report of Chinese thesaurus constructing system device yet there are no.At present, The domestic research for thesaurus generation technique field lacks, such as：Du Huiping, He Lin, Hou Hanqing, the nature based on cluster analysis Language thesaurus is built automatically, National Library's academic periodical, and 2007,3:44-49；Xu Ruifang, Li Xiaowen, Hou Hanqing, thesaurus word Between Automated generalization rule comparative studies, information science, 2009 (1):89-93；Yuan Xu, Chang Chun, towards the thesaurus phase of structure Close Relation acquisition Study of way, information science, 2013,31 (1)：68-72；These documents are to be limited to generate thesaurus During a certain stage local research, and without the systemic development in complete meaning；Another document (Liu Hua, Shen Yulan, Zeng Jianxun, China, the U.S. and Britain's thesaurus establishment national standard comparative studies, Library Information Service, 2009,53 (22):Research work 72-75) is based on the external thesaurus research establishment situation of follow-up report；Another two documents (Liu Wei, Zhou Jie, the concurrent mechanism research under network environment in thesaurus workout system, Library Information Service, 2011,55 (22)：11- 14)：Zhao Jianhua, Zhao Jianguo etc., the exploitation of Chinese thesaurus microcomputer establishment management system, information journal, 1995:184-193) originally Be area of computer aided manual entry, establishment and the technology for safeguarding thesaurus in matter, i.e., it is auxiliary using the database technology of computer Assistant editor's system and processing vocabulary, realize vocabulary structure structure and basic editting function, and are not to be directed to thesaurus content in itself Constructing technology realization.The external research work relative maturity on thesaurus constructing technology, from the seventies in last century just Through starting correlative study work, still, due to statement difference intrinsic between language so that replicate external thesaurus structure completely It is worthless to build technology and method, and therefore, the structure research-and-development activity for Chinese thesaurus, which is one, has reality The work of meaning.

The content of the invention

For insufficient present in background technology, it is an object of the invention to provide a kind of Chinese thesaurus constructing system, It can overcome the shortcomings that original manual method, use manpower and material resources sparingly, and improve the structure efficiency of Chinese thesaurus, can conveniently, soon Prompt and low cost dynamic construction, renewal and the maintenance of realizing Chinese thesaurus.

Another object of the present invention is to provide a kind of Chinese thesaurus constructing system, compared to artificial constructed Chinese thesaurus Method, its more can guarantee that Chinese thesaurus structure quality, all Chinese based on Digital Documents field can be supported to chat The structure of vocabulary or information extraction.

Another object of the present invention is information tissue and the utilization for being beneficial to Library and file administration field, and can To serve digital library.

To achieve these goals, the invention provides a kind of Chinese thesaurus constructing system, it includes input equipment, is System processor, memory and output equipment.

Input equipment input builds the raw data file needed for Chinese thesaurus and exports raw data file.

System processor includes：Data processor, it is communicatively coupled to input equipment and receives the original exported by input equipment Beginning data file, there is provided the storage address of raw data file, normative judgement is carried out to the raw data file received, such as The raw data file that fruit is received belongs to the raw data file for the non-standardization for not meeting data processor processes, then should Raw data file is changed to generate specification text data file and specification text data file be segmented and part of speech Mark and export specification text data, if the raw data file received belongs to the standardization for meeting data processor processes Raw data file, then the raw data file is directly advanced participle and part-of-speech tagging and exports specification text data；Chat Word identifies and withdrawal device, is communicatively coupled to data processor and receives the participle of data processor output and the specification of part-of-speech tagging Text data, with based on standard GB/T 13190-91 Chinese thesaurus establishment rules carry out a group word, the identification of descriptor and extraction, And the descriptor extracted is generated and exports, the descriptor of extraction is as selected descriptor set；Descriptor relation recognition and withdrawal device, communication link Data processor and descriptor identification are connected to withdrawal device and receives specification text data and the descriptor knowledge of data processor output Not with the selected descriptor set of withdrawal device output, with based on standard GB/T 13190-91 Chinese thesaurus establishment rules to selected Each descriptor in descriptor set carries out descriptor dependency relation and category divides the identification and extraction of relation, and by the descriptor of each descriptor Dependency relation and category point relation export；And descriptor table generator, it is communicatively coupled to descriptor identification and is closed with withdrawal device and descriptor System's identification and withdrawal device, receive the selected descriptor set of descriptor identification and withdrawal device output, receive descriptor relation recognition with extracting The descriptor dependency relation and category of each descriptor of device output divide relation, to be compiled based on standard GB/T 13190-91 Chinese thesaurus System rule is combined to the relation between descriptor, descriptor, sorted, to generate and export thesaurus.

Memory communication is connected to the data processor of system processor, descriptor identification and withdrawal device, descriptor relation recognition With withdrawal device, descriptor table generator, data storage processor, descriptor identification with withdrawal device, descriptor relation recognition and withdrawal device, chat The result that vocabulary maker each exports.

Output equipment, which is communicatively coupled to the data processor of system processor, descriptor identification and withdrawal device, descriptor relation, to be known Not with withdrawal device, descriptor table generator, and receive and output data processor exported specification text data, descriptor identification and The descriptor dependency relation and category point that selected descriptor set, descriptor relation recognition and the withdrawal device that withdrawal device is exported are exported are closed The thesaurus that system, descriptor table generator are exported.

Beneficial effects of the present invention are as follows：

By Chinese thesaurus constructing system provided by the invention, the shortcomings that overcoming original manual method, manpower thing is saved Power, improves the structure efficiency of Chinese thesaurus, the dynamic construction for realizing Chinese thesaurus that can be convenient, fast and inexpensive, Renewal and maintenance.

By Chinese thesaurus constructing system provided by the invention, it can guarantee that the structure quality of Chinese thesaurus, can be with Support structure or the information extraction of the Chinese thesaurus of all spectra.

By Chinese thesaurus constructing system provided by the invention, it is beneficial to the information of Library and file administration field Tissue and utilization, and digital library can be served.

Brief description of the drawings

Fig. 1 is the compositional block diagram according to the Chinese thesaurus constructing system of the present invention.

Wherein, description of reference numerals is as follows：

1 input equipment

2 system processors

21 data processors

211 system initialization process devices

212 data initialization processors

213 data processor memories

22 descriptors identify and withdrawal device

221 candidate's descriptors judge and generation processor

222 descriptors judge and generation processor

223 descriptor result memories

23 descriptor relation recognitions and withdrawal device

231 descriptor dependency relations are identified with extracting processor

232 descriptor category divide relation recognition with extracting processor

233 descriptor relational result memories

24 descriptor table generators

241 thesaurus generate processor

242 thesaurus result memories

3 memories

4 output equipments

5 verifications and modifier

Embodiment

With reference to the accompanying drawings come describe in detail according to the present invention Chinese thesaurus constructing system.

Reference picture 1, input equipment 1, system processor 2, storage are included according to the Chinese thesaurus constructing system of the present invention Device 3 and output equipment 4.

The input of input equipment 1 builds the raw data file needed for Chinese thesaurus and exports raw data file.

System processor 2 includes：Data processor 21, is communicatively coupled to input equipment 1 and reception is exported by input equipment 1 Raw data file (number of raw data file at least one), there is provided the storage address of raw data file, to institute The raw data file of reception carries out normative judgement, if the raw data file received, which belongs to, does not meet data processor The raw data file of the non-standardization of 21 processing, then the raw data file is changed to generate specification text data text Part and specification text data file is segmented and part-of-speech tagging and exports specification text data, if the original number received Belong to the raw data file for the standardization for meeting the processing of data processor 21 according to file, then to the direct row of the raw data file Enter participle and part-of-speech tagging and export specification text data；Descriptor identifies and withdrawal device 22, is communicatively coupled to data processor 21 And the participle of the output of data processor 21 and the specification text data of part-of-speech tagging are received, with based on standard GB/T 13190-91 Chinese thesaurus establishment rules carry out the identification of a group word, descriptor with extraction and generate and export the descriptor of extraction, the descriptor of extraction As selected descriptor set；Descriptor relation recognition and withdrawal device 23, be communicatively coupled to data processor 21 and descriptor identification with Withdrawal device 22 simultaneously receives the specification text data of the output of data processor 21 and selected the chatting of descriptor identification and the output of withdrawal device 22 Set of words, to be entered based on standard GB/T 13190-91 Chinese thesaurus establishment rules to selecting each descriptor in descriptor set Row descriptor dependency relation and category divide the identification and extraction of relation, and divide relation defeated the descriptor dependency relation and category of each descriptor Go out；And descriptor table generator 24, descriptor identification and withdrawal device 22 and descriptor relation recognition and withdrawal device 23 are communicatively coupled to, The selected descriptor set of the identification of reception descriptor and the output of withdrawal device 22, reception descriptor relation recognition export each with withdrawal device 23 The descriptor dependency relation and category of descriptor divide relation, with based on standard GB/T 13190-91 Chinese thesaurus establishment rules to chatting Relation between word, descriptor is combined, sorted, to generate and export thesaurus.

Memory 3 is communicatively coupled to the data processor 21 of system processor 2, descriptor identification and withdrawal device 22, descriptor and closed System's identification and withdrawal device 23, descriptor table generator 24, data storage processor 21, descriptor identification and withdrawal device 22, descriptor relation The result that identification each exports with withdrawal device 23, descriptor table generator 24.

Output equipment 4 is communicatively coupled to the data processor 21 of system processor 2, descriptor identification and withdrawal device 22, descriptor Relation recognition and withdrawal device 23, descriptor table generator 24, and receive the specification textual data exported with output data processor 21 The descriptor that the selected descriptor set that is exported according to the identification of, descriptor with withdrawal device 22, descriptor relation recognition are exported with withdrawal device 23 The thesaurus that dependency relation and category point relation, descriptor table generator 24 are exported.

According in Chinese thesaurus constructing system of the present invention, the raw data file includes text data text Part, XML file, pdf document, the specification text data file include text and XML file.

In an embodiment of data processor 21, reference picture 1, data processor 21 includes：System initialization process device 211, it is communicatively coupled to input equipment 1 and receives the raw data file exported by input equipment 1, there is provided raw data file Storage address, normative judgement is carried out to the raw data file received, if the raw data file received belongs to not Meet the raw data file of the non-standardization of the processing of data processor 21, then changed the raw data file to generate Specification text data file simultaneously exports specification text data file, meets if the raw data file received belongs at data The raw data file for the standardization that device 21 is handled is managed, then the raw data file is defeated directly as specification text data file Go out；Data initialization processor 212, it is communicatively coupled to system initialization process device 211, reception system initialization processor 211 The specification text data file of output, specification text data file is segmented and part-of-speech tagging and will participle and part-of-speech tagging Specification text data output afterwards；And data processor memory 213, it is communicatively coupled to data initialization processor 212 simultaneously Receive the specification text data file after the participle and part-of-speech tagging exported with data storage initialization processor 212.

In an embodiment of the descriptor identification with withdrawal device 22, reference picture 1, descriptor identification includes with withdrawal device 22：Candidate Descriptor judges, with generating processor 221, to be communicatively coupled to the data initialization processor 212 of data processor 21, and receive number Specification text data file after the participle and part-of-speech tagging that are exported according to the data initialization processor 212 of processor 21, is based on Language rule and mutual information statistics are identified and taken out to the canonical data files after participle and part-of-speech tagging received Candidate's descriptor is taken, generates and exports candidate's descriptor set；Descriptor judges, with generating processor 222, to be communicatively coupled to candidate's descriptor Judge with generating processor 221, and receive candidate's descriptor set that candidate's descriptor judges to export with generation processor 221, be based on Position weighted sum word frequency statisticses, the judgement of descriptor word and extraction are carried out to candidate's descriptor in candidate's descriptor set for being received, To generate and export selected descriptor set；And descriptor result memory 223, it is communicatively coupled to descriptor and judges and generation processor 222, and receive and store the selected descriptor set that descriptor judges to export with generation processor 222.

The specification text data file that data initialization processor 212 exports, its content are by participle and part of speech mark The term data of processing is noted, it is made up of word string, and candidate's descriptor judges, with generating processor 221, to advise by using following language Then obtain candidate's descriptor set.Language rule is：

At least contain a verb, noun or nominal composition in candidate's descriptor；

Last word of candidate's descriptor is verb, noun or nominal composition；

First word of candidate's descriptor is not preposition, measure word；

There is no conjunction, pronoun and modal particle in candidate's descriptor；

Extract word string member of the length between 2-8 and usually form candidate's descriptor；

Extract and chatted by the candidate of part of speech composition noun+noun of participle, adjective+noun, verb+noun, noun+verb Word phrase string, the maximum length of phrase string is 8.

To improve the generation quality of candidate's descriptor, candidate descriptor is regarded as being made up of multiple candidate descriptors, i.e., candidate chats Word phrase, then candidate's descriptor judges and generation processor 221 obtains selected candidate's descriptor collection using mutual information statistical calculation method Close.Mutual information counts calculation formula：

Formula 1

Wherein, candidate's descriptor T is word (t_i,t_j) combination, word t is by t₁t₂...t_nComposition, word string are designated as t_i= t₁t₂...t_n-r, t_j=t_rt_r+1...t_n,probability(t_i) represent word t_iIndividually in the institute of data initialization processor 212 There is the probability occurred in specification text data file；probability(t_j) represent word t_jIndividually in data initialization processor The probability occurred in 212 all specification text data files；probability(t_i,t_j) represent word t_iAnd t_jAppear in jointly Probability in the same specification text data file of data initialization processor 212；If t_iAnd t_jIt is very close with reference to obtaining, then probability(t_i,t_j) and probability (t_i) or probability (t_j) be more or less the same (and specific difference can by with Family determines), then the word string t in the candidate's descriptor T that the formula calculates_iAnd t_jAssociation relationship with regard to larger, conversely, probability (t_i) and probability (t_j) probability (t will be much larger than_i,t_j), then the t calculated_iAnd t_jAssociation relationship just It is smaller, i.e. Mutual-information (t_i,t_j) value is bigger, then word t_iAnd t_jThe probability for being combined into candidate's descriptor is bigger.

Wherein, descriptor judge with generate processor 222 using the method being combined based on position weighted sum word frequency statisticses come The method basis for judging as descriptor word and extracting.

The construction situation of weighting function value is as follows：

The weighting function of keyword is calculated：

Weight_i=a × TFIDF_iFormula 2

Calculating to non-key word descriptor (with reference to formula 1)：

Weight_i=Mutual-information (T) × a × TFIDF_iFormula 3

In formula 2 and formula 3,

Formula 4

Wherein：ft_iRefer to word t_iOccur in all specification text data file d from data initialization processor 212 Frequency；N is the number of all specification text data files from data initialization processor 212；n_iIt is to include word t_iSpecification Text data number；A is word t_iPosition (i.e. title, summary, keyword, text of the word in specification text data file Where this four positions) weighted value.

In descriptor relation recognition and an embodiment of withdrawal device 23, reference picture 1, descriptor relation recognition is wrapped with withdrawal device 23 Include：Descriptor dependency relation identifies and extracts processor 231, is communicatively coupled to the data initialization processor of data processor 21 212 and descriptor identification and the descriptor of withdrawal device 22 judge with generation processor 222, receive the data initialization of data processor 21 Specification text data file and the descriptor identification of participle and part-of-speech tagging that processor 212 exports and the descriptor of withdrawal device 22 judge The selected descriptor set exported with generation processor 222, based on the descriptor in selected descriptor set in participle and part-of-speech tagging Co-occurrence probabilities statistical value in specification text data file, identify and extract the descriptor dependency relation of the descriptor, and export and taken out The descriptor dependency relation taken；Descriptor category divides relation recognition and extracts processor 232, is communicatively coupled to the data of data processor 21 The descriptor of initialization processor 212 and descriptor identification and withdrawal device 22 judges and generates processor 222, receives data processor 21 Data initialization processor 212 export participle and part-of-speech tagging specification text data file and descriptor identification and withdrawal device 22 descriptor judges the selected descriptor set exported with generation processor 222, is being formed based on the descriptor in selected descriptor set The metric in form with similarity between inclusion relation and calculating descriptor is identified and extracted to descriptor category point relation, and The extracted descriptor category of output divides relation；And descriptor relational result memory 233, it is communicatively coupled to the identification of descriptor dependency relation Divide relation recognition with extracting processor 232 with extracting processor 231 and descriptor category, to receive and store the knowledge of descriptor dependency relation Do not chatted with the descriptor dependency relation and descriptor category point relation recognition for extracting the output of processor 231 with what extraction processor 232 exported Word category divides relation.

Descriptor dependency relation automatic identification receives the data initialization processor of data processor 21 with extracting processor 231 Specification text data file and the descriptor identification of participle and part-of-speech tagging of 212 outputs and the descriptor of withdrawal device 22 judge and generation The selected descriptor set that processor 222 exports.Dependency relation is carried out using the similarity calculating method based on joint probability distribution Identification and extraction, calculation formula is as follows：

Formula 5

Wherein：probability(A,B):Represent all specification text datas generated in data initialization processor 212 In file, uniform window (refer in each specification text data file position (title, summary, keyword, text this Four positions)), select the frequency that the A words in descriptor set and B words occur simultaneously；Represent in all specifications In text data file, occur in the A words that uniform window is selected in descriptor set, and the B words in selected descriptor set occur without Frequency；Represent in all specification text data files, the B in uniform window selectes descriptor set Word occurs, and the frequency that the A words in selected descriptor set occur without.

Wherein, descriptor category divides relation recognition to receive the data initialization of data processor 21 with extracting processor 232 and handled The participle and the specification text data file and descriptor of part-of-speech tagging that device 212 exports identify and the descriptor of withdrawal device 22 judges and raw The selected descriptor set exported into processor 222.The category point of descriptor is obtained using the metric calculation formula of similitude between descriptor Relation, calculation formula are as follows：

Formula 6

Wherein, sim_number represents that two descriptors in selected descriptor set contain (by descriptor and descriptor to be included is included) Total number have, identical word；Sub_number represents to be included the total of the word contained by descriptor in selected descriptor set Number；Number represents the total number of the word contained by the descriptor to be included in selected descriptor set； Represent that two descriptors in selected descriptor set contain, (i.e. the word exists identical word for location in by comprising descriptor By the prefix comprising descriptor, where in word and these three positions of suffix) flexible strategy sum；Qsim_number (i) is represented The location of two descriptors in selected descriptor set contain, i-th of word of identical weight；Sim_number (i) is represented Two descriptors in selected descriptor set contain, identical word is concentrated, and i-th of word concentrates location number (i.e. i in word Value)；q₁Represent the weight coefficient of word in prefix, word and these three positions of suffix in by comprising descriptor；Sub_number (i) tables Show two descriptors in selected descriptor set contain, i-th of word of identical location number in by comprising descriptor；Represent two descriptors in selected descriptor set contain, identical word institute in descriptor to be included Position (i.e. prefix, in word and suffix these three positions where of the word in descriptor to be included) the flexible strategy sum at place； q₂Represent the weight coefficient of word in prefix, word and these three positions of suffix in descriptor to be included；Number (i) represents selected and chatted Two descriptors in set of words contain, i-th of word of identical location number in descriptor to be included；Dp represents position system Number, its value are by comprising the ratio between total number of word of descriptor to be included in descriptor and selected descriptor set in selected descriptor set；i =1,2 ..., sim_number.

In an embodiment of descriptor table generator 24, reference picture 1, descriptor table generator 24 includes：At thesaurus generation Device 241 is managed, the descriptor for being communicatively coupled to descriptor identification and withdrawal device 22 judges and generation processor 222 and descriptor relation recognition Descriptor dependency relation identification with withdrawal device 23 divides relation recognition with extracting processor 232 with extracting processor 231 and descriptor category, The descriptor for receiving descriptor identification and withdrawal device 22 judges the selected descriptor set exported with generation processor 222, receives descriptor and closes System's identification and the descriptor dependency relation identification of withdrawal device 23 divide relation recognition to be handled with extraction with extracting processor 231 and descriptor category The descriptor dependency relation and category point relation for each descriptor that device 232 exports respectively, are chatted based on standard GB/T 13190-91 Chinese Vocabulary establishment rules are combined to the relation between descriptor, descriptor, sorted, to generate and export thesaurus；And thesaurus Result memory 242, thesaurus generation processor 241 is communicatively coupled to, and it is defeated to receive and store thesaurus generation processor 241 The thesaurus gone out.

In the embodiment according to Chinese thesaurus constructing system of the present invention, reference picture 1, the Chinese descriptor Table constructing system may also include：Verification and modifier 5, are communicatively coupled to data processor memory 213, descriptor result memory 223rd, descriptor relational result memory 233, thesaurus result memory 242, with the rule stored to data processor storage 213 Model text data file, the selected descriptor set of the storage of descriptor result memory 223, descriptor relational result memory 233 store Descriptor dependency relation and descriptor category point relation, thesaurus result memory 242 store thesaurus carry out desk checking, repair Change, delete.Based on verification and modifier 5, user can be as needed, and opening needs are checked, change and deleted and be above-mentioned at any time The content of storage, to be operated accordingly.

In an embodiment of memory 3, memory 3 may be selected from hard disk, USB flash disk, mobile hard disk, storage card.

In the embodiment according to Chinese thesaurus constructing system of the present invention, the Chinese thesaurus structure system System may also include：Visual operation interface (not shown), it is communicatively coupled to input equipment 1, system processor 2, memory 3, output Equipment 4 and verification and modifier 5.By visual operation interface, user can be easy to realize that structure is handled whole thesaurus automatically Process.

It is to the descriptor and descriptor relation recognition of the present invention and entering for extraction below with reference to all technical characteristic that Fig. 1 is provided One step is demonstrate,proved.

Using the system and device of the present invention to mechanical engineering patent Chinese literature data 2426, natural language processing The Chinese periodical data in literature of technology 2783, the identification and extraction of descriptor and descriptor relation are carried out, by the behaviour of above step Make, it is as follows to obtain result of the test：

The identification of the present invention of table 1 and the candidate's descriptor extracted and selected descriptor result of the test example

The identification of the present invention of table 2 and the descriptor result of the test example extracted

Result of the test example of the identification of the present invention of table 3 with extracting descriptor dependency relation

The identification of the present invention of table 4 divides the result of the test example of relation with extracting descriptor category

Claims

A kind of 1. Chinese thesaurus constructing system, it is characterised in that including：

Input equipment (1), input builds the raw data file needed for Chinese thesaurus and exports raw data file, described Raw data file includes text data file, XML file, pdf document；

System processor (2), including：

Data processor (21), it is communicatively coupled to input equipment (1) and receives the initial data text by input equipment (1) output Part, there is provided the storage address of raw data file, normative judgement is carried out to the raw data file received, if received Raw data file belong to do not meet data processor (21) processing non-standardization raw data file, then it is this is original Data file is changed to generate specification text data file and specification text data file be segmented and part-of-speech tagging And specification text data is exported, if the raw data file received belongs to the standardization for meeting data processor (21) processing Raw data file, then the raw data file is directly segmented and part-of-speech tagging and exports specification text data, institute Stating specification text data file includes text and XML file；Descriptor identifies and withdrawal device (22), is communicatively coupled to data The specification text data of the participle and part-of-speech tagging of processor (21) and reception data processor (21) output, to be marked based on country Quasi- GB13190-91 Chinese thesaurus establishment rules carry out a group word, descriptor identification and extraction and generate and what output was extracted chats Word, the descriptor of extraction is as selected descriptor set；

Descriptor relation recognition and withdrawal device (23), it is communicatively coupled to data processor (21) and descriptor identification and withdrawal device (22) And receive the specification text data of data processor (21) output and the selected descriptor collection of descriptor identification and withdrawal device (22) output Close, to be chatted based on standard GB/T 13190-91 Chinese thesaurus establishment rules to selecting each descriptor in descriptor set Word dependency relation and category divide the identification and extraction of relation, and the descriptor dependency relation of each descriptor and category point relation are exported；With And

Descriptor table generator (24), it is communicatively coupled to descriptor identification and withdrawal device (22) and descriptor relation recognition and withdrawal device (23), receive the selected descriptor set of descriptor identification and withdrawal device (22) output, receive descriptor relation recognition and withdrawal device (23) The descriptor dependency relation and category of each descriptor of output divide relation, to be worked out based on standard GB/T 13190-91 Chinese thesaurus Rule is combined to the relation between descriptor, descriptor, sorted, to generate and export thesaurus；

Memory (3), be communicatively coupled to system processor (2) data processor (21), descriptor identification with withdrawal device (22), chat Word relation recognition and withdrawal device (23), descriptor table generator (24), data storage processor (21), descriptor identification and withdrawal device (22), the result that descriptor relation recognition each exports with withdrawal device (23), descriptor table generator (24)；And

Output equipment (4), be communicatively coupled to system processor (2) data processor (21), descriptor identification with withdrawal device (22), Descriptor relation recognition and withdrawal device (23), descriptor table generator (24), and receive what is exported with output data processor (21) Selected descriptor set, descriptor relation recognition and the withdrawal device that specification text data, descriptor identification are exported with withdrawal device (22) (23) thesaurus that the descriptor dependency relation and category point relation that are exported, descriptor table generator (24) are exported；

Data processor (21) includes：

System initialization process device (211), it is communicatively coupled to input equipment (1) and receives by the original of input equipment (1) output Data file, there is provided the storage address of raw data file, normative judgement is carried out to the raw data file received, if The raw data file received belongs to the raw data file for the non-standardization for not meeting data processor (21) processing, then will The raw data file is changed to generate specification text data file and export specification text data file, if received Raw data file belong to meet data processor (21) processing standardization raw data file, then the initial data text Part exports directly as specification text data file；

Data initialization processor (212), it is communicatively coupled to system initialization process device (211), reception system initialization process Device (211) output specification text data file, specification text data file is segmented and part-of-speech tagging and will participle and Specification text data output after part-of-speech tagging；And

Data processor memory (213), data initialization processor (212) is communicatively coupled to, and received at the beginning of with data storage Specification text data file after the participle and part-of-speech tagging of beginningization processor (212) output；

Descriptor identification includes with withdrawal device (22)：

Candidate's descriptor judges and generates processor (221), is communicatively coupled to the data initialization processor of data processor (21) (212) and receive data processor (21) data initialization processor (212) output participle and part-of-speech tagging after specification Text data file, the specification after participle and part-of-speech tagging to being received is calculated based on language rule and mutual information statistics Data file is identified with extracting candidate's descriptor, is generated and is exported candidate's descriptor set；

Descriptor judges and generates processor (222), is communicatively coupled to the judgement of candidate's descriptor and generation processor (221), and receive Candidate's descriptor judges candidate's descriptor set with generation processor (221) output, based on position weighted sum word frequency statisticses, to being connect Candidate's descriptor in candidate's descriptor set of receipts carries out the judgement of descriptor word and extraction, to generate and export selected descriptor set； And

Descriptor result memory (223), it is communicatively coupled to descriptor and judges with generating processor (222), and receives and store descriptor Judge the selected descriptor set with generation processor (222) output；

Wherein, the content of the specification text data file of data initialization processor (212) output is by participle and part of speech mark The term data of processing is noted, term data is made up of word string, and candidate's descriptor judges to use following language with generation processor (221) Candidate's descriptor set is calculated in rule and mutual information system；

Language rule is：

At least contain a verb, noun or nominal composition in candidate's descriptor；

Last word of candidate's descriptor is verb, noun or nominal composition；

First word of candidate's descriptor is not preposition, measure word；

There is no conjunction, pronoun and modal particle in candidate's descriptor；

Extract word string member of the length between 2-8 and usually form candidate's descriptor；

Extract part of speech composition noun+noun, adjective+noun, verb+noun, the candidate's descriptor word of noun+verb by participle Group string, the maximum length of phrase string is 8；

Candidate descriptor is considered to be made up of multiple candidate descriptors, i.e. candidate's descriptor phrase, then candidate's descriptor judges to handle with generation The mutual information that device (221) uses counts the formula calculated：

Wherein, candidate's descriptor T is word (t_i,t_j) combination, word t is by t₁t₂...t_nComposition, t_i=t₁t₂...t_n-r, t_j= t_rt_r+1...t_n,probability(t_i) represent word t_iIndividually in all specification textual datas of data initialization processor (212) According to the probability occurred in file；probability(t_j) represent word t_jIndividually in all rule of data initialization processor (212) The probability occurred in model text data file；probability(t_i,t_j) represent word t_iAnd t_jData initialization is appeared in jointly Probability in the same specification text data file of processor (212)；If t_iAnd t_jIt is very close with reference to obtaining, then probability(t_i,t_j) and probability (t_i) or probability (t_j) be more or less the same, specific difference by user Lai It is determined that the word t in candidate's descriptor T that then formula calculates_iAnd t_jAssociation relationship with regard to larger, conversely, probability (t_i) and probability(t_j) probability (t will be much larger than_i,t_j), then the t calculated_iAnd t_jAssociation relationship with regard to smaller, That is Mutual-information (t_i,t_j) value is bigger, then word t_iAnd t_jThe probability for being combined into candidate's descriptor is bigger；

Descriptor relation recognition includes with withdrawal device (23)：

Descriptor dependency relation is identified with extracting processor (231), is communicatively coupled at the data initialization of data processor (21) Descriptor judgement and the generation processor (222) of device (212) and descriptor identification and withdrawal device (22) are managed, receives data processor (21) Data initialization processor (212) output participle and part-of-speech tagging specification text data file and descriptor identification and extract The descriptor of device (22) judges the selected descriptor set with generation processor (222) output, based on the descriptor in selected descriptor set Co-occurrence probabilities statistical value in the specification text data file with part-of-speech tagging is segmented, identifies and extracts the descriptor phase of the descriptor Pass relation, and export extracted descriptor dependency relation；

Descriptor category divides relation recognition to be communicatively coupled to extracting processor (232) at the data initialization of data processor (21) Descriptor judgement and the generation processor (222) of device (212) and descriptor identification and withdrawal device (22) are managed, receives data processor (21) Data initialization processor (212) output participle and part-of-speech tagging specification text data and descriptor identification and withdrawal device (22) descriptor judges the selected descriptor set with generation processor (222) output, is existed based on the descriptor in selected descriptor set The inclusion relation that has on construction form and calculate similarity between descriptor metric descriptor category point relation is identified with Extract, and export extracted descriptor category and divide relation；And

Descriptor relational result memory (233), the identification of descriptor dependency relation is communicatively coupled to extracting processor (231) and chatting Word category divides relation recognition with extracting processor (232), to receive and store the identification of descriptor dependency relation with extracting processor (231) The descriptor dependency relation and descriptor category point relation recognition of output divide relation with extracting the descriptor category of processor (232) output；

Wherein：

Descriptor judges to be used as descriptor using the method being combined based on position weighted sum word frequency statisticses with generation processor (222) The method basis that word judges and extracted,

The construction situation of weighting function value is as follows：

The weighting function of keyword is calculated：

Weight_i=a × TFIDF_iFormula 2

Calculating to non-key word descriptor, with reference to formula 1：

Weight_i=Mutual-information (T) × a × TFIDF_iFormula 3

In formula 2 and formula 3,

Wherein：ft_iRefer to word t_iThe frequency occurred in all specification text data file d from data initialization processor (212) Rate；N is the number of all specification text data files from data initialization processor (212)；n_iIt is to include word t_iSpecification Text data number；A is word t_iThe weighted value of position, i.e. word t_iTitle, summary in specification text data file, pass The weighted value of where this four positions of keyword, text；

The identification and extraction of dependency relation are carried out using the similarity calculating method based on joint probability distribution, calculation formula is such as Under：

Wherein：probability(A,B)：Represent all specification text datas text in data initialization processor (212) generation In part, select A words in descriptor set and B words in uniform window while the frequency occurred, wherein window refer in each specification Position in text data file, position are title, summary, keyword, this four positions of text；Represent In all specification text data files, occur in the A words that uniform window is selected in descriptor set, and in selected descriptor set The frequency that B words occur without；Represent in all specification text data files, descriptor is selected in uniform window B words in set occur, and the frequency that the A words in selected descriptor set occur without；

The category point relation of descriptor is obtained using the metric calculation formula of similitude between descriptor, calculation formula is as follows：

Wherein, total number that two descriptors that sim_number represents to select in descriptor set contain, identical word, the two Descriptor is referred to by comprising descriptor and descriptor to be included；Sub_number represent in selected descriptor set by comprising contained by descriptor The total number of some words；Number represents the total number of the word contained by the descriptor to be included in selected descriptor set；Represent two descriptors in selected descriptor set contain, identical word residing in by comprising descriptor Position flexible strategy sum, wherein identical word refers to the word by the word comprising descriptor the location of in by comprising descriptor Where in first, word and these three positions of suffix；Qsim_number (i) represents that two in selected descriptor set are chatted The location of word contains, i-th of word of identical weight；Sim_number (i) represents that two in selected descriptor set are chatted Word contains, identical word is concentrated, and i-th of word concentrates location number, i.e. i values in word；q₁Represent that word is being included descriptor The weight coefficient of in middle prefix, word and these three positions of suffix；Sub_number (i) represents that two in selected descriptor set are chatted Word contains, i-th of word of identical location number in by comprising descriptor；Represent selected to chat Two descriptors in set of words contain, identical word location flexible strategy sum, wherein identical in descriptor to be included Word location in descriptor to be included refers to the word in the prefix, word of descriptor to be included and in these three positions of suffix Which position；q₂Represent the weight coefficient of word in prefix, word and these three positions of suffix in descriptor to be included；number (i) two descriptors in selected descriptor set contain, i-th of word of identical location in descriptor to be included is represented Number；Dp represents position parameter, and its value is by comprising descriptor and waiting to include and chat in selected descriptor set in selected descriptor set The ratio between total number of word of word；I=1,2 ..., sim_number.
2. Chinese thesaurus constructing system according to claim 1, it is characterised in that descriptor table generator (24) includes：

Thesaurus generation processor (241), the descriptor for being communicatively coupled to descriptor identification and withdrawal device (22) judge to handle with generation The identification of descriptor dependency relation and extraction processor (231) and descriptor of device (222) and descriptor relation recognition with withdrawal device (23) Category divides relation recognition and extracts processor (232), and the descriptor for receiving descriptor identification and withdrawal device (22) judges and generation processor (222) the selected descriptor set of output, receive the descriptor dependency relation identification of descriptor relation recognition and withdrawal device (23) and extract The descriptor for each descriptor that processor (231) and descriptor category point relation recognition export respectively to extracting processor (232) is related to close System and category divide relation, and the relation between descriptor, descriptor is entered based on standard GB/T 13190-91 Chinese thesaurus establishment rules Row combination, sequence, to generate and export thesaurus；And

Thesaurus result memory (242), it is communicatively coupled to thesaurus generation processor (241) and receives and store thesaurus life Into the thesaurus of processor (241) output.
3. Chinese thesaurus constructing system according to claim 2, it is characterised in that the Chinese thesaurus constructing system Also include：

Verification and modifier (5), are communicatively coupled to data processor memory (213), descriptor result memory (223), descriptor Relational result memory (233), thesaurus result memory (242), with the specification stored to data processor storage (213) Text data file, the selected descriptor set of descriptor result memory (223) storage, descriptor relational result memory (233) are deposited The descriptor dependency relation and descriptor category of storage divide relation, thesaurus result memory (242) storage thesaurus carry out desk checking, Modification, delete.
4. Chinese thesaurus constructing system according to claim 1, it is characterised in that memory (3) is selected from hard disk, USB flash disk And the one or more in storage card.
5. Chinese thesaurus constructing system according to claim 1, it is characterised in that the Chinese thesaurus constructing system Also include：

Visual operation interface, be communicatively coupled to input equipment (1), system processor (2), memory (3), output equipment (4), with And verification and modifier (5).