CN104102847A

CN104102847A - Chinese descriptor list building system

Info

Publication number: CN104102847A
Application number: CN201410359650.5A
Authority: CN
Inventors: 曾文; 乔晓东; 朱礼军; 张均胜
Original assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Current assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date: 2014-07-25
Filing date: 2014-07-25
Publication date: 2014-10-15
Anticipated expiration: 2034-07-25
Also published as: CN104102847B

Abstract

The invention provides a Chinese descriptor list building system, which comprises input equipment, a system processor, a memory and output equipment, wherein the system processor comprises a data processor, a descriptor reorganization and extraction device, a descriptor relationship reorganization and extraction device, a descriptor list generator; the memory is communicated and connected with the data processor, the descriptor reorganization and extraction device, the descriptor relationship reorganization and extraction device and the descriptor list generator of the system processor; and the output equipment is communicated and connected with the system processor. Therefore the defects of the original manual method are overcome; the labor and the materials are saved; the Chinese descriptor list building efficiency is improved; the dynamic building, updating and maintenance of the Chinese descriptor list can be conveniently and fast realized at low cost; the descriptor list building quality can be ensured; the information extraction or the building of Chinese descriptor lists in all fields can be supported; and the system is beneficial to the information organization and utilization in the fields of library, information and archive management, and can serve digital libraries.

Description

Chinese thesaurus constructing system

Technical field

The present invention relates to data processing technique, relate in particular to a kind of Chinese thesaurus constructing system.

Background technology

Thesaurus is a kind of standardization dynamic vocabulary that shows semantic relation between descriptor, descriptor word, wherein include specific license field, many vocabulary of being correlated with on semantic and hierarchical relationship, from function aspects, thesaurus is the thinking bridge between indexing personnel and retrieval personnel, a kind of term control tool of changing between natural language (document language used) and system language (searching system normalized language), be also simultaneously people with system between the medium that exchanges.In scientific and technical develop rapidly, today that Network Information service is day by day universal, the method for traditional artificial constructed thesaurus is consuming time and cost is expensive.The maximum shortcoming of artificial constructed thesaurus is to solve " knowledge acquisition bottleneck " problem that tabulation brainstrust self exists, and is also unfavorable for upgrading in time and safeguarding of thesaurus.Artificial constructed thesaurus is applied to networking, during digitized environment, the renewal degree himself existing causes vocabulary content ageing not, the disappearance of the aspects such as descriptor term scale and quality, make it be difficult in digital network environment, in all types of user, use and promote, cannot meet the professional in Library and file administration field, and the needs of retrieval user, in addition, the Digital Documents data in Library and file administration field increase progressively with the data volume of magnanimity scale every year, the data in literature that existing art is constantly updated and development increases, the generation that the data in literature that the appearance of frontier technology produces all causes new terminology to emerge in an endless stream.Therefore, the existing thesaurus of transformation and renewal, needs to rebuild new industry technology field thesaurus to emerging technical field or specialty.Building thesaurus is at present the common recognition of domestic and international Library and file administration industry, can list of references, Robert M.Losee, thesaurus builds and the decision method research of using, information processing and management, 2007 (4): 958-968 (Decisions in Thesaurus Construction and Use.Information Processing & Management), 2007 (4): 958-968.).Fast building Chinese thesaurus how efficiently, is the actual demand urgently to be resolved hurrily of Library and file administration field.

From published document and practical application, yet there are no the report of Chinese thesaurus constructing system device.At present, the domestic research for thesaurus generation technique field lacks, as: Du Huiping, He Lin, Hou Hanqing, the natural language thesaurus based on cluster analysis builds automatically, National Library's academic periodical, 2007,3:44-49, Xu Ruifang, Li Xiaowen, Hou Hanqing, is related to the comparative studies of processing rule between thesaurus word, information science, 2009 (1): 89-93, Yuan Xu, Chang Chun, towards the thesaurus correlationship acquiring way research building, information science, 2013,31 (1): 68-72, these documents are all the part research being only confined to certain one-phase in thesaurus generative process, and there is no the systemic development in complete meaning, another piece of document (Liu Hua, Shen Yulan, Zeng Jianxun, China, the U.S. and Britain's thesaurus establishment national standard comparative studies, Library Information Service, 2009,53 (22): research work 72-75) be take the external thesaurus of follow-up report research establishment situation as main, another two pieces of document (Liu Wei, Zhou Jie, concurrent mechanism research in net environment thesaurus workout system, Library Information Service, 2011, 55 (22): 11-14): Zhao Jianhua, Zhao Jianguo etc., the exploitation of Chinese thesaurus microcomputer establishment management system, information journal, 1995:184-193) be all in essence area of computer aided manual entry, work out and safeguard the technology of thesaurus, utilize the auxiliary establishment of database technology of computing machine and process vocabulary, realize vocabulary structure construction and basic editting function, and be not the realization for the constructing technology of thesaurus content itself.Relatively ripe about the research work of thesaurus constructing technology abroad, since the just correlative study work seventies in last century, but, due to statement difference intrinsic between language, make to copy external thesaurus constructing technology completely and method is worthless, therefore, the structure research-and-development activity for Chinese thesaurus is a job with realistic meaning.

Summary of the invention

For the deficiency existing in background technology, the object of the present invention is to provide a kind of Chinese thesaurus constructing system, it can overcome the shortcoming of original manual method, use manpower and material resources sparingly, the structure efficiency that improves Chinese thesaurus, can realize dynamic construction, renewal and the maintenance of Chinese thesaurus convenient, fast and cheaply.

Another object of the present invention is to provide a kind of Chinese thesaurus constructing system, compare the method for artificial constructed Chinese thesaurus, it more can guarantee the quality that Chinese thesaurus builds, and can support structure or the information extraction of all Chinese thesaurus based on Digital Documents field.

A further object of the present invention is to be of value to Information Organization and the utilization in Library and file administration field, and can serve digital library.

To achieve these goals, the invention provides a kind of Chinese thesaurus constructing system, it comprises input equipment, system processor, storer and output device.

Input equipment input builds the required raw data file of Chinese thesaurus and raw data file is exported.

System processor comprises: data processor, communicate to connect in input equipment and receive the raw data file of being exported by input equipment, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor processes, this raw data file is changed with generating standard text data file and to standard text data file and carried out participle and part-of-speech tagging and export standard text data, if the raw data file receiving belongs to the normalized raw data file that meets data processor processes, to this raw data file directly advance participle and part-of-speech tagging export standard text data, descriptor identification and withdrawal device, communicate to connect in data processor and receive the participle of data processor output and the standard text data of part-of-speech tagging, to organize the identification of word, descriptor and extraction based on standard GB/T 13190-91 Chinese thesaurus establishment rules, and to generate and descriptor that output is extracted, the descriptor of extraction is as selected descriptor set, descriptor relation recognition and withdrawal device, communicate to connect in data processor and descriptor identification and withdrawal device and receive the standard text data of data processor output and the selected descriptor set of descriptor identification and withdrawal device output, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, each descriptor in selected descriptor set is carried out to identification and the extraction of descriptor correlationship and minute relation of genus, and by the descriptor correlationship of each descriptor and genus minute relation output, and thesaurus maker, communicate to connect in descriptor identification and withdrawal device and descriptor relation recognition and withdrawal device, receive descriptor and identify descriptor correlationship and the genus minute relation of each descriptor of selected descriptor set, reception descriptor relation recognition and the withdrawal device output of exporting with withdrawal device, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, the relation between descriptor, descriptor is combined, to be sorted, to generate and to export thesaurus.

Memory communication is connected in data processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, the thesaurus maker of system processor, the result that storage data processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, thesaurus maker are exported separately.

Output device communicates to connect data processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, the thesaurus maker in system processor, and receives and thesaurus that descriptor correlationship that standard text data that output data processor is exported, descriptor identification and selected descriptor set, descriptor relation recognition and withdrawal device that withdrawal device is exported are exported and genus minute relation, thesaurus maker are exported.

Beneficial effect of the present invention is as follows:

By Chinese thesaurus constructing system provided by the invention, overcome the shortcoming of original manual method, use manpower and material resources sparingly, improve the structure efficiency of Chinese thesaurus, can realize convenient, fast and cheaply dynamic construction, renewal and the maintenance of Chinese thesaurus.

By Chinese thesaurus constructing system provided by the invention, it can guarantee the structure quality of Chinese thesaurus, can support structure or the information extraction of the Chinese thesaurus of all spectra.

By Chinese thesaurus constructing system provided by the invention, be of value to Information Organization and the utilization in Library and file administration field, and can serve digital library.

Accompanying drawing explanation

Fig. 1 is according to the compositional block diagram of Chinese thesaurus constructing system of the present invention.

Wherein, description of reference numerals is as follows:

1 input equipment

2 system processors

21 data processors

211 system initialization process devices

212 data initialization processors

213 data processor storeies

22 descriptor identification and withdrawal devices

221 candidate's descriptors are judged and are generated processor

222 descriptors are judged and are generated processor

223 descriptor result memories

23 descriptor relation recognition and withdrawal devices

231 descriptor correlationship identifications and extraction processor

232 descriptors belong to minute relation recognition and extract processor

233 descriptor relational result storeies

24 thesaurus makers

241 thesaurus generate processor

242 thesaurus result memories

3 storeies

4 output devices

5 verifications and modifier

Embodiment

Describe in detail with reference to the accompanying drawings according to Chinese thesaurus constructing system of the present invention.

With reference to Fig. 1, Chinese thesaurus constructing system according to the present invention comprises input equipment 1, system processor 2, storer 3 and output device 4.

Input equipment 1 input builds the required raw data file of Chinese thesaurus and raw data file is exported.

System processor 2 comprises: data processor 21, communicate to connect in input equipment 1 and receive the raw data file (number of raw data file is at least one) of being exported by input equipment 1, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor 21 processing, this raw data file is changed with generating standard text data file and to standard text data file and carried out participle and part-of-speech tagging and export standard text data, if belonging to, the raw data file receiving meets the normalized raw data file that data processor 21 is processed, to this raw data file directly advance participle and part-of-speech tagging export standard text data, descriptor identification and withdrawal device 22, communicate to connect in data processor 21 and receive the participle of data processor 21 outputs and the standard text data of part-of-speech tagging, to organize the identification of word, descriptor and extraction based on standard GB/T 13190-91 Chinese thesaurus establishment rules, and to generate and descriptor that output is extracted, the descriptor of extraction is as selected descriptor set, descriptor relation recognition and withdrawal device 23, communicate to connect in data processor 21 and descriptor identification and withdrawal device 22 and receive the standard text data of data processor 21 outputs and the selected descriptor set of descriptor identification and withdrawal device 22 outputs, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, each descriptor in selected descriptor set is carried out to identification and the extraction of descriptor correlationship and minute relation of genus, and by the descriptor correlationship of each descriptor and genus minute relation output, and thesaurus maker 24, communicate to connect in descriptor identification and withdrawal device 22 and descriptor relation recognition and withdrawal device 23, receive descriptor and identify descriptor correlationship and the genus minute relation of each descriptor of selected descriptor set, reception descriptor relation recognition and withdrawal device 23 outputs of exporting with withdrawal device 22, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, the relation between descriptor, descriptor is combined, to be sorted, to generate and to export thesaurus.

Storer 3 communicates to connect data processor 21, the descriptor identification and withdrawal device 22, descriptor relation recognition and withdrawal device 23, thesaurus maker 24 in system processor 2, the result that storage data processor 21, descriptor identification and withdrawal device 22, descriptor relation recognition and withdrawal device 23, thesaurus maker 24 are exported separately.

Output device 4 communicates to connect data processor 21 in system processor 2, descriptor identification and withdrawal device 22, descriptor relation recognition and withdrawal device 23, thesaurus maker 24, and receives and thesaurus that descriptor correlationship that standard text data that output data processor 21 is exported, descriptor identification and selected descriptor set, descriptor relation recognition and withdrawal device 23 that withdrawal device 22 is exported are exported and genus minute relation, thesaurus maker 24 are exported.

According in Chinese thesaurus constructing system of the present invention, described raw data file comprises text data file, XML file, pdf document, and described standard text data file comprises text and XML file.

In an embodiment of data processor 21, with reference to Fig. 1, data processor 21 comprises: system initialization process device 211, communicate to connect in input equipment 1 and receive the raw data file of being exported by input equipment 1, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor 21 processing, this raw data file changed with generating standard text data file and exported standard text data file, if belonging to, the raw data file receiving meets the normalized raw data file that data processor 21 is processed, this raw data file is directly exported as standard text data file, data initialization processor 212, communicate to connect in system initialization process device 211, the standard text data file of receiving system initialization processor 211 outputs, carries out participle and part-of-speech tagging and the standard text data after participle and part-of-speech tagging is exported standard text data file, and data processor storer 213, communicate to connect in data initialization processor 212 and receive and the participle of storage data initialization processor 212 outputs and the standard text data file after part-of-speech tagging.

In an embodiment of descriptor identification and withdrawal device 22, with reference to Fig. 1, descriptor identification comprises with withdrawal device 22: candidate's descriptor is judged and generated processor 221, communicate to connect the data initialization processor 212 in data processor 21, and receive the participle of data initialization processor 212 output of data processor 21 and the standard text data file after part-of-speech tagging, based on language rule and mutual information statistics, the received authority data file after participle and part-of-speech tagging identified and extracted candidate's descriptor, generating and export the set of candidate's descriptor; Descriptor is judged and is generated processor 222, communicate to connect in candidate's descriptor and judge and generate processor 221, and receive the judgement of candidate's descriptor and generate candidate's descriptor set that processor 221 is exported, position-based weighted sum word frequency statistics, candidate's descriptor in received candidate's descriptor set is carried out to the judgement of descriptor word and extraction, to generate and to export selected descriptor set; And descriptor result memory 223, communicate to connect in descriptor and judge and generate processor 222, and receive and store the selected descriptor set that processor 222 outputs are judged and generated to descriptor.

The standard text data file of data initialization processor 212 outputs, its content is the word data through participle and part-of-speech tagging processing, it is comprised of word string, and candidate's descriptor is judged and generated processor 221, by using following language rule to obtain the set of candidate's descriptor.Language rule is:

In candidate's descriptor, at least contain a verb, noun or name part of speech composition;

Last word of candidate's descriptor is verb, noun or name part of speech composition;

First word of candidate's descriptor is not preposition, measure word;

In candidate's descriptor, there is no conjunction, pronoun and modal particle;

Extract the word string unit of length between 2-8 and usually form candidate's descriptor;

Extract and press part of speech composition noun+noun, adjective+noun, the verb+noun of participle, candidate's descriptor phrase string of noun+verb, the maximum length of phrase string is 8.

For improving the generation quality of candidate's descriptor, candidate descriptor can be thought to be comprised of a plurality of candidate descriptors, i.e. candidate's descriptor phrase, and candidate's descriptor is judged and is generated processor 221 and adopt mutual information statistical calculation methods to obtain selected candidate's descriptor set.Mutual information statistical computation formula is:

Mutual - information (T) = Mutual - information (t_{i}, t_{j}) = \log_{2} \frac{probability (t_{i}, t_{j})}{probability (t_{i}) \times probability (t_{j})}

Formula 1

Wherein, candidate's descriptor T is word (t _i, t _j) combination, word t is by t ₁t ₂... t _nform, word string is designated as t _i=t ₁t ₂... t _n-r, t _j=t _rt _r+1... t _n, probability (t _i) expression word t _ithe probability occurring in all standard text data files of data initialization processor 212 separately; Probability (t _j) expression word t _jthe probability occurring in all standard text data files of data initialization processor 212 separately; Probability (t _i, t _j) expression word t _iand t _jjointly appear at the probability in the same standard text data file of data initialization processor 212; If t _iand t _jin conjunction with very tight, probability (t _i, t _j) and probability (t _i) or probability (t _j) be more or less the same (concrete difference can be determined by user), the word string t in candidate's descriptor T that this formula is calculated _iand t _jmutual information value just larger, otherwise, probability (t _i) and probability (t _j) will be much larger than probability (t _i, t _j), the t calculating _iand t _jmutual information value just smaller, i.e. Mutual-information (t _i, t _j) value is larger, word t _iand t _jthe probability that is combined into candidate's descriptor is larger.

Wherein, descriptor judges that the method that adopts position-based weighted sum word frequency statisticses to combine with generation processor 222 is used as the method basis that descriptor word is judged and extracted.

The structure situation of weighting function value is as follows:

The weighting function of keyword is calculated:

Weight _i=a * TFIDF _iformula 2

To the calculating of non-key word descriptor (in conjunction with formula 1):

Weight _i=Mutual-information (T) * a * TFIDF _iformula 3

In formula 2 and formula 3,

{TFIDF}_{i} = \frac{{ft}_{i} \times \log (N / n_{i})}{\sqrt{{\underset{j}{Σ} ({ft}_{i} \times \log (N / n_{j}))}^{2}}}

Formula 4

Wherein: ft _irefer to word t _ithe frequency occurring in all standard text data file d from data initialization processor 212; N is the number of all standard text data files from data initialization processor 212; n _ito comprise word t _istandard text data number; A is word t _ithe weighted value of position (being which position in the title of word in standard text data file, summary, keyword, these four positions of text).

In an embodiment of descriptor relation recognition and withdrawal device 23, with reference to Fig. 1, descriptor relation recognition and withdrawal device 23 comprise: the identification of descriptor correlationship and extraction processor 231, communicate to connect in data initialization processor 212 and the descriptor identification of data processor 21 and judge with the descriptor of withdrawal device 22 and generate processor 222, receive the descriptor judgement and the selected descriptor set that generates processor 222 outputs of data initialization processor 212 participles of output and the standard text data file of part-of-speech tagging and descriptor identification and the withdrawal device 22 of data processor 21, the co-occurrence probabilities statistical value of descriptor based in selected descriptor set in the standard text data file of participle and part-of-speech tagging, identify and extract the descriptor correlationship of this descriptor, and the descriptor correlationship that extracts of output, descriptor belongs to minute relation recognition and extracts processor 232, communicate to connect in data initialization processor 212 and the descriptor identification of data processor 21 and judge with the descriptor of withdrawal device 22 and generate processor 222, receive the descriptor judgement and the selected descriptor set that generates processor 222 outputs of data initialization processor 212 participles of output and the standard text data file of part-of-speech tagging and descriptor identification and the withdrawal device 22 of data processor 21, descriptor based in selected descriptor set has the metric of similarity between relation of inclusion and calculating descriptor a descriptor genus minute relation is identified and extracted on construction form, and the descriptor that output is extracted belongs to minute relation, and descriptor relational result storer 233, communicate to connect in descriptor correlationship and identify and extract processor 231 and descriptor genus minute relation recognition and extraction processor 232, to receive and to store descriptor correlationship, identify descriptor correlationship and the descriptor genus minute relation recognition of exporting with extraction processor 231 and extract the descriptor genus minute relation that processor 232 is exported.

Descriptor correlationship automatically identifies and extract data initialization processor 212 participles of output and the standard text data file of part-of-speech tagging of processor 231 reception data processors 21 and descriptor is identified and the descriptor of withdrawal device 22 is judged and generate the selected descriptor set that processor 222 is exported.The similarity calculating method of use based on joint probability distribution carries out identification and the extraction of correlationship, and computing formula is as follows:

Similarity (A, B) = \frac{probability (A \cap B)}{probabilty (A \cup B)} = \frac{probability (A, B)}{probability (A, \overset{&OverBar;}{B}) + probability (A, B) probability (\overset{&OverBar;}{A}, B)}

Formula 5

Wherein: probability (A, B): be illustrated in all standard text data files of data initialization processor 212 generations, at uniform window (referring to the position (title, summary, keyword, these four positions of text) in each standard text data file), the frequency that the A word in selected descriptor set and B word occur simultaneously; be illustrated in all standard text data files, the A word appearance in the selected descriptor set of uniform window, and select the absent variable frequency of B word in descriptor set; be illustrated in all standard text data files, the B word appearance in the selected descriptor set of uniform window, and select the absent variable frequency of A word in descriptor set.

The selected descriptor set of the participles that wherein, the data initialization processor 212 of descriptor genus minute relation recognition and extraction processor 232 reception data processors 21 is exported and the standard text data file of part-of-speech tagging and the descriptor judgement of descriptor identification and withdrawal device 22 and 222 outputs of generation processor.Between use descriptor, the metric calculation formula of similarity obtains the genus minute relation of descriptor, and computing formula is as follows:

Simiarity = (\frac{sim_number}{sub_munber} + \frac{sim_number}{number}) / 2 dp \times (Σ \frac{q_{1} \sin_number (i)}{Σsub_number (i)} + Σ \frac{q_{2} sim_number (i)}{Σnumber (i)}) / 2

Formula 6

Wherein, sim_number represents total number of two descriptors (involved descriptor and the descriptor to be comprised) word that contain, identical in selected descriptor set; Sub_number represents total number of the word that the involved descriptor in selected descriptor set is contained; Number represents total number of the word that the descriptor to be comprised in selected descriptor set is contained; represent two descriptors word that contain, identical in selected descriptor set residing position (i.e. this word at the prefix of involved descriptor, which position in word and these three positions of suffix) flexible strategy sum in involved descriptor; Qsim_number (i) represents two descriptors i the residing position weight of word that contain, identical in selected descriptor set; Sim_number (i) represents that two descriptors word that contain, identical in selected descriptor set is concentrated, and i word concentrated residing positional number (being i value) at word; q ₁represent that word is in involved descriptor in prefix, word and the weight coefficient of these three positions of suffix; Sub_number (i) represents two descriptors i word that contain, the identical residing positional number in involved descriptor in selected descriptor set; represent two descriptors word that contain, identical in selected descriptor set residing position (i.e. this word at the prefix of descriptor to be comprised, which position in word and these three positions of suffix) flexible strategy sum in descriptor to be comprised; q ₂represent that word is in descriptor to be comprised in prefix, word and the weight coefficient of these three positions of suffix; Number (i) represents two descriptors i word that contain, the identical residing positional number in descriptor to be comprised in selected descriptor set; Dp represents position parameter, and its value is the ratio of the total number of word of the descriptor to be comprised in the involved descriptor in selected descriptor set and selected descriptor set; I=1,2 ..., sim_number.

In an embodiment of thesaurus maker 24, with reference to Fig. 1, thesaurus maker 24 comprises: thesaurus generates processor 241, communicate to connect in descriptor identification and the descriptor judgement of withdrawal device 22 and belong to minute relation recognition and extract processor 232 with extraction processor 231 and descriptor with the descriptor correlationship identification that generates processor 222 and descriptor relation recognition and withdrawal device 23, receiving descriptor identification judges and the selected descriptor set that generates processor 222 outputs with the descriptor of withdrawal device 22, the descriptor correlationship identification that receives descriptor relation recognition and withdrawal device 23 with extract processor 231 and descriptor and belong to minute relation recognition and divide a relation with the descriptor correlationship and the genus that extract each descriptor that processor 232 export respectively, based on standard GB/T 13190-91 Chinese thesaurus establishment rules to descriptor, relation between descriptor combines, sequence, to generate and to export thesaurus, and thesaurus result memory 242, communicate to connect in thesaurus and generate processor 241, and receive and store the thesaurus that thesaurus generates processor 241 outputs.

According in an embodiment of Chinese thesaurus constructing system of the present invention, with reference to Fig. 1, described Chinese thesaurus constructing system also can comprise: verification and modifier 5, communicate to connect in data processor storer 213, descriptor result memory 223, descriptor relational result storer 233, thesaurus result memory 242, with the standard text data file to 213 storages of data processor storage, the selected descriptor set of descriptor result memory 223 storages, descriptor correlationship and the descriptor of 233 storages of descriptor relational result storer belong to a minute relation, the thesaurus of thesaurus result memory 242 storages carries out desk checking, revise, delete.Based on verification and modifier 5, user can open the content of the above-mentioned storage that need to check, revise and delete as required at any time, operates accordingly.

In an embodiment of storer 3, storer 3 can be selected from hard disk, USB flash disk, portable hard drive, storage card.

According in an embodiment of Chinese thesaurus constructing system of the present invention, described Chinese thesaurus constructing system also can comprise: visual operation interface (not shown), communicates to connect in input equipment 1, system processor 2, storer 3, output device 4 and verification and modifier 5.By visual operation interface, can be convenient to user and realize whole thesaurus and automatically build processing procedure.

It is below the further checking to descriptor of the present invention and descriptor relation recognition and extraction of all technical characteristic that provides in conjunction with Fig. 1.

Adopt this system and device of the present invention to 2426 pieces of mechanical engineering patent Chinese literature data, 2783 pieces of the Chinese periodical data in literature of natural language processing technique, carry out identification and the extraction of descriptor and descriptor relation, the operation through above step, obtains test findings as follows:

Candidate's descriptor of table 1 identification of the present invention and extraction and selected descriptor test findings example

The descriptor test findings example of table 2 identification of the present invention and extraction

Table 3 identification of the present invention and the test findings example that extracts descriptor correlationship

Table 4 identification of the present invention and the test findings example that extracts descriptor genus minute relation

Claims

1. a Chinese thesaurus constructing system, is characterized in that, comprising:

Input equipment (1), input builds the required raw data file of Chinese thesaurus and raw data file is exported;

System processor (2), comprising:

Data processor (21), communicate to connect in input equipment (1) and receive the raw data file by input equipment (1) output, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor (21) processing, this raw data file is changed with generating standard text data file and to standard text data file and carried out participle and part-of-speech tagging and export standard text data, if belonging to, the raw data file receiving meets the normalized raw data file that data processor (21) is processed, to this raw data file directly advance participle and part-of-speech tagging export standard text data,

Descriptor identification and withdrawal device (22), communicate to connect in data processor (21) and receive the participle of data processor (21) output and the standard text data of part-of-speech tagging, to organize the identification of word, descriptor and extraction based on standard GB/T 13190-91 Chinese thesaurus establishment rules, and to generate and descriptor that output is extracted, the descriptor of extraction is as selected descriptor set;

Descriptor relation recognition and withdrawal device (23), communicate to connect in data processor (21) and descriptor identification and withdrawal device (22) and receive the standard text data of data processor (21) output and the selected descriptor set of descriptor identification and withdrawal device (22) output, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, each descriptor in selected descriptor set is carried out to identification and the extraction of descriptor correlationship and minute relation of genus, and by the descriptor correlationship of each descriptor and genus minute relation output; And

Thesaurus maker (24), communicate to connect in descriptor identification and withdrawal device (22) and descriptor relation recognition and withdrawal device (23), the descriptor correlationship that receives each descriptor of descriptor is identified and withdrawal device (22) is exported selected descriptor set, reception descriptor relation recognition and withdrawal device (23) output is divided a relation with genus, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, the relation between descriptor, descriptor is combined, to be sorted, to generate and to export thesaurus; Storer (3), communicate to connect data processor (21), descriptor identification and withdrawal device (22), descriptor relation recognition and withdrawal device (23), thesaurus maker (24) in system processor (2), the result that storage data processor (21), descriptor identification and withdrawal device (22), descriptor relation recognition and withdrawal device (23), thesaurus maker (24) are exported separately; And

Output device (4), communicate to connect data processor (21) in system processor (2), descriptor identification and withdrawal device (22), descriptor relation recognition and withdrawal device (23), thesaurus maker (24), and receive and thesaurus that descriptor correlationship that standard text data that output data processor (21) is exported, descriptor identification and selected descriptor set, descriptor relation recognition and withdrawal device (23) that withdrawal device (22) is exported are exported and genus minute relation, thesaurus maker (24) are exported.

2. Chinese thesaurus constructing system according to claim 1, is characterized in that,

Described raw data file comprises text data file, XML file, pdf document;

Described standard text data file comprises text and XML file.

3. Chinese thesaurus constructing system according to claim 1, is characterized in that, data processor (21) comprising:

System initialization process device (211), communicate to connect in input equipment (1) and receive the raw data file by input equipment (1) output, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor (21) processing, this raw data file changed with generating standard text data file and exported standard text data file, if belonging to, the raw data file receiving meets the normalized raw data file that data processor (21) is processed, this raw data file is directly exported as standard text data file,

Data initialization processor (212), communicate to connect in system initialization process device (211), the standard text data file of receiving system initialization processor (211) output, carries out participle and part-of-speech tagging and the standard text data after participle and part-of-speech tagging is exported standard text data file; And

Data processor storer (213), communicates to connect in data initialization processor (212), and receives and the participle of storage data initialization processor (212) output and the standard text data file after part-of-speech tagging.

4. Chinese thesaurus constructing system according to claim 3, is characterized in that, descriptor identification comprising with withdrawal device (22):

Candidate's descriptor is judged and is generated processor (221), communicate to connect in the data initialization processor (212) of data processor (21) and receive the participle of data initialization processor (212) output of data processor (21) and the standard text data file after part-of-speech tagging, based on language rule and mutual information statistical computation, the received authority data file after participle and part-of-speech tagging identified and extracted candidate's descriptor, generating and export the set of candidate's descriptor;

Descriptor is judged and is generated processor (222), communicate to connect in candidate's descriptor and judge and generate processor (221), and receive the judgement of candidate's descriptor and generate candidate's descriptor set that processor (221) is exported, position-based weighted sum word frequency statistics, candidate's descriptor in received candidate's descriptor set is carried out to the judgement of descriptor word and extraction, to generate and to export selected descriptor set; And

Descriptor result memory (223), communicates to connect in descriptor and judges and generate processor (222), and receives and store the selected descriptor set that processor (222) output is judged and generated to descriptor.

5. Chinese thesaurus constructing system according to claim 4, is characterized in that, descriptor relation recognition and withdrawal device (23) comprising:

The identification of descriptor correlationship and extraction processor (231), communicate to connect in data initialization processor (212) and the descriptor identification of data processor (21) and judge with the descriptor of withdrawal device (22) and generate processor (222), receive the descriptor judgement and the selected descriptor set that generates processor (222) output of the participle of data initialization processor (212) output and the standard text data file of part-of-speech tagging and descriptor identification and the withdrawal device (22) of data processor (21), the co-occurrence probabilities statistical value of descriptor based in selected descriptor set in the standard text data file of participle and part-of-speech tagging, identify and extract the descriptor correlationship of this descriptor, and the descriptor correlationship that extracts of output,

Descriptor belongs to minute relation recognition and extracts processor (232), communicate to connect in data initialization processor (212) and the descriptor identification of data processor (21) and judge with the descriptor of withdrawal device (22) and generate processor (222), receive the descriptor judgement and the selected descriptor set that generates processor (222) output of the participle of data initialization processor (212) output and the standard text data of part-of-speech tagging and descriptor identification and the withdrawal device (22) of data processor (21), the relation of inclusion that descriptor based in selected descriptor set has on construction form and the metric that calculates similarity between descriptor belong to a minute relation to descriptor and identify and extract, and the descriptor that output is extracted belongs to minute relation, and

Descriptor relational result storer (233), communicate to connect in descriptor correlationship and identify and extract processor (231) and descriptor genus minute relation recognition and extraction processor (232), to receive and to store descriptor correlationship, identify descriptor correlationship and the descriptor genus minute relation recognition of exporting with extraction processor (231) and extract the descriptor genus minute relation that processor (232) is exported.

6. Chinese thesaurus constructing system according to claim 5, is characterized in that, thesaurus maker (24) comprising:

Thesaurus generates processor (241), communicate to connect in descriptor identification and the descriptor judgement of withdrawal device (22) and belong to minute relation recognition and extract processor (232) with extraction processor (231) and descriptor with the descriptor correlationship identification that generates processor (222) and descriptor relation recognition and withdrawal device (23), receiving descriptor identification judges and the selected descriptor set that generates processor (222) output with the descriptor of withdrawal device (22), the descriptor correlationship identification that receives descriptor relation recognition and withdrawal device (23) with extract processor (231) and descriptor and belong to minute relation recognition and divide a relation with the descriptor correlationship that extracts each descriptor that processor (232) export respectively with genus, based on standard GB/T 13190-91 Chinese thesaurus establishment rules to descriptor, relation between descriptor combines, sequence, to generate and to export thesaurus, and

Thesaurus result memory (242), communicates to connect in thesaurus and generates processor (241) and receive and store the thesaurus that thesaurus generates processor (241) output.

7. Chinese thesaurus constructing system according to claim 6, is characterized in that, described Chinese thesaurus constructing system also comprises:

Verification and modifier (5), communicate to connect in data processor storer (213), descriptor result memory (223), descriptor relational result storer (233), thesaurus result memory (242), with the selected descriptor set that the standard text data file of data processor storage (213) storage, descriptor result memory (223) are stored, the descriptor correlationship of descriptor relational result storer (233) storage and the thesaurus that descriptor belongs to minute relation, thesaurus result memory (242) storage, carry out desk checking, modification, deletion.

8. Chinese thesaurus constructing system according to claim 1, is characterized in that, storer (3) is selected from hard disk, USB flash disk, portable hard drive, storage card.

9. Chinese thesaurus constructing system according to claim 1, is characterized in that, described Chinese thesaurus constructing system also comprises:

Visual operation interface, communicates to connect in input equipment (1), system processor (2), storer (3), output device (4) and verification and modifier (5).