CN104102847A - Chinese descriptor list building system - Google Patents

Chinese descriptor list building system Download PDF

Info

Publication number
CN104102847A
CN104102847A CN201410359650.5A CN201410359650A CN104102847A CN 104102847 A CN104102847 A CN 104102847A CN 201410359650 A CN201410359650 A CN 201410359650A CN 104102847 A CN104102847 A CN 104102847A
Authority
CN
China
Prior art keywords
descriptor
processor
thesaurus
data file
withdrawal device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410359650.5A
Other languages
Chinese (zh)
Other versions
CN104102847B (en
Inventor
曾文
乔晓东
朱礼军
张均胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Original Assignee
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA filed Critical INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority to CN201410359650.5A priority Critical patent/CN104102847B/en
Publication of CN104102847A publication Critical patent/CN104102847A/en
Application granted granted Critical
Publication of CN104102847B publication Critical patent/CN104102847B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Chinese descriptor list building system, which comprises input equipment, a system processor, a memory and output equipment, wherein the system processor comprises a data processor, a descriptor reorganization and extraction device, a descriptor relationship reorganization and extraction device, a descriptor list generator; the memory is communicated and connected with the data processor, the descriptor reorganization and extraction device, the descriptor relationship reorganization and extraction device and the descriptor list generator of the system processor; and the output equipment is communicated and connected with the system processor. Therefore the defects of the original manual method are overcome; the labor and the materials are saved; the Chinese descriptor list building efficiency is improved; the dynamic building, updating and maintenance of the Chinese descriptor list can be conveniently and fast realized at low cost; the descriptor list building quality can be ensured; the information extraction or the building of Chinese descriptor lists in all fields can be supported; and the system is beneficial to the information organization and utilization in the fields of library, information and archive management, and can serve digital libraries.

Description

Chinese thesaurus constructing system
Technical field
The present invention relates to data processing technique, relate in particular to a kind of Chinese thesaurus constructing system.
Background technology
Thesaurus is a kind of standardization dynamic vocabulary that shows semantic relation between descriptor, descriptor word, wherein include specific license field, many vocabulary of being correlated with on semantic and hierarchical relationship, from function aspects, thesaurus is the thinking bridge between indexing personnel and retrieval personnel, a kind of term control tool of changing between natural language (document language used) and system language (searching system normalized language), be also simultaneously people with system between the medium that exchanges.In scientific and technical develop rapidly, today that Network Information service is day by day universal, the method for traditional artificial constructed thesaurus is consuming time and cost is expensive.The maximum shortcoming of artificial constructed thesaurus is to solve " knowledge acquisition bottleneck " problem that tabulation brainstrust self exists, and is also unfavorable for upgrading in time and safeguarding of thesaurus.Artificial constructed thesaurus is applied to networking, during digitized environment, the renewal degree himself existing causes vocabulary content ageing not, the disappearance of the aspects such as descriptor term scale and quality, make it be difficult in digital network environment, in all types of user, use and promote, cannot meet the professional in Library and file administration field, and the needs of retrieval user, in addition, the Digital Documents data in Library and file administration field increase progressively with the data volume of magnanimity scale every year, the data in literature that existing art is constantly updated and development increases, the generation that the data in literature that the appearance of frontier technology produces all causes new terminology to emerge in an endless stream.Therefore, the existing thesaurus of transformation and renewal, needs to rebuild new industry technology field thesaurus to emerging technical field or specialty.Building thesaurus is at present the common recognition of domestic and international Library and file administration industry, can list of references, Robert M.Losee, thesaurus builds and the decision method research of using, information processing and management, 2007 (4): 958-968 (Decisions in Thesaurus Construction and Use.Information Processing & Management), 2007 (4): 958-968.).Fast building Chinese thesaurus how efficiently, is the actual demand urgently to be resolved hurrily of Library and file administration field.
From published document and practical application, yet there are no the report of Chinese thesaurus constructing system device.At present, the domestic research for thesaurus generation technique field lacks, as: Du Huiping, He Lin, Hou Hanqing, the natural language thesaurus based on cluster analysis builds automatically, National Library's academic periodical, 2007,3:44-49, Xu Ruifang, Li Xiaowen, Hou Hanqing, is related to the comparative studies of processing rule between thesaurus word, information science, 2009 (1): 89-93, Yuan Xu, Chang Chun, towards the thesaurus correlationship acquiring way research building, information science, 2013,31 (1): 68-72, these documents are all the part research being only confined to certain one-phase in thesaurus generative process, and there is no the systemic development in complete meaning, another piece of document (Liu Hua, Shen Yulan, Zeng Jianxun, China, the U.S. and Britain's thesaurus establishment national standard comparative studies, Library Information Service, 2009,53 (22): research work 72-75) be take the external thesaurus of follow-up report research establishment situation as main, another two pieces of document (Liu Wei, Zhou Jie, concurrent mechanism research in net environment thesaurus workout system, Library Information Service, 2011, 55 (22): 11-14): Zhao Jianhua, Zhao Jianguo etc., the exploitation of Chinese thesaurus microcomputer establishment management system, information journal, 1995:184-193) be all in essence area of computer aided manual entry, work out and safeguard the technology of thesaurus, utilize the auxiliary establishment of database technology of computing machine and process vocabulary, realize vocabulary structure construction and basic editting function, and be not the realization for the constructing technology of thesaurus content itself.Relatively ripe about the research work of thesaurus constructing technology abroad, since the just correlative study work seventies in last century, but, due to statement difference intrinsic between language, make to copy external thesaurus constructing technology completely and method is worthless, therefore, the structure research-and-development activity for Chinese thesaurus is a job with realistic meaning.
Summary of the invention
For the deficiency existing in background technology, the object of the present invention is to provide a kind of Chinese thesaurus constructing system, it can overcome the shortcoming of original manual method, use manpower and material resources sparingly, the structure efficiency that improves Chinese thesaurus, can realize dynamic construction, renewal and the maintenance of Chinese thesaurus convenient, fast and cheaply.
Another object of the present invention is to provide a kind of Chinese thesaurus constructing system, compare the method for artificial constructed Chinese thesaurus, it more can guarantee the quality that Chinese thesaurus builds, and can support structure or the information extraction of all Chinese thesaurus based on Digital Documents field.
A further object of the present invention is to be of value to Information Organization and the utilization in Library and file administration field, and can serve digital library.
To achieve these goals, the invention provides a kind of Chinese thesaurus constructing system, it comprises input equipment, system processor, storer and output device.
Input equipment input builds the required raw data file of Chinese thesaurus and raw data file is exported.
System processor comprises: data processor, communicate to connect in input equipment and receive the raw data file of being exported by input equipment, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor processes, this raw data file is changed with generating standard text data file and to standard text data file and carried out participle and part-of-speech tagging and export standard text data, if the raw data file receiving belongs to the normalized raw data file that meets data processor processes, to this raw data file directly advance participle and part-of-speech tagging export standard text data, descriptor identification and withdrawal device, communicate to connect in data processor and receive the participle of data processor output and the standard text data of part-of-speech tagging, to organize the identification of word, descriptor and extraction based on standard GB/T 13190-91 Chinese thesaurus establishment rules, and to generate and descriptor that output is extracted, the descriptor of extraction is as selected descriptor set, descriptor relation recognition and withdrawal device, communicate to connect in data processor and descriptor identification and withdrawal device and receive the standard text data of data processor output and the selected descriptor set of descriptor identification and withdrawal device output, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, each descriptor in selected descriptor set is carried out to identification and the extraction of descriptor correlationship and minute relation of genus, and by the descriptor correlationship of each descriptor and genus minute relation output, and thesaurus maker, communicate to connect in descriptor identification and withdrawal device and descriptor relation recognition and withdrawal device, receive descriptor and identify descriptor correlationship and the genus minute relation of each descriptor of selected descriptor set, reception descriptor relation recognition and the withdrawal device output of exporting with withdrawal device, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, the relation between descriptor, descriptor is combined, to be sorted, to generate and to export thesaurus.
Memory communication is connected in data processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, the thesaurus maker of system processor, the result that storage data processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, thesaurus maker are exported separately.
Output device communicates to connect data processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, the thesaurus maker in system processor, and receives and thesaurus that descriptor correlationship that standard text data that output data processor is exported, descriptor identification and selected descriptor set, descriptor relation recognition and withdrawal device that withdrawal device is exported are exported and genus minute relation, thesaurus maker are exported.
Beneficial effect of the present invention is as follows:
By Chinese thesaurus constructing system provided by the invention, overcome the shortcoming of original manual method, use manpower and material resources sparingly, improve the structure efficiency of Chinese thesaurus, can realize convenient, fast and cheaply dynamic construction, renewal and the maintenance of Chinese thesaurus.
By Chinese thesaurus constructing system provided by the invention, it can guarantee the structure quality of Chinese thesaurus, can support structure or the information extraction of the Chinese thesaurus of all spectra.
By Chinese thesaurus constructing system provided by the invention, be of value to Information Organization and the utilization in Library and file administration field, and can serve digital library.
Accompanying drawing explanation
Fig. 1 is according to the compositional block diagram of Chinese thesaurus constructing system of the present invention.
Wherein, description of reference numerals is as follows:
1 input equipment
2 system processors
21 data processors
211 system initialization process devices
212 data initialization processors
213 data processor storeies
22 descriptor identification and withdrawal devices
221 candidate's descriptors are judged and are generated processor
222 descriptors are judged and are generated processor
223 descriptor result memories
23 descriptor relation recognition and withdrawal devices
231 descriptor correlationship identifications and extraction processor
232 descriptors belong to minute relation recognition and extract processor
233 descriptor relational result storeies
24 thesaurus makers
241 thesaurus generate processor
242 thesaurus result memories
3 storeies
4 output devices
5 verifications and modifier
Embodiment
Describe in detail with reference to the accompanying drawings according to Chinese thesaurus constructing system of the present invention.
With reference to Fig. 1, Chinese thesaurus constructing system according to the present invention comprises input equipment 1, system processor 2, storer 3 and output device 4.
Input equipment 1 input builds the required raw data file of Chinese thesaurus and raw data file is exported.
System processor 2 comprises: data processor 21, communicate to connect in input equipment 1 and receive the raw data file (number of raw data file is at least one) of being exported by input equipment 1, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor 21 processing, this raw data file is changed with generating standard text data file and to standard text data file and carried out participle and part-of-speech tagging and export standard text data, if belonging to, the raw data file receiving meets the normalized raw data file that data processor 21 is processed, to this raw data file directly advance participle and part-of-speech tagging export standard text data, descriptor identification and withdrawal device 22, communicate to connect in data processor 21 and receive the participle of data processor 21 outputs and the standard text data of part-of-speech tagging, to organize the identification of word, descriptor and extraction based on standard GB/T 13190-91 Chinese thesaurus establishment rules, and to generate and descriptor that output is extracted, the descriptor of extraction is as selected descriptor set, descriptor relation recognition and withdrawal device 23, communicate to connect in data processor 21 and descriptor identification and withdrawal device 22 and receive the standard text data of data processor 21 outputs and the selected descriptor set of descriptor identification and withdrawal device 22 outputs, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, each descriptor in selected descriptor set is carried out to identification and the extraction of descriptor correlationship and minute relation of genus, and by the descriptor correlationship of each descriptor and genus minute relation output, and thesaurus maker 24, communicate to connect in descriptor identification and withdrawal device 22 and descriptor relation recognition and withdrawal device 23, receive descriptor and identify descriptor correlationship and the genus minute relation of each descriptor of selected descriptor set, reception descriptor relation recognition and withdrawal device 23 outputs of exporting with withdrawal device 22, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, the relation between descriptor, descriptor is combined, to be sorted, to generate and to export thesaurus.
Storer 3 communicates to connect data processor 21, the descriptor identification and withdrawal device 22, descriptor relation recognition and withdrawal device 23, thesaurus maker 24 in system processor 2, the result that storage data processor 21, descriptor identification and withdrawal device 22, descriptor relation recognition and withdrawal device 23, thesaurus maker 24 are exported separately.
Output device 4 communicates to connect data processor 21 in system processor 2, descriptor identification and withdrawal device 22, descriptor relation recognition and withdrawal device 23, thesaurus maker 24, and receives and thesaurus that descriptor correlationship that standard text data that output data processor 21 is exported, descriptor identification and selected descriptor set, descriptor relation recognition and withdrawal device 23 that withdrawal device 22 is exported are exported and genus minute relation, thesaurus maker 24 are exported.
According in Chinese thesaurus constructing system of the present invention, described raw data file comprises text data file, XML file, pdf document, and described standard text data file comprises text and XML file.
In an embodiment of data processor 21, with reference to Fig. 1, data processor 21 comprises: system initialization process device 211, communicate to connect in input equipment 1 and receive the raw data file of being exported by input equipment 1, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor 21 processing, this raw data file changed with generating standard text data file and exported standard text data file, if belonging to, the raw data file receiving meets the normalized raw data file that data processor 21 is processed, this raw data file is directly exported as standard text data file, data initialization processor 212, communicate to connect in system initialization process device 211, the standard text data file of receiving system initialization processor 211 outputs, carries out participle and part-of-speech tagging and the standard text data after participle and part-of-speech tagging is exported standard text data file, and data processor storer 213, communicate to connect in data initialization processor 212 and receive and the participle of storage data initialization processor 212 outputs and the standard text data file after part-of-speech tagging.
In an embodiment of descriptor identification and withdrawal device 22, with reference to Fig. 1, descriptor identification comprises with withdrawal device 22: candidate's descriptor is judged and generated processor 221, communicate to connect the data initialization processor 212 in data processor 21, and receive the participle of data initialization processor 212 output of data processor 21 and the standard text data file after part-of-speech tagging, based on language rule and mutual information statistics, the received authority data file after participle and part-of-speech tagging identified and extracted candidate's descriptor, generating and export the set of candidate's descriptor; Descriptor is judged and is generated processor 222, communicate to connect in candidate's descriptor and judge and generate processor 221, and receive the judgement of candidate's descriptor and generate candidate's descriptor set that processor 221 is exported, position-based weighted sum word frequency statistics, candidate's descriptor in received candidate's descriptor set is carried out to the judgement of descriptor word and extraction, to generate and to export selected descriptor set; And descriptor result memory 223, communicate to connect in descriptor and judge and generate processor 222, and receive and store the selected descriptor set that processor 222 outputs are judged and generated to descriptor.
The standard text data file of data initialization processor 212 outputs, its content is the word data through participle and part-of-speech tagging processing, it is comprised of word string, and candidate's descriptor is judged and generated processor 221, by using following language rule to obtain the set of candidate's descriptor.Language rule is:
In candidate's descriptor, at least contain a verb, noun or name part of speech composition;
Last word of candidate's descriptor is verb, noun or name part of speech composition;
First word of candidate's descriptor is not preposition, measure word;
In candidate's descriptor, there is no conjunction, pronoun and modal particle;
Extract the word string unit of length between 2-8 and usually form candidate's descriptor;
Extract and press part of speech composition noun+noun, adjective+noun, the verb+noun of participle, candidate's descriptor phrase string of noun+verb, the maximum length of phrase string is 8.
For improving the generation quality of candidate's descriptor, candidate descriptor can be thought to be comprised of a plurality of candidate descriptors, i.e. candidate's descriptor phrase, and candidate's descriptor is judged and is generated processor 221 and adopt mutual information statistical calculation methods to obtain selected candidate's descriptor set.Mutual information statistical computation formula is:
Mutual - information ( T ) = Mutual - information ( t i , t j ) = log 2 probability ( t i , t j ) probability ( t i ) × probability ( t j ) Formula 1
Wherein, candidate's descriptor T is word (t i, t j) combination, word t is by t 1t 2... t nform, word string is designated as t i=t 1t 2... t n-r, t j=t rt r+1... t n, probability (t i) expression word t ithe probability occurring in all standard text data files of data initialization processor 212 separately; Probability (t j) expression word t jthe probability occurring in all standard text data files of data initialization processor 212 separately; Probability (t i, t j) expression word t iand t jjointly appear at the probability in the same standard text data file of data initialization processor 212; If t iand t jin conjunction with very tight, probability (t i, t j) and probability (t i) or probability (t j) be more or less the same (concrete difference can be determined by user), the word string t in candidate's descriptor T that this formula is calculated iand t jmutual information value just larger, otherwise, probability (t i) and probability (t j) will be much larger than probability (t i, t j), the t calculating iand t jmutual information value just smaller, i.e. Mutual-information (t i, t j) value is larger, word t iand t jthe probability that is combined into candidate's descriptor is larger.
Wherein, descriptor judges that the method that adopts position-based weighted sum word frequency statisticses to combine with generation processor 222 is used as the method basis that descriptor word is judged and extracted.
The structure situation of weighting function value is as follows:
The weighting function of keyword is calculated:
Weight i=a * TFIDF iformula 2
To the calculating of non-key word descriptor (in conjunction with formula 1):
Weight i=Mutual-information (T) * a * TFIDF iformula 3
In formula 2 and formula 3,
TFIDF i = ft i × log ( N / n i ) Σ j ( ft i × log ( N / n j ) ) 2 Formula 4
Wherein: ft irefer to word t ithe frequency occurring in all standard text data file d from data initialization processor 212; N is the number of all standard text data files from data initialization processor 212; n ito comprise word t istandard text data number; A is word t ithe weighted value of position (being which position in the title of word in standard text data file, summary, keyword, these four positions of text).
In an embodiment of descriptor relation recognition and withdrawal device 23, with reference to Fig. 1, descriptor relation recognition and withdrawal device 23 comprise: the identification of descriptor correlationship and extraction processor 231, communicate to connect in data initialization processor 212 and the descriptor identification of data processor 21 and judge with the descriptor of withdrawal device 22 and generate processor 222, receive the descriptor judgement and the selected descriptor set that generates processor 222 outputs of data initialization processor 212 participles of output and the standard text data file of part-of-speech tagging and descriptor identification and the withdrawal device 22 of data processor 21, the co-occurrence probabilities statistical value of descriptor based in selected descriptor set in the standard text data file of participle and part-of-speech tagging, identify and extract the descriptor correlationship of this descriptor, and the descriptor correlationship that extracts of output, descriptor belongs to minute relation recognition and extracts processor 232, communicate to connect in data initialization processor 212 and the descriptor identification of data processor 21 and judge with the descriptor of withdrawal device 22 and generate processor 222, receive the descriptor judgement and the selected descriptor set that generates processor 222 outputs of data initialization processor 212 participles of output and the standard text data file of part-of-speech tagging and descriptor identification and the withdrawal device 22 of data processor 21, descriptor based in selected descriptor set has the metric of similarity between relation of inclusion and calculating descriptor a descriptor genus minute relation is identified and extracted on construction form, and the descriptor that output is extracted belongs to minute relation, and descriptor relational result storer 233, communicate to connect in descriptor correlationship and identify and extract processor 231 and descriptor genus minute relation recognition and extraction processor 232, to receive and to store descriptor correlationship, identify descriptor correlationship and the descriptor genus minute relation recognition of exporting with extraction processor 231 and extract the descriptor genus minute relation that processor 232 is exported.
Descriptor correlationship automatically identifies and extract data initialization processor 212 participles of output and the standard text data file of part-of-speech tagging of processor 231 reception data processors 21 and descriptor is identified and the descriptor of withdrawal device 22 is judged and generate the selected descriptor set that processor 222 is exported.The similarity calculating method of use based on joint probability distribution carries out identification and the extraction of correlationship, and computing formula is as follows:
Similarity ( A , B ) = probability ( A ∩ B ) probabilty ( A ∪ B ) = probability ( A , B ) probability ( A , B ‾ ) + probability ( A , B ) probability ( A ‾ , B )
Formula 5
Wherein: probability (A, B): be illustrated in all standard text data files of data initialization processor 212 generations, at uniform window (referring to the position (title, summary, keyword, these four positions of text) in each standard text data file), the frequency that the A word in selected descriptor set and B word occur simultaneously; be illustrated in all standard text data files, the A word appearance in the selected descriptor set of uniform window, and select the absent variable frequency of B word in descriptor set; be illustrated in all standard text data files, the B word appearance in the selected descriptor set of uniform window, and select the absent variable frequency of A word in descriptor set.
The selected descriptor set of the participles that wherein, the data initialization processor 212 of descriptor genus minute relation recognition and extraction processor 232 reception data processors 21 is exported and the standard text data file of part-of-speech tagging and the descriptor judgement of descriptor identification and withdrawal device 22 and 222 outputs of generation processor.Between use descriptor, the metric calculation formula of similarity obtains the genus minute relation of descriptor, and computing formula is as follows:
Simiarity = ( sim _ number sub _ munber + sim _ number number ) / 2 dp × ( Σ q 1 sin _ number ( i ) Σsub _ number ( i ) + Σ q 2 sim _ number ( i ) Σnumber ( i ) ) / 2 Formula 6
Wherein, sim_number represents total number of two descriptors (involved descriptor and the descriptor to be comprised) word that contain, identical in selected descriptor set; Sub_number represents total number of the word that the involved descriptor in selected descriptor set is contained; Number represents total number of the word that the descriptor to be comprised in selected descriptor set is contained; represent two descriptors word that contain, identical in selected descriptor set residing position (i.e. this word at the prefix of involved descriptor, which position in word and these three positions of suffix) flexible strategy sum in involved descriptor; Qsim_number (i) represents two descriptors i the residing position weight of word that contain, identical in selected descriptor set; Sim_number (i) represents that two descriptors word that contain, identical in selected descriptor set is concentrated, and i word concentrated residing positional number (being i value) at word; q 1represent that word is in involved descriptor in prefix, word and the weight coefficient of these three positions of suffix; Sub_number (i) represents two descriptors i word that contain, the identical residing positional number in involved descriptor in selected descriptor set; represent two descriptors word that contain, identical in selected descriptor set residing position (i.e. this word at the prefix of descriptor to be comprised, which position in word and these three positions of suffix) flexible strategy sum in descriptor to be comprised; q 2represent that word is in descriptor to be comprised in prefix, word and the weight coefficient of these three positions of suffix; Number (i) represents two descriptors i word that contain, the identical residing positional number in descriptor to be comprised in selected descriptor set; Dp represents position parameter, and its value is the ratio of the total number of word of the descriptor to be comprised in the involved descriptor in selected descriptor set and selected descriptor set; I=1,2 ..., sim_number.
In an embodiment of thesaurus maker 24, with reference to Fig. 1, thesaurus maker 24 comprises: thesaurus generates processor 241, communicate to connect in descriptor identification and the descriptor judgement of withdrawal device 22 and belong to minute relation recognition and extract processor 232 with extraction processor 231 and descriptor with the descriptor correlationship identification that generates processor 222 and descriptor relation recognition and withdrawal device 23, receiving descriptor identification judges and the selected descriptor set that generates processor 222 outputs with the descriptor of withdrawal device 22, the descriptor correlationship identification that receives descriptor relation recognition and withdrawal device 23 with extract processor 231 and descriptor and belong to minute relation recognition and divide a relation with the descriptor correlationship and the genus that extract each descriptor that processor 232 export respectively, based on standard GB/T 13190-91 Chinese thesaurus establishment rules to descriptor, relation between descriptor combines, sequence, to generate and to export thesaurus, and thesaurus result memory 242, communicate to connect in thesaurus and generate processor 241, and receive and store the thesaurus that thesaurus generates processor 241 outputs.
According in an embodiment of Chinese thesaurus constructing system of the present invention, with reference to Fig. 1, described Chinese thesaurus constructing system also can comprise: verification and modifier 5, communicate to connect in data processor storer 213, descriptor result memory 223, descriptor relational result storer 233, thesaurus result memory 242, with the standard text data file to 213 storages of data processor storage, the selected descriptor set of descriptor result memory 223 storages, descriptor correlationship and the descriptor of 233 storages of descriptor relational result storer belong to a minute relation, the thesaurus of thesaurus result memory 242 storages carries out desk checking, revise, delete.Based on verification and modifier 5, user can open the content of the above-mentioned storage that need to check, revise and delete as required at any time, operates accordingly.
In an embodiment of storer 3, storer 3 can be selected from hard disk, USB flash disk, portable hard drive, storage card.
According in an embodiment of Chinese thesaurus constructing system of the present invention, described Chinese thesaurus constructing system also can comprise: visual operation interface (not shown), communicates to connect in input equipment 1, system processor 2, storer 3, output device 4 and verification and modifier 5.By visual operation interface, can be convenient to user and realize whole thesaurus and automatically build processing procedure.
It is below the further checking to descriptor of the present invention and descriptor relation recognition and extraction of all technical characteristic that provides in conjunction with Fig. 1.
Adopt this system and device of the present invention to 2426 pieces of mechanical engineering patent Chinese literature data, 2783 pieces of the Chinese periodical data in literature of natural language processing technique, carry out identification and the extraction of descriptor and descriptor relation, the operation through above step, obtains test findings as follows:
Candidate's descriptor of table 1 identification of the present invention and extraction and selected descriptor test findings example
The descriptor test findings example of table 2 identification of the present invention and extraction
Table 3 identification of the present invention and the test findings example that extracts descriptor correlationship
Table 4 identification of the present invention and the test findings example that extracts descriptor genus minute relation

Claims (9)

1. a Chinese thesaurus constructing system, is characterized in that, comprising:
Input equipment (1), input builds the required raw data file of Chinese thesaurus and raw data file is exported;
System processor (2), comprising:
Data processor (21), communicate to connect in input equipment (1) and receive the raw data file by input equipment (1) output, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor (21) processing, this raw data file is changed with generating standard text data file and to standard text data file and carried out participle and part-of-speech tagging and export standard text data, if belonging to, the raw data file receiving meets the normalized raw data file that data processor (21) is processed, to this raw data file directly advance participle and part-of-speech tagging export standard text data,
Descriptor identification and withdrawal device (22), communicate to connect in data processor (21) and receive the participle of data processor (21) output and the standard text data of part-of-speech tagging, to organize the identification of word, descriptor and extraction based on standard GB/T 13190-91 Chinese thesaurus establishment rules, and to generate and descriptor that output is extracted, the descriptor of extraction is as selected descriptor set;
Descriptor relation recognition and withdrawal device (23), communicate to connect in data processor (21) and descriptor identification and withdrawal device (22) and receive the standard text data of data processor (21) output and the selected descriptor set of descriptor identification and withdrawal device (22) output, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, each descriptor in selected descriptor set is carried out to identification and the extraction of descriptor correlationship and minute relation of genus, and by the descriptor correlationship of each descriptor and genus minute relation output; And
Thesaurus maker (24), communicate to connect in descriptor identification and withdrawal device (22) and descriptor relation recognition and withdrawal device (23), the descriptor correlationship that receives each descriptor of descriptor is identified and withdrawal device (22) is exported selected descriptor set, reception descriptor relation recognition and withdrawal device (23) output is divided a relation with genus, based on standard GB/T 13190-91 Chinese thesaurus establishment rules, the relation between descriptor, descriptor is combined, to be sorted, to generate and to export thesaurus; Storer (3), communicate to connect data processor (21), descriptor identification and withdrawal device (22), descriptor relation recognition and withdrawal device (23), thesaurus maker (24) in system processor (2), the result that storage data processor (21), descriptor identification and withdrawal device (22), descriptor relation recognition and withdrawal device (23), thesaurus maker (24) are exported separately; And
Output device (4), communicate to connect data processor (21) in system processor (2), descriptor identification and withdrawal device (22), descriptor relation recognition and withdrawal device (23), thesaurus maker (24), and receive and thesaurus that descriptor correlationship that standard text data that output data processor (21) is exported, descriptor identification and selected descriptor set, descriptor relation recognition and withdrawal device (23) that withdrawal device (22) is exported are exported and genus minute relation, thesaurus maker (24) are exported.
2. Chinese thesaurus constructing system according to claim 1, is characterized in that,
Described raw data file comprises text data file, XML file, pdf document;
Described standard text data file comprises text and XML file.
3. Chinese thesaurus constructing system according to claim 1, is characterized in that, data processor (21) comprising:
System initialization process device (211), communicate to connect in input equipment (1) and receive the raw data file by input equipment (1) output, the memory address of raw data file is provided, received raw data file is carried out to standardization judgement, if the raw data file receiving belongs to the raw data file of the non-standardization that does not meet data processor (21) processing, this raw data file changed with generating standard text data file and exported standard text data file, if belonging to, the raw data file receiving meets the normalized raw data file that data processor (21) is processed, this raw data file is directly exported as standard text data file,
Data initialization processor (212), communicate to connect in system initialization process device (211), the standard text data file of receiving system initialization processor (211) output, carries out participle and part-of-speech tagging and the standard text data after participle and part-of-speech tagging is exported standard text data file; And
Data processor storer (213), communicates to connect in data initialization processor (212), and receives and the participle of storage data initialization processor (212) output and the standard text data file after part-of-speech tagging.
4. Chinese thesaurus constructing system according to claim 3, is characterized in that, descriptor identification comprising with withdrawal device (22):
Candidate's descriptor is judged and is generated processor (221), communicate to connect in the data initialization processor (212) of data processor (21) and receive the participle of data initialization processor (212) output of data processor (21) and the standard text data file after part-of-speech tagging, based on language rule and mutual information statistical computation, the received authority data file after participle and part-of-speech tagging identified and extracted candidate's descriptor, generating and export the set of candidate's descriptor;
Descriptor is judged and is generated processor (222), communicate to connect in candidate's descriptor and judge and generate processor (221), and receive the judgement of candidate's descriptor and generate candidate's descriptor set that processor (221) is exported, position-based weighted sum word frequency statistics, candidate's descriptor in received candidate's descriptor set is carried out to the judgement of descriptor word and extraction, to generate and to export selected descriptor set; And
Descriptor result memory (223), communicates to connect in descriptor and judges and generate processor (222), and receives and store the selected descriptor set that processor (222) output is judged and generated to descriptor.
5. Chinese thesaurus constructing system according to claim 4, is characterized in that, descriptor relation recognition and withdrawal device (23) comprising:
The identification of descriptor correlationship and extraction processor (231), communicate to connect in data initialization processor (212) and the descriptor identification of data processor (21) and judge with the descriptor of withdrawal device (22) and generate processor (222), receive the descriptor judgement and the selected descriptor set that generates processor (222) output of the participle of data initialization processor (212) output and the standard text data file of part-of-speech tagging and descriptor identification and the withdrawal device (22) of data processor (21), the co-occurrence probabilities statistical value of descriptor based in selected descriptor set in the standard text data file of participle and part-of-speech tagging, identify and extract the descriptor correlationship of this descriptor, and the descriptor correlationship that extracts of output,
Descriptor belongs to minute relation recognition and extracts processor (232), communicate to connect in data initialization processor (212) and the descriptor identification of data processor (21) and judge with the descriptor of withdrawal device (22) and generate processor (222), receive the descriptor judgement and the selected descriptor set that generates processor (222) output of the participle of data initialization processor (212) output and the standard text data of part-of-speech tagging and descriptor identification and the withdrawal device (22) of data processor (21), the relation of inclusion that descriptor based in selected descriptor set has on construction form and the metric that calculates similarity between descriptor belong to a minute relation to descriptor and identify and extract, and the descriptor that output is extracted belongs to minute relation, and
Descriptor relational result storer (233), communicate to connect in descriptor correlationship and identify and extract processor (231) and descriptor genus minute relation recognition and extraction processor (232), to receive and to store descriptor correlationship, identify descriptor correlationship and the descriptor genus minute relation recognition of exporting with extraction processor (231) and extract the descriptor genus minute relation that processor (232) is exported.
6. Chinese thesaurus constructing system according to claim 5, is characterized in that, thesaurus maker (24) comprising:
Thesaurus generates processor (241), communicate to connect in descriptor identification and the descriptor judgement of withdrawal device (22) and belong to minute relation recognition and extract processor (232) with extraction processor (231) and descriptor with the descriptor correlationship identification that generates processor (222) and descriptor relation recognition and withdrawal device (23), receiving descriptor identification judges and the selected descriptor set that generates processor (222) output with the descriptor of withdrawal device (22), the descriptor correlationship identification that receives descriptor relation recognition and withdrawal device (23) with extract processor (231) and descriptor and belong to minute relation recognition and divide a relation with the descriptor correlationship that extracts each descriptor that processor (232) export respectively with genus, based on standard GB/T 13190-91 Chinese thesaurus establishment rules to descriptor, relation between descriptor combines, sequence, to generate and to export thesaurus, and
Thesaurus result memory (242), communicates to connect in thesaurus and generates processor (241) and receive and store the thesaurus that thesaurus generates processor (241) output.
7. Chinese thesaurus constructing system according to claim 6, is characterized in that, described Chinese thesaurus constructing system also comprises:
Verification and modifier (5), communicate to connect in data processor storer (213), descriptor result memory (223), descriptor relational result storer (233), thesaurus result memory (242), with the selected descriptor set that the standard text data file of data processor storage (213) storage, descriptor result memory (223) are stored, the descriptor correlationship of descriptor relational result storer (233) storage and the thesaurus that descriptor belongs to minute relation, thesaurus result memory (242) storage, carry out desk checking, modification, deletion.
8. Chinese thesaurus constructing system according to claim 1, is characterized in that, storer (3) is selected from hard disk, USB flash disk, portable hard drive, storage card.
9. Chinese thesaurus constructing system according to claim 1, is characterized in that, described Chinese thesaurus constructing system also comprises:
Visual operation interface, communicates to connect in input equipment (1), system processor (2), storer (3), output device (4) and verification and modifier (5).
CN201410359650.5A 2014-07-25 2014-07-25 Chinese thesaurus constructing system Expired - Fee Related CN104102847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410359650.5A CN104102847B (en) 2014-07-25 2014-07-25 Chinese thesaurus constructing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410359650.5A CN104102847B (en) 2014-07-25 2014-07-25 Chinese thesaurus constructing system

Publications (2)

Publication Number Publication Date
CN104102847A true CN104102847A (en) 2014-10-15
CN104102847B CN104102847B (en) 2017-11-10

Family

ID=51670992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410359650.5A Expired - Fee Related CN104102847B (en) 2014-07-25 2014-07-25 Chinese thesaurus constructing system

Country Status (1)

Country Link
CN (1) CN104102847B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204620A (en) * 2021-05-12 2021-08-03 首都师范大学 Method, system, equipment and computer storage medium for automatically constructing narrative table

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090177668A1 (en) * 2008-01-08 2009-07-09 International Business Machines Corporation Term-Driven Records File Plan and Thesaurus Design
CN102087669A (en) * 2011-03-11 2011-06-08 北京汇智卓成科技有限公司 Intelligent search engine system based on semantic association
CN102243649A (en) * 2011-06-07 2011-11-16 上海交通大学 Semi-automatic information extraction processing device of ontology
CN102930022A (en) * 2012-10-31 2013-02-13 中国运载火箭技术研究院 User-oriented information search engine system and method
CN102982095A (en) * 2012-10-31 2013-03-20 中国运载火箭技术研究院 Noumenon automatic generating system and method thereof based on thesaurus
WO2013166949A1 (en) * 2012-05-08 2013-11-14 Shenzhen Shi Ji Guang Su Information Technology Co., Ltd. System, apparatus and method for recommending thesaurus in input method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090177668A1 (en) * 2008-01-08 2009-07-09 International Business Machines Corporation Term-Driven Records File Plan and Thesaurus Design
CN102087669A (en) * 2011-03-11 2011-06-08 北京汇智卓成科技有限公司 Intelligent search engine system based on semantic association
CN102243649A (en) * 2011-06-07 2011-11-16 上海交通大学 Semi-automatic information extraction processing device of ontology
WO2013166949A1 (en) * 2012-05-08 2013-11-14 Shenzhen Shi Ji Guang Su Information Technology Co., Ltd. System, apparatus and method for recommending thesaurus in input method
CN102930022A (en) * 2012-10-31 2013-02-13 中国运载火箭技术研究院 User-oriented information search engine system and method
CN102982095A (en) * 2012-10-31 2013-03-20 中国运载火箭技术研究院 Noumenon automatic generating system and method thereof based on thesaurus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曾文等: "网络化数字化时代主题词表自动构建技术的探究与实践", 《国家图书馆学刊》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204620A (en) * 2021-05-12 2021-08-03 首都师范大学 Method, system, equipment and computer storage medium for automatically constructing narrative table

Also Published As

Publication number Publication date
CN104102847B (en) 2017-11-10

Similar Documents

Publication Publication Date Title
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
TWI554896B (en) Information Classification Method and Information Classification System Based on Product Identification
CN103150405B (en) Classification model modeling method, Chinese cross-textual reference resolution method and system
CN106997341B (en) A kind of innovation scheme matching process, device, server and system
CN105468605A (en) Entity information map generation method and device
JP2020027649A (en) Method, apparatus, device and storage medium for generating entity relationship data
CN102156711B (en) Cloud storage based power full text retrieval method and system
CN110532352B (en) Text duplication checking method and device, computer readable storage medium and electronic equipment
CN106156145A (en) The management method of a kind of address date and device
CN106055623A (en) Cross-language recommendation method and system
CN107122382A (en) A kind of patent classification method based on specification
Alami et al. Cybercrime profiling: Text mining techniques to detect and predict criminal activities in microblog posts
CN105868177A (en) Universal formula search method
CN105095091B (en) A kind of software defect code file localization method based on Inverted Index Technique
CN103714086A (en) Method and device used for generating non-relational data base module
CN110929022A (en) Text abstract generation method and system
CN106933824A (en) The method and apparatus that the collection of document similar to destination document is determined in multiple documents
US10353927B2 (en) Categorizing columns in a data table
CN108846033A (en) The discovery and classifier training method and apparatus of specific area vocabulary
CN114491081A (en) Electric power data tracing method and system based on data blood relationship graph
CN105260878A (en) Auxiliary secret-level setting method and device
JP5324677B2 (en) Similar document search support device and similar document search support program
CN103294780B (en) Directory mapping relationship mining device and directory mapping relationship mining device
CN103092838B (en) A kind of method and device for obtaining English words
CN106919565B (en) MapReduce-based document retrieval method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171110

CF01 Termination of patent right due to non-payment of annual fee