CN104102847B - Chinese thesaurus constructing system - Google Patents
Chinese thesaurus constructing system Download PDFInfo
- Publication number
- CN104102847B CN104102847B CN201410359650.5A CN201410359650A CN104102847B CN 104102847 B CN104102847 B CN 104102847B CN 201410359650 A CN201410359650 A CN 201410359650A CN 104102847 B CN104102847 B CN 104102847B
- Authority
- CN
- China
- Prior art keywords
- descriptor
- processor
- word
- data file
- relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 230000015654 memory Effects 0.000 claims abstract description 38
- 238000000034 method Methods 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 22
- 238000010276 construction Methods 0.000 claims abstract description 7
- 239000000284 extract Substances 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 239000000203 mixture Substances 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 6
- 239000003607 modifier Substances 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 238000013500 data storage Methods 0.000 claims description 5
- 241001269238 Data Species 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 239000012141 concentrate Substances 0.000 claims description 2
- 239000002245 particle Substances 0.000 claims description 2
- 238000012986 modification Methods 0.000 claims 1
- 230000004048 modification Effects 0.000 claims 1
- 230000009286 beneficial effect Effects 0.000 abstract description 4
- 238000004891 communication Methods 0.000 abstract description 3
- 238000012423 maintenance Methods 0.000 abstract description 3
- 239000000463 material Substances 0.000 abstract description 2
- 238000001228 spectrum Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 11
- 238000011160 research Methods 0.000 description 7
- 238000012360 testing method Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000000052 comparative effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 230000032683 aging Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 238000011089 mechanical engineering Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000009885 systemic effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a kind of Chinese thesaurus constructing system, and it includes input equipment, system processor, memory, output equipment.System processor includes data processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, descriptor table generator.Memory communication is connected to the data processor of system processor, descriptor identification and withdrawal device, descriptor relation recognition and withdrawal device, descriptor table generator.Output equipment is communicatively coupled to system processor.Thus, the shortcomings that overcoming original manual method, uses manpower and material resources sparingly, and improves Chinese thesaurus structure efficiency, the dynamic construction for realizing Chinese thesaurus, renewal and maintenance that can be convenient, fast and inexpensive;The quality of descriptor structure is can guarantee that, structure or the information extraction of the Chinese thesaurus of all spectra can be supported;It is beneficial to information tissue and the utilization of Library and file administration field, and digital library can be served.
Description
Technical field
The present invention relates to data processing technique, more particularly to a kind of Chinese thesaurus constructing system.
Background technology
Thesaurus is a kind of standardization dynamic vocabulary for showing semantic relation between descriptor, descriptor word, wherein including
Specific license field, in the semantic many vocabulary related on hierarchical relationship, from the perspective of in terms of function, thesaurus is then document mark
Thinking bridge between the person of inducing one and retrieval personnel, it is natural language (language used in document) and system language (searching system rule
Generalized language) between a kind of term control instrument for being changed, while be also the medium that is exchanged between people and system.
Science and technology rapid development, today that Network Information service becomes increasingly popular, the method for traditional artificial constructed thesaurus take
It is and costly.The shortcomings that maximum of artificial constructed thesaurus is that can not solve " knowledge acquisition existing for brainstrust itself of tabulating
Bottleneck " problem, also it is unfavorable for upgrading in time and safeguarding for thesaurus.Artificial constructed thesaurus is applied to networking, digitlization
Environment when, its own existing updating survey not enough causes vocabulary content in ageing, descriptor term scale and quality etc.
Missing, make it be difficult to use and promote in all types of user in digital networked environment, i.e., can not meet Library
With the professional in file administration field and the needs of user are retrieved, in addition, Library and the numeral in file administration field
Change data in literature every year with the data volume of magnanimity scale to be incremented by, increased document number is constantly updated and developed to existing art
The generation that new terminology emerges in an endless stream is resulted according to data in literature caused by the appearance of, frontier technology.Therefore, transform and update
There is thesaurus, emerging technical field or specialty are then needed to rebuild new industry technology field thesaurus.Structure at present
Thesaurus is the common recognition of domestic and international Library and file administration industry, refers to document, Robert M.Losee, thesaurus structure
The decision method research built and used, information processing and management, 2007 (4):958-968(Decisions in Thesaurus
Construction and Use.Information Processing&Management),2007(4):958-968.).Such as
Efficiently, quickly structure Chinese thesaurus is Library and the actual demand urgently to be resolved hurrily of file administration field for what.
From published document and practical application, the report of Chinese thesaurus constructing system device yet there are no.At present,
The domestic research for thesaurus generation technique field lacks, such as:Du Huiping, He Lin, Hou Hanqing, the nature based on cluster analysis
Language thesaurus is built automatically, National Library's academic periodical, and 2007,3:44-49;Xu Ruifang, Li Xiaowen, Hou Hanqing, thesaurus word
Between Automated generalization rule comparative studies, information science, 2009 (1):89-93;Yuan Xu, Chang Chun, towards the thesaurus phase of structure
Close Relation acquisition Study of way, information science, 2013,31 (1):68-72;These documents are to be limited to generate thesaurus
During a certain stage local research, and without the systemic development in complete meaning;Another document (Liu Hua, Shen
Yulan, Zeng Jianxun, China, the U.S. and Britain's thesaurus establishment national standard comparative studies, Library Information Service, 2009,53
(22):Research work 72-75) is based on the external thesaurus research establishment situation of follow-up report;Another two documents (Liu Wei,
Zhou Jie, the concurrent mechanism research under network environment in thesaurus workout system, Library Information Service, 2011,55 (22):11-
14):Zhao Jianhua, Zhao Jianguo etc., the exploitation of Chinese thesaurus microcomputer establishment management system, information journal, 1995:184-193) originally
Be area of computer aided manual entry, establishment and the technology for safeguarding thesaurus in matter, i.e., it is auxiliary using the database technology of computer
Assistant editor's system and processing vocabulary, realize vocabulary structure structure and basic editting function, and are not to be directed to thesaurus content in itself
Constructing technology realization.The external research work relative maturity on thesaurus constructing technology, from the seventies in last century just
Through starting correlative study work, still, due to statement difference intrinsic between language so that replicate external thesaurus structure completely
It is worthless to build technology and method, and therefore, the structure research-and-development activity for Chinese thesaurus, which is one, has reality
The work of meaning.
The content of the invention
For insufficient present in background technology, it is an object of the invention to provide a kind of Chinese thesaurus constructing system,
It can overcome the shortcomings that original manual method, use manpower and material resources sparingly, and improve the structure efficiency of Chinese thesaurus, can conveniently, soon
Prompt and low cost dynamic construction, renewal and the maintenance of realizing Chinese thesaurus.
Another object of the present invention is to provide a kind of Chinese thesaurus constructing system, compared to artificial constructed Chinese thesaurus
Method, its more can guarantee that Chinese thesaurus structure quality, all Chinese based on Digital Documents field can be supported to chat
The structure of vocabulary or information extraction.
Another object of the present invention is information tissue and the utilization for being beneficial to Library and file administration field, and can
To serve digital library.
To achieve these goals, the invention provides a kind of Chinese thesaurus constructing system, it includes input equipment, is
System processor, memory and output equipment.
Input equipment input builds the raw data file needed for Chinese thesaurus and exports raw data file.
System processor includes:Data processor, it is communicatively coupled to input equipment and receives the original exported by input equipment
Beginning data file, there is provided the storage address of raw data file, normative judgement is carried out to the raw data file received, such as
The raw data file that fruit is received belongs to the raw data file for the non-standardization for not meeting data processor processes, then should
Raw data file is changed to generate specification text data file and specification text data file be segmented and part of speech
Mark and export specification text data, if the raw data file received belongs to the standardization for meeting data processor processes
Raw data file, then the raw data file is directly advanced participle and part-of-speech tagging and exports specification text data;Chat
Word identifies and withdrawal device, is communicatively coupled to data processor and receives the participle of data processor output and the specification of part-of-speech tagging
Text data, with based on standard GB/T 13190-91 Chinese thesaurus establishment rules carry out a group word, the identification of descriptor and extraction,
And the descriptor extracted is generated and exports, the descriptor of extraction is as selected descriptor set;Descriptor relation recognition and withdrawal device, communication link
Data processor and descriptor identification are connected to withdrawal device and receives specification text data and the descriptor knowledge of data processor output
Not with the selected descriptor set of withdrawal device output, with based on standard GB/T 13190-91 Chinese thesaurus establishment rules to selected
Each descriptor in descriptor set carries out descriptor dependency relation and category divides the identification and extraction of relation, and by the descriptor of each descriptor
Dependency relation and category point relation export;And descriptor table generator, it is communicatively coupled to descriptor identification and is closed with withdrawal device and descriptor
System's identification and withdrawal device, receive the selected descriptor set of descriptor identification and withdrawal device output, receive descriptor relation recognition with extracting
The descriptor dependency relation and category of each descriptor of device output divide relation, to be compiled based on standard GB/T 13190-91 Chinese thesaurus
System rule is combined to the relation between descriptor, descriptor, sorted, to generate and export thesaurus.
Memory communication is connected to the data processor of system processor, descriptor identification and withdrawal device, descriptor relation recognition
With withdrawal device, descriptor table generator, data storage processor, descriptor identification with withdrawal device, descriptor relation recognition and withdrawal device, chat
The result that vocabulary maker each exports.
Output equipment, which is communicatively coupled to the data processor of system processor, descriptor identification and withdrawal device, descriptor relation, to be known
Not with withdrawal device, descriptor table generator, and receive and output data processor exported specification text data, descriptor identification and
The descriptor dependency relation and category point that selected descriptor set, descriptor relation recognition and the withdrawal device that withdrawal device is exported are exported are closed
The thesaurus that system, descriptor table generator are exported.
Beneficial effects of the present invention are as follows:
By Chinese thesaurus constructing system provided by the invention, the shortcomings that overcoming original manual method, manpower thing is saved
Power, improves the structure efficiency of Chinese thesaurus, the dynamic construction for realizing Chinese thesaurus that can be convenient, fast and inexpensive,
Renewal and maintenance.
By Chinese thesaurus constructing system provided by the invention, it can guarantee that the structure quality of Chinese thesaurus, can be with
Support structure or the information extraction of the Chinese thesaurus of all spectra.
By Chinese thesaurus constructing system provided by the invention, it is beneficial to the information of Library and file administration field
Tissue and utilization, and digital library can be served.
Brief description of the drawings
Fig. 1 is the compositional block diagram according to the Chinese thesaurus constructing system of the present invention.
Wherein, description of reference numerals is as follows:
1 input equipment
2 system processors
21 data processors
211 system initialization process devices
212 data initialization processors
213 data processor memories
22 descriptors identify and withdrawal device
221 candidate's descriptors judge and generation processor
222 descriptors judge and generation processor
223 descriptor result memories
23 descriptor relation recognitions and withdrawal device
231 descriptor dependency relations are identified with extracting processor
232 descriptor category divide relation recognition with extracting processor
233 descriptor relational result memories
24 descriptor table generators
241 thesaurus generate processor
242 thesaurus result memories
3 memories
4 output equipments
5 verifications and modifier
Embodiment
With reference to the accompanying drawings come describe in detail according to the present invention Chinese thesaurus constructing system.
Reference picture 1, input equipment 1, system processor 2, storage are included according to the Chinese thesaurus constructing system of the present invention
Device 3 and output equipment 4.
The input of input equipment 1 builds the raw data file needed for Chinese thesaurus and exports raw data file.
System processor 2 includes:Data processor 21, is communicatively coupled to input equipment 1 and reception is exported by input equipment 1
Raw data file (number of raw data file at least one), there is provided the storage address of raw data file, to institute
The raw data file of reception carries out normative judgement, if the raw data file received, which belongs to, does not meet data processor
The raw data file of the non-standardization of 21 processing, then the raw data file is changed to generate specification text data text
Part and specification text data file is segmented and part-of-speech tagging and exports specification text data, if the original number received
Belong to the raw data file for the standardization for meeting the processing of data processor 21 according to file, then to the direct row of the raw data file
Enter participle and part-of-speech tagging and export specification text data;Descriptor identifies and withdrawal device 22, is communicatively coupled to data processor 21
And the participle of the output of data processor 21 and the specification text data of part-of-speech tagging are received, with based on standard GB/T 13190-91
Chinese thesaurus establishment rules carry out the identification of a group word, descriptor with extraction and generate and export the descriptor of extraction, the descriptor of extraction
As selected descriptor set;Descriptor relation recognition and withdrawal device 23, be communicatively coupled to data processor 21 and descriptor identification with
Withdrawal device 22 simultaneously receives the specification text data of the output of data processor 21 and selected the chatting of descriptor identification and the output of withdrawal device 22
Set of words, to be entered based on standard GB/T 13190-91 Chinese thesaurus establishment rules to selecting each descriptor in descriptor set
Row descriptor dependency relation and category divide the identification and extraction of relation, and divide relation defeated the descriptor dependency relation and category of each descriptor
Go out;And descriptor table generator 24, descriptor identification and withdrawal device 22 and descriptor relation recognition and withdrawal device 23 are communicatively coupled to,
The selected descriptor set of the identification of reception descriptor and the output of withdrawal device 22, reception descriptor relation recognition export each with withdrawal device 23
The descriptor dependency relation and category of descriptor divide relation, with based on standard GB/T 13190-91 Chinese thesaurus establishment rules to chatting
Relation between word, descriptor is combined, sorted, to generate and export thesaurus.
Memory 3 is communicatively coupled to the data processor 21 of system processor 2, descriptor identification and withdrawal device 22, descriptor and closed
System's identification and withdrawal device 23, descriptor table generator 24, data storage processor 21, descriptor identification and withdrawal device 22, descriptor relation
The result that identification each exports with withdrawal device 23, descriptor table generator 24.
Output equipment 4 is communicatively coupled to the data processor 21 of system processor 2, descriptor identification and withdrawal device 22, descriptor
Relation recognition and withdrawal device 23, descriptor table generator 24, and receive the specification textual data exported with output data processor 21
The descriptor that the selected descriptor set that is exported according to the identification of, descriptor with withdrawal device 22, descriptor relation recognition are exported with withdrawal device 23
The thesaurus that dependency relation and category point relation, descriptor table generator 24 are exported.
According in Chinese thesaurus constructing system of the present invention, the raw data file includes text data text
Part, XML file, pdf document, the specification text data file include text and XML file.
In an embodiment of data processor 21, reference picture 1, data processor 21 includes:System initialization process device
211, it is communicatively coupled to input equipment 1 and receives the raw data file exported by input equipment 1, there is provided raw data file
Storage address, normative judgement is carried out to the raw data file received, if the raw data file received belongs to not
Meet the raw data file of the non-standardization of the processing of data processor 21, then changed the raw data file to generate
Specification text data file simultaneously exports specification text data file, meets if the raw data file received belongs at data
The raw data file for the standardization that device 21 is handled is managed, then the raw data file is defeated directly as specification text data file
Go out;Data initialization processor 212, it is communicatively coupled to system initialization process device 211, reception system initialization processor 211
The specification text data file of output, specification text data file is segmented and part-of-speech tagging and will participle and part-of-speech tagging
Specification text data output afterwards;And data processor memory 213, it is communicatively coupled to data initialization processor 212 simultaneously
Receive the specification text data file after the participle and part-of-speech tagging exported with data storage initialization processor 212.
In an embodiment of the descriptor identification with withdrawal device 22, reference picture 1, descriptor identification includes with withdrawal device 22:Candidate
Descriptor judges, with generating processor 221, to be communicatively coupled to the data initialization processor 212 of data processor 21, and receive number
Specification text data file after the participle and part-of-speech tagging that are exported according to the data initialization processor 212 of processor 21, is based on
Language rule and mutual information statistics are identified and taken out to the canonical data files after participle and part-of-speech tagging received
Candidate's descriptor is taken, generates and exports candidate's descriptor set;Descriptor judges, with generating processor 222, to be communicatively coupled to candidate's descriptor
Judge with generating processor 221, and receive candidate's descriptor set that candidate's descriptor judges to export with generation processor 221, be based on
Position weighted sum word frequency statisticses, the judgement of descriptor word and extraction are carried out to candidate's descriptor in candidate's descriptor set for being received,
To generate and export selected descriptor set;And descriptor result memory 223, it is communicatively coupled to descriptor and judges and generation processor
222, and receive and store the selected descriptor set that descriptor judges to export with generation processor 222.
The specification text data file that data initialization processor 212 exports, its content are by participle and part of speech mark
The term data of processing is noted, it is made up of word string, and candidate's descriptor judges, with generating processor 221, to advise by using following language
Then obtain candidate's descriptor set.Language rule is:
At least contain a verb, noun or nominal composition in candidate's descriptor;
Last word of candidate's descriptor is verb, noun or nominal composition;
First word of candidate's descriptor is not preposition, measure word;
There is no conjunction, pronoun and modal particle in candidate's descriptor;
Extract word string member of the length between 2-8 and usually form candidate's descriptor;
Extract and chatted by the candidate of part of speech composition noun+noun of participle, adjective+noun, verb+noun, noun+verb
Word phrase string, the maximum length of phrase string is 8.
To improve the generation quality of candidate's descriptor, candidate descriptor is regarded as being made up of multiple candidate descriptors, i.e., candidate chats
Word phrase, then candidate's descriptor judges and generation processor 221 obtains selected candidate's descriptor collection using mutual information statistical calculation method
Close.Mutual information counts calculation formula:
Formula 1
Wherein, candidate's descriptor T is word (ti,tj) combination, word t is by t1t2...tnComposition, word string are designated as ti=
t1t2...tn-r, tj=trtr+1...tn,probability(ti) represent word tiIndividually in the institute of data initialization processor 212
There is the probability occurred in specification text data file;probability(tj) represent word tjIndividually in data initialization processor
The probability occurred in 212 all specification text data files;probability(ti,tj) represent word tiAnd tjAppear in jointly
Probability in the same specification text data file of data initialization processor 212;If tiAnd tjIt is very close with reference to obtaining, then
probability(ti,tj) and probability (ti) or probability (tj) be more or less the same (and specific difference can by with
Family determines), then the word string t in the candidate's descriptor T that the formula calculatesiAnd tjAssociation relationship with regard to larger, conversely, probability
(ti) and probability (tj) probability (t will be much larger thani,tj), then the t calculatediAnd tjAssociation relationship just
It is smaller, i.e. Mutual-information (ti,tj) value is bigger, then word tiAnd tjThe probability for being combined into candidate's descriptor is bigger.
Wherein, descriptor judge with generate processor 222 using the method being combined based on position weighted sum word frequency statisticses come
The method basis for judging as descriptor word and extracting.
The construction situation of weighting function value is as follows:
The weighting function of keyword is calculated:
Weighti=a × TFIDFiFormula 2
Calculating to non-key word descriptor (with reference to formula 1):
Weighti=Mutual-information (T) × a × TFIDFiFormula 3
In formula 2 and formula 3,
Formula 4
Wherein:ftiRefer to word tiOccur in all specification text data file d from data initialization processor 212
Frequency;N is the number of all specification text data files from data initialization processor 212;niIt is to include word tiSpecification
Text data number;A is word tiPosition (i.e. title, summary, keyword, text of the word in specification text data file
Where this four positions) weighted value.
In descriptor relation recognition and an embodiment of withdrawal device 23, reference picture 1, descriptor relation recognition is wrapped with withdrawal device 23
Include:Descriptor dependency relation identifies and extracts processor 231, is communicatively coupled to the data initialization processor of data processor 21
212 and descriptor identification and the descriptor of withdrawal device 22 judge with generation processor 222, receive the data initialization of data processor 21
Specification text data file and the descriptor identification of participle and part-of-speech tagging that processor 212 exports and the descriptor of withdrawal device 22 judge
The selected descriptor set exported with generation processor 222, based on the descriptor in selected descriptor set in participle and part-of-speech tagging
Co-occurrence probabilities statistical value in specification text data file, identify and extract the descriptor dependency relation of the descriptor, and export and taken out
The descriptor dependency relation taken;Descriptor category divides relation recognition and extracts processor 232, is communicatively coupled to the data of data processor 21
The descriptor of initialization processor 212 and descriptor identification and withdrawal device 22 judges and generates processor 222, receives data processor 21
Data initialization processor 212 export participle and part-of-speech tagging specification text data file and descriptor identification and withdrawal device
22 descriptor judges the selected descriptor set exported with generation processor 222, is being formed based on the descriptor in selected descriptor set
The metric in form with similarity between inclusion relation and calculating descriptor is identified and extracted to descriptor category point relation, and
The extracted descriptor category of output divides relation;And descriptor relational result memory 233, it is communicatively coupled to the identification of descriptor dependency relation
Divide relation recognition with extracting processor 232 with extracting processor 231 and descriptor category, to receive and store the knowledge of descriptor dependency relation
Do not chatted with the descriptor dependency relation and descriptor category point relation recognition for extracting the output of processor 231 with what extraction processor 232 exported
Word category divides relation.
Descriptor dependency relation automatic identification receives the data initialization processor of data processor 21 with extracting processor 231
Specification text data file and the descriptor identification of participle and part-of-speech tagging of 212 outputs and the descriptor of withdrawal device 22 judge and generation
The selected descriptor set that processor 222 exports.Dependency relation is carried out using the similarity calculating method based on joint probability distribution
Identification and extraction, calculation formula is as follows:
Formula 5
Wherein:probability(A,B):Represent all specification text datas generated in data initialization processor 212
In file, uniform window (refer in each specification text data file position (title, summary, keyword, text this
Four positions)), select the frequency that the A words in descriptor set and B words occur simultaneously;Represent in all specifications
In text data file, occur in the A words that uniform window is selected in descriptor set, and the B words in selected descriptor set occur without
Frequency;Represent in all specification text data files, the B in uniform window selectes descriptor set
Word occurs, and the frequency that the A words in selected descriptor set occur without.
Wherein, descriptor category divides relation recognition to receive the data initialization of data processor 21 with extracting processor 232 and handled
The participle and the specification text data file and descriptor of part-of-speech tagging that device 212 exports identify and the descriptor of withdrawal device 22 judges and raw
The selected descriptor set exported into processor 222.The category point of descriptor is obtained using the metric calculation formula of similitude between descriptor
Relation, calculation formula are as follows:
Formula 6
Wherein, sim_number represents that two descriptors in selected descriptor set contain (by descriptor and descriptor to be included is included)
Total number have, identical word;Sub_number represents to be included the total of the word contained by descriptor in selected descriptor set
Number;Number represents the total number of the word contained by the descriptor to be included in selected descriptor set;
Represent that two descriptors in selected descriptor set contain, (i.e. the word exists identical word for location in by comprising descriptor
By the prefix comprising descriptor, where in word and these three positions of suffix) flexible strategy sum;Qsim_number (i) is represented
The location of two descriptors in selected descriptor set contain, i-th of word of identical weight;Sim_number (i) is represented
Two descriptors in selected descriptor set contain, identical word is concentrated, and i-th of word concentrates location number (i.e. i in word
Value);q1Represent the weight coefficient of word in prefix, word and these three positions of suffix in by comprising descriptor;Sub_number (i) tables
Show two descriptors in selected descriptor set contain, i-th of word of identical location number in by comprising descriptor;Represent two descriptors in selected descriptor set contain, identical word institute in descriptor to be included
Position (i.e. prefix, in word and suffix these three positions where of the word in descriptor to be included) the flexible strategy sum at place;
q2Represent the weight coefficient of word in prefix, word and these three positions of suffix in descriptor to be included;Number (i) represents selected and chatted
Two descriptors in set of words contain, i-th of word of identical location number in descriptor to be included;Dp represents position system
Number, its value are by comprising the ratio between total number of word of descriptor to be included in descriptor and selected descriptor set in selected descriptor set;i
=1,2 ..., sim_number.
In an embodiment of descriptor table generator 24, reference picture 1, descriptor table generator 24 includes:At thesaurus generation
Device 241 is managed, the descriptor for being communicatively coupled to descriptor identification and withdrawal device 22 judges and generation processor 222 and descriptor relation recognition
Descriptor dependency relation identification with withdrawal device 23 divides relation recognition with extracting processor 232 with extracting processor 231 and descriptor category,
The descriptor for receiving descriptor identification and withdrawal device 22 judges the selected descriptor set exported with generation processor 222, receives descriptor and closes
System's identification and the descriptor dependency relation identification of withdrawal device 23 divide relation recognition to be handled with extraction with extracting processor 231 and descriptor category
The descriptor dependency relation and category point relation for each descriptor that device 232 exports respectively, are chatted based on standard GB/T 13190-91 Chinese
Vocabulary establishment rules are combined to the relation between descriptor, descriptor, sorted, to generate and export thesaurus;And thesaurus
Result memory 242, thesaurus generation processor 241 is communicatively coupled to, and it is defeated to receive and store thesaurus generation processor 241
The thesaurus gone out.
In the embodiment according to Chinese thesaurus constructing system of the present invention, reference picture 1, the Chinese descriptor
Table constructing system may also include:Verification and modifier 5, are communicatively coupled to data processor memory 213, descriptor result memory
223rd, descriptor relational result memory 233, thesaurus result memory 242, with the rule stored to data processor storage 213
Model text data file, the selected descriptor set of the storage of descriptor result memory 223, descriptor relational result memory 233 store
Descriptor dependency relation and descriptor category point relation, thesaurus result memory 242 store thesaurus carry out desk checking, repair
Change, delete.Based on verification and modifier 5, user can be as needed, and opening needs are checked, change and deleted and be above-mentioned at any time
The content of storage, to be operated accordingly.
In an embodiment of memory 3, memory 3 may be selected from hard disk, USB flash disk, mobile hard disk, storage card.
In the embodiment according to Chinese thesaurus constructing system of the present invention, the Chinese thesaurus structure system
System may also include:Visual operation interface (not shown), it is communicatively coupled to input equipment 1, system processor 2, memory 3, output
Equipment 4 and verification and modifier 5.By visual operation interface, user can be easy to realize that structure is handled whole thesaurus automatically
Process.
It is to the descriptor and descriptor relation recognition of the present invention and entering for extraction below with reference to all technical characteristic that Fig. 1 is provided
One step is demonstrate,proved.
Using the system and device of the present invention to mechanical engineering patent Chinese literature data 2426, natural language processing
The Chinese periodical data in literature of technology 2783, the identification and extraction of descriptor and descriptor relation are carried out, by the behaviour of above step
Make, it is as follows to obtain result of the test:
The identification of the present invention of table 1 and the candidate's descriptor extracted and selected descriptor result of the test example
The identification of the present invention of table 2 and the descriptor result of the test example extracted
Result of the test example of the identification of the present invention of table 3 with extracting descriptor dependency relation
The identification of the present invention of table 4 divides the result of the test example of relation with extracting descriptor category
Claims (5)
- A kind of 1. Chinese thesaurus constructing system, it is characterised in that including:Input equipment (1), input builds the raw data file needed for Chinese thesaurus and exports raw data file, described Raw data file includes text data file, XML file, pdf document;System processor (2), including:Data processor (21), it is communicatively coupled to input equipment (1) and receives the initial data text by input equipment (1) output Part, there is provided the storage address of raw data file, normative judgement is carried out to the raw data file received, if received Raw data file belong to do not meet data processor (21) processing non-standardization raw data file, then it is this is original Data file is changed to generate specification text data file and specification text data file be segmented and part-of-speech tagging And specification text data is exported, if the raw data file received belongs to the standardization for meeting data processor (21) processing Raw data file, then the raw data file is directly segmented and part-of-speech tagging and exports specification text data, institute Stating specification text data file includes text and XML file;Descriptor identifies and withdrawal device (22), is communicatively coupled to data The specification text data of the participle and part-of-speech tagging of processor (21) and reception data processor (21) output, to be marked based on country Quasi- GB13190-91 Chinese thesaurus establishment rules carry out a group word, descriptor identification and extraction and generate and what output was extracted chats Word, the descriptor of extraction is as selected descriptor set;Descriptor relation recognition and withdrawal device (23), it is communicatively coupled to data processor (21) and descriptor identification and withdrawal device (22) And receive the specification text data of data processor (21) output and the selected descriptor collection of descriptor identification and withdrawal device (22) output Close, to be chatted based on standard GB/T 13190-91 Chinese thesaurus establishment rules to selecting each descriptor in descriptor set Word dependency relation and category divide the identification and extraction of relation, and the descriptor dependency relation of each descriptor and category point relation are exported;With AndDescriptor table generator (24), it is communicatively coupled to descriptor identification and withdrawal device (22) and descriptor relation recognition and withdrawal device (23), receive the selected descriptor set of descriptor identification and withdrawal device (22) output, receive descriptor relation recognition and withdrawal device (23) The descriptor dependency relation and category of each descriptor of output divide relation, to be worked out based on standard GB/T 13190-91 Chinese thesaurus Rule is combined to the relation between descriptor, descriptor, sorted, to generate and export thesaurus;Memory (3), be communicatively coupled to system processor (2) data processor (21), descriptor identification with withdrawal device (22), chat Word relation recognition and withdrawal device (23), descriptor table generator (24), data storage processor (21), descriptor identification and withdrawal device (22), the result that descriptor relation recognition each exports with withdrawal device (23), descriptor table generator (24);AndOutput equipment (4), be communicatively coupled to system processor (2) data processor (21), descriptor identification with withdrawal device (22), Descriptor relation recognition and withdrawal device (23), descriptor table generator (24), and receive what is exported with output data processor (21) Selected descriptor set, descriptor relation recognition and the withdrawal device that specification text data, descriptor identification are exported with withdrawal device (22) (23) thesaurus that the descriptor dependency relation and category point relation that are exported, descriptor table generator (24) are exported;Data processor (21) includes:System initialization process device (211), it is communicatively coupled to input equipment (1) and receives by the original of input equipment (1) output Data file, there is provided the storage address of raw data file, normative judgement is carried out to the raw data file received, if The raw data file received belongs to the raw data file for the non-standardization for not meeting data processor (21) processing, then will The raw data file is changed to generate specification text data file and export specification text data file, if received Raw data file belong to meet data processor (21) processing standardization raw data file, then the initial data text Part exports directly as specification text data file;Data initialization processor (212), it is communicatively coupled to system initialization process device (211), reception system initialization process Device (211) output specification text data file, specification text data file is segmented and part-of-speech tagging and will participle and Specification text data output after part-of-speech tagging;AndData processor memory (213), data initialization processor (212) is communicatively coupled to, and received at the beginning of with data storage Specification text data file after the participle and part-of-speech tagging of beginningization processor (212) output;Descriptor identification includes with withdrawal device (22):Candidate's descriptor judges and generates processor (221), is communicatively coupled to the data initialization processor of data processor (21) (212) and receive data processor (21) data initialization processor (212) output participle and part-of-speech tagging after specification Text data file, the specification after participle and part-of-speech tagging to being received is calculated based on language rule and mutual information statistics Data file is identified with extracting candidate's descriptor, is generated and is exported candidate's descriptor set;Descriptor judges and generates processor (222), is communicatively coupled to the judgement of candidate's descriptor and generation processor (221), and receive Candidate's descriptor judges candidate's descriptor set with generation processor (221) output, based on position weighted sum word frequency statisticses, to being connect Candidate's descriptor in candidate's descriptor set of receipts carries out the judgement of descriptor word and extraction, to generate and export selected descriptor set; AndDescriptor result memory (223), it is communicatively coupled to descriptor and judges with generating processor (222), and receives and store descriptor Judge the selected descriptor set with generation processor (222) output;Wherein, the content of the specification text data file of data initialization processor (212) output is by participle and part of speech mark The term data of processing is noted, term data is made up of word string, and candidate's descriptor judges to use following language with generation processor (221) Candidate's descriptor set is calculated in rule and mutual information system;Language rule is:At least contain a verb, noun or nominal composition in candidate's descriptor;Last word of candidate's descriptor is verb, noun or nominal composition;First word of candidate's descriptor is not preposition, measure word;There is no conjunction, pronoun and modal particle in candidate's descriptor;Extract word string member of the length between 2-8 and usually form candidate's descriptor;Extract part of speech composition noun+noun, adjective+noun, verb+noun, the candidate's descriptor word of noun+verb by participle Group string, the maximum length of phrase string is 8;Candidate descriptor is considered to be made up of multiple candidate descriptors, i.e. candidate's descriptor phrase, then candidate's descriptor judges to handle with generation The mutual information that device (221) uses counts the formula calculated:Wherein, candidate's descriptor T is word (ti,tj) combination, word t is by t1t2...tnComposition, ti=t1t2...tn-r, tj= trtr+1...tn,probability(ti) represent word tiIndividually in all specification textual datas of data initialization processor (212) According to the probability occurred in file;probability(tj) represent word tjIndividually in all rule of data initialization processor (212) The probability occurred in model text data file;probability(ti,tj) represent word tiAnd tjData initialization is appeared in jointly Probability in the same specification text data file of processor (212);If tiAnd tjIt is very close with reference to obtaining, then probability(ti,tj) and probability (ti) or probability (tj) be more or less the same, specific difference by user Lai It is determined that the word t in candidate's descriptor T that then formula calculatesiAnd tjAssociation relationship with regard to larger, conversely, probability (ti) and probability(tj) probability (t will be much larger thani,tj), then the t calculatediAnd tjAssociation relationship with regard to smaller, That is Mutual-information (ti,tj) value is bigger, then word tiAnd tjThe probability for being combined into candidate's descriptor is bigger;Descriptor relation recognition includes with withdrawal device (23):Descriptor dependency relation is identified with extracting processor (231), is communicatively coupled at the data initialization of data processor (21) Descriptor judgement and the generation processor (222) of device (212) and descriptor identification and withdrawal device (22) are managed, receives data processor (21) Data initialization processor (212) output participle and part-of-speech tagging specification text data file and descriptor identification and extract The descriptor of device (22) judges the selected descriptor set with generation processor (222) output, based on the descriptor in selected descriptor set Co-occurrence probabilities statistical value in the specification text data file with part-of-speech tagging is segmented, identifies and extracts the descriptor phase of the descriptor Pass relation, and export extracted descriptor dependency relation;Descriptor category divides relation recognition to be communicatively coupled to extracting processor (232) at the data initialization of data processor (21) Descriptor judgement and the generation processor (222) of device (212) and descriptor identification and withdrawal device (22) are managed, receives data processor (21) Data initialization processor (212) output participle and part-of-speech tagging specification text data and descriptor identification and withdrawal device (22) descriptor judges the selected descriptor set with generation processor (222) output, is existed based on the descriptor in selected descriptor set The inclusion relation that has on construction form and calculate similarity between descriptor metric descriptor category point relation is identified with Extract, and export extracted descriptor category and divide relation;AndDescriptor relational result memory (233), the identification of descriptor dependency relation is communicatively coupled to extracting processor (231) and chatting Word category divides relation recognition with extracting processor (232), to receive and store the identification of descriptor dependency relation with extracting processor (231) The descriptor dependency relation and descriptor category point relation recognition of output divide relation with extracting the descriptor category of processor (232) output;Wherein:Descriptor judges to be used as descriptor using the method being combined based on position weighted sum word frequency statisticses with generation processor (222) The method basis that word judges and extracted,The construction situation of weighting function value is as follows:The weighting function of keyword is calculated:Weighti=a × TFIDFiFormula 2Calculating to non-key word descriptor, with reference to formula 1:Weighti=Mutual-information (T) × a × TFIDFiFormula 3In formula 2 and formula 3,Wherein:ftiRefer to word tiThe frequency occurred in all specification text data file d from data initialization processor (212) Rate;N is the number of all specification text data files from data initialization processor (212);niIt is to include word tiSpecification Text data number;A is word tiThe weighted value of position, i.e. word tiTitle, summary in specification text data file, pass The weighted value of where this four positions of keyword, text;The identification and extraction of dependency relation are carried out using the similarity calculating method based on joint probability distribution, calculation formula is such as Under:Wherein:probability(A,B):Represent all specification text datas text in data initialization processor (212) generation In part, select A words in descriptor set and B words in uniform window while the frequency occurred, wherein window refer in each specification Position in text data file, position are title, summary, keyword, this four positions of text;Represent In all specification text data files, occur in the A words that uniform window is selected in descriptor set, and in selected descriptor set The frequency that B words occur without;Represent in all specification text data files, descriptor is selected in uniform window B words in set occur, and the frequency that the A words in selected descriptor set occur without;The category point relation of descriptor is obtained using the metric calculation formula of similitude between descriptor, calculation formula is as follows:Wherein, total number that two descriptors that sim_number represents to select in descriptor set contain, identical word, the two Descriptor is referred to by comprising descriptor and descriptor to be included;Sub_number represent in selected descriptor set by comprising contained by descriptor The total number of some words;Number represents the total number of the word contained by the descriptor to be included in selected descriptor set;Represent two descriptors in selected descriptor set contain, identical word residing in by comprising descriptor Position flexible strategy sum, wherein identical word refers to the word by the word comprising descriptor the location of in by comprising descriptor Where in first, word and these three positions of suffix;Qsim_number (i) represents that two in selected descriptor set are chatted The location of word contains, i-th of word of identical weight;Sim_number (i) represents that two in selected descriptor set are chatted Word contains, identical word is concentrated, and i-th of word concentrates location number, i.e. i values in word;q1Represent that word is being included descriptor The weight coefficient of in middle prefix, word and these three positions of suffix;Sub_number (i) represents that two in selected descriptor set are chatted Word contains, i-th of word of identical location number in by comprising descriptor;Represent selected to chat Two descriptors in set of words contain, identical word location flexible strategy sum, wherein identical in descriptor to be included Word location in descriptor to be included refers to the word in the prefix, word of descriptor to be included and in these three positions of suffix Which position;q2Represent the weight coefficient of word in prefix, word and these three positions of suffix in descriptor to be included;number (i) two descriptors in selected descriptor set contain, i-th of word of identical location in descriptor to be included is represented Number;Dp represents position parameter, and its value is by comprising descriptor and waiting to include and chat in selected descriptor set in selected descriptor set The ratio between total number of word of word;I=1,2 ..., sim_number.
- 2. Chinese thesaurus constructing system according to claim 1, it is characterised in that descriptor table generator (24) includes:Thesaurus generation processor (241), the descriptor for being communicatively coupled to descriptor identification and withdrawal device (22) judge to handle with generation The identification of descriptor dependency relation and extraction processor (231) and descriptor of device (222) and descriptor relation recognition with withdrawal device (23) Category divides relation recognition and extracts processor (232), and the descriptor for receiving descriptor identification and withdrawal device (22) judges and generation processor (222) the selected descriptor set of output, receive the descriptor dependency relation identification of descriptor relation recognition and withdrawal device (23) and extract The descriptor for each descriptor that processor (231) and descriptor category point relation recognition export respectively to extracting processor (232) is related to close System and category divide relation, and the relation between descriptor, descriptor is entered based on standard GB/T 13190-91 Chinese thesaurus establishment rules Row combination, sequence, to generate and export thesaurus;AndThesaurus result memory (242), it is communicatively coupled to thesaurus generation processor (241) and receives and store thesaurus life Into the thesaurus of processor (241) output.
- 3. Chinese thesaurus constructing system according to claim 2, it is characterised in that the Chinese thesaurus constructing system Also include:Verification and modifier (5), are communicatively coupled to data processor memory (213), descriptor result memory (223), descriptor Relational result memory (233), thesaurus result memory (242), with the specification stored to data processor storage (213) Text data file, the selected descriptor set of descriptor result memory (223) storage, descriptor relational result memory (233) are deposited The descriptor dependency relation and descriptor category of storage divide relation, thesaurus result memory (242) storage thesaurus carry out desk checking, Modification, delete.
- 4. Chinese thesaurus constructing system according to claim 1, it is characterised in that memory (3) is selected from hard disk, USB flash disk And the one or more in storage card.
- 5. Chinese thesaurus constructing system according to claim 1, it is characterised in that the Chinese thesaurus constructing system Also include:Visual operation interface, be communicatively coupled to input equipment (1), system processor (2), memory (3), output equipment (4), with And verification and modifier (5).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410359650.5A CN104102847B (en) | 2014-07-25 | 2014-07-25 | Chinese thesaurus constructing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410359650.5A CN104102847B (en) | 2014-07-25 | 2014-07-25 | Chinese thesaurus constructing system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104102847A CN104102847A (en) | 2014-10-15 |
CN104102847B true CN104102847B (en) | 2017-11-10 |
Family
ID=51670992
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410359650.5A Expired - Fee Related CN104102847B (en) | 2014-07-25 | 2014-07-25 | Chinese thesaurus constructing system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104102847B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113204620A (en) * | 2021-05-12 | 2021-08-03 | 首都师范大学 | Method, system, equipment and computer storage medium for automatically constructing narrative table |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102087669A (en) * | 2011-03-11 | 2011-06-08 | 北京汇智卓成科技有限公司 | Intelligent search engine system based on semantic association |
CN102243649A (en) * | 2011-06-07 | 2011-11-16 | 上海交通大学 | Semi-automatic information extraction processing device of ontology |
CN102930022A (en) * | 2012-10-31 | 2013-02-13 | 中国运载火箭技术研究院 | User-oriented information search engine system and method |
CN102982095A (en) * | 2012-10-31 | 2013-03-20 | 中国运载火箭技术研究院 | Noumenon automatic generating system and method thereof based on thesaurus |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10733223B2 (en) * | 2008-01-08 | 2020-08-04 | International Business Machines Corporation | Term-driven records file plan and thesaurus design |
CN103389979B (en) * | 2012-05-08 | 2018-10-12 | 深圳市世纪光速信息技术有限公司 | Recommend system, the device and method of classified lexicon in input method |
-
2014
- 2014-07-25 CN CN201410359650.5A patent/CN104102847B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102087669A (en) * | 2011-03-11 | 2011-06-08 | 北京汇智卓成科技有限公司 | Intelligent search engine system based on semantic association |
CN102243649A (en) * | 2011-06-07 | 2011-11-16 | 上海交通大学 | Semi-automatic information extraction processing device of ontology |
CN102930022A (en) * | 2012-10-31 | 2013-02-13 | 中国运载火箭技术研究院 | User-oriented information search engine system and method |
CN102982095A (en) * | 2012-10-31 | 2013-03-20 | 中国运载火箭技术研究院 | Noumenon automatic generating system and method thereof based on thesaurus |
Non-Patent Citations (1)
Title |
---|
网络化数字化时代主题词表自动构建技术的探究与实践;曾文等;《国家图书馆学刊》;20120831(第4期);第78-82页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104102847A (en) | 2014-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107463658B (en) | Text classification method and device | |
TWI554896B (en) | Information Classification Method and Information Classification System Based on Product Identification | |
CN102890698B (en) | Method for automatically describing microblogging topic tag | |
CN106997341B (en) | A kind of innovation scheme matching process, device, server and system | |
JP2020027649A (en) | Method, apparatus, device and storage medium for generating entity relationship data | |
CN104391842A (en) | Translation model establishing method and system | |
CN103678684A (en) | Chinese word segmentation method based on navigation information retrieval | |
CN111274814B (en) | Novel semi-supervised text entity information extraction method | |
CN104281565B (en) | Semantic dictionary construction method and device | |
CN110377901A (en) | A kind of text mining method for making a report on case for distribution line tripping | |
CN104679738A (en) | Method and device for mining Internet hot words | |
CN106383836A (en) | Ascribing actionable attributes to data describing personal identity | |
CN106909628A (en) | A kind of text similarity method based on interval | |
CN110929022A (en) | Text abstract generation method and system | |
CN112632982A (en) | Dialogue text emotion analysis method capable of being used for supplier evaluation | |
CN109214445A (en) | A kind of multi-tag classification method based on artificial intelligence | |
CN114491081A (en) | Electric power data tracing method and system based on data blood relationship graph | |
CN103927176A (en) | Method for generating program feature tree on basis of hierarchical topic model | |
CN108241650B (en) | Training method and device for training classification standard | |
CN104102847B (en) | Chinese thesaurus constructing system | |
CN106372083A (en) | Controversial news clue automatic discovery method and system | |
CN112528640A (en) | Automatic domain term extraction method based on abnormal subgraph detection | |
CN108427769B (en) | Character interest tag extraction method based on social network | |
CN106202033B (en) | A kind of adverbial word Word sense disambiguation method and device based on interdependent constraint and knowledge | |
Makinist et al. | Preparation of improved Turkish dataset for sentiment analysis in social media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171110 |
|
CF01 | Termination of patent right due to non-payment of annual fee |