CN102831134A - Novel semi-automatic indexing method of Chinese scientific and technical documents - Google Patents

Novel semi-automatic indexing method of Chinese scientific and technical documents Download PDF

Info

Publication number
CN102831134A
CN102831134A CN2011104243691A CN201110424369A CN102831134A CN 102831134 A CN102831134 A CN 102831134A CN 2011104243691 A CN2011104243691 A CN 2011104243691A CN 201110424369 A CN201110424369 A CN 201110424369A CN 102831134 A CN102831134 A CN 102831134A
Authority
CN
China
Prior art keywords
document
index
offered
candidate
references
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104243691A
Other languages
Chinese (zh)
Other versions
CN102831134B (en
Inventor
刘伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Original Assignee
INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA filed Critical INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority to CN201110424369.1A priority Critical patent/CN102831134B/en
Publication of CN102831134A publication Critical patent/CN102831134A/en
Application granted granted Critical
Publication of CN102831134B publication Critical patent/CN102831134B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a novel semi-automatic indexing method of Chinese scientific and technical documents. The method comprises the following steps: acquiring cited documents of a documents collection to be labeled by users, so as to obtain a cited document collection; labeling all documents in the cited document collection to obtain labeled cited documents; constructing a network of citing relations among Chinese documents in the cited document collection, to obtain the network of citing relations among the Chinese documents in the cited document collection; and performing iterative labeling on the documents in the documents collection to be labeled by users until each document in the documents collection to be labeled by users is labeled. By adopting the method, the shortcomings of low indexing efficiency and low accuracy existing in the current automatic indexing method of Chinese scientific and technical documents can be effectively overcome.

Description

The semi-automatic indexing method of a kind of new-type Chinese scientific and technical literature
Technical field
The present invention relates to literature search and text classification field, particularly relate to the semi-automatic indexing method of a kind of new-type Chinese scientific and technical literature.
Background technology
Index is the important step in the document processing.Through index, give document with searching mark, indicate the theme generic of its content characteristic, then go out various catalogues and index, or be stored in the computing machine, to realize the retrieval of document in order to cooperation bibliography information preparation.Index is the necessary method of information filtering; Index is to the refining of information and lifting, and information itself is had the intelligence contribution; Index can make retrieval more efficient, and is more accurate.
The speed that current scientific and technical literature is delivered and quoted is the trend of quick growth.With China is example, 2000 to 2010, deliver about 720,000 pieces of scientific and technical literature altogether, and quote number of times and reach 4,230,000 times.In the face of the scientific and technical literature information of magnanimity, artificial indexing efficient can not satisfy the growing Indexing of Scien. and Tech. Literature demand of scientific and technical personnel and broad masses.The automatic indexing technology grows up thereupon.Scientific and technical literature automatic indexing (Automatic Indexing) is meant the process of utilizing Automatic Extraction term in the computer system scientific and technical literature.From last century the fifties begin, the automatic indexing technology is the focus of attention of research circle and industrial community always.
The automatic indexing technology of existing scientific and technical literature all is to extract plurality of keywords as index term through the internal features such as text, summary or title of analyzing scientific and technical literature.There are some common defectives in these methods in Chinese scientific and technical literature index.Specific as follows said:
(1) divide word algorithm to have defective.
Divide word algorithm to have defective.Finding out each semantic primitive, be to think deeply the first step of judging with index, and the participle problem of Chinese exists always, the solution that up to the present various minute word algorithms all also are provided with ambiguity partition.This just causes carrying out automatic indexing basis existing problems.
(2) a type thesaurus does not catch up with the reach of science.
Each discipline development of modern society is swift and violent unusually; Subdiscipline, frontier branch of science continue to bring out; The establishment of vocabulary always lags behind the reach of science; Feasible cutting algorithm based on dictionary always has some new word segmentations and does not go out, and has also influenced the accuracy of carrying out the automatic indexing system of words and phrases control based on vocabulary greatly.
(3) though the technology of automatic indexing at present has been superior to manual type greatly on efficient; Still can't catch up with actual application demand but complicated language semantic analysis technology makes on the efficient, and these technology need fairly large training sample in practical application.
In sum, existing method exists low, the inaccurate defective of index efficient in Chinese scientific and technical literature index.
Thereby; Need the urgent technical matters that solves of those skilled in the art to be exactly at present: how to find a kind of new-type Chinese scientific and technical literature automatic indexing method, can effectively solve low, the inaccurate defective of index efficient that exists in the present Chinese scientific and technical literature automatic indexing method.
Summary of the invention
A technical matters to be solved by this invention provides the semi-automatic indexing method of a kind of new-type Chinese scientific and technical literature, can effectively solve low, the inaccurate defective of index efficient that exists in the present Chinese scientific and technical literature automatic indexing method.
In order to address the above problem, the invention discloses the semi-automatic indexing method of a kind of new-type Chinese scientific and technical literature, comprising: being offered by quoted passage of document set to the user need mark obtained, and obtains by quoted passage and offers set;
Every piece of document marks in the set to being offered by quoted passage, obtains being offered by quoted passage of mark;
The adduction relationship network between the document makes up in the set to being offered by quoted passage, obtains saidly to be offered the adduction relationship network between the document in the set by quoted passage;
Document in the document set that need mark the user carries out the iteration mark, and every piece of document in the said document set that needs mark is all marked.
Preferably, said being offered by quoted passage of document set that need mark the user obtained, and obtains the step of being offered set by quoted passage, comprising:
Initialization is offered set by quoted passage;
Every piece of document in the document set that need mark the user obtains corresponding all lists of references of every piece of document; If corresponding all lists of references of every piece of document are offered in the set by quoted passage said, then saidly offered by quoted passage that the quantity of document remains unchanged in the set; If corresponding all lists of references of every piece of document are not offered in the set by quoted passage said, then corresponding all lists of references of every piece of document are put into and saidly offered set by quoted passage, offered set by quoted passage after obtaining to upgrade.
Preferably, said every piece of document marks in the set to being offered by quoted passage, obtains the step of being offered by quoted passage of mark, comprising:
Do not offered as if being offered in the set, then every piece of document being offered by quoted passage in the set is marked, accomplish, obtain being offered of index by quoted passage to offered the index process of being offered by quoted passage in the set by quoted passage by quoted passage by quoted passage;
If being offered to have by quoted passage in the set by quoted passage offers, then offered by quoted passage and carry out manual work mark offered in the set every piece by quoted passage, obtain artificial being offered of marking by quoted passage; Every piece of document carrying out again being offered by quoted passage in the set marks, and accomplishes offered the index process of being offered by quoted passage in the set by quoted passage, obtains being offered by quoted passage of index.
Preferably, said the adduction relationship network between the document makes up in the set to being offered by quoted passage, obtains saidly to be offered the step of the adduction relationship network between the document in the set by quoted passage, comprising:
To adduction relationship set carrying out initialization;
The document that needs is made up the adduction relationship network is put into the document set, the document set after obtaining to upgrade accordingly;
Two kinds of different adduction relationships between any two pieces of documents in the set of the document after the said renewal are put into said adduction relationship network, obtain the adduction relationship of corresponding any two pieces of documents;
Return the adduction relationship of above-mentioned any two pieces of documents and gather the adduction relationship set between the said document; The adduction relationship network between the document makes up in the set to being offered by quoted passage in completion, obtains saidly to be offered the adduction relationship network between the document in the set by quoted passage.
Preferably, the document during the said document that need mark the user is gathered carries out the iteration mark, and the step that every piece of document in the said document set that needs to mark is all marked comprises:
The set of initialization index document;
Need each piece document in the set of index document to carry out the iteration index to the user; If the list of references of the document is then carried out the operation of the document being carried out index all by index;
The document that will pass through index takes out from the set of index document, will pass through the document of index simultaneously and put into corresponding index document set and offered set by quoted passage accordingly;
If need the set of index document not for empty; Then carry out each piece document from the user is needed index document set carry out the iteration index gather to the document that will pass through index from the index document taking-up; The document of index be will pass through simultaneously and corresponding index document set and the cycling of being offered set accordingly by quoted passage put into; Be combined into sky up to needs index archives, the document of then accomplishing the document set that need mark the user carries out the overall process of iteration mark;
If need the index archives to be combined into sky, then carry out the operation steps of all documents being carried out index, obtain the set of index document; With the index document of said acquisition set output, accomplish the overall process that needs the document in the document set of index to carry out the iteration index to the user.
Preferably, saidly need each piece document in the index document set to carry out the iteration index to the user; If the list of references of the document is then carried out the step of the document being carried out the operation of index all by index, comprising:
Need each piece document in the set of index document to carry out the iteration index to the user; If the list of references of the document, then calculates the index weight of each piece document in the list of references set of the document all by index, obtain the index weighted value of each piece document in the said list of references set;
Set is obtained to candidate's index term, obtains said candidate's index term set;
Calculate the weight of each speech in said candidate's index term set, obtain the weighted value of each speech in said candidate's index term set;
Choose the index term of 6 speech of weighted value maximum in said candidate's index term set, accomplish and carry out the operation of the document being carried out index as corresponding document.
Preferably, the index weight of each piece document in the list of references of the said calculating document set obtains the step of the index weighted value of each piece document in the said list of references set, comprising:
Obtain the number of times of quoting of each piece document in the list of references set;
Calculate the length of each piece document and the corresponding shortest path of index document in said adduction relationship network in the list of references set, obtain the length of each piece document and the corresponding shortest path of index document in said adduction relationship network in the list of references set;
Calculate the weight of each piece document in the said list of references set, obtain the weighted value of each piece document in the said list of references set;
Return the weighted value of all documents in the set of said list of references and the weighted value of above-mentioned all documents is gathered; Accomplish the overall process of the index weighted value that calculates each piece document in the said list of references set, thereby obtain the index weighted value of each piece document in the said list of references.
Preferably, said candidate's index term is gathered obtains, and obtains the step of said candidate's index term set, comprising:
Initialization is carried out in set to candidate's index term, obtains initialized candidate's index term set;
The set of candidate's index term put in the index term of the list of references of selected document, obtain corresponding candidate's index term set;
Return said candidate's index term set, accomplish the overall process that said candidate's index term set is obtained, thereby obtain said candidate's index term set.
Preferably; If in the said list of references set candidate's index term is arranged; Then carry out from the set of said list of references and take out one piece of document, said candidate's index term set put in the index term of the document of taking out in the said list of references set, execute single job after; The circulation execution is above-mentioned to be chosen document and the operation that said candidate's index term is gathered is put in the index term of corresponding document, all candidate's index terms in finding said list of references set; Wherein, in above-mentioned cyclic process,, then only calculate the primary indexing speech if said index term is identical;
If do not have candidate's index term in the said list of references set; Then carry out from said list of references set and choose document, corresponding document is carried out index, obtain accordingly by the index document; Joined by the index document in said candidate's index term set what obtain; After executing single job, the circulation execution is above-mentioned to be chosen document and the operation that said candidate's index term is gathered is put in the index term of corresponding document, all candidate's index terms in finding said list of references set; Wherein, in above-mentioned cyclic process,, then only calculate the primary indexing speech if said index term is identical.
Preferably, the weight of each speech in the set of the said candidate's index term of said calculating obtains the step of the weighted value of each speech in said candidate's index term set, comprising:
Choose the document that the user need carry out index, obtain the document that said user need carry out index;
Need carry out each candidate's index term in the document of index to said user and seek all lists of references of corresponding index term, obtain all lists of references of corresponding index term;
Calculate the weight of all lists of references of corresponding index term, obtain the index weighted value of all lists of references of corresponding index term;
The index weighted value of all lists of references of corresponding index term is carried out addition, obtain the summation of index weighted value of all lists of references of corresponding index term;
Return the summation of the said index weighted value of each index term in said candidate's index term, accomplish the overall process of calculating the weight of each speech in said candidate's index term set, obtain the weighted value of each speech in said candidate's index term set;
Carry out the corresponding operating of the overall process of the weight of each speech in the said candidate's index term set of aforementioned calculation to each the candidate's index term in the set of candidate's index term; Each candidate's index term in the set of the said candidate's index term of traversal, thus the weighted value of each speech in said candidate's index term set obtained.
Compared with prior art, the present invention has the following advantages:
The present invention is different from existing scientific and technical literature automatic indexing method, has only utilized the adduction relationship between the scientific and technical literature, the information of the keyword that can select for use that provides during science and technology is asked down and three aspects of thesaurus.The information of above-mentioned three aspects all is objective clear and definite, and the index efficient that therefore can avoid prior art to exist is low and index is inaccurate, causes the defective of ambiguity easily.
In a word, the invention provides the semi-automatic indexing method of a kind of new-type Chinese scientific and technical literature, can effectively solve low, the inaccurate defective of index efficient that exists in the present Chinese scientific and technical literature automatic indexing method.
Description of drawings
Fig. 1 is the schematic flow sheet of the semi-automatic indexing method embodiment 1 of a kind of new-type Chinese scientific and technical literature of the present invention;
Fig. 2 is the structural representation of the general frame of the semi-automatic index of Chinese scientific and technical literature among the present invention;
Fig. 3 is that the step 201 among the present invention is promptly initially offered the schematic flow sheet that obtains of set by quoted passage;
Fig. 4 is the schematic flow sheet of the mark promptly initially offered by quoted passage of the step 202 among the present invention;
Fig. 5 is that the step 203 among the present invention is the schematic flow sheet of adduction relationship network struction;
Fig. 6 is that the step 204 among the present invention is the schematic flow sheet of single document mark;
Fig. 7 is the schematic flow sheet of the index weighted value of each piece document in the set of computations in the step 204 among the present invention;
Fig. 8 is the schematic flow sheet that obtains the set of candidate's index term in the step 204 among the present invention;
Fig. 9 is the schematic flow sheet of the weight of each speech during the calculated candidate index is gathered in the step 204 among the present invention;
Figure 10 is that the step 205 among the present invention is the schematic flow sheet of many documents iteration mark;
Figure 11 is the schematic flow sheet of the semi-automatic indexing method embodiment 2 of a kind of new-type Chinese scientific and technical literature of the present invention;
Figure 12 is the structural representation of the adduction relationship network between four documents in the embodiment of the invention 2;
Figure 13 is the structural representation of the semi-automatic index device of a kind of new-type Chinese scientific and technical literature of the present invention.
Embodiment
For make above-mentioned purpose of the present invention, feature and advantage can be more obviously understandable, below in conjunction with accompanying drawing and embodiment the present invention done further detailed explanation.
One of core concept of the present invention has provided the semi-automatic indexing method of a kind of new-type Chinese scientific and technical literature, comprising: being offered by quoted passage of document set to the user need mark obtained, and obtains by quoted passage and offers set; Every piece of document marks in the set to being offered by quoted passage, obtains being offered by quoted passage of mark; The adduction relationship network between the document makes up in the set to being offered by quoted passage, obtains saidly to be offered the adduction relationship network between the document in the set by quoted passage; Document in the document set that need mark the user carries out the iteration mark, and every piece of document in the said document set that needs mark is all marked; This method can effectively solve low, the inaccurate defective of index efficient that exists in the present Chinese scientific and technical literature automatic indexing method.
With reference to Fig. 1, show the schematic flow sheet of the semi-automatic indexing method embodiment 1 of a kind of new-type Chinese scientific and technical literature of the present invention, specifically can comprise:
Step 101, being offered by quoted passage of document set that need mark the user are obtained, and obtain by quoted passage and offer set.
In order to make those skilled in the art understand the present invention better, in a preferred embodiment of the invention, said step 101 specifically can comprise:
Substep A1, initialization are offered set by quoted passage.
Every piece of document in substep A2, the document set that need mark the user obtains corresponding all lists of references of every piece of document; If corresponding all lists of references of every piece of document are offered in the set by quoted passage said, then saidly offered by quoted passage that the quantity of document remains unchanged in the set; If corresponding all lists of references of every piece of document are not offered in the set by quoted passage said, then corresponding all lists of references of every piece of document are put into and saidly offered set by quoted passage, offered set by quoted passage after obtaining to upgrade.
Step 102, every piece of document marks in the set to being offered by quoted passage, obtains being offered by quoted passage of mark.
In order to make those skilled in the art understand the present invention better, in another preferred embodiment of the present invention, said step 102 specifically can comprise:
Substep B1, if offered in the set by quoted passage and not offered by quoted passage, then every piece of document being offered by quoted passage in the set is marked, accomplish offered the index process of being offered by quoted passage in the set by quoted passage, obtain being offered of index by quoted passage.
Substep B2, if being offered to have by quoted passage in the set by quoted passage offers, then offered by quoted passage and carry out manual work mark offered in the set every piece by quoted passage, obtain artificial being offered of marking by quoted passage; Every piece of document carrying out again being offered by quoted passage in the set marks, and accomplishes offered the index process of being offered by quoted passage in the set by quoted passage, obtains being offered by quoted passage of index.
Step 103, the adduction relationship network between the document makes up in the set to being offered by quoted passage, obtains saidly to be offered the adduction relationship network between the document in the set by quoted passage.
In order to make those skilled in the art understand the present invention better, in another preferred embodiment of the present invention, said step 103 specifically can comprise:
Substep C1, initialization is carried out in adduction relationship set.
Substep C2, the document that will make up the adduction relationship network are put into the document set, obtain the document set after corresponding the renewal.
Substep C3, put into said adduction relationship network, obtain the adduction relationship of corresponding any two pieces of documents for two kinds of different adduction relationships between any two pieces of documents in the document after the said renewal set.
Substep C4, return the adduction relationship of above-mentioned any two pieces of documents and gather the adduction relationship set between the said document; The adduction relationship network between the document makes up in the set to being offered by quoted passage in completion, obtains saidly to be offered the adduction relationship network between the document in the set by quoted passage.
Document in step 104, the document set that need mark the user carries out the iteration mark, and every piece of document in the said document set that needs mark is all marked.
In order to make those skilled in the art understand the present invention better, in another preferred embodiment of the present invention, said step 104 specifically can comprise:
Substep D1, the set of initialization index document.
Substep D2, need each piece document in the index document set to carry out the iteration index to the user; If the list of references of the document is then carried out the operation of the document being carried out index all by index.
Wherein, said substep D2 specifically can comprise:
Substep E1, need each piece document in the index document set to carry out the iteration index to the user; If the list of references of the document, then calculates the index weight of each piece document in the list of references set of the document all by index, obtain the index weighted value of each piece document in the said list of references set.
The index weight of each piece document in the set of the list of references of the said calculating document obtains the step of the index weighted value of each piece document in the said list of references set, specifically can comprise:
Substep F1, obtain the number of times of quoting of each piece document in the list of references set.
Substep F2, calculate the length of each piece document and the corresponding shortest path of index document in said adduction relationship network in the list of references set, obtain the length of each piece document and the corresponding shortest path of index document in said adduction relationship network in the list of references set.
Substep F3, calculate the weight of each piece document in the said list of references set, obtain the weighted value of each piece document in the said list of references set.
Substep F4, return the weighted value of all documents in the set of said list of references and the weighted value of above-mentioned all documents is gathered; Accomplish the overall process of the index weighted value that calculates each piece document in the said list of references set, thereby obtain the index weighted value of each piece document in the said list of references.
Substep E2, candidate's index term set is obtained, obtain said candidate's index term set.
Wherein, said substep E2 specifically can comprise:
Substep G1, initialization is carried out in candidate's index term set, obtained initialized candidate's index term set.
Substep G2, the set of candidate's index term put in the index term of the list of references of selected document, obtain corresponding candidate's index term set.
According in the said list of references set whether candidate's index term being arranged, carry out operation accordingly, specifically can comprise the steps,
Specific as follows said:
Substep H1, if in the set of said list of references candidate's index term is arranged; Then carry out from said list of references set and take out one piece of document; Said candidate's index term set put in the index term of the document of taking out in the said list of references set; After executing single job, the operation that said candidate's index term set put in the above-mentioned index term of choosing document and special corresponding document is carried out in circulation, all candidate's index terms in finding said list of references set; Wherein, in above-mentioned cyclic process,, then only calculate the primary indexing speech if said index term is identical.
Substep H2, if do not have candidate's index term in the set of said list of references; Then carry out from said list of references set and choose document; Corresponding document is carried out index; Obtain accordingly by the index document, joined by the index document in said candidate's index term set what obtain, execute single job after; The circulation execution is above-mentioned to be chosen document and the operation that said candidate's index term is gathered is put in the index term of corresponding document, all candidate's index terms in finding said list of references set; Wherein, in above-mentioned cyclic process,, then only calculate the primary indexing speech if said index term is identical.
Substep G3, return the set of said candidate's index term, accomplish the overall process that said candidate's index term set is obtained, thereby obtain said candidate's index term set.
Substep E3, calculate the weight of each speech in said candidate's index term set, obtain the weighted value of each speech in said candidate's index term set.
Wherein, said substep E3 specifically can comprise:
Substep I1, choose the document that the user need carry out index, obtain the document that said user need carry out index.
Substep I2, need carry out all lists of references that corresponding index term sought in each candidate's index term in the document of index, obtain all lists of references of corresponding index term to said user.
Substep I3, calculate the weight of all lists of references of corresponding index term, obtain the index weighted value of all lists of references of corresponding index term.
Substep I4, the index weighted value of all lists of references of corresponding index term is carried out addition, obtain the summation of index weighted value of all lists of references of corresponding index term.
Substep I5, return the summation of the said index weighted value of each index term in said candidate's index term, accomplish the overall process of calculating the weight of each speech in said candidate's index term set, obtain the weighted value of each speech in said candidate's index term set.
Substep I6, carry out the corresponding operating of the overall process of the weight of each speech in the said candidate's index term set of aforementioned calculation to each the candidate's index term in the candidate's index term set; Each candidate's index term in the set of the said candidate's index term of traversal, thus the weighted value of each speech in said candidate's index term set obtained.
Substep E4, choose the index term of 6 maximum speech of weighted value in the set of said candidate's index term, accomplish and carry out the operation of the document being carried out index as corresponding document.
Substep D3, the document that will pass through index take out from the set of index document, and the document that will pass through index is simultaneously put into corresponding index document and gathered and offered set by quoted passage accordingly.
Substep D4, if need index document set for empty; Then carry out each piece document from the user is needed index document set carry out the iteration index gather to the document that will pass through index from the index document taking-up; The document of index be will pass through simultaneously and corresponding index document set and the cycling of being offered set accordingly by quoted passage put into; Be combined into sky up to needs index archives, the document of then accomplishing the document set that need mark the user carries out the overall process of iteration mark.
Substep D5, if need the index archives to be combined into sky, then carry out the operation steps of all documents being carried out index, obtain the set of index document; With the index document of said acquisition set output, accomplish the overall process that needs the document in the document set of index to carry out the iteration index to the user.
With reference to Fig. 2, show the structural representation of the general frame of the semi-automatic index of Chinese scientific and technical literature among the present invention.
As can be seen from Figure 2: the overall process of the semi-automatic index of Chinese scientific and technical literature among the present invention specifically can comprise:
Step 201, initially offered obtaining of set by quoted passage.
Step 202, the mark of initially being offered by quoted passage.
Step 203, adduction relationship network struction.
Step 204, single document mark.
Wherein, said single indexing method is to come the therefrom great index term as the document of right to choose according to the weight that number of times calculates these index terms of quoting of the index term of the list of references of the document and affiliated list of references.
Step 205, many documents iteration mark.
Wherein, Said many documents iteration mark is the adduction relationship network that at first all documents is made up them; In relational network, selecting in-degree is that 0 node marks, this node of deletion from the adduction relationship network then, and selecting the next in-degree of selecting again is that 0 node marks; Repeat this process, all marked up to all nodes.
With reference to figure 3, the step 201 that shows among the present invention is promptly initially offered the schematic flow sheet that obtains of set by quoted passage, and said step 201 is promptly initially offered the step of obtaining of set by quoted passage, specifically can comprise:
Substep J1, initialization are wanted index document set TCDS and are offered set CDS by quoted passage.
Substep J2, with all want the document of index put into the set TCDS.
If substep J3 set TCDS is empty, forward step J6 to; Otherwise change step J4.
Substep J4, from set TCDS, take out one piece of document d.
Substep J5, all lists of references of d are put into set CDS, change step J3.
Substep J6, execution set operation CDS-TCDS, the document of promptly gathering among the CDS can not comprise the document among the set TCDS.
Substep J7, return the set CDS.
From Fig. 3, can find out: the step 201 among the present invention is promptly initially offered the overall process of obtaining of set by quoted passage.
With reference to figure 4, show the schematic flow sheet of the mark that the step 202 among the present invention promptly initially offered by quoted passage, the step of the mark that said step 202 is promptly initially offered by quoted passage specifically can comprise:
If substep K1 set CDS is empty, rotor step K 4; Otherwise rotor step K 2.
Substep K2, from set CDS, take out document d.
Substep K3, document d is carried out manual work mark, rotor step K 1.
Substep K4, return annotation results.
With reference to figure 5, the step 203 that shows among the present invention is the schematic flow sheet of adduction relationship network struction, and said step 203 is the step of adduction relationship network struction, specifically can comprise:
Substep M1, initialization document set DS and adduction relationship set CRS.
Substep M2, the document that will make up the adduction relationship network are all put into set DS.
Any two pieces of document d1 and d2 among substep M3, the pair set DS if d1 quotes d2, put into set CRS with adduction relationship d1 → d2 so, if d2 quotes d1, so adduction relationship d2 → d1 are put into set CRS.
Substep M4, return the set CRS.
With reference to figure 6, showing step 204 among the present invention is the schematic flow sheet of single document mark, said step 204 be single document mark step, specifically can comprise:
In this step, suppose that all lists of references that will mark document are all marked.
Substep N1, initialization list of references set CS.
Substep N2, all lists of references that will index document d are put into set CS.
Substep N3, utilize the step of adduction relationship network struction, the document among the pair set CS is set up the adduction relationship network.
Each piece document cd among substep N4, the set of computations CS iIndex weighted value w (cd i).
Substep N5, obtain candidate's index term set IWS.
The weight of each speech among substep N6, the calculated candidate index term set IWS.
Substep N7, choose the index term of 6 maximum speech of weight among the set IWS as document d.
With reference to figure 7, show in the step 204 among the present invention the schematic flow sheet of the index weighted value of each piece document in the set of computations, the step of the index weighted value of each piece document in the set of computations in the said step 204 specifically can comprise:
Substep O1, obtain the set CS in each piece document cd iQuote number of times c i
Each piece document cd among substep O2, the set of computations CS iWith the shortest path length l of d in the adduction relationship network i
Weight w (the cd of each piece document among substep O3, the set of computations CS i), computing formula is:
w ( cd i ) = c i log 2 l i + 1 .
Substep O4, return the weight of all documents among the set CS.
With reference to figure 8, show the schematic flow sheet that obtains the set of candidate's index term in the step 204 among the present invention; Obtain the step of candidate's index term set in the said step 204, specifically can comprise:
Substep P1, initialization candidate index term set IWS.
If substep P2 set CS is empty, then return set IWS, otherwise rotor step P3.
Substep P3, from set CS, take out one piece of document d.
Substep P4, do not put into set IWS, rotor step P2 with what the index term of document d had a repetition.
Substep P5, return the set IWS.
With reference to figure 9, show the schematic flow sheet of the weight of each speech during the calculated candidate index is gathered in the step 204 among the present invention; The step of the weight of each speech in the calculated candidate index set in the said step 204 specifically can comprise:
Each speech iw among substep Q1, the pair set IWS carries out substep Q2 to Q3, forwards substep Q4 then to.
Substep Q2, obtain among the document d iw all lists of references as index term.
Substep Q3, with the index weight addition of these lists of references, summation is the weight of iw.
Substep Q4, return the weight of each the speech iw of set among the IWS.
With reference to Figure 10, the step 205 that shows among the present invention is the schematic flow sheet of many documents iteration mark, and said step 205 is the step of many documents iteration mark, specifically can comprise:
Substep R1, according to the network struction of step adduction relationship, to make up the citation relations network G on the index document set TCDS.
Substep R2, according to the mark that step is initially offered by quoted passage, each piece document among the pair set CDS marks.
If substep R3 G is empty, forwards substep R7 to, otherwise forward substep R4 to.
In-degree is a some v of 0 among substep R4, the selection G.
Substep R5, v and the limit that links to each other with v are deleted from G.
Substep R6, the corresponding document of v is marked, forward substep R3 to according to the mark of the single document of step.
Substep R7, return the annotation results of all documents among the set TCDS.
With reference to Figure 11, show the schematic flow sheet of the semi-automatic indexing method embodiment 2 of a kind of new-type Chinese scientific and technical literature of the present invention, specifically can comprise:
With 4 pieces of scientific and technical literatures is that example is described semi-automatic indexing method proposed by the invention.4 pieces of scientific and technical literature adduction relationships are as shown in table 1:
Numbering Adduction relationship Drawn the frequency
1 48
2 Quote 1 22
3 Quote 1 15
4 Quote 1,3 3
Table 1
Step 1101, initially offered obtaining of set by quoted passage.
That wherein, obtains document 1 is offered (list of references): c1, c2, c3 by quoted passage;
Obtain being offered of document 2: c2, c4,1 by quoted passage;
Obtain being offered of document 3: c5,1 by quoted passage;
Obtain being offered of document 4: c3,1,3 by quoted passage;
Initially offered set={ c1, c2, c3}+{c2, c4,1}+{c5,1}+{c3,1,3}-{1,2,3,4}={c1, c2, c3, c4, c5} by quoted passage
Step 1102, the mark of initially being offered by quoted passage.
Offered c1, c2, c3, c4, c5 to 7 pieces by quoted passage and carry out the manual work mark respectively, as shown in table 2:
Numbering Index term Drawn the frequency
c1 w1、w2 31
c2 w3、w4、w5 27
c3 w1、w4 6
c4 w2 52
c5 w4、w6 11
Table 2
Step 1103, adduction relationship network struction.
With reference to Figure 12, show in the embodiment of the invention 2 structural representation of the adduction relationship network between four documents.
Wherein, Figure 12 A is total adduction relationship network chart of the adduction relationship network between four documents in the embodiment of the invention 2;
Figure 12 B is that document 1 conduct will be by the adduction relationship network chart of the document of index in the embodiment of the invention 2;
Figure 12 C is that document 2 conducts will be by the adduction relationship network chart of the document of index in the embodiment of the invention 2;
Figure 12 D is that document 3 conducts will be by the adduction relationship network chart of the document of index in the embodiment of the invention 2;
Figure 12 E is that document 4 conducts will be by the adduction relationship network chart of the document of index in the embodiment of the invention 2.
The mark of step 1104, single document.
In order to make those of ordinary skill in the art can understand the present invention better, be the mark that example is explained single document with document 1 with document 3 below, specific as follows said:
The mark of document 1:
The all references document c1 of document 1, c2, c3 have marked completion;
Obtain the index number of times (seeing table 1 and table 2) of c1, c2, c3;
Calculate c1, c2, c3 shortest path length, be respectively 1,1,1 to document 1;
Calculate the index weight of c1, c2, c3: w (c1)=31, w (c2)=27, w (c3)=6;
Obtain candidate's index term set={ w1, w2}+{w3, w4, w5}+{w1, w4}={w1, w2, w3, w4, the w5} of document 1
Calculate the weight of each index term, w1=37, w2=31, w3=27, w4=33, w5=27;
Sort from big to small according to weight, select w1, w2, w4 index term as document 1.
The mark of document 3:
The all references document 1 of document 3, c5 have marked completion;
Obtain 1, the index number of times (seeing table 1 and table 2) of c5;
Calculate 1, c5 is to the shortest path length of document 3, is respectively 1,1;
The index weight of calculating 1, c5: w (1)=48, w (c5)=11;
Obtain candidate's index term set={ w1, w2, w4}+{w4, w6}={w1, w2, w4, the w6} of document 1
Calculate the weight of each index term, w1=48, w2=48, w4=59, w6=11;
Sort from big to small according to weight, select w1, w2, w4 index term as document 3.
The iteration mark of step 1105, many documents.
According to total adduction relationship network, iteration mark process is:
All lists of references that have only document 1 are utilized step D by index, to document 1 index;
All lists of references of document 2 and document 3 are utilized step D by index, to document 2 and document 3 indexes respectively;
All lists of references of document 4 are utilized step D by index, to document 1 index;
All indexings are accomplished.
With reference to Figure 13, show the structural representation of the semi-automatic index device of a kind of new-type Chinese scientific and technical literature of the present invention, specifically can comprise:
Initially offered the deriving means 1301 of set, be used to obtain by corresponding all lists of references of index document by quoted passage.
Wherein, the said deriving means 1301 of initially being offered set by quoted passage specifically can comprise:
The deriving means 1311 of the list of references of single document is used for the corresponding list of references of each piece document is extracted from the document.
The merging device 1321 of list of references is used for special all lists of references that obtain that extract and merges, and removes the list of references of repetition, obtains initially being offered set by quoted passage.
The index device of initially being offered by quoted passage 1331 is used for each piece document of initially being offered set by quoted passage is carried out index.
Initially offered the index device 1302 of set, be used for every piece of document initially being offered set by quoted passage is carried out index by quoted passage.
Adduction relationship construction device 1303 is used between document, setting up the adduction relationship network.
Wherein, said adduction relationship construction device 1303 specifically can comprise:
Point set symphysis apparatus for converting 1313 is used in the unique corresponding some points that are mapped to the adduction relationship network of each piece document.
Set generating apparatus 1323 in limit is used in unique corresponding a certain the limit that is mapped to the adduction relationship network of each adduction relationship.
Network struction device 1333 is used for gathering through the some set that generates and limit and sets up corresponding adduction relationship network.
Single indexing device 1304 is used for each piece document is marked automatically.
Document iteration index device 1305 is used for a plurality of documents are carried out automatic indexing.
In a word, the invention provides the semi-automatic indexing method of a kind of new-type Chinese scientific and technical literature, this method can effectively solve low, the inaccurate defective of index efficient that exists in the present Chinese scientific and technical literature automatic indexing method.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For system or device embodiment, because it is similar basically with method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
More than the semi-automatic indexing method of a kind of new-type Chinese scientific and technical literature provided by the present invention has been carried out detailed introduction; Used concrete example among this paper principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that on embodiment and range of application, all can change, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. the semi-automatic indexing method of new-type Chinese scientific and technical literature is characterized in that, comprising:
Being offered by quoted passage of document set to the user need mark obtained, and obtains by quoted passage and offers set;
Every piece of document marks in the set to being offered by quoted passage, obtains being offered by quoted passage of mark;
The adduction relationship network between the document makes up in the set to being offered by quoted passage, obtains saidly to be offered the adduction relationship network between the document in the set by quoted passage;
Document in the document set that need mark the user carries out the iteration mark, and every piece of document in the said document set that needs mark is all marked.
2. method according to claim 1 is characterized in that, said being offered by quoted passage of document set that need mark the user obtained, and obtains the step of being offered set by quoted passage, comprising:
Initialization is offered set by quoted passage;
Every piece of document in the document set that need mark the user obtains corresponding all lists of references of every piece of document; If corresponding all lists of references of every piece of document are offered in the set by quoted passage said, then saidly offered by quoted passage that the quantity of document remains unchanged in the set; If corresponding all lists of references of every piece of document are not offered in the set by quoted passage said, then corresponding all lists of references of every piece of document are put into and saidly offered set by quoted passage, offered set by quoted passage after obtaining to upgrade.
3. method according to claim 1 is characterized in that, said every piece of document marks in the set to being offered by quoted passage, obtains the step of being offered by quoted passage of mark, comprising:
Do not offered as if being offered in the set, then every piece of document being offered by quoted passage in the set is marked, accomplish, obtain being offered of index by quoted passage to offered the index process of being offered by quoted passage in the set by quoted passage by quoted passage by quoted passage;
If being offered to have by quoted passage in the set by quoted passage offers, then offered by quoted passage and carry out manual work mark offered in the set every piece by quoted passage, obtain artificial being offered of marking by quoted passage; Every piece of document carrying out again being offered by quoted passage in the set marks, and accomplishes offered the index process of being offered by quoted passage in the set by quoted passage, obtains being offered by quoted passage of index.
4. method according to claim 1 is characterized in that, said the adduction relationship network between the document makes up in the set to being offered by quoted passage, obtains saidly to be offered the step of the adduction relationship network between the document in the set by quoted passage, comprising:
To adduction relationship set carrying out initialization;
The document that needs is made up the adduction relationship network is put into the document set, the document set after obtaining to upgrade accordingly;
Two kinds of different adduction relationships between any two pieces of documents in the set of the document after the said renewal are put into said adduction relationship network, obtain the adduction relationship of corresponding any two pieces of documents;
Return the adduction relationship of above-mentioned any two pieces of documents and gather the adduction relationship set between the said document; The adduction relationship network between the document makes up in the set to being offered by quoted passage in completion, obtains saidly to be offered the adduction relationship network between the document in the set by quoted passage.
5. method according to claim 1 is characterized in that, the document during the said document that need mark the user is gathered carries out the iteration mark, and the step that every piece of document in the said document set that needs to mark is all marked comprises:
The set of initialization index document;
Need each piece document in the set of index document to carry out the iteration index to the user; If the list of references of the document is then carried out the operation of the document being carried out index all by index;
The document that will pass through index takes out from the set of index document, will pass through the document of index simultaneously and put into corresponding index document set and offered set by quoted passage accordingly;
If need the set of index document not for empty; Then carry out each piece document from the user is needed index document set carry out the iteration index gather to the document that will pass through index from the index document taking-up; The document of index be will pass through simultaneously and corresponding index document set and the cycling of being offered set accordingly by quoted passage put into; Be combined into sky up to needs index archives, the document of then accomplishing the document set that need mark the user carries out the overall process of iteration mark;
If need the index archives to be combined into sky, then carry out the operation steps of all documents being carried out index, obtain the set of index document; With the index document of said acquisition set output, accomplish the overall process that needs the document in the document set of index to carry out the iteration index to the user.
6. method according to claim 5 is characterized in that, saidly needs each piece document in the index document set to carry out the iteration index to the user; If the list of references of the document is then carried out the step of the document being carried out the operation of index all by index, comprising:
Need each piece document in the set of index document to carry out the iteration index to the user; If the list of references of the document, then calculates the index weight of each piece document in the list of references set of the document all by index, obtain the index weighted value of each piece document in the said list of references set;
Set is obtained to candidate's index term, obtains said candidate's index term set;
Calculate the weight of each speech in said candidate's index term set, obtain the weighted value of each speech in said candidate's index term set;
Choose the index term of 6 speech of weighted value maximum in said candidate's index term set, accomplish and carry out the operation of the document being carried out index as corresponding document.
7. method according to claim 6 is characterized in that, the index weight of each piece document in the set of the list of references of the said calculating document obtains the step of the index weighted value of each piece document in the said list of references set, comprising:
Obtain the number of times of quoting of each piece document in the list of references set;
Calculate the length of each piece document and the corresponding shortest path of index document in said adduction relationship network in the list of references set, obtain the length of each piece document and the corresponding shortest path of index document in said adduction relationship network in the list of references set;
Calculate the weight of each piece document in the said list of references set, obtain the weighted value of each piece document in the said list of references set;
Return the weighted value of all documents in the set of said list of references and the weighted value of above-mentioned all documents is gathered; Accomplish the overall process of the index weighted value that calculates each piece document in the said list of references set, thereby obtain the index weighted value of each piece document in the said list of references.
8. method according to claim 6 is characterized in that, said candidate's index term is gathered obtains, and obtains the step of said candidate's index term set, comprising:
Initialization is carried out in set to candidate's index term, obtains initialized candidate's index term set;
The set of candidate's index term put in the index term of the list of references of selected document, obtain corresponding candidate's index term set;
Return said candidate's index term set, accomplish the overall process that said candidate's index term set is obtained, thereby obtain said candidate's index term set.
9. method according to claim 8 is characterized in that:
If in the said list of references set candidate's index term is arranged; Then carry out from said list of references set and take out one piece of document; Said candidate's index term set put in the index term of the document of taking out in the said list of references set; After executing single job, the operation that said candidate's index term set put in the above-mentioned index term of choosing document and special corresponding document is carried out in circulation, all candidate's index terms in finding said list of references set; Wherein, in above-mentioned cyclic process,, then only calculate the primary indexing speech if said index term is identical;
If do not have candidate's index term in the said list of references set; Then carry out from said list of references set and choose document, corresponding document is carried out index, obtain accordingly by the index document; Joined by the index document in said candidate's index term set what obtain; After executing single job, the circulation execution is above-mentioned to be chosen document and the operation that said candidate's index term is gathered is put in the index term of corresponding document, all candidate's index terms in finding said list of references set; Wherein, in above-mentioned cyclic process,, then only calculate the primary indexing speech if said index term is identical.
10. method according to claim 6 is characterized in that, the weight of each speech in the set of the said candidate's index term of said calculating obtains the step of the weighted value of each speech in said candidate's index term set, comprising:
Choose the document that the user need carry out index, obtain the document that said user need carry out index;
Need carry out each candidate's index term in the document of index to said user and seek all lists of references of corresponding index term, obtain all lists of references of corresponding index term;
Calculate the weight of all lists of references of corresponding index term, obtain the index weighted value of all lists of references of corresponding index term;
The index weighted value of all lists of references of corresponding index term is carried out addition, obtain the summation of index weighted value of all lists of references of corresponding index term;
Return the summation of the said index weighted value of each index term in said candidate's index term, accomplish the overall process of calculating the weight of each speech in said candidate's index term set, obtain the weighted value of each speech in said candidate's index term set;
Carry out the corresponding operating of the overall process of the weight of each speech in the said candidate's index term set of aforementioned calculation to each the candidate's index term in the set of candidate's index term; Each candidate's index term in the set of the said candidate's index term of traversal, thus the weighted value of each speech in said candidate's index term set obtained.
CN201110424369.1A 2011-12-16 2011-12-16 Novel semi-automatic indexing method of Chinese scientific and technical documents Expired - Fee Related CN102831134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110424369.1A CN102831134B (en) 2011-12-16 2011-12-16 Novel semi-automatic indexing method of Chinese scientific and technical documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110424369.1A CN102831134B (en) 2011-12-16 2011-12-16 Novel semi-automatic indexing method of Chinese scientific and technical documents

Publications (2)

Publication Number Publication Date
CN102831134A true CN102831134A (en) 2012-12-19
CN102831134B CN102831134B (en) 2015-02-25

Family

ID=47334276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110424369.1A Expired - Fee Related CN102831134B (en) 2011-12-16 2011-12-16 Novel semi-automatic indexing method of Chinese scientific and technical documents

Country Status (1)

Country Link
CN (1) CN102831134B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654114A (en) * 2015-12-24 2016-06-08 国家电网公司信息通信分公司 Literature novelty checking method and device
CN105740452A (en) * 2016-02-03 2016-07-06 北京工业大学 Scientific and technical literature importance degree evaluation method based on PageRank and time decay

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539904A (en) * 2009-04-21 2009-09-23 武汉大学 Automatic indexing method of quotations
CN102163222A (en) * 2011-04-02 2011-08-24 中国医学科学院医学信息研究所 Information search sequencing method based on index association relation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539904A (en) * 2009-04-21 2009-09-23 武汉大学 Automatic indexing method of quotations
CN102163222A (en) * 2011-04-02 2011-08-24 中国医学科学院医学信息研究所 Information search sequencing method based on index association relation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曾洪京: "试论引文在自动标引中的应用", 《情报杂志》, vol. 12, no. 04, 30 September 1993 (1993-09-30), pages 58 - 61 *
马然等: "基于引文的自动标引法初探", 《江苏图书馆学报》, no. 01, 31 December 2002 (2002-12-31), pages 13 - 15 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654114A (en) * 2015-12-24 2016-06-08 国家电网公司信息通信分公司 Literature novelty checking method and device
CN105740452A (en) * 2016-02-03 2016-07-06 北京工业大学 Scientific and technical literature importance degree evaluation method based on PageRank and time decay
CN105740452B (en) * 2016-02-03 2019-04-19 北京工业大学 The scientific and technical literature different degree evaluation method to be decayed based on PageRank and time

Also Published As

Publication number Publication date
CN102831134B (en) 2015-02-25

Similar Documents

Publication Publication Date Title
US9104749B2 (en) Semantically aggregated index in an indexer-agnostic index building system
CN104899314B (en) A kind of parentage analysis method and apparatus of data warehouse
CN103678412B (en) A kind of method and device of file retrieval
CN103886099B (en) Semantic retrieval system and method of vague concepts
CN102982076A (en) Multi-dimensionality content labeling method based on semanteme label database
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN105117487A (en) Book semantic retrieval method based on content structures
CN109947921A (en) A kind of intelligent Answer System based on natural language processing
CN107463711A (en) A kind of tag match method and device of data
CN102831131A (en) Method and device for establishing labeling webpage linguistic corpus
CN105631018A (en) Article feature extraction method based on topic model
CN104572758A (en) Method and system for automatically extracting power field specialized vocabularies
CN105608075A (en) Related knowledge point acquisition method and system
CN103440343B (en) Knowledge base construction method facing domain service target
CN102411568A (en) Chinese word segmentation method based on travel industry feature word stock
CN107679124B (en) Knowledge graph Chinese question-answer retrieval method based on dynamic programming algorithm
CN101661469A (en) System and method for indexing and retrieving keywords of academic documents
CN105447104A (en) Knowledge map generating method and apparatus
CN106980639B (en) Short text data aggregation system and method
CN104063382B (en) Towards the standard terminology processing method of more strategy fusions in oil-gas pipeline field
CN102831134A (en) Novel semi-automatic indexing method of Chinese scientific and technical documents
CN107391690B (en) Method for processing document information
CN108595413B (en) Answer extraction method based on semantic dependency tree
CN109271560A (en) A kind of link data critical word querying method based on tree template
CN106933844A (en) Towards the construction method of the accessibility search index of extensive RDF data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150225

Termination date: 20161216