CN101539904B - Automatic indexing method of quotations - Google Patents

Automatic indexing method of quotations Download PDF

Info

Publication number
CN101539904B
CN101539904B CN2009100617119A CN200910061711A CN101539904B CN 101539904 B CN101539904 B CN 101539904B CN 2009100617119 A CN2009100617119 A CN 2009100617119A CN 200910061711 A CN200910061711 A CN 200910061711A CN 101539904 B CN101539904 B CN 101539904B
Authority
CN
China
Prior art keywords
document
text block
quoted passage
text
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009100617119A
Other languages
Chinese (zh)
Other versions
CN101539904A (en
Inventor
沈阳
沈劲枝
田晨耕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN2009100617119A priority Critical patent/CN101539904B/en
Publication of CN101539904A publication Critical patent/CN101539904A/en
Application granted granted Critical
Publication of CN101539904B publication Critical patent/CN101539904B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an automatic indexing method of quotations. The automatic indexing method is characterized by comprising the following steps: step 1: cutting a submitted document to obtain text blocks, and extracting characteristic expression strings or information fingerprints from the text blocks; and then subscribing the characteristic expression strings or the information fingerprints to a search engine; step 2: as for the submitted characteristic expression strings or the submitted information fingerprints, recording search results as quotation sources of a corresponding text block, the ending position of the text block in the document and the correlation between the quotation sources and the ending position of the text block when the search engine returns to the search results corresponding to the characteristic expression strings or the information fingerprints; and step 3: eliminating repeated quotation sources by quotation indexes and the search results in the submitted document, and indexing various ordered quotation sources according to the front-back position relation in the submitted document. The automatic indexing method helps overcome the disadvantage of extremely low efficiency in the existing manual method, and improve the indexing speed and accuracy.

Description

A kind of automatic indexing method of quotations
Technical field
The invention belongs to the PC Tools field, particularly relate to a kind of automatic indexing method of quotations.
Background technology
The form of index has two kinds, first list of references, and it two is footnote or endnote, and list of references is in the academic research process, and to the reference or the reference of the integral body of a certain works or paper, it is last generally to list in article; Footnote and endnote is the supplementary notes to text.Footnote generally is positioned at the bottom of the page, can be used as the note of document somewhere content; Endnote generally is positioned at the end of document, lists the source of quoted passage etc.List of references, footnote and endnote all are made up of two associated parts, and one of which is an invoking marks, and it two is corresponding narrative text or source explanation, and the present invention abbreviates the quoted passage source as.Form that invoking marks is common such as *, [1], [1]Deng, corresponding narrative text or the source common form of explanation have: book is drawn, the 153rd page with annotating 4 in [18].Or as: [3] Heider, E.R.& D.C.Oliver.The structure of color space in naming andmemory of two languages [J] .Foreign Language Teaching and Research, 1999, (3): 62-67.
When writing text books or write paper; People usually can duplicate and paste some written materials in the works of oneself, and when becoming original text, but can't carry out index to quoted passage because having lost the material source; To cause subjective nothing to plagiarize consciousness, the sorry of cribbing objectively but taken place.
The function of softwares realizations such as EndNote, NoteExpress is to help the user to compile documents and materials at present; When writing scientific paper, academic dissertation, monograph or reporting; Can add note in the literary composition easily by the assigned address in text, generate list of references automatically according to different periodical call formats then.Aforesaid way can be realized the known list of references of user is inserted or revises very easily, but but can't solve the problem that can't carry out index to quoted passage because lost the material source.
Summary of the invention
The object of the invention is the deficiency to prior art, and a kind of method of quoted passage automatic indexing is provided, and substitutes the manual information retrieval mentioned way of inefficiency.
Technical scheme of the present invention may further comprise the steps,
Step 1 obtains text block to submitting to document to cut, and text block is extracted characteristic words and phrases string or information fingerprint; Then characteristic words and phrases string or information fingerprint are submitted to search engine;
Step 2; For characteristic words and phrases string or the information fingerprint submitted to; When search engine is returned with characteristic words and phrases string or information fingerprint corresponding retrieval results; The record retrieval result is as the quoted passage source of corresponding text block, and the final position of recording text piece in document, the quoted passage source of recording text piece and the incidence relation of final position;
Step 3 in conjunction with submitting existing quoting after index removes the quoted passage source of repetition with result for retrieval in the document to, is drawn all quoted passage sources laggard rower that sorts according to the position context in the submission document;
It is following to have the concrete implementation in quoted passage source of quoting index and result for retrieval removal repetition in the said combination submission document,
From submit document to, extract the existing relevant information of index of quoting; Compare with the relevant information of step 2 gained result for retrieval; The said existing relevant information of quoting index comprises the existing incidence relation of quoting invoking marks position, quoted passage source, invoking marks position and the quoted passage source of index; The relevant information of said result for retrieval is the final position of text block in document, the quoted passage source of text block, and the quoted passage source of text block and the incidence relation of final position;
When the quoted passage source that duplicates; Incidence relation or the quoted passage source of text block and the incidence relation of final position according to invoking marks position and quoted passage source; Find and the existing accordingly invoking marks position or the final position of text block in document of index quoted in quoted passage source; Retention position is a most preceding quoted passage source in the submission document, and remove in the quoted passage source of other repetition;
Said according to the position context in submitting document to all quoted passage sources orderings after, the concrete implementation of carrying out index is following,
In document, quote the invoking marks position of index or the final position of text block adds invoking marks existing, and the quoted passage source is added in the submission document according to the incidence relation in invoking marks position and quoted passage source or the quoted passage source of text block and the incidence relation of final position according to ordering.
And, when in the step 1 information fingerprint being submitted to search engine, adopt character string rigidity matching technique that information fingerprint is retrieved, the result for retrieval that the recorded information fingerprint conforms in step 2 is as the quoted passage source of corresponding text block.
And; When in the step 1 characteristic words and phrases string being submitted to search engine; Adopt flexible matching technique of character string or character string information correlation technique that characteristic words and phrases string is retrieved, in step 2, only write down correlativity and be higher than the quoted passage source of the result for retrieval of preset dependent thresholds as corresponding text block.
And, the reference position of recording text piece in document; Return when submitting to characteristic words and phrases string correlativity to be higher than the result for retrieval of the preset threshold value that conforms to when search engine,, in submitting document to, add quotation mark for text piece according to the reference position and the final position of text block in document with step 1.
And, when step 1 is extracted from text block when obtaining an above characteristic words and phrases string,
To all characteristic words and phrases string circulation execution in step 2, execution in step 3 after all characteristic words and phrases string execution in step 2 are finished; Perhaps, to characteristic words and phrases string order execution in step 2 and step 3 one by one.
And, when search engine return with characteristic words and phrases string or information fingerprint corresponding retrieval results after, the corresponding text block of adding up current all quoted passage sources accounts for to be submitted the total number of documents ratio to and to the user demonstration is provided.
And, before execution in step 3, to the user three kinds of logic redirects are provided through man-machine interface, comprise the mark text block, revise text block and deletion text block; When the user selects to mark text block, allow execution in step 3.
And, in step 2, the reference position of recording text piece in document; When the user selects to revise text block; Do not allow execution in step 3; According to reference position and the final position of text block in document; With text piece outstanding confession user's modification that shows in submitting document to, and after the user's modification preservation, be back to step 1, carry out automatic indexing again based on amended text block.
And, in step 2, the reference position of recording text piece in document; When the user selects to delete text block, do not allow execution in step 3, according to reference position and the final position of text block in document, from submit document to, delete text piece automatically.
And after carrying out the logic redirect and executing handled, the corresponding text block of adding up current all quoted passage sources accounts for to be submitted the total number of documents ratio to and to the user demonstration is provided.
Compared with prior art, the present invention has the following advantages:
The present invention overcome existing manual method efficient very under, the weakness that can't implement basically, and can greatly remedy the defective that the quoted passage labeling system of EndNote and so on can only use known references or sealing document databse.
2. the present invention goes to overlap and technology through quoted passage, has promoted the uniqueness of quoted passage mark, when reducing repetition quoted passage quantity, has promoted index speed and accuracy.
3. the present invention can accurately mark out even the literal of quoting is revised also to some extent through flexible coupling and information correlativity technology.Cross under the situation of quoting literal in case so just solved user's modification, can't look for the document source problem of oneself once quoting again.
4. the present invention is towards whole internet and the document resource database; Detect engine and net the retrieval-by-unification engine that four basic engines of excavation engine constitute deeply through the information correlativity (containing the information similarity) in first search engine, vertical search engine, the document resource database; Thereby when fundamentally having solved automatic indexing, the magnanimity of information source covers.
Description of drawings
The process flow diagram of Fig. 1 embodiment of the invention.
Embodiment
Automatic indexing method of quotations provided by the invention may further comprise the steps, and can adopt computer software technology to realize operation automatically during practical implementation:
Step 1 obtains text block to submitting to document to cut, and text block is extracted characteristic words and phrases string or information fingerprint; Then characteristic words and phrases string or information fingerprint are submitted to search engine.
During practical implementation, can set up the interactive window that supplies the user to import document or document source and perhaps come linking sources, thereby confirm to submit to document to treat index with the text of accepting user's submission.The user can directly submit the document files of one piece of form such as Doc, Txt or Docx to, and the user also can come for example online office system Google Doc with the address or the content submission of certain online document in online office system as directly simultaneously.
Obtain text block to submitting to document to cut, can adopt prior art to text block extraction characteristic words and phrases string or information fingerprint, for example anti-system, the Turnitin of plagiarizing of Rost just provides document cutting module and text block characteristic words and phrases extraction module.When stripping and slicing, size that the user can self-defined stripping and slicing or stripping and slicing rule can be the stripping and slicing foundation with the number of words both, also can be with paragragh, and sentence or a certain special symbol are foundation.Waiting to look under the very little situation of document, the text block that obtains of cutting can be directly as characteristic words and phrases string, but more situation is that each text block cutting is obtained a plurality of characteristic words and phrases strings, so can be to characteristic words and phrases string order execution in step 2 and step 3 one by one.When being practical implementation; When whenever from text block, cutting out a characteristic words and phrases string; Just this characteristic words and phrases string is submitted to search engine, execution in step 2 and step 3, the next characteristic words and phrases string that will cut out is then submitted to search engine, execution in step 2 and step 3.Also can carry out the simplification on the flow process: execution in step 2 after the search engine is submitted in circulation to all characteristic words and phrases strings; Execution in step 3 after all characteristic words and phrases string circulation execution in step 2 are finished, thus in the end comprehensive all result for retrieval draw the quoted passage source laggard rower that sorts.This mode efficient is higher, wastes resource on can avoiding sorting the work that laggard rower draws in quoted passage source, all quoted passage sources of removing repetition.
The embodiment of the invention adopts above-mentioned simplified way, is some text block K with the document cutting 1, K 2K N, from these text block, extract certain characteristics words and phrases string S altogether 1, S 2S N, then with all characteristic words and phrases string S 1, S 2S NCirculation is committed to querying server and inquires about, and after all characteristic words and phrases string cyclic queries of submitting document to were finished, execution in step 2 then, obtains one and the initial final position record set of characteristic words and phrases string place text block { P 1, P 2P NQuoted passage source record set { U is mutually related 1, U 2U N, and both degree of correlation record set { R 1, R 2R N.Consider if the characteristic words and phrases string that continuous two or more text block is extracted retrieval obtains source, same source; Then can the plurality of continuous text block be merged into a new text block; And the final position (some situation under also need record start position) of new text block in document after obtaining to merge, therefore can carry out further comprehensive the simplification and handle.Processing mode is to initial final position record set { P 1, P 2P NAnd quoted passage source record set { U 1, U 2U NAnalyze comparison; If have same quoted passage source at the final position of preceding text block and reference position consecutive hours in the back text block; Text block is merged, before being used in the reference position of text block and in the back final position of text block upgrades initial final position record set { P 1, P 2P NIn the relevant position record, and be associated with the corresponding same quoted passage source of original these text block, upgrade quoted passage source record set { U 1, U 2U NIn respective record.Carry out step 3 comprehensive the simplification on the processing basis at last.
During practical implementation, can characteristic words and phrases string or information fingerprint be committed to the querying server realization retrieval on Internet or the Intranet.Querying server both can be the server of a plurality of existing search engines; Also can be in order to realize the self-built server of quoted passage mark, can also be forum or dark grid database or encyclopaedia, question and answer and community network class querying server that certain document information content database server (like all places data query server and so on), support query task are transmitted.
The present invention suggestion detects engine by the information correlativity in first search engine, vertical search engine, the document resource database and dark net excavates four basic search engine gather datas such as engine, like this can magnanimity covers to be offered by quoted passage.Wherein first search engine refers to the engine that calls other independent search engine, first search engine be exactly to a plurality of independent search engines integration, call, control and optimize utilization.And vertical search engine is the professional search engine to some industries; Be the segmentation and the extension of search engine; Be that certain type in the web page library special information is once integrated; Therefore can be directed against certain specific document field, carry out retrieval and inquisition to some specific website or local file.Information correlativity in the document resource database detects engine, and the document total amount that its sensing range comprises reaches ten thousand pieces.The document type comprises: academic journal, doctorate paper, outstanding master thesis, reference book, momentous conference's paper, yearbook, monograph, newspaper, patent, standard, scientific and technological achievement, knowledge unit, comment database, ancient books etc.Dark net excavates engine; Dr.Jill Ellsworth proposed Deep Web at first in 1996; Be stealthy Web or net resource deeply: common search engine can not be found the information content wherein, but their data volume is again very huge, often has higher authority and high-quality.This exactly the user like best the content of quoting, according to Gary Price research, at present WWW go up the quantity of Deep Web be Visible Web quantity 2-50 doubly, and mass ratio VisibleWeb is much higher.Therefore in order to follow the trail of the quoted passage source of certain document, the dark net excavation engine that structure can be retrieved above-mentioned Deep Web document is very useful.
Before characteristic words and phrases string submitted to search engine, can set the flexible matching technique of existing character string, character string rigidity matching technique or character string information correlation technique are adopted in the seeking inquiry of characteristic words and phrases string.The advantage of wherein flexible coupling and information correlativity technology is to mark the part quoted passage through revising, and the character string rigidity is mated then rapid speed.Character string rigidity matching technique is that characteristic words and phrases string is committed to querying server; The identical related information content of characteristic words and phrases string that is and submits to of querying server retrieval, the characteristic words and phrases that the flexibility coupling of character string then can be retrieved and submit to are the pertinent literature information of difference to some extent, and provide a degree of correlation of submitting characteristic words and phrases string and the relevant words and phrases of match query to; As: the characteristic words and phrases string of submission is " a quoted passage mark automatically "; After flexible coupling, querying server not only can be inquired about " quoted passage mark automatically ", can also inquire about " manually quoted passage mark "; And will provide the degree of correlation of " automatically quoted passage mark " and " manually quoted passage marks "; Even the part quoted passage through revising also can be discerned and mark out like this, increase accuracy rate and recall ratio.The flexible matching technique of character string information correlation technique and character string is similar; The result for retrieval that detection obtains not only has and the identical related information content of characteristic words and phrases string; Also have the pertinent literature information that has correlativity with the characteristic words and phrases of submitting to, be similarly result for retrieval and provide the degree of correlation.The present invention is said relevant, except statement is similar, comprises that also content is relevant.Information fingerprint is submitted to search engine then be fit to character string rigidity matching technique, information fingerprint is exactly a string Hash numerical value of obtaining after the fingerprint extraction that text block is carried out.
Step 2; For characteristic words and phrases string or the information fingerprint submitted to; When search engine is returned with characteristic words and phrases string or information fingerprint corresponding retrieval results; The record retrieval result is as the quoted passage source of corresponding text block, and the final position of recording text piece in document, the quoted passage source of recording text piece and the incidence relation of final position.
The present invention also provides further technical scheme: when in the step 1 characteristic words and phrases string being submitted to search engine; If adopting flexible matching technique of character string or character string information correlation technique retrieves characteristic words and phrases string; In step 2, only write down correlativity and be higher than the quoted passage source of the result for retrieval of preset dependent thresholds A as corresponding text block; Be the quoted passage source of characteristic words and phrases string place text block, the result for retrieval that other correlativity is low is excluded.Can improve the automatic indexing accuracy rate like this.For the information fingerprint of being submitted to; Search engine then is to return the result for retrieval that equates with the hash value of this information fingerprint; The record retrieval result is as the quoted passage source of text block; And the final position of text block in document under the record hash value, the quoted passage source of recording text piece and the incidence relation of final position.The form that quoted passage source details are described can be adopted in the alleged quoted passage of the present invention source, and name of document, author, invention date or the like for example are described, also can adopt and simply directly come the linking sources form.
According to quoted passage statement custom,, be generally double quotation marks if when substance quoted and document original text are identical, in document, will represent directly to quote for substance quoted adds quotation mark.The present invention also provides corresponding automatic processing scheme: the reference position of recording text piece in document; Return when submitting to characteristic words and phrases string correlativity to be higher than the result for retrieval of the preset threshold value B that conforms to when search engine,, in submitting document to, add quotation mark for text piece according to the reference position and the final position of text block in document with step 1.The preset threshold value B that conforms to should be higher than preset dependent thresholds A, can be when the preset threshold value B that conforms to being set limiting the result for retrieval that obtains conforming to fully with characteristic words and phrases string, and corresponding text block just can be coupled with quotation mark.
When being to text block information extraction fingerprint in the step 1 and when submitting to search engine, can adopt character string rigidity matching technique that information fingerprint is retrieved, the result for retrieval that the recorded information fingerprint conforms in step 2 is as the quoted passage source of corresponding text block.Because the information fingerprint retrieval generally is to mate the hash value that equates fully; The result for retrieval that returns should be identical with the substance quoted of text block; Therefore the return message fingerprint conform to result for retrieval the time can in the submission document, add quotation mark directly according to reference position and the final position of text block in document for text piece.
The embodiment of the invention is at all characteristic words and phrases string S that will from submit document to, extract 1, S 2S NCirculation is committed to querying server inquire about after, the quoted passage source that obtains is recorded to quoted passage source record set { U 1, U 2U NIn, the degree of correlation of the result for retrieval that inquires and individual features words and phrases string place text block is recorded to degree of correlation record set { R 1, R 2R NIn, and with quoted passage source record set { U 1, U 2U N, the initial final position record set { P of characteristic words and phrases string place text block 1, P 2P NAnd degree of correlation record set { R 1, R 2R NIn source, corresponding source, the initial final position of text block and both degrees of correlation interrelated.The degree of correlation is carried out association can support subsequent applications work.The reference position of characteristic words and phrases string place text block can be cut to extract when handling with final position and obtained in step 1, and reference position also can be assisted other operations of realization except that being used for automatically for text block adds the quotation mark.For example before execution in step 3, to the user three kinds of logic redirects are provided, comprise the mark text block, revise text block and deletion text block through man-machine interface.Have only when the user selects to mark text block, allow execution in step 3.When the user selects to revise text block; Do not allow execution in step 3; According to reference position and the final position of text block in document; With text piece outstanding confession user's modification that shows in submitting document to, and after the user's modification preservation, be back to step 1, carry out automatic indexing again based on amended text block.When the user selects to delete text block, do not allow execution in step 3, based on original position and the final position of text block in document, from submit document to, delete text piece automatically.
These three kinds of logic redirects also can realize automatic redirect, and not need the user to select through realizing that the redirect condition is set.When the present invention advises practical implementation, the redirect of threshold value or conduct the carrying out redirect judgement of text block position is set according to user's needs.Through the text block position is set, the user can specify concrete processing mode, for example when obtaining result for retrieval to the text block that is in submission document ad-hoc location (like postmedian), deletes text piece automatically.The embodiment of the invention provides the concrete mode that realizes automatic logic redirect and carry out handled following:
The embodiment acquiescence is not when occurring revising text block with the redirect of deletion text block; Automatically jump to the mark text block; Promptly carry out step 3; With the final position of text piece with submit to document in existing invoking marks position of quoting index (document that comprises the end of writing list of references and footnote, endnote document marks) carry out comprehensive improvement, according to quoted passage source record set { U 1, U 2U NThe list of references of repetition that non-article one is occurred removes the back rearrangement.Add simultaneously in the text termination of a block position of submitting document to as " [2], [2]" and so on invoking marks.Revise simultaneously, delete or back up and submit the former list of references of document to, the quoted passage source of retrieving gained is inserted into document afterbody or footnote relevant position.The final position of said text block is from initial final position record set { P 1, P 2P NThe middle extraction.During practical implementation, also can satisfy and presetly conform to threshold values B, thereby only when the substance quoted of contained literature content of result for retrieval and text block is identical, text block added the quotation mark rower notes of going forward side by side as the redirect condition that marks text block with whether.
Return when submitting to characteristic words and phrases string correlativity to be higher than the result for retrieval of preset certain threshold value C when search engine, jump to automatically and remind the user to select to revise text, from the initial final position record set { P of text block with step 1 1, P 2P NIn extract the characteristic words and phrases place text block start address of being submitted to inquiry, waiting to look in the document location text piece and outstandingly showing that the user then can make amendment to text piece content.After user's modification is preserved, return step 1 pair amended content automatically and retrieve once more then, see whether similarity still is higher than preset certain threshold value.As still be higher than preset certain threshold value C, then can jump to again and remind the user to select to revise text.
When search engine is returned when submitting to characteristic words and phrases string correlativity to be higher than the result for retrieval of preset certain threshold value D with step 1, automatically with the position of corresponding text block in document from initial final position record set { P 1, P 2P NIn extract retrieval in submitting document to, location text piece and directly from original text deletion or tag delete sign.
Step 3 in conjunction with submitting existing quoting after index removes the quoted passage source of repetition with result for retrieval in the document to, is drawn all quoted passage sources laggard rower that sorts according to the position context in the submission document;
It is following to have the concrete implementation in quoted passage source of quoting index and result for retrieval removal repetition in the said combination submission document,
From submit document to, extract the existing relevant information of index of quoting; Compare with the relevant information of step 2 gained result for retrieval; The said existing relevant information of quoting index comprises the existing incidence relation of quoting invoking marks position, quoted passage source, invoking marks position and the quoted passage source of index; The relevant information of said result for retrieval is the final position of text block in document, the quoted passage source of text block, and the quoted passage source of text block and the incidence relation of final position;
When the quoted passage source that duplicates; Incidence relation or the quoted passage source of text block and the incidence relation of final position according to invoking marks position and quoted passage source; Find and the existing accordingly invoking marks position or the final position of text block in document of index quoted in quoted passage source; Retention position is a most preceding quoted passage source in the submission document, and remove in the quoted passage source of other repetition;
Said according to the position context in submitting document to all quoted passage sources orderings after, the concrete implementation of carrying out index is following,
In document, quote the invoking marks position of index or the final position of text block adds invoking marks existing, and the quoted passage source is added in the submission document according to the incidence relation in invoking marks position and quoted passage source or the quoted passage source of text block and the incidence relation of final position according to ordering.
For this step; The implementation of the embodiment of the invention is: that will submit document to existingly quotes index and extracts; Extract the existing relevant information records of quoting index at former quoted passage record set { Reference/Footnote/Annotation ... In; Relevant information comprises the existing incidence relation of quoting invoking marks position, quoted passage source, invoking marks position and the quoted passage source of index; Because invoking marks all is the final position that is put in the indication text block, so the invoking marks position can be used for comparing with the final position of characteristic words and phrases string place text block in document.According to annotation formatting commonly used, existingly when document is submitted to quote index and possibly comprise three kinds: Reference list of references (quoted passage source place article last), Footnote endnote (the quoted passage source generally places before article afterbody, the list of references), Annotation footnote (quoted passage source place footer last).With degree of correlation record set { P 1, P 2P NIn the correlativity that writes down be higher than the corresponding quoted passage source record set { U that presets dependent thresholds A 1, U 2U NIn quoted passage source and former quoted passage record set { Reference/Footnote/Annotation ... In record compare, if there is the quoted passage source repeat to occur, then need rearrangement, go heavy and merge.According to certain list of references form (can referring to GB or periodical society standard), mark with forms such as list of references, endnote or footnotes.
If during practical implementation; The mode that adopts is when whenever from text block, cutting out a characteristic words and phrases string; Just this characteristic words and phrases string is submitted to search engine, execution in step 2 and step 3, the next characteristic words and phrases string that will cut out is then submitted to search engine, execution in step 2 and step 3.So so-called existing existing quote index of the relevant information of index of quoting except submitting to document when submitting to, just to have, the index of adding in the index round before also existing.The incidence relation in invoking marks position, quoted passage source, invoking marks position and the quoted passage source of drawing of subscripting; Final position, the quoted passage source of text block in document of being write down in the step 2 of index round before being directed to respectively, and the quoted passage source of text block and the incidence relation of final position; And before record in the former quoted passage record set in the step 3 of index round.
Some article is delivered and is stipulated it is to quote literal can not surpass certain percentage in full at present, therefore the invention provides further technical scheme, and the corresponding text block of adding up current all quoted passage sources accounts for to be submitted the total number of documents ratio to and to the user demonstration is provided.So-called current all quoted passage sources had both comprised the new quoted passage source that result for retrieval constitutes, and also comprised submitting existing quoted passage source of quoting index in the document to.Can return when search engine with characteristic words and phrases string or information fingerprint corresponding retrieval results after add up demonstration, also can after carrying out the logic redirect and executing handled, add up demonstration.In time reaction ratio can be convenient to user real time and grasp and currently quote or copy number of words what are; Can the corresponding text block in current all the quoted passage sources of statistics gained be accounted for during practical implementation and submit to the total number of documents ratio to be buffered in the calculator memory, showing to the user through human-computer interaction interfaces such as display screens provides.

Claims (7)

1. automatic indexing method of quotations is characterized in that: may further comprise the steps,
Step 1 obtains text block to submitting to document to cut, and text block is extracted characteristic words and phrases string or information fingerprint; Then characteristic words and phrases string or information fingerprint are submitted to search engine;
Step 2; For characteristic words and phrases string or the information fingerprint submitted to; When search engine is returned with characteristic words and phrases string or information fingerprint corresponding retrieval results; The record retrieval result is as the quoted passage source of corresponding text block, and the final position of recording text piece in document, the quoted passage source of recording text piece and the incidence relation of final position;
Step 3 in conjunction with submitting existing quoting after index removes the quoted passage source of repetition with result for retrieval in the document to, is drawn all quoted passage sources laggard rower that sorts according to the position context in the submission document;
It is following to have the concrete implementation in quoted passage source of quoting index and result for retrieval removal repetition in the said combination submission document,
From submit document to, extract the existing relevant information of index of quoting; Compare with the relevant information of step 2 gained result for retrieval; The said existing relevant information of quoting index comprises the existing incidence relation of quoting invoking marks position, quoted passage source, invoking marks position and the quoted passage source of index; The relevant information of said result for retrieval is the final position of text block in document, the quoted passage source of text block, and the quoted passage source of text block and the incidence relation of final position;
When the quoted passage source that duplicates; Incidence relation or the quoted passage source of text block and the incidence relation of final position according to invoking marks position and quoted passage source; Find and the existing accordingly invoking marks position or the final position of text block in document of index quoted in quoted passage source; Retention position is a most preceding quoted passage source in the submission document, and remove in the quoted passage source of other repetition;
Said according to the position context in submitting document to all quoted passage sources orderings after, the concrete implementation of carrying out index is following,
In document, quote the invoking marks position of index or the final position of text block adds invoking marks existing, and the quoted passage source is added in the submission document according to the incidence relation in invoking marks position and quoted passage source or the quoted passage source of text block and the incidence relation of final position according to ordering.
2. automatic indexing method of quotations according to claim 1; It is characterized in that: when in the step 1 information fingerprint being submitted to search engine; Adopt character string rigidity matching technique that information fingerprint is retrieved, the result for retrieval that the recorded information fingerprint conforms in step 2 is as the quoted passage source of corresponding text block.
3. automatic indexing method of quotations according to claim 1; It is characterized in that: when in the step 1 characteristic words and phrases string being submitted to search engine; Adopt flexible matching technique of character string or character string information correlation technique that characteristic words and phrases string is retrieved, in step 2, only write down correlativity and be higher than the quoted passage source of the result for retrieval of preset dependent thresholds as corresponding text block.
4. automatic indexing method of quotations according to claim 3 is characterized in that: the reference position of recording text piece in document; Return when submitting to characteristic words and phrases string correlativity to be higher than the result for retrieval of the preset threshold value that conforms to when search engine,, in submitting document to, add quotation mark for text piece according to the reference position and the final position of text block in document with step 1.
5. according to claim 1 or 2 or 3 or 4 described automatic indexing method of quotations, it is characterized in that: before execution in step 3, to the user three kinds of logic redirects are provided, comprise the mark text block, revise text block and deletion text block through man-machine interface; When the user selects to mark text block, allow execution in step 3.
6. automatic indexing method of quotations according to claim 5 is characterized in that: in step 2, and the reference position of recording text piece in document; When the user selects to revise text block; Do not allow execution in step 3; According to reference position and the final position of text block in document; With text piece outstanding confession user's modification that shows in submitting document to, and after the user's modification preservation, be back to step 1, carry out automatic indexing again based on amended text block.
7. automatic indexing method of quotations according to claim 5 is characterized in that: in step 2, and the reference position of recording text piece in document; When the user selects to delete text block, do not allow execution in step 3, according to reference position and the final position of text block in document, from submit document to, delete text piece automatically.
CN2009100617119A 2009-04-21 2009-04-21 Automatic indexing method of quotations Expired - Fee Related CN101539904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100617119A CN101539904B (en) 2009-04-21 2009-04-21 Automatic indexing method of quotations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100617119A CN101539904B (en) 2009-04-21 2009-04-21 Automatic indexing method of quotations

Publications (2)

Publication Number Publication Date
CN101539904A CN101539904A (en) 2009-09-23
CN101539904B true CN101539904B (en) 2012-05-30

Family

ID=41123095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100617119A Expired - Fee Related CN101539904B (en) 2009-04-21 2009-04-21 Automatic indexing method of quotations

Country Status (1)

Country Link
CN (1) CN101539904B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110107194A1 (en) * 2009-11-05 2011-05-05 Xu Cui System, methods, and user interface for conveniently creating citations in a document
CN104063368B (en) * 2010-12-01 2018-09-04 百度在线网络技术(北京)有限公司 Mark citation methods of exhibiting and device when online editing
CN102033864B (en) * 2010-12-01 2014-07-09 百度在线网络技术(北京)有限公司 Method and device for displaying quotation marks in on-line editing process
CN102033962B (en) * 2010-12-31 2012-05-30 中国传媒大学 File data replication method for quick deduplication
CN102831134B (en) * 2011-12-16 2015-02-25 中国科学技术信息研究所 Novel semi-automatic indexing method of Chinese scientific and technical documents
CN103064892B (en) * 2012-12-13 2016-11-16 北京海量融通软件技术有限公司 A kind of network patch literary composition indexing system and indexing method
CN104050158B (en) * 2014-06-27 2017-05-17 吴涛军 Automatic quotation extraction method and device with semantic integrity kept
CN106126497A (en) * 2016-06-21 2016-11-16 同方知网数字出版技术股份有限公司 A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment
CN105930546B (en) * 2016-07-08 2020-04-03 北京北大英华科技有限公司 File association display method
CN106708934A (en) * 2016-11-16 2017-05-24 百度在线网络技术(北京)有限公司 Artificial intelligence-based academic literature search method and apparatus
CN109117435B (en) * 2017-06-22 2021-07-27 索意互动(北京)信息技术有限公司 Client, server, retrieval method and system thereof
CN107562932A (en) * 2017-09-18 2018-01-09 西华大学 The academic reference of books data in literature acquisition method of Chinese
CN109241364A (en) * 2018-07-13 2019-01-18 广州神马移动信息科技有限公司 Generation method, device and the equipment/terminal/server of reference information
CN109325093A (en) * 2018-08-24 2019-02-12 深圳职业技术学院 Bibliography automatic generation method, device and computer-readable storage medium
CN111460765B (en) * 2020-03-30 2020-12-29 掌阅科技股份有限公司 Electronic book labeling processing method, electronic equipment and storage medium
CN114091456B (en) * 2022-01-20 2022-04-15 京华信息科技股份有限公司 Intelligent positioning method and system for quotation contents

Also Published As

Publication number Publication date
CN101539904A (en) 2009-09-23

Similar Documents

Publication Publication Date Title
CN101539904B (en) Automatic indexing method of quotations
CN109992645B (en) Data management system and method based on text data
US20210342404A1 (en) System and method for indexing electronic discovery data
US9460396B1 (en) Computer-implemented method and system for automated validity and/or invalidity claim charts with context associations
US20130046754A1 (en) Method and system to formulate intellectual property search and to organize results of intellectual property search
US20130018805A1 (en) Method and system for linking information regarding intellectual property, items of trade, and technical, legal or interpretive analysis
US20050119995A1 (en) Apparatus for and method of searching and organizing intellectual property information utilizing an IP thesaurus
CA2807494C (en) Method and system for integrating web-based systems with local document processing applications
US11860914B1 (en) Natural language database generation and query system
US20120066580A1 (en) System for extracting relevant data from an intellectual property database
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
CN101432733A (en) Augmenting the contents of an electronic document with data retrieved from a search
CN112926299B (en) Text comparison method, contract review method and auditing system
TWI682286B (en) System for document searching using results of text analysis and natural language input
CN114064851A (en) Multi-machine retrieval method and system for government office documents
JP4469432B2 (en) INTERNET INFORMATION PROCESSING DEVICE, INTERNET INFORMATION PROCESSING METHOD, AND COMPUTER-READABLE RECORDING MEDIUM CONTAINING PROGRAM FOR CAUSING COMPUTER TO EXECUTE THE METHOD
CN115146030A (en) Official document writing method and system based on knowledge graph
JP2000231569A (en) Internet information retrieving device, internet information retrieving method and computer readable recording medium with program making computer execute method recorded therein
Varvel Jr et al. Google Digital Humanities Awards recipient interviews report
Veena et al. A Personalized and Scalable Machine Learning-Based File Management System
TW202145029A (en) Legal information displaying system and displaying method with convenient functions
Raghallaigh et al. Ainm. ie: Breathing New Life into a Canonical Collection of Irish-language Biographies.
Aluthman Technology-Based Platforms in the Translation Workflow: An Investigation of the Use of CAT Tools among Saudi Professional Translators
Cooper et al. Extracting database information from e-mail messages
CN117313676A (en) Text data cleaning method, system, device and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120530

Termination date: 20130421