CN101539904A - Automatic indexing method of quotations - Google Patents

Automatic indexing method of quotations Download PDF

Info

Publication number
CN101539904A
CN101539904A CN200910061711A CN200910061711A CN101539904A CN 101539904 A CN101539904 A CN 101539904A CN 200910061711 A CN200910061711 A CN 200910061711A CN 200910061711 A CN200910061711 A CN 200910061711A CN 101539904 A CN101539904 A CN 101539904A
Authority
CN
China
Prior art keywords
document
text block
quoted passage
text
feature words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910061711A
Other languages
Chinese (zh)
Other versions
CN101539904B (en
Inventor
沈阳
沈劲枝
田晨耕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN2009100617119A priority Critical patent/CN101539904B/en
Publication of CN101539904A publication Critical patent/CN101539904A/en
Application granted granted Critical
Publication of CN101539904B publication Critical patent/CN101539904B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides an automatic indexing method of quotations. The automatic indexing method is characterized by comprising the following steps: step 1: cutting a submitted document to obtain text blocks, and extracting characteristic expression strings or information fingerprints from the text blocks; and then subscribing the characteristic expression strings or the information fingerprints to a search engine; step 2: as for the submitted characteristic expression strings or the submitted information fingerprints, recording search results as quotation sources of a corresponding text block, the ending position of the text block in the document and the correlation between the quotation sources and the ending position of the text block when the search engine returns to the search results corresponding to the characteristic expression strings or the information fingerprints; and step 3: eliminating repeated quotation sources by quotation indexes and the search results in the submitted document, and indexing various ordered quotation sources according to the front-back position relation in the submitted document. The automatic indexing method helps overcome the disadvantage of extremely low efficiency in the existing manual method, and improve the indexing speed and accuracy.

Description

A kind of automatic indexing method of quotations
Technical field
The invention belongs to the PC Tools field, particularly relate to a kind of automatic indexing method of quotations.
Background technology
The form of index has two kinds, first list of references, and it two is footnote or endnote, and list of references is in the academic research process, and to the reference or the reference of the integral body of a certain works or paper, it is last generally to list in article; Footnote and endnote is the supplementary notes to text.Footnote generally is positioned at the bottom of the page, can be used as the note of document somewhere content; Endnote generally is positioned at the end of document, lists the source of quoted passage etc.List of references, footnote and endnote all are made up of the part of two associations, and one is an invoking marks, and it two is corresponding narrative text or source explanation, and the present invention abbreviates the quoted passage source as.Form that invoking marks is common such as *, [1], [1]Deng, corresponding narrative text or the source common form of explanation have: [18] draw book, the 153rd page with annotating 4.Or as: [3] Heider, E.R.﹠amp; D.C.Ol iver.The structure of color space in naming andmemory of two languages[J] .Foreign Language Teaching and Research, 1999, (3): 62-67.
When writing text books or write paper; people usually can duplicate and paste some written materials in the works of oneself, and when becoming original text, but can't carry out index to quoted passage because having lost the material source; to cause subjective nothing to plagiarize consciousness, the sorry of cribbing objectively but taken place.
The function of softwares realizations such as EndNote, NoteExpress is to help the user to compile documents and materials at present, when writing scientific paper, academic dissertation, monograph or reporting, can add note in the literary composition easily by the assigned address in text, generate list of references automatically according to different periodical call formats then.Aforesaid way can be realized the known list of references of user is inserted or revises very easily, but but can't solve the problem that can't carry out index to quoted passage because lost the material source.
Summary of the invention
The object of the invention is at the deficiencies in the prior art, and a kind of method of quoted passage automatic indexing is provided, and substitutes the manual information retrieval mentioned way of inefficiency.
Technical scheme of the present invention may further comprise the steps,
Step 1 obtains text block to submitting to document to cut, and text block is extracted feature words and phrases string or information fingerprint; Then feature words and phrases string or information fingerprint are submitted to search engine;
Step 2, for submitted feature words and phrases string or information fingerprint, when search engine is returned with feature words and phrases string or information fingerprint corresponding retrieval results, the record retrieval result is as the quoted passage source of corresponding text block, and the final position of recording text piece in document, the quoted passage source of recording text piece and the incidence relation of final position;
Step 3, in conjunction with submit in the document existing quote index and result for retrieval and remove the quoted passage source of repeating after, according to the position context in submitting document to all quoted passage sources laggard rower that sorts is drawn;
It is as follows to have the quoted passage source specific implementation of quoting index and result for retrieval removal repetition in the described combination submission document,
From submit document to, extract the existing relevant information of index of quoting, compare with the relevant information of step 2 gained result for retrieval, the described existing relevant information of quoting index comprises the existing incidence relation of quoting invoking marks position, quoted passage source, invoking marks position and the quoted passage source of index, the relevant information of described result for retrieval is the final position of text block in document, the quoted passage source of text block, and the quoted passage source of text block and the incidence relation of final position;
When the quoted passage source that duplicates, incidence relation or the quoted passage source of text block and the incidence relation of final position according to invoking marks position and quoted passage source, find and the existing accordingly invoking marks position or the final position of text block in document of index quoted in quoted passage source, retention position is a most preceding quoted passage source in submitting document to, and remove in the quoted passage source of other repetition;
Described according to the position context in submitting document to all quoted passage sources orderings after, the specific implementation of carrying out index is as follows,
In document, quote the invoking marks position of index or the final position of text block adds invoking marks existing, and the quoted passage source is added in the submission document according to the quoted passage source of the incidence relation in invoking marks position and quoted passage source or text block incidence relation with final position according to ordering.
And, when in the step 1 information fingerprint being submitted to search engine, adopt character string rigidity matching technique that information fingerprint is retrieved, the result for retrieval that the recorded information fingerprint conforms in step 2 is as the quoted passage source of corresponding text block.
And, when in the step 1 feature words and phrases string being submitted to search engine, adopt flexible matching technique of character string or character string information correlation technique that feature words and phrases string is retrieved, in step 2, only write down correlativity and be higher than the quoted passage source of the result for retrieval of default dependent thresholds as corresponding text block.
And, the reference position of recording text piece in document; Return when submitting to feature words and phrases string correlativity to be higher than the result for retrieval of the default threshold value that conforms to when search engine,, in submitting document to, add quotation mark for text piece according to the reference position and the final position of text block in document with step 1.
And, when step 1 is extracted from text block when obtaining an above feature words and phrases string,
To all feature words and phrases string circulation execution in step 2, execution in step 3 after all feature words and phrases string execution in step 2 are finished; Perhaps, to feature words and phrases string order execution in step 2 and step 3 one by one.
And, when search engine return with feature words and phrases string or information fingerprint corresponding retrieval results after, the corresponding text block of adding up current all quoted passage sources accounts for to be submitted the total number of documents ratio to and provides demonstration to the user.
And, before execution in step 3, provide three kinds of logic redirects to the user by man-machine interface, comprise the mark text block, revise text block and deletion text block; When the user selects to mark text block, allow execution in step 3.
And, in step 2, the reference position of recording text piece in document; When the user selects to revise text block, do not allow execution in step 3, according to reference position and the final position of text block in document, text piece is highlighted in submitting document to for user's modification, and after user's modification is preserved, be back to step 1, carry out automatic indexing again based on amended text block.
And, in step 2, the reference position of recording text piece in document; When the user selects to delete text block, do not allow execution in step 3, according to reference position and the final position of text block in document, from submit document to, delete text piece automatically.
And after carrying out the logic redirect and executing respective handling, the corresponding text block of adding up current all quoted passage sources accounts for to be submitted the total number of documents ratio to and provides demonstration to the user.
Compared with prior art, the present invention has the following advantages:
1. it is very beneath that the present invention has overcome existing manual method efficient, the weakness that can't implement substantially, and can greatly remedy the defective that the quoted passage labeling system of EndNote and so on can only use known references or sealing document databse.
2. the present invention goes to overlap and technology by quoted passage, has promoted the uniqueness of quoted passage mark, when reducing repetition quoted passage quantity, has promoted index speed and accuracy.
3. the present invention can accurately mark out even the literal of quoting is revised also to some extent by flexible coupling and information correlativity technology.Cross under the situation of quoting literal in case so just solved user's modification, can't look for the document source problem of oneself once quoting again.
4. the present invention is towards whole internet and the document resource database, detect engine and net the retrieval-by-unification engine that four basic engines of excavation engine constitute deeply by the information correlativity (containing the information similarity) in first search engine, vertical search engine, the document resource database, thereby when fundamentally having solved automatic indexing, the magnanimity of information source covers.
Description of drawings
The process flow diagram of Fig. 1 embodiment of the invention.
Embodiment
Automatic indexing method of quotations provided by the invention may further comprise the steps, and can adopt computer software technology to realize operation automatically when specifically implementing:
Step 1 obtains text block to submitting to document to cut, and text block is extracted feature words and phrases string or information fingerprint; Then feature words and phrases string or information fingerprint are submitted to search engine.
During concrete enforcement, can set up the interactive window of importing document or document source for the user with the text accepting the user and submit to or come linking sources, thereby determine to submit to document to treat index.The user can directly submit the document files of forms such as a piece of Doc, Txt or Docx to, and the user also can come for example online office system Google Doc with the address or the content submission of certain online document in online office system as directly simultaneously.
Obtain text block to submitting to document to cut, can adopt prior art to text block extraction feature words and phrases string or information fingerprint, for example anti-system, the Turnitin of plagiarizing of Rost just provides document cutting module and text block feature words and phrases extraction module.When stripping and slicing, size or stripping and slicing rule that the user can self-defined stripping and slicing can be the stripping and slicing foundation with the number of words both, also can be with paragragh, and sentence or a certain special symbol are foundation.Waiting to look under the very little situation of document, the text block that obtains of cutting can be directly as feature words and phrases string, but more situation is that each text block cutting is obtained a plurality of feature words and phrases strings, so can be to feature words and phrases string order execution in step 2 and step 3 one by one.During promptly concrete enforcement, when whenever from text block, cutting out a feature words and phrases string, just this feature words and phrases string is submitted to search engine, execution in step 2 and step 3, the next feature words and phrases string that will cut out is then submitted to search engine, execution in step 2 and step 3.Also can carry out the simplification on the flow process: execution in step 2 after the search engine is submitted in circulation to all feature words and phrases strings, execution in step 3 after all feature words and phrases string circulation execution in step 2 are finished, thus in the end comprehensive all result for retrieval draw the quoted passage source laggard rower that sorts.This mode efficient is higher, wastes resource on can avoiding sorting the work that laggard rower draws in quoted passage source, all quoted passage sources of remove repeating.
The embodiment of the invention adopts above-mentioned simplified way, is some text block K with the document cutting 1, K 2K N, from these text block, extract certain characteristics words and phrases string S altogether 1, S 2S N, then with all feature words and phrases string S 1, S 2S NCirculation is committed to querying server and inquires about, and after all feature words and phrases string cyclic queries of submitting document to were finished, execution in step 2 then, obtain one and the initial final position record set of feature words and phrases string place text block { P 1, P 2P NQuoted passage source record set { U is mutually related 1, U 2U N, and both degree of correlation record set { R 1, R 2R N.Consider if the feature words and phrases string that continuous two or more text block is extracted retrieval obtains source, same source, then the plurality of continuous text block can be merged into a new text block, and the final position (some situation under also need record start position) of new text block in document after obtaining to merge, therefore can carry out further comprehensive the simplification and handle.Processing mode is to initial final position record set { P 1, P 2P NAnd quoted passage source record set { U 1, U 2U NAnalyze comparison, if have same quoted passage source at the final position of preceding text block and the reference position consecutive hours of this piece later, text block is merged, before being used in the reference position of text block and later the final position of this piece upgrade initial final position record set { P 1, P 2P NIn the relevant position record, and and the same quoted passage source of original these text block correspondences be associated, upgrade quoted passage source record set { U 1, U 2U NIn respective record.Carry out step 3 comprehensive the simplification on the processing basis at last.
During concrete enforcement, the querying server that feature words and phrases string or information fingerprint can be committed on Internet or the Intranet is realized retrieval.Querying server both can be the server of a plurality of existing search engines, also can be in order to realize the self-built server of quoted passage mark, can also be forum or dark grid database or encyclopaedia, question and answer and community network class querying server that certain document information content database server (as all places data query server and so on), support query task are transmitted.
The present invention suggestion detects engine by the information correlativity in first search engine, vertical search engine, the document resource database and dark net excavates four basic search engine gather datas such as engine, like this can magnanimity covers to be offered by quoted passage.Wherein first search engine refers to the engine that calls other independent search engine, first search engine be exactly to a plurality of independent search engines integration, call, control and optimize utilization.And vertical search engine is the professional search engine at some industries, be the segmentation and the extension of search engine, be that the special information of certain class in the web page library is once integrated, therefore can carry out retrieval and inquisition at some specific website or local file at certain specific document field.Information correlativity in the document resource database detects engine, and the document total amount that its sensing range comprises reaches ten thousand pieces.The document type comprises: academic journal, doctorate paper, outstanding master thesis, reference book, momentous conference's paper, yearbook, monograph, newspaper, patent, standard, scientific and technological achievement, knowledge unit, comment database, ancient books etc.Dark net excavates engine, Dr.Jill Ellsworth proposed Deep Web at first in 1996, be stealthy Web or dark net resource: common search engine can not be found the information content wherein, but their data volume is again very huge, often has higher authority and high-quality.This exactly the user like best the content of quoting, according to Gary Price research, at present WWW go up the quantity of Deep Web be Visible Web quantity 2-50 doubly, and mass ratio VisibleWeb is much higher.Therefore in order to follow the trail of the quoted passage source of certain document, the dark net excavation engine that structure can be retrieved above-mentioned Deep Web document is very useful.
Before feature words and phrases string submitted to search engine, can set the flexible matching technique of existing character string, character string rigidity matching technique or character string information correlation technique are adopted in the seeking inquiry of feature words and phrases string.The advantage of wherein flexible coupling and information correlativity technology is to mark the part quoted passage through modification, and the character string rigidity mates that then speed is very fast.Character string rigidity matching technique is that feature words and phrases string is committed to querying server, the identical related information content of feature words and phrases string that is and submits to that querying server is retrieved, the feature words and phrases that the flexibility coupling of character string then can be retrieved and submit to are the pertinent literature information of difference to some extent, and provide a degree of correlation of submitting feature words and phrases string and the relevant words and phrases of match query to, as: the feature words and phrases string of submission is " a quoted passage mark automatically ", after flexible coupling, querying server not only can be inquired about " quoted passage mark automatically ", can also inquire about " manually quoted passage mark ", and will provide the degree of correlation of " automatically quoted passage mark " and " manually quoted passage marks ", even through the part quoted passage of modification like this, also can discern and mark out, increase accuracy rate and recall ratio.The flexible matching technique of character string information correlation technique and character string is similar, the result for retrieval that detection obtains not only has and the identical related information content of feature words and phrases string, also have the pertinent literature information that has correlativity with the feature words and phrases of submitting to, be similarly result for retrieval and provide the degree of correlation.The present invention is said relevant, except statement is similar, comprises that also content is relevant.Information fingerprint is submitted to search engine then be fit to character string rigidity matching technique, information fingerprint is exactly a string Hash numerical value of obtaining after the fingerprint extraction that text block is carried out.
Step 2, for submitted feature words and phrases string or information fingerprint, when search engine is returned with feature words and phrases string or information fingerprint corresponding retrieval results, the record retrieval result is as the quoted passage source of corresponding text block, and the final position of recording text piece in document, the quoted passage source of recording text piece and the incidence relation of final position.
The present invention also provides further technical scheme: when in the step 1 feature words and phrases string being submitted to search engine, if adopting flexible matching technique of character string or character string information correlation technique retrieves feature words and phrases string, in step 2, only write down correlativity and be higher than the quoted passage source of the result for retrieval of default dependent thresholds A as corresponding text block, be the quoted passage source of feature words and phrases string place text block, the result for retrieval that other correlativity is low is excluded.Can improve the automatic indexing accuracy rate like this.For submitted information fingerprint, search engine then is to return the result for retrieval that equates with the hash value of this information fingerprint, the record retrieval result is as the quoted passage source of text block, and the final position of text block in document under the record hash value, the quoted passage source of recording text piece and the incidence relation of final position.The form that quoted passage source details are described can be adopted in the alleged quoted passage of the present invention source, and name of document, author, invention date or the like for example are described, also can adopt and simply directly come the linking sources form.
According to quoted passage statement custom,, be generally double quotation marks if when substance quoted and document original text are identical, in document, will represent directly to quote for substance quoted adds quotation mark.The present invention also provides corresponding automatic processing scheme: the reference position of recording text piece in document; Return when submitting to feature words and phrases string correlativity to be higher than the result for retrieval of the default threshold value B that conforms to when search engine,, in submitting document to, add quotation mark for text piece according to the reference position and the final position of text block in document with step 1.The default threshold value B that conforms to should be higher than default dependent thresholds A, can be when the default threshold value B that conforms to being set limiting the result for retrieval that obtains conforming to fully with feature words and phrases string, and corresponding text block just can be coupled with quotation mark.
When being to text block information extraction fingerprint in the step 1 and when submitting to search engine, can adopt character string rigidity matching technique that information fingerprint is retrieved, the result for retrieval that the recorded information fingerprint conforms in step 2 is as the quoted passage source of corresponding text block.Because the information fingerprint retrieval generally is to mate the hash value that equates fully, the result for retrieval that returns should be identical with the substance quoted of text block, therefore the return message fingerprint conform to result for retrieval the time can in the submission document, add quotation mark directly according to reference position and the final position of text block in document for text piece.
The embodiment of the invention is at all feature words and phrases string S that will extract from submit document to 1, S 2S NCirculation is committed to querying server inquire about after, the quoted passage source that obtains is recorded to quoted passage source record set { U 1, U 2U NIn, the result for retrieval that inquires and the degree of correlation of individual features words and phrases string place text block are recorded to degree of correlation record set { R 1, R 2R NIn, and with quoted passage source record set { U 1, U 2U N, the initial final position record set { P of feature words and phrases string place text block 1, P 2P NAnd degree of correlation record set { R 1, R 2R NIn source, corresponding source, the initial final position of text block and both degrees of correlation interrelated.The degree of correlation is carried out association can support subsequent applications work.The reference position of feature words and phrases string place text block and final position can cut when extract handling in step 1 and obtain, and reference position also can be assisted and realize other operations for text block adds the quotation mark except that being used for automatically.For example before execution in step 3, provide three kinds of logic redirects to the user, comprise the mark text block, revise text block and deletion text block by man-machine interface.Have only when the user selects to mark text block, allow execution in step 3.When the user selects to revise text block, do not allow execution in step 3, according to reference position and the final position of text block in document, text piece is highlighted in submitting document to for user's modification, and after user's modification is preserved, be back to step 1, carry out automatic indexing again based on amended text block.When the user selects to delete text block, do not allow execution in step 3, according to reference position and the final position of text block in document, from submit document to, delete text piece automatically.
These three kinds of logic redirects also can realize automatic redirect, and not need the user to select by realizing that the redirect condition is set.When the present invention advises specifically implementing, the redirect of threshold value or conduct the carrying out redirect judgement of text block position is set according to user's needs.By the text block position is set, the user can specify concrete processing mode, for example when obtaining result for retrieval at the text block that is in submission document ad-hoc location (as postmedian), deletes text piece automatically.The embodiment of the invention provides the concrete mode that realizes automatic logic redirect and carry out respective handling as follows:
The embodiment acquiescence is not when occurring revising text block and the redirect of deletion text block, automatically jump to the mark text block, promptly carry out step 3, with the final position of text piece with submit to document in existing invoking marks position of quoting index (the document mark that comprises the end of writing list of references and footnote, endnote document) carry out comprehensive improvement, according to quoted passage source record set { U 1, U 2U NThe list of references of repetition that non-article one is occurred removes the back rearrangement.Add simultaneously in the text termination of a block position of submitting document to as " [2], [2]" and so on invoking marks.Revise simultaneously, delete or back up and submit the former list of references of document to, the quoted passage source of retrieving gained is inserted into document afterbody or footnote relevant position.The final position of described text block is from initial final position record set { P 1, P 2P NThe middle extraction.During concrete enforcement, also can defaultly conform to threshold values B, thereby only when the substance quoted of contained literature content of result for retrieval and text block is identical, text block be added the quotation mark rower notes of going forward side by side whether satisfying as the redirect condition that marks text block.
Return when submitting to feature words and phrases string correlativity to be higher than the result for retrieval of default certain threshold value C when search engine, jump to automatically and remind the user to select to revise text, from the initial final position record set { P of text block with step 1 1, P 2P NIn extract the feature words and phrases place text block start address of submitted inquiry, waiting to look in the document location text piece and highlighting, the user then can make amendment to text piece content.After user's modification is preserved, return step 1 pair amended content automatically and retrieve once more then, see whether similarity still is higher than default certain threshold value.Remind the user to select to revise text as still being higher than default certain threshold value C, then can jumping to again.
When search engine is returned when submitting to feature words and phrases string correlativity to be higher than the result for retrieval of default certain threshold value D with step 1, automatically with the position of corresponding text block in document from initial final position record set { P 1, P 2P NIn extract retrieval in submitting document to, location text piece and directly from original text deletion or tag delete sign.
Step 3, in conjunction with submit in the document existing quote index and result for retrieval and remove the quoted passage source of repeating after, according to the position context in submitting document to all quoted passage sources laggard rower that sorts is drawn;
It is as follows to have the quoted passage source specific implementation of quoting index and result for retrieval removal repetition in the described combination submission document,
From submit document to, extract the existing relevant information of index of quoting, compare with the relevant information of step 2 gained result for retrieval, the described existing relevant information of quoting index comprises the existing incidence relation of quoting invoking marks position, quoted passage source, invoking marks position and the quoted passage source of index, the relevant information of described result for retrieval is the final position of text block in document, the quoted passage source of text block, and the quoted passage source of text block and the incidence relation of final position;
When the quoted passage source that duplicates, incidence relation or the quoted passage source of text block and the incidence relation of final position according to invoking marks position and quoted passage source, find and the existing accordingly invoking marks position or the final position of text block in document of index quoted in quoted passage source, retention position is a most preceding quoted passage source in submitting document to, and remove in the quoted passage source of other repetition;
Described according to the position context in submitting document to all quoted passage sources orderings after, the specific implementation of carrying out index is as follows,
In document, quote the invoking marks position of index or the final position of text block adds invoking marks existing, and the quoted passage source is added in the submission document according to the quoted passage source of the incidence relation in invoking marks position and quoted passage source or text block incidence relation with final position according to ordering.
For this step, the implementation of the embodiment of the invention is: that will submit document to existingly quotes index and extracts, extract to have and quote the relevant information records of index at former quoted passage record set { Reference/Footnote/Annotation ... in, relevant information comprises the existing incidence relation of quoting invoking marks position, quoted passage source, invoking marks position and the quoted passage source of index, because invoking marks all is the final position that is put in the indication text block, so the invoking marks position can be used for comparing with the final position of feature words and phrases string place text block in document.According to annotation formatting commonly used, existingly when document is submitted quote index and may comprise three kinds: Reference list of references (quoted passage source place article last), Footnote endnote (the quoted passage source generally places before article afterbody, the list of references), Annotation footnote (quoted passage source place footer last).With degree of correlation record set { P 1, P 2P NIn the correlativity that writes down be higher than the corresponding quoted passage source record set { U that presets dependent thresholds A 1, U 2U NIn quoted passage source and former quoted passage record set { Reference/Footnote/Annotation ... in record compare, if there is the quoted passage source repeat, then need rearrangement, go heavy and merge.According to certain list of references form (can referring to GB or periodical society standard), mark with forms such as list of references, endnote or footnotes.
If during concrete enforcement, the mode that adopts is when whenever cutting out a feature words and phrases string from text block, just this feature words and phrases string is submitted to search engine, execution in step 2 and step 3, the next feature words and phrases string that will cut out is then submitted to search engine, execution in step 2 and step 3.So so-called existing existing quote index of the relevant information of index of quoting except submitting to document when submitting to, just to have, the index of adding in the index round before also existing.The incidence relation in invoking marks position, quoted passage source, invoking marks position and the quoted passage source of drawing of subscripting, final position, the quoted passage source of text block in document of being write down in the step 2 of index round before being directed to respectively, and the quoted passage source of text block and the incidence relation of final position; And in the step 3 of index round before, record in the former quoted passage record set.
Some article is delivered and is stipulated it is to quote literal can not surpass certain percentage in full at present, therefore the invention provides further technical scheme, and the corresponding text block of adding up current all quoted passage sources accounts for to be submitted the total number of documents ratio to and provide demonstration to the user.So-called current all quoted passage sources had both comprised the new quoted passage source that result for retrieval constitutes, and also comprised submitting existing quoted passage source of quoting index in the document to.Can return when search engine with feature words and phrases string or information fingerprint corresponding retrieval results after add up demonstration, also can after carrying out the logic redirect and executing respective handling, add up demonstration.In time reaction ratio can be convenient to the user and grasp in real time and currently quote or copy number of words what are, the corresponding text block in current all the quoted passage sources of statistics gained can be accounted for during concrete enforcement and submit to the total number of documents ratio to be buffered in the calculator memory, showing to the user by human-computer interaction interfaces such as display screens provides.

Claims (10)

1. automatic indexing method of quotations is characterized in that: may further comprise the steps,
Step 1 obtains text block to submitting to document to cut, and text block is extracted feature words and phrases string or information fingerprint; Then feature words and phrases string or information fingerprint are submitted to search engine;
Step 2, for submitted feature words and phrases string or information fingerprint, when search engine is returned with feature words and phrases string or information fingerprint corresponding retrieval results, the record retrieval result is as the quoted passage source of corresponding text block, and the final position of recording text piece in document, the quoted passage source of recording text piece and the incidence relation of final position;
Step 3, in conjunction with submit in the document existing quote index and result for retrieval and remove the quoted passage source of repeating after, according to the position context in submitting document to all quoted passage sources laggard rower that sorts is drawn;
It is as follows to have the quoted passage source specific implementation of quoting index and result for retrieval removal repetition in the described combination submission document,
From submit document to, extract the existing relevant information of index of quoting, compare with the relevant information of step 2 gained result for retrieval, the described existing relevant information of quoting index comprises the existing incidence relation of quoting invoking marks position, quoted passage source, invoking marks position and the quoted passage source of index, the relevant information of described result for retrieval is the final position of text block in document, the quoted passage source of text block, and the quoted passage source of text block and the incidence relation of final position;
When the quoted passage source that duplicates, incidence relation or the quoted passage source of text block and the incidence relation of final position according to invoking marks position and quoted passage source, find and the existing accordingly invoking marks position or the final position of text block in document of index quoted in quoted passage source, retention position is a most preceding quoted passage source in submitting document to, and remove in the quoted passage source of other repetition;
Described according to the position context in submitting document to all quoted passage sources orderings after, the specific implementation of carrying out index is as follows,
In document, quote the invoking marks position of index or the final position of text block adds invoking marks existing, and the quoted passage source is added in the submission document according to the quoted passage source of the incidence relation in invoking marks position and quoted passage source or text block incidence relation with final position according to ordering.
2. automatic indexing method of quotations according to claim 1, it is characterized in that: when in the step 1 information fingerprint being submitted to search engine, adopt character string rigidity matching technique that information fingerprint is retrieved, the result for retrieval that the recorded information fingerprint conforms in step 2 is as the quoted passage source of corresponding text block.
3. automatic indexing method of quotations according to claim 1, it is characterized in that: when in the step 1 feature words and phrases string being submitted to search engine, adopt flexible matching technique of character string or character string information correlation technique that feature words and phrases string is retrieved, in step 2, only write down correlativity and be higher than the quoted passage source of the result for retrieval of default dependent thresholds as corresponding text block.
4. automatic indexing method of quotations according to claim 3 is characterized in that: the reference position of recording text piece in document; Return when submitting to feature words and phrases string correlativity to be higher than the result for retrieval of the default threshold value that conforms to when search engine,, in submitting document to, add quotation mark for text piece according to the reference position and the final position of text block in document with step 1.
5. according to claim 1 or 3 or 4 described automatic indexing method of quotations, it is characterized in that: when step 1 is extracted from text block when obtaining an above feature words and phrases string,
To all feature words and phrases string circulation execution in step 2, execution in step 3 after all feature words and phrases string execution in step 2 are finished; Perhaps, to feature words and phrases string order execution in step 2 and step 3 one by one.
6. according to claim 1 or 2 or 3 or 4 described automatic indexing method of quotations, it is characterized in that: when search engine return with feature words and phrases string or information fingerprint corresponding retrieval results after, the corresponding text block of adding up current all quoted passage sources accounts for to be submitted the total number of documents ratio to and provides demonstration to the user.
7. according to claim 1 or 2 or 3 or 4 described automatic indexing method of quotations, it is characterized in that: before execution in step 3, provide three kinds of logic redirects to the user, comprise the mark text block, revise text block and deletion text block by man-machine interface; When the user selects to mark text block, allow execution in step 3.
8. automatic indexing method of quotations according to claim 7 is characterized in that: in step 2, and the reference position of recording text piece in document; When the user selects to revise text block, do not allow execution in step 3, according to reference position and the final position of text block in document, text piece is highlighted in submitting document to for user's modification, and after user's modification is preserved, be back to step 1, carry out automatic indexing again based on amended text block.
9. automatic indexing method of quotations according to claim 7 is characterized in that: in step 2, and the reference position of recording text piece in document; When the user selects to delete text block, do not allow execution in step 3, according to reference position and the final position of text block in document, from submit document to, delete text piece automatically.
10. automatic indexing method of quotations according to claim 7 is characterized in that: after carrying out the logic redirect and executing respective handling, the corresponding text block of adding up current all quoted passage sources accounts for to be submitted the total number of documents ratio to and provides demonstration to the user.
CN2009100617119A 2009-04-21 2009-04-21 Automatic indexing method of quotations Expired - Fee Related CN101539904B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100617119A CN101539904B (en) 2009-04-21 2009-04-21 Automatic indexing method of quotations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100617119A CN101539904B (en) 2009-04-21 2009-04-21 Automatic indexing method of quotations

Publications (2)

Publication Number Publication Date
CN101539904A true CN101539904A (en) 2009-09-23
CN101539904B CN101539904B (en) 2012-05-30

Family

ID=41123095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100617119A Expired - Fee Related CN101539904B (en) 2009-04-21 2009-04-21 Automatic indexing method of quotations

Country Status (1)

Country Link
CN (1) CN101539904B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033864A (en) * 2010-12-01 2011-04-27 百度在线网络技术(北京)有限公司 Method and device for displaying quotation marks in on-line editing process
CN102033962A (en) * 2010-12-31 2011-04-27 中国传媒大学 File data replication method for quick deduplication
CN102156690A (en) * 2009-11-05 2011-08-17 崔旭 System, methods, and user interface for conveniently creating citations in a document
CN102831134A (en) * 2011-12-16 2012-12-19 中国科学技术信息研究所 Novel semi-automatic indexing method of Chinese scientific and technical documents
CN103064892A (en) * 2012-12-13 2013-04-24 北京海量融通软件技术有限公司 Network post indexing system and method
CN104050158A (en) * 2014-06-27 2014-09-17 吴涛军 Automatic quotation extraction method and device with semantic integrity kept
CN104063368A (en) * 2010-12-01 2014-09-24 百度在线网络技术(北京)有限公司 Display method and device for reference marks in on-line edit
CN105930546A (en) * 2016-07-08 2016-09-07 北京北大英华科技有限公司 File association display method
CN106126497A (en) * 2016-06-21 2016-11-16 同方知网数字出版技术股份有限公司 A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment
CN106708934A (en) * 2016-11-16 2017-05-24 百度在线网络技术(北京)有限公司 Artificial intelligence-based academic literature search method and apparatus
CN107562932A (en) * 2017-09-18 2018-01-09 西华大学 The academic reference of books data in literature acquisition method of Chinese
CN109117435A (en) * 2017-06-22 2019-01-01 索意互动(北京)信息技术有限公司 A kind of client, server, search method and its system
CN109241364A (en) * 2018-07-13 2019-01-18 广州神马移动信息科技有限公司 Generation method, device and the equipment/terminal/server of reference information
CN109325093A (en) * 2018-08-24 2019-02-12 深圳职业技术学院 Bibliography automatic generation method, device and computer-readable storage medium
CN111460765A (en) * 2020-03-30 2020-07-28 掌阅科技股份有限公司 Electronic book labeling processing method, electronic equipment and storage medium
CN114091456A (en) * 2022-01-20 2022-02-25 京华信息科技股份有限公司 Intelligent positioning method and system for quotation contents

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156690A (en) * 2009-11-05 2011-08-17 崔旭 System, methods, and user interface for conveniently creating citations in a document
CN102033864B (en) * 2010-12-01 2014-07-09 百度在线网络技术(北京)有限公司 Method and device for displaying quotation marks in on-line editing process
CN102033864A (en) * 2010-12-01 2011-04-27 百度在线网络技术(北京)有限公司 Method and device for displaying quotation marks in on-line editing process
CN104063368A (en) * 2010-12-01 2014-09-24 百度在线网络技术(北京)有限公司 Display method and device for reference marks in on-line edit
CN102033962B (en) * 2010-12-31 2012-05-30 中国传媒大学 File data replication method for quick deduplication
CN102033962A (en) * 2010-12-31 2011-04-27 中国传媒大学 File data replication method for quick deduplication
CN102831134A (en) * 2011-12-16 2012-12-19 中国科学技术信息研究所 Novel semi-automatic indexing method of Chinese scientific and technical documents
CN102831134B (en) * 2011-12-16 2015-02-25 中国科学技术信息研究所 Novel semi-automatic indexing method of Chinese scientific and technical documents
CN103064892A (en) * 2012-12-13 2013-04-24 北京海量融通软件技术有限公司 Network post indexing system and method
CN103064892B (en) * 2012-12-13 2016-11-16 北京海量融通软件技术有限公司 A kind of network patch literary composition indexing system and indexing method
CN104050158A (en) * 2014-06-27 2014-09-17 吴涛军 Automatic quotation extraction method and device with semantic integrity kept
CN104050158B (en) * 2014-06-27 2017-05-17 吴涛军 Automatic quotation extraction method and device with semantic integrity kept
CN106126497A (en) * 2016-06-21 2016-11-16 同方知网数字出版技术股份有限公司 A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment
CN105930546A (en) * 2016-07-08 2016-09-07 北京北大英华科技有限公司 File association display method
CN105930546B (en) * 2016-07-08 2020-04-03 北京北大英华科技有限公司 File association display method
CN106708934A (en) * 2016-11-16 2017-05-24 百度在线网络技术(北京)有限公司 Artificial intelligence-based academic literature search method and apparatus
CN109117435A (en) * 2017-06-22 2019-01-01 索意互动(北京)信息技术有限公司 A kind of client, server, search method and its system
CN107562932A (en) * 2017-09-18 2018-01-09 西华大学 The academic reference of books data in literature acquisition method of Chinese
CN109241364A (en) * 2018-07-13 2019-01-18 广州神马移动信息科技有限公司 Generation method, device and the equipment/terminal/server of reference information
CN109325093A (en) * 2018-08-24 2019-02-12 深圳职业技术学院 Bibliography automatic generation method, device and computer-readable storage medium
CN111460765A (en) * 2020-03-30 2020-07-28 掌阅科技股份有限公司 Electronic book labeling processing method, electronic equipment and storage medium
CN114091456A (en) * 2022-01-20 2022-02-25 京华信息科技股份有限公司 Intelligent positioning method and system for quotation contents

Also Published As

Publication number Publication date
CN101539904B (en) 2012-05-30

Similar Documents

Publication Publication Date Title
CN101539904B (en) Automatic indexing method of quotations
CN109992645B (en) Data management system and method based on text data
US20210342404A1 (en) System and method for indexing electronic discovery data
CN108829858B (en) Data query method and device and computer readable storage medium
US7945600B1 (en) Techniques for organizing data to support efficient review and analysis
US9460396B1 (en) Computer-implemented method and system for automated validity and/or invalidity claim charts with context associations
US20130046754A1 (en) Method and system to formulate intellectual property search and to organize results of intellectual property search
US20130018805A1 (en) Method and system for linking information regarding intellectual property, items of trade, and technical, legal or interpretive analysis
US20020138465A1 (en) Apparatus for and method of searching and organizing intellectual property information utilizing a classification system
CA2807494C (en) Method and system for integrating web-based systems with local document processing applications
US20120066580A1 (en) System for extracting relevant data from an intellectual property database
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
CN103324622A (en) Method and device for automatic generating of front page abstract
CN112926299B (en) Text comparison method, contract review method and auditing system
CN114064851A (en) Multi-machine retrieval method and system for government office documents
US11861321B1 (en) Systems and methods for structure discovery and structure-based analysis in natural language processing models
US20090083312A1 (en) Document composition system and method
JP2000231570A (en) Internet information processor, internet information processing method and computer readable recording medium with program making computer execute method recorded therein
JP2000250908A (en) Support device for production of electronic book
US11860914B1 (en) Natural language database generation and query system
CN109359023A (en) Based on the mobile application location of mistake method for submitting information
JP2004234582A (en) Dictionary construction method, system, and screen
Varvel Jr et al. Google Digital Humanities Awards recipient interviews report
Raghallaigh et al. Ainm. ie: Breathing New Life into a Canonical Collection of Irish-language Biographies.
TW202145029A (en) Legal information displaying system and displaying method with convenient functions

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120530

Termination date: 20130421