CN103377187A - Method, device and program for paragraph segmentation - Google Patents

Method, device and program for paragraph segmentation Download PDF

Info

Publication number
CN103377187A
CN103377187A CN2012105481901A CN201210548190A CN103377187A CN 103377187 A CN103377187 A CN 103377187A CN 2012105481901 A CN2012105481901 A CN 2012105481901A CN 201210548190 A CN201210548190 A CN 201210548190A CN 103377187 A CN103377187 A CN 103377187A
Authority
CN
China
Prior art keywords
paragraph
document
vector
sentence
handling part
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105481901A
Other languages
Chinese (zh)
Other versions
CN103377187B (en
Inventor
柿下容弓
服部英春
村上智一
今一修
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Publication of CN103377187A publication Critical patent/CN103377187A/en
Application granted granted Critical
Publication of CN103377187B publication Critical patent/CN103377187B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method, a device and a program for paragraph segmentation. In the prior art, a plurality of paragraphs in one document, which are similar in both meanings and features, cannot be correctly segmented. Under the control of the control part of the device for paragraph segmentation, an input document input through an input part is segmented by a sentence segmentation part into sentence units. The segmented sentences can be retrieved by a feature calculating part. Documents prestored in a corpus database are available for the associative retrieval to generate document vectors. A similarity calculating part retrieves two document vectors of a maximal similarity. When the similarity is larger than a predetermined threshold value, a retrieval and inquiry generation part merges the above two sentences as general elements to generate an inquiry. Then document vectors are generated by the feature calculating part based on the above inquiry. A feature updating part updates the feature amount based on the degree of reliability. During the updating process of the feature amount, corresponding sentences are connected to be set as a paragraph.

Description

Paragraph dividing method, device and program
Technical field
The present invention relates to the processing of the file of electronization, relate in particular to the paragraph cutting techniques of electronic file.
Background technology
In recent years, electronization or the data base system of file have obtained progress, and thus, natural language processing technique is also obtained very great development, such as having carried out a large amount of researchs that are used for the autoabstract of file or the automatic keyword extraction of document retrieval etc.But, become the file of the object of these technology, in most cases imagine according to each paragraph, namely to conclude unit according to the meaning of each topic or content divided, perhaps only comprises the file of single paragraph.Therefore, for the file that comprises a plurality of paragraphs, it is effective cutting apart in advance paragraph.At present, as this paragraph dividing method, text segmentation (text segmentation) method of putting down in writing in known patent document 1 or the patent documentation 2 etc.
But the method that existingly cut apart with paragraph, text segmentation is relevant comprises the sentence that is close in meaning when containing, when being a plurality of paragraph of the similar sentence of its characteristic quantity, is difficult to correctly cut apart paragraph in a file.The result can't carry out expeditiously for the autoabstract of file or the automatic keyword extraction of document retrieval etc.
Patent documentation 1: TOHKEMY 2009-15795 communique
Patent documentation 2: TOHKEMY 2004-145790 communique
Summary of the invention
Propose the present invention in view of above-mentioned problem, its purpose is to provide a kind of paragraph dividing method, device and program of effectively cutting apart the file that comprises a plurality of paragraphs.
In order to reach above-mentioned purpose, in the present invention, providing a kind of is the paragraph dividing method of paragraph with document segmentation by handling part, wherein, handling part is sentence unit with document segmentation, and the sentence after will cutting apart extracts related document as inquiry from pre-stored a plurality of documents, the generating feature amount, the general usually regeneration characteristics amount of this two characteristic quantities of the similar degree of two characteristic quantities in the characteristic quantity that use generates more than predetermined threshold value.
In addition, in order to reach above-mentioned purpose, it is the paragraph segmenting device of paragraph that a kind of document segmentation with input is provided in the present invention, wherein, possess handling part and storage part, handling part is sentence unit with document segmentation, the sentence that to cut apart and store is as inquiry, from pre-stored a plurality of documents storage part, extract related document, generating feature amount, the general usually regeneration characteristics amount of this characteristic quantity of two similar degree in the characteristic quantity that use generates more than predetermined threshold value.
And, in order to reach above-mentioned purpose, a kind of paragraph segmentation procedure is provided in the present invention, it is by possessing handling part and storage part, and be that the handling part of the paragraph segmenting device of paragraph is carried out with the document segmentation of inputting, wherein, make handling part carry out following action: to be sentence unit with document segmentation, sentence after will cutting apart is as inquiry, from pre-stored a plurality of documents storage part, extract related document, document vector with the association that extracts generates characteristic quantity, uses the general usually regeneration characteristics amount of this characteristic quantity of two similar degree more than predetermined threshold value in the characteristic quantity that generates.
According to the present invention, contain the sentence that is close in meaning, when being a plurality of paragraph of the similar sentence of characteristic quantity, also can correctly cut apart paragraph even in a file, comprised.
Description of drawings
Figure 1A is the figure of a functional structure of the paragraph segmenting device of expression the first embodiment.
Figure 1B is the figure of a hardware configuration of the paragraph segmenting device of expression the first embodiment.
Fig. 2 is the figure of an example of action of the paragraph segmentation procedure of expression the first embodiment.
Fig. 3 be expression the first embodiment connect the figure of the situation of sentence according to the similar degree of document vector.
Fig. 4 is the figure of a functional structure of the paragraph segmenting device of expression the second embodiment.
Fig. 5 is the figure of an example of action of the paragraph segmentation procedure of expression the second embodiment.
Fig. 6 is the figure for an example of the document vector of each embodiment of explanation.
Fig. 7 is the figure for an example of the word vector of each embodiment of explanation.
Symbol description
11?CPU
12 storage parts
13 input and output sections
14 Department of Communication Forces
100,400 paragraph segmenting devices
101,401 control parts
102,402 input parts
103,403 sentence cutting parts
104,404 feature value calculation unit
105,405 similar degree calculating parts
106,406 retrieval and inquisition generating units
107,407 characteristic quantity renewal sections
108,408 paragraph renewal sections
109,409 efferents
110,410 sentence storage parts
111,411 corpus sections
112,412 characteristic quantity storage parts
113,413 paragraph storage parts
114,414 morpheme analysis sections
Embodiment
Below, according to the description of drawings embodiments of the invention, still the invention is not restricted to the embodiment of following explanation.In this manual, establish " file " and " document " and be identical meanings.In addition, so-called " paragraph ", certain unit that the meaning of expression topic or content is concluded.And so-called document vector representation is with the document the stored vector as dimension, and so-called word vector representation is with whole words of occurring in whole documents vector as dimension.And in this manual, " characteristic quantity " of so-called sentence represents the meaning of sentence quantitatively, for example as the one example and specification documents vector or word vector.
(embodiment 1)
The first embodiment uses the document vector in similar degree calculates, use the embodiment of paragraph dividing method, device and the program of word vector in similar document retrieval.In the present embodiment, so-called document vector is that whole documents of comprising in corpus (corpus) section with segmenting device are as the vector of dimension.
Before describing present embodiment in detail, an example of specification documents vector sum word vector.
Fig. 6 represents an example of document vector.In Fig. 6 the sum of the document that comprises in the corpus section being made as 10 represents for example.And, be in 1,3,4,8 the situation, can as the document vector 601 as shown in this figure (a), represent the document vector at the document that obtains as result for retrieval.Equally, obtaining retrieving in the situation of score as result for retrieval, can use the retrieval Score Lists that obtains to be shown the document vector 602 shown in (b) of this figure.
Fig. 7 represents an example of word vector.So-called word vector is that whole words of occurring in all files are the vector of dimension, and the kind that has represented for example the word that occurs in whole documents in the word vector of Fig. 7 is 10 example.And the word that comprises in certain document is 3,6,7,8, and occurrence frequency is respectively in 1,5,3,9 the situation, by with the corresponding key element of occurrence frequency substitution, obtains the word vector 701 that represents among this figure.
Figure 1A is the figure of an example of functional module of the paragraph segmenting device of expression embodiment 1.Figure 1B is the figure of an example of the hardware configuration of the expression paragraph segmenting device of realizing embodiment 1.The hardware configuration of Figure 1B has represented that by as the storage parts 12 such as the central processing department (Central Processing Unit:CPU) 11 of common handling part, storer, RAM, ROM, hard disk drive (HDD), memory storage, input and output section 13, consist of as the Department of Communication Force 14 of network interface, above-mentioned each module is by internal bus 15 interconnective computing machines.
In Figure 1A, paragraph segmenting device 100 has: control part 101, input part 102, sentence cutting part 103, feature value calculation unit 104, similar degree calculating part 105, retrieval and inquisition generating unit 106, characteristic quantity renewal section 107, paragraph renewal section 108, efferent 109, sentence storage part 110, corpus section 111, characteristic quantity storage part 112, paragraph storage part 113, morpheme analysis section 114.As prerequisite, suppose in corpus section 111 and stored S DFile, document that individual for example newspaper article is such.
Wherein, input part 102, efferent 109 be corresponding to input and output section 13, Department of Communication Force 14, and sentence storage part 110, corpus section 111, characteristic quantity storage part 112, paragraph storage part 113 are corresponding to storer, the memory storage of storage part 12.Remaining control part 101, sentence cutting part 103, feature value calculation unit 104, similar degree calculating part 105, retrieval and inquisition generating unit 106, characteristic quantity renewal section 107, paragraph renewal section 108, morpheme analysis section 114 can be realized by the processing of the various programs of storing in the storage parts such as the operating system among the CPU11 (OS), ROM.
The action of each functional module of the paragraph segmenting device of the embodiment 1 shown in Figure 1A is described successively.
At first, the document that becomes the object that paragraph cuts apart is by from input part 102 input medias.Sentence cutting part 103 is by the pre-programmed execution as the CPU11 of handling part, and with the document segmentation of the inputting subunit that forms a complete sentence, storage is as a plurality of sentences of segmentation result in sentence storage part 110.
Equally, feature value calculation unit 104 is used each sentence that reads from sentence storage part 110, obtains related document from corpus section 111, and a plurality of associated documents that obtain are carried out the document vectorization, then is stored in the characteristic quantity storage part 112.That is, feature value calculation unit 104 generates document vector such shown in giving an example among Fig. 6 by being worth the substitution dimension corresponding with obtained associated document.
Retrieval and inquisition generating unit 106 has the function that generates retrieval and inquisition and send to control part 101.
Feature value calculation unit 104 is being provided via control part 101 in the situation of retrieval and inquisition, obtain the document related with this detection inquiry from sentence storage part 110, the a plurality of associated documents that obtain are carried out the document vectorization, be stored in the characteristic quantity storage part 112 as characteristic quantity, and output to characteristic quantity renewal section 107 via control part 101.
Similar degree calculating part 105 has based on the appointment of control part 101 reads two document vectors from characteristic quantity storage part 112, and calculates the function of the similar degree of two document vectors.The computing method of the similar degree in the present embodiment are described in the back.And, similar degree calculating part 105 judge calculate and similar degree whether more than predetermined threshold value.
Retrieval and inquisition generating unit 106 is read two document vectors based on the appointment of control part 101 from characteristic quantity storage part 112, extracts document group general two document vectors from corpus section 111.Become retrieval and inquisition and output to control part 101 according to the general document all living creatures who extracts.The generation method of this retrieval and inquisition is described in the back.
Characteristic quantity renewal section 107 reads two document vector V based on the appointment of control part 101 from characteristic quantity storage part 112 i, V jIn addition, from control part 101 with document vector V kInput feature vector amount renewal section 107.Three document vector V according to input k, V i, V jCalculate fiduciary level, and based on fiduciary level correction V kThis fiduciary level is described in the back.After this, from characteristic quantity storage part 112 deletion V i, V j, with V kBe stored in the characteristic quantity storage part 112.
Two sentences or paragraph candidate are read based on the appointment of control part 101 by paragraph renewal section 108 from sentence storage part 110 or paragraph storage part 113.Sentence or the paragraph candidate of reading are deleted from sentence storage part 110 or paragraph storage part 113, and connect sentence or the paragraph candidate of reading, this connection result is stored in the paragraph storage part 113 as the paragraph candidate.
Efferent 109 is read respectively sentence, paragraph candidate from sentence storage part 110 and paragraph storage part 113, after taking a decision as to whether not clear paragraph, gives label (Label) rear output based on its result of determination to paragraph.At this, so-called not clear paragraph refers to judge sentence or the paragraph candidate that is connected with which paragraph.The decision method of paragraph is failed to understand in explanation in the back.
Fig. 2 is the process flow diagram of the action of the paragraph segmentation procedure carried out in the paragraph segmenting device of expression present embodiment.Below use Fig. 2 that one example of the action of paragraph segmentation procedure is described.
At this, as an example, the situation of having inputted the document that comprises two paragraphs is described, but the paragraph number in the document that is transfused to also can be that later processing is identical more than two, therefore describe as an example of the document that comprises two paragraphs example.
The sentence that comprises in the first paragraph is defined as a 1, a 2..., a N, the sentence that comprises in the second paragraph is defined as b 1, b 2..., b MAt this, N is the quantity (natural number) of the sentence that comprises in the first paragraph, and M is the quantity (natural number) of the sentence that comprises in the second paragraph.
At first, in step 201, input documents from input part 102.
In step 202, the document of inputting is divided into sentence unit by sentence cutting part 103, be stored in the sentence storage part 110.
In step 203, whole sentence a that will in sentence storage part 110, store 1, a 2, a N..., b 1, b 2..., b MBe input to feature value calculation unit 104, as previously mentioned, obtain the document vector.As the computing method of document vector, enumerate the method for for example using the cosine yardstick.The cosine yardstick is used as calculating one of the method for the similar degree of two vectors.Calculate the cosine yardstick of two vectorial Q, P by following formula 1.
[mathematical expression 1]
Σ i q i p i Σ i q i 2 Σ i p i 2 ( q i ∈ Q , p i ∈ P ) (formula 1)
In the present embodiment, as mentioned above, in the retrieval of similar document, use the word vector.Therefore, be the word vector W of key element for the occurrence frequency of each document word of generating to comprise of storage in corpus section 111 for example i(0<=i<S D).Carry out too the word vectorization about the sentence of inputting, be made as W CurrentCalculate word vector W CurrentWith word vector W i(the cosine yardstick of 0<=i<SD) obtains from the high document of the similar degree that obtains to L (L is the natural number of being scheduled to) document, carries out the document vectorization and is stored in the characteristic quantity storage part 112.
In addition, used the cosine yardstick at this as the example that similar degree calculates, but also can calculate similar degree with other yardstick.As the value of each key element of document vector, as illustrated in Fig. 6 (a) and (b), also can establish selected document is 1, and other document is 0, can use similar degree of calculating etc. to carry out certain weighting.
Then, in step 204, read the document vector of two storages in characteristic quantity storage part 112, use similar degree calculating part 105 to find out the group V of the highest document vector of similar degree i, V jAs the computing method of in this case similar degree, can use above-mentioned cosine yardstick etc., also can use the key element that in the both sides of two document vectors, exists, be quantity of general key element etc.
In step 205, similar degree calculating part 105 judges that the maximum similar degree that calculates is whether more than predefined threshold value in step 204.Threshold value can be predefined fixed value, when calculating similar degree in step 204, also can calculate the average or variance of the similar degree of having calculated and use.
Carry out step 206 and step 207 by retrieval and inquisition generating unit 106.In step 206, when the maximum similar degree of calculating in the step 204 when threshold value is above, extract the group V of document vector i, V jGeneral key element, it is made as the general key element V of document vector Ij
In step 207, according to the general key element V that in step 206, obtains IjGenerate retrieval and inquisition.As the generation method of retrieval and inquisition, for example enumerate the method for having used TFIDF.So-called TFIDF is a kind of of the weighting relevant with word.TF(Term Frequency) and IDF(InVerse Document Frequency) show with following formula table respectively, amass to obtain TFIDF by TF and IDF.
[mathematical expression 2]
tf i = n i Σ k n k , idf i = log | D | | { d : t i ∈ d } | (formula 2)
At this, n iThe occurrence number of the word i among the document d, | D| is total number of files, | { d:t i∈ d}| comprises word t iNumber of files.In the present embodiment, total number of files D is equivalent to whole number of files of storage in corpus section 111.
Use morpheme analysis section 114 to carry out morpheme analysis for document d, extract S according to TFIDF order from big to small WIndividual word is made as retrieval and inquisition with it.Beyond TFIDF, for example also can be according to the importance degree that how much determines of occurrence frequency, also can with the title of document as inquiry, also can generate retrieval and inquisition by other method.
In step 208, the retrieval and inquisition that generates in the step 207 via control part 101 input feature vector amount calculating parts 104, is obtained new document vector V ' in feature value calculation unit 104 Ij
Then, carry out the document vector V ' that newly obtains Ij Step 209 and the step 210 of calculating etc. of fiduciary level.These steps 209 and step 210 are carried out by characteristic quantity renewal section shown in Figure 1 107.At first, in step 209, calculate the document vector V ' that in step 208, obtains IjFiduciary level, according to the vector magnitude of its modified result document vector.
In the present embodiment, so-called fiduciary level is to document vector V ' IjIn comprise how many general key element V IjThe index that quantizes of key element.As the calculating of fiduciary level, for example list document vector V ' IjThe group V that comprises several document vectors i, V jGeneral key element V IjKey element add up, divided by general key element V IjThe method of wanting prime number.In addition, at general key element V IjKey element situation about being weighted by importance degree under, can calculate fiduciary level according to the height of the importance degree that is weighted.In a word, in the low situation of the predetermined value of this fiduciary level ratio, increase and decrease the document vector V ' that obtains IjThe feedback of fiduciary level of vector magnitude etc.
In step 210, deletion is generating general key element V from characteristic quantity storage part 112 IjThe time document vector V i, V j, with the document vector V ' that newly obtains IjBe stored in the characteristic quantity storage part 112.
In step 211, for the paragraph dividing method of present embodiment, by paragraph renewal section 108 will with V i, V jCorresponding two sentences or paragraph connect.The sentence that once also is not connected is stored in the sentence storage part 110.In the situation that sentence is connected, the sentence before deletion connects from sentence storage part 110.With in paragraph candidate and the situation that sentence is connected, perhaps with in the paragraph candidate situation connected to one another, the sentence deletion before not only will connecting, the paragraph candidate before also will connecting is deleted from paragraph storage part 113.Sentence after the connection or paragraph candidate are stored in the paragraph storage part 113 as new paragraph candidate.
In the paragraph dividing method of present embodiment, device, to step 211, generate the target paragraph by step 204 in the flow process that repeats Fig. 2.And, in step 205, when the not enough predetermined threshold value of the maximum similar degree of two document vectors, finish the generation of paragraph, so execution in step 212.
Step 212 is carried out by efferent 109, is the step of failing to understand the output of the judgement of paragraph and paragraph.As an example of the decision method of failing to understand paragraph, the method for investigating the morpheme number that comprises in sentence or paragraph candidate is arranged.In the few situation of the morpheme number that in sentence or paragraph candidate, comprises, sometimes can't generate rightly the document vector, be difficult to connect.Therefore, in step 21, the morpheme number that comprises in remaining sentence or paragraph candidate is when certain threshold value is following, and 409 pairs of not clear paragraphs of efferent are given label and output, end process flow process.
Fig. 3 is shown schematically in the present embodiment, connects an example of the situation of sentence according to the similar degree of document vector.If the threshold value in the step 205 of Fig. 2 is " 10 ".
Primary similar degree result of calculation is 301.As a result in 301 similar degree the highest be a 2And a 3The similar degree 40 of group.
Therefore, this group is carried out the processing from step 205 to step 211 of Fig. 2, again return the step 204 of Fig. 2.The result who connects is expressed as a 23Similarly, in result 302 with b 1And b 2Be chosen as the highest group of similar degree, in result 303 with a 1And a 23Be chosen as the highest group of similar degree, carry out the processing from the step 205 of Fig. 2 to the step 211 of Fig. 2.Owing to set the threshold to 10, therefore, in result 304 nonoptional group, the generation of paragraph is finished.
Embodiment 1 according to above detailed description, even in a file, comprised and contained the sentence that is close in meaning, when being a plurality of paragraph of the similar sentence of characteristic quantity, also can correctly cut apart a plurality of paragraphs, and then carry out for the autoabstract of file or the automatic keyword extraction of document retrieval etc.
(embodiment 2)
Embodiment 2 uses the word vector in similar degree calculates, also use the embodiment of paragraph dividing method, device and the program of word vector in similar document retrieval.
Fig. 4 is the functional block diagram of the paragraph segmenting device of embodiment 2.The hardware configuration of the paragraph segmenting device of this figure is also same with the device of Figure 1A of embodiment 1, certainly can by the realizations such as computing machine shown in Figure 1B, omit illustrating of hardware configuration at this.
Input part 402, sentence cutting part 403, paragraph renewal section 408, efferent 409, sentence storage part 410, characteristic quantity storage part 412, paragraph storage part 413, morpheme analysis section 414 are identical with the respective modules of embodiment 1, therefore different corpus section 411, feature value calculation unit 404, similar degree calculating part 405, retrieval and inquisition generating unit 406 and the characteristic quantity renewal sections 407 of explanation and embodiment 1 only.In addition, morpheme analysis section 414 is connected with feature value calculation unit 404.
In corpus section 411, use such as set or the synonymicon (thesaurus) of the document of newspaper article etc. or use the two.
404 pairs of sentences that read in from sentence storage part 410 of feature value calculation unit use morpheme analysis section 414 to carry out morpheme analysis, and sentence is transformed to the word vector.When the word vector want prime number not enough the time, it is effective using corpus section 411 to increase and wanting the method for prime number.For example, used in the situation of synonymicon as corpus, each word that will obtain from the input sentence is retrieved near synonym as inquiry, and the near synonym that as a result of obtain are appended in the word vector.In addition, used as corpus in the situation of set of document, can in the word vector that from the input sentence, obtains, append the word vector that extracts each document in corpus.
As another example of the method for the key element of appending the word vector, enumerate and from above-mentioned several document, use TFIDF etc. to select primary word and be appended to method in the word vector.Be not limited to this, also can obtain the word related with sentence and append by other method, what make the word vector wants prime number enough.Then, the word vector that obtains is stored in the characteristic quantity storage part 412.In addition, provide in the situation of word vector via 401 pairs of feature value calculation unit 404 of control part from retrieval and inquisition generating unit 406, also expand the prime number of wanting of word vector by same method, be stored in the characteristic quantity storage part 112, and vectorial to characteristic quantity renewal section 407 output words via control part 401.
The similar degree calculating part 405 of present embodiment is read two word vectors based on the appointment of control part 401 from characteristic quantity storage part 412, calculates the similar degree of two word vectors.As the computing method of similar degree, such as cosine yardstick of giving an example out above-mentioned etc.
The retrieval and inquisition generating unit 406 of present embodiment is read two word vectors based on the appointment of control part 401 from characteristic quantity storage part 412, extracts word group general two word vectors from corpus 411.Become the word vector according to the general word all living creatures who extracts, and output to feature value calculation unit 404 via control part 401.
Two word vector V read based on the appointment of control part 401 in characteristic quantity renewal section 407 from characteristic quantity storage part 412 i, V jIn addition, from a word vector of control part 401 inputs V kThree word vector V according to input k, V i, V jCalculate fiduciary level, based on fiduciary level correction V kVector magnitude.After this, from characteristic quantity storage part 412 deletion V i, V j, with V kBe stored in the characteristic quantity storage part 412.
Fig. 5 is the processing flow chart of action of the program of expression embodiment 2.In embodiment 1, the document vector has been used in calculating as similar degree, but uses as mentioned above the word vector in embodiment 2, and this point is different from embodiment 1, but action in addition is identical with embodiment 1.
According to embodiment 2, contain the sentence that is close in meaning, when being a plurality of paragraph of the similar sentence of characteristic quantity, also can correctly cut apart paragraph even in a file, comprised.
In addition, the invention is not restricted to above-described embodiment, and comprise various variation.For example, understand in detail above-described embodiment for the present invention is understood easily, not necessarily be defined in and possess illustrated entire infrastructure.In addition, can add in the structure of certain embodiment the structure of other embodiment.In addition, about the part of the structure of each embodiment, can carry out appending, delete, replacing of other structure.
Each above-mentioned structure, function, handling part, processing means etc. can be with they part or all such as by realizing with hardware with integrated circuit (IC) design etc.In addition, each above-mentioned structure, function etc., for example expression is illustrated with the situation of software by carrying out the program that realizes each function, but, the information such as program, table, file that realizes each function not only can place storer, also can place hard disk, SSD(Solid State Drive) etc. the storage mediums such as memory storage or IC-card, SD card, DVD, also can be as required download, install via network etc.

Claims (15)

1. paragraph dividing method, it is paragraph by handling part with document segmentation, this paragraph dividing method is characterised in that,
Described handling part,
Be sentence unit with described document segmentation,
Described sentence after will cutting apart extracts related document as inquiry from pre-stored a plurality of documents, the generating feature amount,
Use the general usually regeneration characteristics amount of this two characteristic quantities of similar degree more than predetermined threshold value of two characteristic quantities in the described characteristic quantity that generates.
2. paragraph dividing method according to claim 1 is characterized in that,
Described handling part uses the document vector as described characteristic quantity.
3. paragraph dividing method according to claim 2 is characterized in that,
Described handling part,
At two document vector V as described two characteristic quantities i, V jSimilar degree when predetermined threshold value is above, select two described document vector V i, V jGeneral key element V Ij, generate retrieval and inquisition.
4. paragraph dividing method according to claim 3 is characterized in that,
Described handling part obtains new document vector V ' with the described retrieval and inquisition that generates Ij
5. paragraph dividing method according to claim 4 is characterized in that,
Described handling part is according to described new document vector V ' IjComprise described general key element V IjThe degree of key element, revise described new document vector V ' IjVector magnitude.
6. paragraph dividing method according to claim 4 is characterized in that,
Described handling part will with described new document vector V ' IjCorresponding described sentence or paragraph candidate couple together as new paragraph candidate.
7. paragraph dividing method according to claim 1 is characterized in that,
Described handling part uses the word vector as described characteristic quantity.
8. paragraph dividing method according to claim 7 is characterized in that,
As two word vector V as described two characteristic quantities i, V jSimilar degree when predetermined threshold value is above, select two described word vector V i, V jGeneral key element V Ij, generate retrieval and inquisition,
Use the described retrieval and inquisition that generates, obtain new word vector V ' Ij
9. paragraph dividing method according to claim 8 is characterized in that,
Described handling part is according to described new word vector V ' IjComprise described general key element V IjThe degree of key element, revise described new word vector V ' IjVector magnitude.
10. paragraph dividing method according to claim 9 is characterized in that,
Described handling part will with described new word vector V ' IjCorresponding described sentence or paragraph candidate couple together as new paragraph candidate.
11. a paragraph segmenting device, its document segmentation with input is paragraph, and this paragraph segmenting device is characterised in that,
Possess handling part and storage part,
Described handling part,
Be sentence unit with described document segmentation,
Described sentence after will cutting apart extracts related document as inquiry from pre-stored a plurality of documents described storage part, the generating feature amount,
Use the general usually regeneration characteristics amount of this characteristic quantity of two similar degree more than predetermined threshold value in the described characteristic quantity that generates.
12. paragraph segmenting device according to claim 11 is characterized in that,
Described handling part uses document vector or the word vector based on the described document of association to be used as described characteristic quantity.
13. paragraph segmenting device according to claim 12 is characterized in that,
Described handling part,
At two document vectors or the word vector V as described two characteristic quantities i, V jSimilar degree when predetermined threshold value is above, select two described document vectors or word vector V i, V jGeneral key element V Ij, generate retrieval and inquisition,
Use the described retrieval and inquisition that generates, obtain new document vector or word vector V ' Ij,
According to described new document vector or word vector V ' IjComprise described general key element V IjThe degree of key element, revise described new document vector or word vector V ' IjVector magnitude.
14. paragraph segmenting device according to claim 13 is characterized in that,
Described handling part connects and described new document vector V ' IjCorresponding described sentence or paragraph candidate, and will newly connect and the paragraph candidate be stored in the described storage part.
15. a paragraph segmentation procedure, it is by possessing handling part and storage part, and is that the handling part of the paragraph segmenting device of paragraph is carried out with the document segmentation of inputting, and this paragraph segmentation procedure is characterised in that,
Make described handling part carry out following action:
Be sentence unit with described document segmentation,
Described sentence after will cutting apart extracts related document as inquiry from pre-stored a plurality of documents described storage part,
Document with the described association that extracts generates characteristic quantity,
Use the general usually regeneration characteristics amount of this characteristic quantity of two similar degree more than predetermined threshold value in the described characteristic quantity that generates.
CN201210548190.1A 2012-04-19 2012-12-17 Paragraph segmentation and paragraph segmentation device Expired - Fee Related CN103377187B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012095344A JP5869948B2 (en) 2012-04-19 2012-04-19 Passage dividing method, apparatus, and program
JP2012-095344 2012-04-19

Publications (2)

Publication Number Publication Date
CN103377187A true CN103377187A (en) 2013-10-30
CN103377187B CN103377187B (en) 2016-09-28

Family

ID=49462320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210548190.1A Expired - Fee Related CN103377187B (en) 2012-04-19 2012-12-17 Paragraph segmentation and paragraph segmentation device

Country Status (2)

Country Link
JP (1) JP5869948B2 (en)
CN (1) CN103377187B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948518A (en) * 2019-03-18 2019-06-28 武汉汉王大数据技术有限公司 A kind of method of PDF document content text paragraph polymerization neural network based

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649762A (en) * 2016-12-27 2017-05-10 竹间智能科技(上海)有限公司 Intention recognition method and system based on inquiry question and feedback information
JP6543283B2 (en) * 2017-02-03 2019-07-10 日本電信電話株式会社 Passage type question answering device, method and program
CN108009151B (en) * 2017-11-29 2021-04-16 深圳中泓在线股份有限公司 News text automatic segmentation method and device, server and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1447261A (en) * 2002-03-27 2003-10-08 精工爱普生株式会社 Specific factor, generation of alphabetic string and device and method of similarity calculation
US20040093557A1 (en) * 2002-11-08 2004-05-13 Takahiko Kawatani Evaluating commonality of documents
JP2004145790A (en) * 2002-10-28 2004-05-20 Advanced Telecommunication Research Institute International Segmentation method of document and computer program therefor
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102004724A (en) * 2010-12-23 2011-04-06 哈尔滨工业大学 Document paragraph segmenting method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1447261A (en) * 2002-03-27 2003-10-08 精工爱普生株式会社 Specific factor, generation of alphabetic string and device and method of similarity calculation
JP2004145790A (en) * 2002-10-28 2004-05-20 Advanced Telecommunication Research Institute International Segmentation method of document and computer program therefor
US20040093557A1 (en) * 2002-11-08 2004-05-13 Takahiko Kawatani Evaluating commonality of documents
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
CN102004724A (en) * 2010-12-23 2011-04-06 哈尔滨工业大学 Document paragraph segmenting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
傅间莲等: "基于连续段落相似度的主题划分算法", 《计算机应用》, vol. 25, no. 9, 30 September 2005 (2005-09-30), pages 2022 - 2024 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948518A (en) * 2019-03-18 2019-06-28 武汉汉王大数据技术有限公司 A kind of method of PDF document content text paragraph polymerization neural network based

Also Published As

Publication number Publication date
JP5869948B2 (en) 2016-02-24
CN103377187B (en) 2016-09-28
JP2013222418A (en) 2013-10-28

Similar Documents

Publication Publication Date Title
US10922346B2 (en) Generating a summary based on readability
JP6902945B2 (en) Text summarization system
CN103136228A (en) Image search method and image search device
JP5273735B2 (en) Text summarization method, apparatus and program
US20110202528A1 (en) System and method for identifying fresh information in a document set
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
JP2005174336A (en) Learning and use of generalized string pattern for information extraction
CN103577989A (en) Method and system for information classification based on product identification
US20180246872A1 (en) System and method for automatic key phrase extraction rule generation
CN109446410A (en) Knowledge point method for pushing, device and computer readable storage medium
CN103309892A (en) Method and equipment for information processing and Web browsing history navigation and electronic device
CN103377187A (en) Method, device and program for paragraph segmentation
US8862586B2 (en) Document analysis system
CN109657043B (en) Method, device and equipment for automatically generating article and storage medium
CN107169011A (en) The original recognition methods of webpage based on artificial intelligence, device and storage medium
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN110909247B (en) Text information pushing method, electronic equipment and computer storage medium
KR102519955B1 (en) Apparatus and method for extracting of topic keyword
CN113449063B (en) Method and device for constructing document structure information retrieval library
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
US11120204B2 (en) Comment-based article augmentation
CN112926297A (en) Method, apparatus, device and storage medium for processing information
CN114328895A (en) News abstract generation method and device and computer equipment
JP5942981B2 (en) Summary creation device, summary creation method, and program
Wang et al. Summarizing the differences from microblogs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160928

Termination date: 20211217