CN103377187A

CN103377187A - Method, device and program for paragraph segmentation

Info

Publication number: CN103377187A
Application number: CN2012105481901A
Authority: CN
Inventors: 柿下容弓; 服部英春; 村上智一; 今一修
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-04-19
Filing date: 2012-12-17
Publication date: 2013-10-30
Anticipated expiration: 2032-12-17
Also published as: JP5869948B2; CN103377187B; JP2013222418A

Abstract

The invention provides a method, a device and a program for paragraph segmentation. In the prior art, a plurality of paragraphs in one document, which are similar in both meanings and features, cannot be correctly segmented. Under the control of the control part of the device for paragraph segmentation, an input document input through an input part is segmented by a sentence segmentation part into sentence units. The segmented sentences can be retrieved by a feature calculating part. Documents prestored in a corpus database are available for the associative retrieval to generate document vectors. A similarity calculating part retrieves two document vectors of a maximal similarity. When the similarity is larger than a predetermined threshold value, a retrieval and inquiry generation part merges the above two sentences as general elements to generate an inquiry. Then document vectors are generated by the feature calculating part based on the above inquiry. A feature updating part updates the feature amount based on the degree of reliability. During the updating process of the feature amount, corresponding sentences are connected to be set as a paragraph.

Description

Paragraph dividing method, device and program

Technical field

The present invention relates to the processing of the file of electronization, relate in particular to the paragraph cutting techniques of electronic file.

Background technology

In recent years, electronization or the data base system of file have obtained progress, and thus, natural language processing technique is also obtained very great development, such as having carried out a large amount of researchs that are used for the autoabstract of file or the automatic keyword extraction of document retrieval etc.But, become the file of the object of these technology, in most cases imagine according to each paragraph, namely to conclude unit according to the meaning of each topic or content divided, perhaps only comprises the file of single paragraph.Therefore, for the file that comprises a plurality of paragraphs, it is effective cutting apart in advance paragraph.At present, as this paragraph dividing method, text segmentation (text segmentation) method of putting down in writing in known patent document 1 or the patent documentation 2 etc.

But the method that existingly cut apart with paragraph, text segmentation is relevant comprises the sentence that is close in meaning when containing, when being a plurality of paragraph of the similar sentence of its characteristic quantity, is difficult to correctly cut apart paragraph in a file.The result can't carry out expeditiously for the autoabstract of file or the automatic keyword extraction of document retrieval etc.

Patent documentation 1: TOHKEMY 2009-15795 communique

Patent documentation 2: TOHKEMY 2004-145790 communique

Summary of the invention

Propose the present invention in view of above-mentioned problem, its purpose is to provide a kind of paragraph dividing method, device and program of effectively cutting apart the file that comprises a plurality of paragraphs.

In order to reach above-mentioned purpose, in the present invention, providing a kind of is the paragraph dividing method of paragraph with document segmentation by handling part, wherein, handling part is sentence unit with document segmentation, and the sentence after will cutting apart extracts related document as inquiry from pre-stored a plurality of documents, the generating feature amount, the general usually regeneration characteristics amount of this two characteristic quantities of the similar degree of two characteristic quantities in the characteristic quantity that use generates more than predetermined threshold value.

In addition, in order to reach above-mentioned purpose, it is the paragraph segmenting device of paragraph that a kind of document segmentation with input is provided in the present invention, wherein, possess handling part and storage part, handling part is sentence unit with document segmentation, the sentence that to cut apart and store is as inquiry, from pre-stored a plurality of documents storage part, extract related document, generating feature amount, the general usually regeneration characteristics amount of this characteristic quantity of two similar degree in the characteristic quantity that use generates more than predetermined threshold value.

And, in order to reach above-mentioned purpose, a kind of paragraph segmentation procedure is provided in the present invention, it is by possessing handling part and storage part, and be that the handling part of the paragraph segmenting device of paragraph is carried out with the document segmentation of inputting, wherein, make handling part carry out following action: to be sentence unit with document segmentation, sentence after will cutting apart is as inquiry, from pre-stored a plurality of documents storage part, extract related document, document vector with the association that extracts generates characteristic quantity, uses the general usually regeneration characteristics amount of this characteristic quantity of two similar degree more than predetermined threshold value in the characteristic quantity that generates.

According to the present invention, contain the sentence that is close in meaning, when being a plurality of paragraph of the similar sentence of characteristic quantity, also can correctly cut apart paragraph even in a file, comprised.

Description of drawings

Figure 1A is the figure of a functional structure of the paragraph segmenting device of expression the first embodiment.

Figure 1B is the figure of a hardware configuration of the paragraph segmenting device of expression the first embodiment.

Fig. 2 is the figure of an example of action of the paragraph segmentation procedure of expression the first embodiment.

Fig. 3 be expression the first embodiment connect the figure of the situation of sentence according to the similar degree of document vector.

Fig. 4 is the figure of a functional structure of the paragraph segmenting device of expression the second embodiment.

Fig. 5 is the figure of an example of action of the paragraph segmentation procedure of expression the second embodiment.

Fig. 6 is the figure for an example of the document vector of each embodiment of explanation.

Fig. 7 is the figure for an example of the word vector of each embodiment of explanation.

Symbol description

11?CPU

12 storage parts

13 input and output sections

14 Department of Communication Forces

100,400 paragraph segmenting devices

101,401 control parts

102,402 input parts

103,403 sentence cutting parts

104,404 feature value calculation unit

105,405 similar degree calculating parts

106,406 retrieval and inquisition generating units

107,407 characteristic quantity renewal sections

108,408 paragraph renewal sections

109,409 efferents

110,410 sentence storage parts

111,411 corpus sections

112,412 characteristic quantity storage parts

113,413 paragraph storage parts

114,414 morpheme analysis sections

Embodiment

Below, according to the description of drawings embodiments of the invention, still the invention is not restricted to the embodiment of following explanation.In this manual, establish " file " and " document " and be identical meanings.In addition, so-called " paragraph ", certain unit that the meaning of expression topic or content is concluded.And so-called document vector representation is with the document the stored vector as dimension, and so-called word vector representation is with whole words of occurring in whole documents vector as dimension.And in this manual, " characteristic quantity " of so-called sentence represents the meaning of sentence quantitatively, for example as the one example and specification documents vector or word vector.

(embodiment 1)

The first embodiment uses the document vector in similar degree calculates, use the embodiment of paragraph dividing method, device and the program of word vector in similar document retrieval.In the present embodiment, so-called document vector is that whole documents of comprising in corpus (corpus) section with segmenting device are as the vector of dimension.

Before describing present embodiment in detail, an example of specification documents vector sum word vector.

Fig. 6 represents an example of document vector.In Fig. 6 the sum of the document that comprises in the corpus section being made as 10 represents for example.And, be in 1,3,4,8 the situation, can as the document vector 601 as shown in this figure (a), represent the document vector at the document that obtains as result for retrieval.Equally, obtaining retrieving in the situation of score as result for retrieval, can use the retrieval Score Lists that obtains to be shown the document vector 602 shown in (b) of this figure.

Fig. 7 represents an example of word vector.So-called word vector is that whole words of occurring in all files are the vector of dimension, and the kind that has represented for example the word that occurs in whole documents in the word vector of Fig. 7 is 10 example.And the word that comprises in certain document is 3,6,7,8, and occurrence frequency is respectively in 1,5,3,9 the situation, by with the corresponding key element of occurrence frequency substitution, obtains the word vector 701 that represents among this figure.

Figure 1A is the figure of an example of functional module of the paragraph segmenting device of expression embodiment 1.Figure 1B is the figure of an example of the hardware configuration of the expression paragraph segmenting device of realizing embodiment 1.The hardware configuration of Figure 1B has represented that by as the storage parts 12 such as the central processing department (Central Processing Unit:CPU) 11 of common handling part, storer, RAM, ROM, hard disk drive (HDD), memory storage, input and output section 13, consist of as the Department of Communication Force 14 of network interface, above-mentioned each module is by internal bus 15 interconnective computing machines.

In Figure 1A, paragraph segmenting device 100 has: control part 101, input part 102, sentence cutting part 103, feature value calculation unit 104, similar degree calculating part 105, retrieval and inquisition generating unit 106, characteristic quantity renewal section 107, paragraph renewal section 108, efferent 109, sentence storage part 110, corpus section 111, characteristic quantity storage part 112, paragraph storage part 113, morpheme analysis section 114.As prerequisite, suppose in corpus section 111 and stored S _DFile, document that individual for example newspaper article is such.

Wherein, input part 102, efferent 109 be corresponding to input and output section 13, Department of Communication Force 14, and sentence storage part 110, corpus section 111, characteristic quantity storage part 112, paragraph storage part 113 are corresponding to storer, the memory storage of storage part 12.Remaining control part 101, sentence cutting part 103, feature value calculation unit 104, similar degree calculating part 105, retrieval and inquisition generating unit 106, characteristic quantity renewal section 107, paragraph renewal section 108, morpheme analysis section 114 can be realized by the processing of the various programs of storing in the storage parts such as the operating system among the CPU11 (OS), ROM.

The action of each functional module of the paragraph segmenting device of the embodiment 1 shown in Figure 1A is described successively.

At first, the document that becomes the object that paragraph cuts apart is by from input part 102 input medias.Sentence cutting part 103 is by the pre-programmed execution as the CPU11 of handling part, and with the document segmentation of the inputting subunit that forms a complete sentence, storage is as a plurality of sentences of segmentation result in sentence storage part 110.

Equally, feature value calculation unit 104 is used each sentence that reads from sentence storage part 110, obtains related document from corpus section 111, and a plurality of associated documents that obtain are carried out the document vectorization, then is stored in the characteristic quantity storage part 112.That is, feature value calculation unit 104 generates document vector such shown in giving an example among Fig. 6 by being worth the substitution dimension corresponding with obtained associated document.

Retrieval and inquisition generating unit 106 has the function that generates retrieval and inquisition and send to control part 101.

Feature value calculation unit 104 is being provided via control part 101 in the situation of retrieval and inquisition, obtain the document related with this detection inquiry from sentence storage part 110, the a plurality of associated documents that obtain are carried out the document vectorization, be stored in the characteristic quantity storage part 112 as characteristic quantity, and output to characteristic quantity renewal section 107 via control part 101.

Similar degree calculating part 105 has based on the appointment of control part 101 reads two document vectors from characteristic quantity storage part 112, and calculates the function of the similar degree of two document vectors.The computing method of the similar degree in the present embodiment are described in the back.And, similar degree calculating part 105 judge calculate and similar degree whether more than predetermined threshold value.

Retrieval and inquisition generating unit 106 is read two document vectors based on the appointment of control part 101 from characteristic quantity storage part 112, extracts document group general two document vectors from corpus section 111.Become retrieval and inquisition and output to control part 101 according to the general document all living creatures who extracts.The generation method of this retrieval and inquisition is described in the back.

Characteristic quantity renewal section 107 reads two document vector V based on the appointment of control part 101 from characteristic quantity storage part 112 _i, V _jIn addition, from control part 101 with document vector V _kInput feature vector amount renewal section 107.Three document vector V according to input _k, V _i, V _jCalculate fiduciary level, and based on fiduciary level correction V _kThis fiduciary level is described in the back.After this, from characteristic quantity storage part 112 deletion V _i, V _j, with V _kBe stored in the characteristic quantity storage part 112.

Two sentences or paragraph candidate are read based on the appointment of control part 101 by paragraph renewal section 108 from sentence storage part 110 or paragraph storage part 113.Sentence or the paragraph candidate of reading are deleted from sentence storage part 110 or paragraph storage part 113, and connect sentence or the paragraph candidate of reading, this connection result is stored in the paragraph storage part 113 as the paragraph candidate.

Efferent 109 is read respectively sentence, paragraph candidate from sentence storage part 110 and paragraph storage part 113, after taking a decision as to whether not clear paragraph, gives label (Label) rear output based on its result of determination to paragraph.At this, so-called not clear paragraph refers to judge sentence or the paragraph candidate that is connected with which paragraph.The decision method of paragraph is failed to understand in explanation in the back.

Fig. 2 is the process flow diagram of the action of the paragraph segmentation procedure carried out in the paragraph segmenting device of expression present embodiment.Below use Fig. 2 that one example of the action of paragraph segmentation procedure is described.

At this, as an example, the situation of having inputted the document that comprises two paragraphs is described, but the paragraph number in the document that is transfused to also can be that later processing is identical more than two, therefore describe as an example of the document that comprises two paragraphs example.

The sentence that comprises in the first paragraph is defined as a ₁, a ₂..., a _N, the sentence that comprises in the second paragraph is defined as b ₁, b ₂..., b _MAt this, N is the quantity (natural number) of the sentence that comprises in the first paragraph, and M is the quantity (natural number) of the sentence that comprises in the second paragraph.

At first, in step 201, input documents from input part 102.

In step 202, the document of inputting is divided into sentence unit by sentence cutting part 103, be stored in the sentence storage part 110.

In step 203, whole sentence a that will in sentence storage part 110, store ₁, a ₂, a _N..., b ₁, b ₂..., b _MBe input to feature value calculation unit 104, as previously mentioned, obtain the document vector.As the computing method of document vector, enumerate the method for for example using the cosine yardstick.The cosine yardstick is used as calculating one of the method for the similar degree of two vectors.Calculate the cosine yardstick of two vectorial Q, P by following formula 1.

[mathematical expression 1]

\frac{\underset{i}{Σ} q_{i} p_{i}}{\sqrt{\underset{i}{Σ} {q_{i}}^{2} \sqrt{\underset{i}{Σ} {p_{i}}^{2}}}} (q_{i} &Element; Q, p_{i} &Element; P)

(formula 1)

In the present embodiment, as mentioned above, in the retrieval of similar document, use the word vector.Therefore, be the word vector W of key element for the occurrence frequency of each document word of generating to comprise of storage in corpus section 111 for example _i(0＜=i＜S _D).Carry out too the word vectorization about the sentence of inputting, be made as W _CurrentCalculate word vector W _CurrentWith word vector W _i(the cosine yardstick of 0＜=i＜SD) obtains from the high document of the similar degree that obtains to L (L is the natural number of being scheduled to) document, carries out the document vectorization and is stored in the characteristic quantity storage part 112.

In addition, used the cosine yardstick at this as the example that similar degree calculates, but also can calculate similar degree with other yardstick.As the value of each key element of document vector, as illustrated in Fig. 6 (a) and (b), also can establish selected document is 1, and other document is 0, can use similar degree of calculating etc. to carry out certain weighting.

Then, in step 204, read the document vector of two storages in characteristic quantity storage part 112, use similar degree calculating part 105 to find out the group V of the highest document vector of similar degree _i, V _jAs the computing method of in this case similar degree, can use above-mentioned cosine yardstick etc., also can use the key element that in the both sides of two document vectors, exists, be quantity of general key element etc.

In step 205, similar degree calculating part 105 judges that the maximum similar degree that calculates is whether more than predefined threshold value in step 204.Threshold value can be predefined fixed value, when calculating similar degree in step 204, also can calculate the average or variance of the similar degree of having calculated and use.

Carry out step 206 and step 207 by retrieval and inquisition generating unit 106.In step 206, when the maximum similar degree of calculating in the step 204 when threshold value is above, extract the group V of document vector _i, V _jGeneral key element, it is made as the general key element V of document vector _Ij

In step 207, according to the general key element V that in step 206, obtains _IjGenerate retrieval and inquisition.As the generation method of retrieval and inquisition, for example enumerate the method for having used TFIDF.So-called TFIDF is a kind of of the weighting relevant with word.TF(Term Frequency) and IDF(InVerse Document Frequency) show with following formula table respectively, amass to obtain TFIDF by TF and IDF.

[mathematical expression 2]

{tf}_{i} = \frac{n_{i}}{\underset{k}{Σ} n_{k}},

{idf}_{i} = \log \frac{| D |}{| {d : t_{i} &Element; d} |}

(formula 2)

At this, n _iThe occurrence number of the word i among the document d, | D| is total number of files, | { d:t _i∈ d}| comprises word t _iNumber of files.In the present embodiment, total number of files D is equivalent to whole number of files of storage in corpus section 111.

Use morpheme analysis section 114 to carry out morpheme analysis for document d, extract S according to TFIDF order from big to small _WIndividual word is made as retrieval and inquisition with it.Beyond TFIDF, for example also can be according to the importance degree that how much determines of occurrence frequency, also can with the title of document as inquiry, also can generate retrieval and inquisition by other method.

In step 208, the retrieval and inquisition that generates in the step 207 via control part 101 input feature vector amount calculating parts 104, is obtained new document vector V ' in feature value calculation unit 104 _Ij

Then, carry out the document vector V ' that newly obtains _Ij Step 209 and the step 210 of calculating etc. of fiduciary level.These steps 209 and step 210 are carried out by characteristic quantity renewal section shown in Figure 1 107.At first, in step 209, calculate the document vector V ' that in step 208, obtains _IjFiduciary level, according to the vector magnitude of its modified result document vector.

In the present embodiment, so-called fiduciary level is to document vector V ' _IjIn comprise how many general key element V _IjThe index that quantizes of key element.As the calculating of fiduciary level, for example list document vector V ' _IjThe group V that comprises several document vectors _i, V _jGeneral key element V _IjKey element add up, divided by general key element V _IjThe method of wanting prime number.In addition, at general key element V _IjKey element situation about being weighted by importance degree under, can calculate fiduciary level according to the height of the importance degree that is weighted.In a word, in the low situation of the predetermined value of this fiduciary level ratio, increase and decrease the document vector V ' that obtains _IjThe feedback of fiduciary level of vector magnitude etc.

In step 210, deletion is generating general key element V from characteristic quantity storage part 112 _IjThe time document vector V _i, V _j, with the document vector V ' that newly obtains _IjBe stored in the characteristic quantity storage part 112.

In step 211, for the paragraph dividing method of present embodiment, by paragraph renewal section 108 will with V _i, V _jCorresponding two sentences or paragraph connect.The sentence that once also is not connected is stored in the sentence storage part 110.In the situation that sentence is connected, the sentence before deletion connects from sentence storage part 110.With in paragraph candidate and the situation that sentence is connected, perhaps with in the paragraph candidate situation connected to one another, the sentence deletion before not only will connecting, the paragraph candidate before also will connecting is deleted from paragraph storage part 113.Sentence after the connection or paragraph candidate are stored in the paragraph storage part 113 as new paragraph candidate.

In the paragraph dividing method of present embodiment, device, to step 211, generate the target paragraph by step 204 in the flow process that repeats Fig. 2.And, in step 205, when the not enough predetermined threshold value of the maximum similar degree of two document vectors, finish the generation of paragraph, so execution in step 212.

Step 212 is carried out by efferent 109, is the step of failing to understand the output of the judgement of paragraph and paragraph.As an example of the decision method of failing to understand paragraph, the method for investigating the morpheme number that comprises in sentence or paragraph candidate is arranged.In the few situation of the morpheme number that in sentence or paragraph candidate, comprises, sometimes can't generate rightly the document vector, be difficult to connect.Therefore, in step 21, the morpheme number that comprises in remaining sentence or paragraph candidate is when certain threshold value is following, and 409 pairs of not clear paragraphs of efferent are given label and output, end process flow process.

Fig. 3 is shown schematically in the present embodiment, connects an example of the situation of sentence according to the similar degree of document vector.If the threshold value in the step 205 of Fig. 2 is " 10 ".

Primary similar degree result of calculation is 301.As a result in 301 similar degree the highest be a ₂And a ₃The similar degree 40 of group.

Therefore, this group is carried out the processing from step 205 to step 211 of Fig. 2, again return the step 204 of Fig. 2.The result who connects is expressed as a ₂₃Similarly, in result 302 with b ₁And b ₂Be chosen as the highest group of similar degree, in result 303 with a ₁And a ₂₃Be chosen as the highest group of similar degree, carry out the processing from the step 205 of Fig. 2 to the step 211 of Fig. 2.Owing to set the threshold to 10, therefore, in result 304 nonoptional group, the generation of paragraph is finished.

Embodiment 1 according to above detailed description, even in a file, comprised and contained the sentence that is close in meaning, when being a plurality of paragraph of the similar sentence of characteristic quantity, also can correctly cut apart a plurality of paragraphs, and then carry out for the autoabstract of file or the automatic keyword extraction of document retrieval etc.

(embodiment 2)

Embodiment 2 uses the word vector in similar degree calculates, also use the embodiment of paragraph dividing method, device and the program of word vector in similar document retrieval.

Fig. 4 is the functional block diagram of the paragraph segmenting device of embodiment 2.The hardware configuration of the paragraph segmenting device of this figure is also same with the device of Figure 1A of embodiment 1, certainly can by the realizations such as computing machine shown in Figure 1B, omit illustrating of hardware configuration at this.

Input part 402, sentence cutting part 403, paragraph renewal section 408, efferent 409, sentence storage part 410, characteristic quantity storage part 412, paragraph storage part 413, morpheme analysis section 414 are identical with the respective modules of embodiment 1, therefore different corpus section 411, feature value calculation unit 404, similar degree calculating part 405, retrieval and inquisition generating unit 406 and the characteristic quantity renewal sections 407 of explanation and embodiment 1 only.In addition, morpheme analysis section 414 is connected with feature value calculation unit 404.

In corpus section 411, use such as set or the synonymicon (thesaurus) of the document of newspaper article etc. or use the two.

404 pairs of sentences that read in from sentence storage part 410 of feature value calculation unit use morpheme analysis section 414 to carry out morpheme analysis, and sentence is transformed to the word vector.When the word vector want prime number not enough the time, it is effective using corpus section 411 to increase and wanting the method for prime number.For example, used in the situation of synonymicon as corpus, each word that will obtain from the input sentence is retrieved near synonym as inquiry, and the near synonym that as a result of obtain are appended in the word vector.In addition, used as corpus in the situation of set of document, can in the word vector that from the input sentence, obtains, append the word vector that extracts each document in corpus.

As another example of the method for the key element of appending the word vector, enumerate and from above-mentioned several document, use TFIDF etc. to select primary word and be appended to method in the word vector.Be not limited to this, also can obtain the word related with sentence and append by other method, what make the word vector wants prime number enough.Then, the word vector that obtains is stored in the characteristic quantity storage part 412.In addition, provide in the situation of word vector via 401 pairs of feature value calculation unit 404 of control part from retrieval and inquisition generating unit 406, also expand the prime number of wanting of word vector by same method, be stored in the characteristic quantity storage part 112, and vectorial to characteristic quantity renewal section 407 output words via control part 401.

The similar degree calculating part 405 of present embodiment is read two word vectors based on the appointment of control part 401 from characteristic quantity storage part 412, calculates the similar degree of two word vectors.As the computing method of similar degree, such as cosine yardstick of giving an example out above-mentioned etc.

The retrieval and inquisition generating unit 406 of present embodiment is read two word vectors based on the appointment of control part 401 from characteristic quantity storage part 412, extracts word group general two word vectors from corpus 411.Become the word vector according to the general word all living creatures who extracts, and output to feature value calculation unit 404 via control part 401.

Two word vector V read based on the appointment of control part 401 in characteristic quantity renewal section 407 from characteristic quantity storage part 412 _i, V _jIn addition, from a word vector of control part 401 inputs V _kThree word vector V according to input _k, V _i, V _jCalculate fiduciary level, based on fiduciary level correction V _kVector magnitude.After this, from characteristic quantity storage part 412 deletion V _i, V _j, with V _kBe stored in the characteristic quantity storage part 412.

Fig. 5 is the processing flow chart of action of the program of expression embodiment 2.In embodiment 1, the document vector has been used in calculating as similar degree, but uses as mentioned above the word vector in embodiment 2, and this point is different from embodiment 1, but action in addition is identical with embodiment 1.

According to embodiment 2, contain the sentence that is close in meaning, when being a plurality of paragraph of the similar sentence of characteristic quantity, also can correctly cut apart paragraph even in a file, comprised.

In addition, the invention is not restricted to above-described embodiment, and comprise various variation.For example, understand in detail above-described embodiment for the present invention is understood easily, not necessarily be defined in and possess illustrated entire infrastructure.In addition, can add in the structure of certain embodiment the structure of other embodiment.In addition, about the part of the structure of each embodiment, can carry out appending, delete, replacing of other structure.

Each above-mentioned structure, function, handling part, processing means etc. can be with they part or all such as by realizing with hardware with integrated circuit (IC) design etc.In addition, each above-mentioned structure, function etc., for example expression is illustrated with the situation of software by carrying out the program that realizes each function, but, the information such as program, table, file that realizes each function not only can place storer, also can place hard disk, SSD(Solid State Drive) etc. the storage mediums such as memory storage or IC-card, SD card, DVD, also can be as required download, install via network etc.

Claims

1. paragraph dividing method, it is paragraph by handling part with document segmentation, this paragraph dividing method is characterised in that,

Described handling part,

Be sentence unit with described document segmentation,

Described sentence after will cutting apart extracts related document as inquiry from pre-stored a plurality of documents, the generating feature amount,

Use the general usually regeneration characteristics amount of this two characteristic quantities of similar degree more than predetermined threshold value of two characteristic quantities in the described characteristic quantity that generates.

2. paragraph dividing method according to claim 1 is characterized in that,

Described handling part uses the document vector as described characteristic quantity.

3. paragraph dividing method according to claim 2 is characterized in that,

Described handling part,

At two document vector V as described two characteristic quantities _i, V _jSimilar degree when predetermined threshold value is above, select two described document vector V _i, V _jGeneral key element V _Ij, generate retrieval and inquisition.

4. paragraph dividing method according to claim 3 is characterized in that,

Described handling part obtains new document vector V ' with the described retrieval and inquisition that generates _Ij

5. paragraph dividing method according to claim 4 is characterized in that,

Described handling part is according to described new document vector V ' _IjComprise described general key element V _IjThe degree of key element, revise described new document vector V ' _IjVector magnitude.

6. paragraph dividing method according to claim 4 is characterized in that,

Described handling part will with described new document vector V ' _IjCorresponding described sentence or paragraph candidate couple together as new paragraph candidate.

7. paragraph dividing method according to claim 1 is characterized in that,

Described handling part uses the word vector as described characteristic quantity.

8. paragraph dividing method according to claim 7 is characterized in that,

As two word vector V as described two characteristic quantities _i, V _jSimilar degree when predetermined threshold value is above, select two described word vector V _i, V _jGeneral key element V _Ij, generate retrieval and inquisition,

Use the described retrieval and inquisition that generates, obtain new word vector V ' _Ij

9. paragraph dividing method according to claim 8 is characterized in that,

Described handling part is according to described new word vector V ' _IjComprise described general key element V _IjThe degree of key element, revise described new word vector V ' _IjVector magnitude.

10. paragraph dividing method according to claim 9 is characterized in that,

Described handling part will with described new word vector V ' _IjCorresponding described sentence or paragraph candidate couple together as new paragraph candidate.

11. a paragraph segmenting device, its document segmentation with input is paragraph, and this paragraph segmenting device is characterised in that,

Possess handling part and storage part,

Described handling part,

Be sentence unit with described document segmentation,

Described sentence after will cutting apart extracts related document as inquiry from pre-stored a plurality of documents described storage part, the generating feature amount,

Use the general usually regeneration characteristics amount of this characteristic quantity of two similar degree more than predetermined threshold value in the described characteristic quantity that generates.

12. paragraph segmenting device according to claim 11 is characterized in that,

Described handling part uses document vector or the word vector based on the described document of association to be used as described characteristic quantity.

13. paragraph segmenting device according to claim 12 is characterized in that,

Described handling part,

At two document vectors or the word vector V as described two characteristic quantities _i, V _jSimilar degree when predetermined threshold value is above, select two described document vectors or word vector V _i, V _jGeneral key element V _Ij, generate retrieval and inquisition,

Use the described retrieval and inquisition that generates, obtain new document vector or word vector V ' _Ij,

According to described new document vector or word vector V ' _IjComprise described general key element V _IjThe degree of key element, revise described new document vector or word vector V ' _IjVector magnitude.

14. paragraph segmenting device according to claim 13 is characterized in that,

Described handling part connects and described new document vector V ' _IjCorresponding described sentence or paragraph candidate, and will newly connect and the paragraph candidate be stored in the described storage part.

15. a paragraph segmentation procedure, it is by possessing handling part and storage part, and is that the handling part of the paragraph segmenting device of paragraph is carried out with the document segmentation of inputting, and this paragraph segmentation procedure is characterised in that,

Make described handling part carry out following action:

Be sentence unit with described document segmentation,

Described sentence after will cutting apart extracts related document as inquiry from pre-stored a plurality of documents described storage part,

Document with the described association that extracts generates characteristic quantity,