CN107122350A

CN107122350A - A kind of feature extraction system and method for many paragraph texts

Info

Publication number: CN107122350A
Application number: CN201710287337.9A
Authority: CN
Inventors: 许延祥; 王飞剑; 刘宗福; 周东红; 黄世祥
Original assignee: Beijing Easy Mike Technology Co Ltd
Current assignee: Beijing Easy Mike Technology Co Ltd
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2017-09-01
Anticipated expiration: 2037-04-27
Also published as: CN107122350B

Abstract

The present invention relates to a kind of feature extraction system and method for many paragraph texts, including the first computing module, main control module, weight setting module, text processing module, segmenter and the second computing module；First computing module, weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.Technical scheme that the present invention is provided it is general, feasible realize Text character extraction, and during Text character extraction, the weight difference of different paragraphs in text can be embodied.

Description

A kind of feature extraction system and method for many paragraph texts

Technical field

The present invention relates to a kind of text feature extraction technique, and in particular to a kind of feature extraction system of many paragraph texts and Method.

Background technology

Pretreatment, participle, word frequency statisticses, TF-IDF calculating and the vector generation of original document Jing Guo text processing system etc. Step is stored in persistent storage body, in case further text calculates application call.

By extracting text feature and being converted into vectorial storage, the mesh that can be calculated and compared between text is met , while needing the main semanteme for ensureing text to be retained in text vector.So weigh Text character extraction quality Key is exactly：Whether the semanteme of text can preferably retain.Prior art has one in Text character extraction significantly Shortcoming, be exactly text in the whole text content to wait.But, people, when organizing content of text, are complete using text as one Entire chapter chapter is typically the theme for summarizing a full piece with title, field and the category of article is implied, in text first section come what is treated Fall the main contents and core concept of clear and definite full text, other paragraphs are illustrated for some aspect of theme respectively, generally every The first sentence of individual paragraph can express the theme (but this point is often broken) of full section.General final stage can be stated as paragraph is summarized Conclusion looks back central idea (information or simple article may not follow this point).Thus, for each paragraph, same language Its semantic weight (relative importance for expressing text semantic) is different in different paragraphs for sentence, word and word frequency.

On the whole, for paragraph：Title weight>Summary (if there is) weight>First paragraph weight>Tail paragraph weight>Its Its paragraph weight；For each sentence in paragraph：First sentence weight>Other weights.And current text feature extraction technology is not The characteristics of having during this style of writing by paragraph tissue semanteme is taken into account.

The content of the invention

For deficiency of the prior art, it is an object of the invention to provide a kind of feature extraction system of many paragraph texts and Method, the present invention is to realize a kind of general, feasible method to realize Text character extraction, and in Text character extraction process In, the weight difference of different paragraphs in text can be embodied.

The purpose of the present invention is realized using following technical proposals：

The present invention provides a kind of feature extraction system of many paragraph texts, and it is theed improvement is that, including first calculates mould Block, main control module, weight setting module, text processing module, segmenter and the second computing module；First computing module, Weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.

Further, in addition to text vector storehouse, the text vector storehouse is used for the section for storing the main control module transmission Fall text vector.

Further, first computing module is used to calculate the equation group in paragraph text；Second computing module TF-IDF for information retrieval and the conventional weighting of data mining is calculated.

Further, the weight setting module is used to carry out weight setting, the text-processing to the equation group of generation Module is used to carry out segment processing to paragraph text.

The present invention also provides a kind of abstracting method of the feature extraction system of many paragraph texts, and it is theed improvement is that, bag Include：

Any text T paragraph is marked；

Any text T is set and expects relative weighting vector；

Relative weighting vector to above-mentioned mark paragraph and is expected using weight setting module and text processing module respectively Feature extraction is carried out, the text vector of the different weights of paragraph is obtained.

Further, the paragraph to any text T is marked, including：

To any text T, it is made up of n paragraph, i-th of paragraph marks is Pi, then T=[P1, P2 ..., Pn].

Further, described set to any text T expects relative weighting vector, including：

For any text T, there is one and expect relative weighting vector weights=[w1, w2 ... wn], wherein, wi tables Show Pi relative weighting；Wi is represented with absolute figure or relative value.

Further, it is described that feature extraction is carried out to above-mentioned mark paragraph and expectation relative weighting vector, obtain paragraph The text vector of different weights, comprises the steps：

1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and will obtain Paragraph text vector deposit n-dimensional vector array, wherein array element be paragraph Pi corresponding to text vector；

2) for each array element in n-dimensional vector array, the text corresponding to text processing module generation paragraph Pi The weight of vector and, and be stored in weight and array；

3) based on weight and array and expectation relative weighting vector weights, the linear homogeneous for weight distribution is generated Equation group, and add adjustment factor for equation group；

4) the first computing module solving equations are used, it is adjustment factor array to draw solution；

5) each paragraph text vector is adjusted：Adjustment factor is multiplied with corresponding text vector, the paragraph after regulation is drawn Text vector；

6) paragraph text vector is merged：The paragraph text vector for multiplying the system of overregulating in n-dimensional vector array is added up, The paragraph text vector of the final different weights of each paragraph of embodiment is obtained, paragraph text vector is stored into text vector storehouse.

Further, the step 2) in, the weight and computational methods of text vector are：By the text corresponding to paragraph Pi Each element value is added in vector, returns to accumulation result, the weight of text vector and the relevant position of deposit weight and array.

Further, the step 3) in, system of homogeneous linear equations is represented with matrix, finally returns that two groups of arrays； Increase each paragraph an adjustment factor, including：Adjustment factor ci meet equation (paragraphWeight [i] * ci)/ (paragraphWeight [i] * ci)=weights [i]；

Wherein：Ci is paragraph Pi adjustment factor；When obtaining specific solution, constraints sum (ci) is added in equation group =1；ParagraphWeight [] is weight and array；Weights [i] is expectation relative weighting vector；

It is preferred that, the step 5) in, i.e., for each array element in n-dimensional vector array, meet vectorArray On [i] [j] * coefficients [i], the original position for being then stored in original text vector.

Wherein：Coefficients [i] is adjustment factor array element, and vectorArray [i] [j] is n-dimensional vector number Group, i, j=1,2,3 ..., n, i represents the row of array, and j represents the row of array.

Compared with immediate prior art, the beneficial effect that the technical scheme that the present invention is provided reaches is：

The present invention realizes a kind of general, feasible method to realize Text character extraction, and in Text character extraction process In, the weight difference of different paragraphs in text can be embodied, is specially：

1. precision and efficiency high：The text vector of extraction can preferably react the semantic feature of original text, can be significantly Improve the appreciable text of user and recommend precision, and can be according to the relative power for adjusting each paragraph the need for types of applications at any time Weight.

2. it is with low cost：It various text processing systems can easily be connected, need to only replace original text vector generation Part.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is the primary structure and schematic diagram of the Text Feature Extraction of present technology；

Fig. 2 is the structure chart of the feature extraction system of many paragraph texts.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, technical scheme will be carried out below Detailed description.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.Base Embodiment in the present invention, those of ordinary skill in the art are resulting on the premise of creative work is not made to be owned Other embodiment, belongs to the scope that the present invention is protected.

Embodiment one,

The present invention provides a kind of feature extraction system of many paragraph texts, its structure chart as shown in Fig. 2 being calculated including first Module, main control module, weight setting module, text processing module, segmenter and the second computing module；Described first calculates mould Block, weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.

In above-described embodiment, in addition to text vector storehouse, the text vector storehouse is used to store the main control module transmission Paragraph text vector.

In above-described embodiment, first computing module is used to calculate the equation group in paragraph text；Described second calculates The TF-IDF that module is used for the conventional weighting of information retrieval and data mining is calculated.

In above-described embodiment, the weight setting module is used to carry out weight setting, the text to the equation group of generation Processing module is used to carry out segment processing to paragraph text.

Embodiment two,

The present invention also provides a kind of abstracting method of the feature extraction system of many paragraph texts, including：

S1：For any text T, it is assumed that it is made up of n paragraph, i-th paragraph marks is Pi, then T=[P1, P2,…,Pn]。

S2：For any text T, it is assumed that there are one is expected relative weighting vector weights=[w1, w2 ... wn], its In, wi represents Pi relative weighting.Wi can be represented (such as with absolute figure：Integer numerical value), can also relative value carry out table Show (such as：Percentage).

S3：Using weight setting module and text processing module respectively to above-mentioned mark paragraph and expect relative weighting to Amount carries out feature extraction, obtains the text vector of the different weights of paragraph, including following sub-steps：

1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and will obtain Paragraph text vector deposit n-dimensional vector array vectorArray [], wherein array element vectorArray [i] be paragraph Text vector corresponding to Pi；

2) for each element vectorArray [i] in vectorArray, the vectorial weight and calculating side are generated Method is：Each element value in the vector is added, accumulation result is returned.Vector sum deposit array paragraphWeight's [] Relevant position.

3) paragraphWeight and weights is based on, the system of homogeneous linear equations for weight distribution is generated.Equation Group is directly represented with matrix, finally returns that two groups of arrays.To make the text vector of each paragraph be reached in final vector To the requirement of corresponding relative weighting, it is necessary to increase an adjustment factor to each paragraph, equation group is on adjustment factor. Assuming that paragraph Pi adjustment factor be ci, then ci need to meet equation (paragraphWeight [i] * ci)/ (paragraphWeight [i] * ci)=weights [i].To obtain specific solution, constraints sum is added in equation group (ci)=1.

4) the first computing module solving equations are used, it is adjustment factor array coefficients [n] to draw solution.

5) each paragraph text vector is adjusted：Adjustment factor is multiplied with corresponding text vector, the text after regulation is drawn Vector.I.e. for each vector v ectorArray [i] in vectorArray, by each element vectorArray therein On [i] [j] * coefficients [i], the original position for being then stored in original text vector.

6) text vector is merged：The text vector for multiplying the system of overregulating in vectorArray is added up, obtains final The different weights of each paragraph of embodiment text vector, paragraph text vector is stored into text vector storehouse.

Technical scheme can preferably react the semantic feature of original text in the text vector of extraction, can be significantly Degree improves the appreciable text of user and recommends precision, precision and efficiency high, and can at any time be adjusted according to the need for types of applications The relative weighting of whole each paragraph.It various text processing systems can easily be connected, need to only replace original text vector generation Part, it is with low cost.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. a kind of feature extraction system of many paragraph texts, it is characterised in that including the first computing module, main control module, weight Setup module, text processing module, segmenter and the second computing module；First computing module, weight setting module, text Processing module, segmenter and the second computing module carry out data interaction with main control module.

2. feature extraction system as claimed in claim 1, it is characterised in that also including text vector storehouse, the text vector Storehouse is used for the paragraph text vector for storing the main control module transmission.

3. feature extraction system as claimed in claim 1, it is characterised in that first computing module is used to calculate paragraph text Equation group in this；The TF-IDF that second computing module is used for the conventional weighting of information retrieval and data mining is calculated.

4. feature extraction system as claimed in claim 1, it is characterised in that the weight setting module is used for the side to generation Journey group carries out weight setting, and the text processing module is used to carry out segment processing to paragraph text.

5. a kind of abstracting method of the feature extraction system of many paragraph texts as any one of claim 1-4, it is special Levy and be, including：

Any text T paragraph is marked；

Any text T is set and expects relative weighting vector；

To above-mentioned mark paragraph and it is expected that relative weighting vector is carried out respectively using weight setting module and text processing module Feature extraction, obtains the text vector of the different weights of paragraph.

6. abstracting method as claimed in claim 5, it is characterised in that the paragraph to any text T is marked, bag Include：

7. abstracting method as claimed in claim 5, it is characterised in that it is described any text T is set expect relative weighting to Amount, including：

For any text T, there is one and expect relative weighting vector weights=[w1, w2 ... wn], wherein, wi represents Pi Relative weighting；Wi is represented with absolute figure or relative value.

8. abstracting method as claimed in claim 5, it is characterised in that described to above-mentioned mark paragraph and expectation relative weighting Vector carries out feature extraction, obtains the text vector of the different weights of paragraph, comprises the steps：

1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and by obtained section Fall text vector deposit n-dimensional vector array, wherein array element is the text vector corresponding to paragraph Pi；

2) for each array element in n-dimensional vector array, the text vector corresponding to text processing module generation paragraph Pi Weight and, and be stored in weight and array；

3) based on weight and array and expectation relative weighting vector weights, the homogeneous linear equations for weight distribution are generated Group, and add adjustment factor for equation group；

5) each paragraph text vector is adjusted：Adjustment factor is multiplied with corresponding text vector, the paragraph text after regulation is drawn Vector；

6) paragraph text vector is merged：The paragraph text vector for multiplying the system of overregulating in n-dimensional vector array is added up, obtained The paragraph text vector of the final different weights of each paragraph of embodiment, paragraph text vector is stored into text vector storehouse.

9. abstracting method as claimed in claim 8, it is characterised in that the step 2) in, the weight of text vector and calculating Method is：Each element value in text vector corresponding to paragraph Pi is added, accumulation result is returned, the weight of text vector and It is stored in the relevant position of weight and array.

10. abstracting method as claimed in claim 8, it is characterised in that the step 3) in, system of homogeneous linear equations is with matrix Represent, finally return that two groups of arrays；Increase each paragraph an adjustment factor, including：Adjustment factor ci meets equation (paragraphWeight [i] * ci)/(paragraphWeight [i] * ci)=weights [i]；

Wherein：Ci is paragraph Pi adjustment factor；When obtaining specific solution, constraints sum (ci)=1 is added in equation group； ParagraphWeight [] is weight and array；Weights [i] is expectation relative weighting vector；

It is preferred that, the step 5) in, i.e., for each array element in n-dimensional vector array, meet vectorArray [i] On [j] * coefficients [i], the original position for being then stored in original text vector.

Wherein：Coefficients [i] is adjustment factor array element, and vectorArray [i] [j] is n-dimensional vector array, i, j =1,2,3 ..., n, i represents the row of array, and j represents the row of array.