CN107122350A - A kind of feature extraction system and method for many paragraph texts - Google Patents
A kind of feature extraction system and method for many paragraph texts Download PDFInfo
- Publication number
- CN107122350A CN107122350A CN201710287337.9A CN201710287337A CN107122350A CN 107122350 A CN107122350 A CN 107122350A CN 201710287337 A CN201710287337 A CN 201710287337A CN 107122350 A CN107122350 A CN 107122350A
- Authority
- CN
- China
- Prior art keywords
- text
- paragraph
- vector
- array
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of feature extraction system and method for many paragraph texts, including the first computing module, main control module, weight setting module, text processing module, segmenter and the second computing module;First computing module, weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.Technical scheme that the present invention is provided it is general, feasible realize Text character extraction, and during Text character extraction, the weight difference of different paragraphs in text can be embodied.
Description
Technical field
The present invention relates to a kind of text feature extraction technique, and in particular to a kind of feature extraction system of many paragraph texts and
Method.
Background technology
Pretreatment, participle, word frequency statisticses, TF-IDF calculating and the vector generation of original document Jing Guo text processing system etc.
Step is stored in persistent storage body, in case further text calculates application call.
By extracting text feature and being converted into vectorial storage, the mesh that can be calculated and compared between text is met
, while needing the main semanteme for ensureing text to be retained in text vector.So weigh Text character extraction quality
Key is exactly:Whether the semanteme of text can preferably retain.Prior art has one in Text character extraction significantly
Shortcoming, be exactly text in the whole text content to wait.But, people, when organizing content of text, are complete using text as one
Entire chapter chapter is typically the theme for summarizing a full piece with title, field and the category of article is implied, in text first section come what is treated
Fall the main contents and core concept of clear and definite full text, other paragraphs are illustrated for some aspect of theme respectively, generally every
The first sentence of individual paragraph can express the theme (but this point is often broken) of full section.General final stage can be stated as paragraph is summarized
Conclusion looks back central idea (information or simple article may not follow this point).Thus, for each paragraph, same language
Its semantic weight (relative importance for expressing text semantic) is different in different paragraphs for sentence, word and word frequency.
On the whole, for paragraph:Title weight>Summary (if there is) weight>First paragraph weight>Tail paragraph weight>Its
Its paragraph weight;For each sentence in paragraph:First sentence weight>Other weights.And current text feature extraction technology is not
The characteristics of having during this style of writing by paragraph tissue semanteme is taken into account.
The content of the invention
For deficiency of the prior art, it is an object of the invention to provide a kind of feature extraction system of many paragraph texts and
Method, the present invention is to realize a kind of general, feasible method to realize Text character extraction, and in Text character extraction process
In, the weight difference of different paragraphs in text can be embodied.
The purpose of the present invention is realized using following technical proposals:
The present invention provides a kind of feature extraction system of many paragraph texts, and it is theed improvement is that, including first calculates mould
Block, main control module, weight setting module, text processing module, segmenter and the second computing module;First computing module,
Weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.
Further, in addition to text vector storehouse, the text vector storehouse is used for the section for storing the main control module transmission
Fall text vector.
Further, first computing module is used to calculate the equation group in paragraph text;Second computing module
TF-IDF for information retrieval and the conventional weighting of data mining is calculated.
Further, the weight setting module is used to carry out weight setting, the text-processing to the equation group of generation
Module is used to carry out segment processing to paragraph text.
The present invention also provides a kind of abstracting method of the feature extraction system of many paragraph texts, and it is theed improvement is that, bag
Include:
Any text T paragraph is marked;
Any text T is set and expects relative weighting vector;
Relative weighting vector to above-mentioned mark paragraph and is expected using weight setting module and text processing module respectively
Feature extraction is carried out, the text vector of the different weights of paragraph is obtained.
Further, the paragraph to any text T is marked, including:
To any text T, it is made up of n paragraph, i-th of paragraph marks is Pi, then T=[P1, P2 ..., Pn].
Further, described set to any text T expects relative weighting vector, including:
For any text T, there is one and expect relative weighting vector weights=[w1, w2 ... wn], wherein, wi tables
Show Pi relative weighting;Wi is represented with absolute figure or relative value.
Further, it is described that feature extraction is carried out to above-mentioned mark paragraph and expectation relative weighting vector, obtain paragraph
The text vector of different weights, comprises the steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and will obtain
Paragraph text vector deposit n-dimensional vector array, wherein array element be paragraph Pi corresponding to text vector;
2) for each array element in n-dimensional vector array, the text corresponding to text processing module generation paragraph Pi
The weight of vector and, and be stored in weight and array;
3) based on weight and array and expectation relative weighting vector weights, the linear homogeneous for weight distribution is generated
Equation group, and add adjustment factor for equation group;
4) the first computing module solving equations are used, it is adjustment factor array to draw solution;
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the paragraph after regulation is drawn
Text vector;
6) paragraph text vector is merged:The paragraph text vector for multiplying the system of overregulating in n-dimensional vector array is added up,
The paragraph text vector of the final different weights of each paragraph of embodiment is obtained, paragraph text vector is stored into text vector storehouse.
Further, the step 2) in, the weight and computational methods of text vector are:By the text corresponding to paragraph Pi
Each element value is added in vector, returns to accumulation result, the weight of text vector and the relevant position of deposit weight and array.
Further, the step 3) in, system of homogeneous linear equations is represented with matrix, finally returns that two groups of arrays;
Increase each paragraph an adjustment factor, including:Adjustment factor ci meet equation (paragraphWeight [i] * ci)/
(paragraphWeight [i] * ci)=weights [i];
Wherein:Ci is paragraph Pi adjustment factor;When obtaining specific solution, constraints sum (ci) is added in equation group
=1;ParagraphWeight [] is weight and array;Weights [i] is expectation relative weighting vector;
It is preferred that, the step 5) in, i.e., for each array element in n-dimensional vector array, meet vectorArray
On [i] [j] * coefficients [i], the original position for being then stored in original text vector.
Wherein:Coefficients [i] is adjustment factor array element, and vectorArray [i] [j] is n-dimensional vector number
Group, i, j=1,2,3 ..., n, i represents the row of array, and j represents the row of array.
Compared with immediate prior art, the beneficial effect that the technical scheme that the present invention is provided reaches is:
The present invention realizes a kind of general, feasible method to realize Text character extraction, and in Text character extraction process
In, the weight difference of different paragraphs in text can be embodied, is specially:
1. precision and efficiency high:The text vector of extraction can preferably react the semantic feature of original text, can be significantly
Improve the appreciable text of user and recommend precision, and can be according to the relative power for adjusting each paragraph the need for types of applications at any time
Weight.
2. it is with low cost:It various text processing systems can easily be connected, need to only replace original text vector generation
Part.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the primary structure and schematic diagram of the Text Feature Extraction of present technology;
Fig. 2 is the structure chart of the feature extraction system of many paragraph texts.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, technical scheme will be carried out below
Detailed description.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.Base
Embodiment in the present invention, those of ordinary skill in the art are resulting on the premise of creative work is not made to be owned
Other embodiment, belongs to the scope that the present invention is protected.
Embodiment one,
The present invention provides a kind of feature extraction system of many paragraph texts, its structure chart as shown in Fig. 2 being calculated including first
Module, main control module, weight setting module, text processing module, segmenter and the second computing module;Described first calculates mould
Block, weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.
In above-described embodiment, in addition to text vector storehouse, the text vector storehouse is used to store the main control module transmission
Paragraph text vector.
In above-described embodiment, first computing module is used to calculate the equation group in paragraph text;Described second calculates
The TF-IDF that module is used for the conventional weighting of information retrieval and data mining is calculated.
In above-described embodiment, the weight setting module is used to carry out weight setting, the text to the equation group of generation
Processing module is used to carry out segment processing to paragraph text.
Embodiment two,
The present invention also provides a kind of abstracting method of the feature extraction system of many paragraph texts, including:
S1:For any text T, it is assumed that it is made up of n paragraph, i-th paragraph marks is Pi, then T=[P1,
P2,…,Pn]。
S2:For any text T, it is assumed that there are one is expected relative weighting vector weights=[w1, w2 ... wn], its
In, wi represents Pi relative weighting.Wi can be represented (such as with absolute figure:Integer numerical value), can also relative value carry out table
Show (such as:Percentage).
S3:Using weight setting module and text processing module respectively to above-mentioned mark paragraph and expect relative weighting to
Amount carries out feature extraction, obtains the text vector of the different weights of paragraph, including following sub-steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and will obtain
Paragraph text vector deposit n-dimensional vector array vectorArray [], wherein array element vectorArray [i] be paragraph
Text vector corresponding to Pi;
2) for each element vectorArray [i] in vectorArray, the vectorial weight and calculating side are generated
Method is:Each element value in the vector is added, accumulation result is returned.Vector sum deposit array paragraphWeight's []
Relevant position.
3) paragraphWeight and weights is based on, the system of homogeneous linear equations for weight distribution is generated.Equation
Group is directly represented with matrix, finally returns that two groups of arrays.To make the text vector of each paragraph be reached in final vector
To the requirement of corresponding relative weighting, it is necessary to increase an adjustment factor to each paragraph, equation group is on adjustment factor.
Assuming that paragraph Pi adjustment factor be ci, then ci need to meet equation (paragraphWeight [i] * ci)/
(paragraphWeight [i] * ci)=weights [i].To obtain specific solution, constraints sum is added in equation group
(ci)=1.
4) the first computing module solving equations are used, it is adjustment factor array coefficients [n] to draw solution.
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the text after regulation is drawn
Vector.I.e. for each vector v ectorArray [i] in vectorArray, by each element vectorArray therein
On [i] [j] * coefficients [i], the original position for being then stored in original text vector.
6) text vector is merged:The text vector for multiplying the system of overregulating in vectorArray is added up, obtains final
The different weights of each paragraph of embodiment text vector, paragraph text vector is stored into text vector storehouse.
Technical scheme can preferably react the semantic feature of original text in the text vector of extraction, can be significantly
Degree improves the appreciable text of user and recommends precision, precision and efficiency high, and can at any time be adjusted according to the need for types of applications
The relative weighting of whole each paragraph.It various text processing systems can easily be connected, need to only replace original text vector generation
Part, it is with low cost.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.
Claims (10)
1. a kind of feature extraction system of many paragraph texts, it is characterised in that including the first computing module, main control module, weight
Setup module, text processing module, segmenter and the second computing module;First computing module, weight setting module, text
Processing module, segmenter and the second computing module carry out data interaction with main control module.
2. feature extraction system as claimed in claim 1, it is characterised in that also including text vector storehouse, the text vector
Storehouse is used for the paragraph text vector for storing the main control module transmission.
3. feature extraction system as claimed in claim 1, it is characterised in that first computing module is used to calculate paragraph text
Equation group in this;The TF-IDF that second computing module is used for the conventional weighting of information retrieval and data mining is calculated.
4. feature extraction system as claimed in claim 1, it is characterised in that the weight setting module is used for the side to generation
Journey group carries out weight setting, and the text processing module is used to carry out segment processing to paragraph text.
5. a kind of abstracting method of the feature extraction system of many paragraph texts as any one of claim 1-4, it is special
Levy and be, including:
Any text T paragraph is marked;
Any text T is set and expects relative weighting vector;
To above-mentioned mark paragraph and it is expected that relative weighting vector is carried out respectively using weight setting module and text processing module
Feature extraction, obtains the text vector of the different weights of paragraph.
6. abstracting method as claimed in claim 5, it is characterised in that the paragraph to any text T is marked, bag
Include:
To any text T, it is made up of n paragraph, i-th of paragraph marks is Pi, then T=[P1, P2 ..., Pn].
7. abstracting method as claimed in claim 5, it is characterised in that it is described any text T is set expect relative weighting to
Amount, including:
For any text T, there is one and expect relative weighting vector weights=[w1, w2 ... wn], wherein, wi represents Pi
Relative weighting;Wi is represented with absolute figure or relative value.
8. abstracting method as claimed in claim 5, it is characterised in that described to above-mentioned mark paragraph and expectation relative weighting
Vector carries out feature extraction, obtains the text vector of the different weights of paragraph, comprises the steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and by obtained section
Fall text vector deposit n-dimensional vector array, wherein array element is the text vector corresponding to paragraph Pi;
2) for each array element in n-dimensional vector array, the text vector corresponding to text processing module generation paragraph Pi
Weight and, and be stored in weight and array;
3) based on weight and array and expectation relative weighting vector weights, the homogeneous linear equations for weight distribution are generated
Group, and add adjustment factor for equation group;
4) the first computing module solving equations are used, it is adjustment factor array to draw solution;
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the paragraph text after regulation is drawn
Vector;
6) paragraph text vector is merged:The paragraph text vector for multiplying the system of overregulating in n-dimensional vector array is added up, obtained
The paragraph text vector of the final different weights of each paragraph of embodiment, paragraph text vector is stored into text vector storehouse.
9. abstracting method as claimed in claim 8, it is characterised in that the step 2) in, the weight of text vector and calculating
Method is:Each element value in text vector corresponding to paragraph Pi is added, accumulation result is returned, the weight of text vector and
It is stored in the relevant position of weight and array.
10. abstracting method as claimed in claim 8, it is characterised in that the step 3) in, system of homogeneous linear equations is with matrix
Represent, finally return that two groups of arrays;Increase each paragraph an adjustment factor, including:Adjustment factor ci meets equation
(paragraphWeight [i] * ci)/(paragraphWeight [i] * ci)=weights [i];
Wherein:Ci is paragraph Pi adjustment factor;When obtaining specific solution, constraints sum (ci)=1 is added in equation group;
ParagraphWeight [] is weight and array;Weights [i] is expectation relative weighting vector;
It is preferred that, the step 5) in, i.e., for each array element in n-dimensional vector array, meet vectorArray [i]
On [j] * coefficients [i], the original position for being then stored in original text vector.
Wherein:Coefficients [i] is adjustment factor array element, and vectorArray [i] [j] is n-dimensional vector array, i, j
=1,2,3 ..., n, i represents the row of array, and j represents the row of array.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710287337.9A CN107122350B (en) | 2017-04-27 | 2017-04-27 | Method of multi-paragraph text feature extraction system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710287337.9A CN107122350B (en) | 2017-04-27 | 2017-04-27 | Method of multi-paragraph text feature extraction system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107122350A true CN107122350A (en) | 2017-09-01 |
CN107122350B CN107122350B (en) | 2021-02-05 |
Family
ID=59725061
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710287337.9A Active CN107122350B (en) | 2017-04-27 | 2017-04-27 | Method of multi-paragraph text feature extraction system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107122350B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115952279A (en) * | 2022-12-02 | 2023-04-11 | 杭州瑞成信息技术股份有限公司 | Text outline extraction method and device, electronic device and storage medium |
CN118568266A (en) * | 2024-08-05 | 2024-08-30 | 湖州南浔交水规划设计研究有限公司 | Municipal engineering design data processing method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101035128A (en) * | 2007-04-18 | 2007-09-12 | 大连理工大学 | Three-folded webpage text content recognition and filtering method based on the Chinese punctuation |
WO2013038824A1 (en) * | 2011-09-15 | 2013-03-21 | 株式会社富士通マーケティング | Accounting data generating device, method, program, system, server device, and recording medium |
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN104408083A (en) * | 2014-10-27 | 2015-03-11 | 六盘水职业技术学院 | Socialized media analyzing system |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
CN106372064A (en) * | 2016-11-18 | 2017-02-01 | 北京工业大学 | Characteristic word weight calculating method for text mining |
-
2017
- 2017-04-27 CN CN201710287337.9A patent/CN107122350B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101035128A (en) * | 2007-04-18 | 2007-09-12 | 大连理工大学 | Three-folded webpage text content recognition and filtering method based on the Chinese punctuation |
WO2013038824A1 (en) * | 2011-09-15 | 2013-03-21 | 株式会社富士通マーケティング | Accounting data generating device, method, program, system, server device, and recording medium |
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN104408083A (en) * | 2014-10-27 | 2015-03-11 | 六盘水职业技术学院 | Socialized media analyzing system |
CN105760474A (en) * | 2016-02-14 | 2016-07-13 | Tcl集团股份有限公司 | Document collection feature word extracting method and system based on position information |
CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
CN106372064A (en) * | 2016-11-18 | 2017-02-01 | 北京工业大学 | Characteristic word weight calculating method for text mining |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115952279A (en) * | 2022-12-02 | 2023-04-11 | 杭州瑞成信息技术股份有限公司 | Text outline extraction method and device, electronic device and storage medium |
CN115952279B (en) * | 2022-12-02 | 2023-09-12 | 杭州瑞成信息技术股份有限公司 | Text outline extraction method and device, electronic device and storage medium |
CN118568266A (en) * | 2024-08-05 | 2024-08-30 | 湖州南浔交水规划设计研究有限公司 | Municipal engineering design data processing method |
Also Published As
Publication number | Publication date |
---|---|
CN107122350B (en) | 2021-02-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9418147B2 (en) | Method and apparatus of determining product category information | |
Glimm et al. | Conservative front tracking and level set algorithms | |
CN101650709B (en) | Report generation method and report system | |
CN107204184A (en) | Audio recognition method and system | |
CN106897340A (en) | A kind of data table updating method and device | |
CN102317943B (en) | Method and device for full-text search | |
CN110059163B (en) | Method and device for generating template, electronic equipment and computer readable medium | |
CN103440288A (en) | Big data storage method and device | |
CN102567421B (en) | Document retrieval method and device | |
CN108228745A (en) | A kind of proposed algorithm and device based on collaborative filtering optimization | |
CN107679208A (en) | A kind of searching method of picture, terminal device and storage medium | |
CN108875065B (en) | Indonesia news webpage recommendation method based on content | |
CN106156239A (en) | A kind of form abstracting method and device | |
CN102289523A (en) | Method for intelligently extracting text labels | |
CN104239373A (en) | Document tag adding method and document tag adding device | |
CN106528877A (en) | Modular method and system for word document | |
CN116644168A (en) | Interactive data construction method, device, equipment and storage medium | |
CN107256144A (en) | Front and back code automatic generation method, terminal and computer-readable recording medium | |
CN104572785A (en) | Method and device for establishing index in distributed form | |
CN105354182B (en) | The method and the method and device using its generation special topic for obtaining correlated digital resource | |
CN107122350A (en) | A kind of feature extraction system and method for many paragraph texts | |
CN105786901B (en) | A kind of method and device adjusting webpage font size | |
CN113761114A (en) | Phrase generation method and device and computer-readable storage medium | |
CN109409848A (en) | Node intelligent recommended method, terminal device and the storage medium of open process | |
CN107766036A (en) | A kind of construction method of module, construction device and terminal device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |