CN107122350A - A kind of feature extraction system and method for many paragraph texts - Google Patents

A kind of feature extraction system and method for many paragraph texts Download PDF

Info

Publication number
CN107122350A
CN107122350A CN201710287337.9A CN201710287337A CN107122350A CN 107122350 A CN107122350 A CN 107122350A CN 201710287337 A CN201710287337 A CN 201710287337A CN 107122350 A CN107122350 A CN 107122350A
Authority
CN
China
Prior art keywords
text
paragraph
vector
array
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710287337.9A
Other languages
Chinese (zh)
Other versions
CN107122350B (en
Inventor
许延祥
王飞剑
刘宗福
周东红
黄世祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Easy Mike Technology Co Ltd
Original Assignee
Beijing Easy Mike Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Easy Mike Technology Co Ltd filed Critical Beijing Easy Mike Technology Co Ltd
Priority to CN201710287337.9A priority Critical patent/CN107122350B/en
Publication of CN107122350A publication Critical patent/CN107122350A/en
Application granted granted Critical
Publication of CN107122350B publication Critical patent/CN107122350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of feature extraction system and method for many paragraph texts, including the first computing module, main control module, weight setting module, text processing module, segmenter and the second computing module;First computing module, weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.Technical scheme that the present invention is provided it is general, feasible realize Text character extraction, and during Text character extraction, the weight difference of different paragraphs in text can be embodied.

Description

A kind of feature extraction system and method for many paragraph texts
Technical field
The present invention relates to a kind of text feature extraction technique, and in particular to a kind of feature extraction system of many paragraph texts and Method.
Background technology
Pretreatment, participle, word frequency statisticses, TF-IDF calculating and the vector generation of original document Jing Guo text processing system etc. Step is stored in persistent storage body, in case further text calculates application call.
By extracting text feature and being converted into vectorial storage, the mesh that can be calculated and compared between text is met , while needing the main semanteme for ensureing text to be retained in text vector.So weigh Text character extraction quality Key is exactly:Whether the semanteme of text can preferably retain.Prior art has one in Text character extraction significantly Shortcoming, be exactly text in the whole text content to wait.But, people, when organizing content of text, are complete using text as one Entire chapter chapter is typically the theme for summarizing a full piece with title, field and the category of article is implied, in text first section come what is treated Fall the main contents and core concept of clear and definite full text, other paragraphs are illustrated for some aspect of theme respectively, generally every The first sentence of individual paragraph can express the theme (but this point is often broken) of full section.General final stage can be stated as paragraph is summarized Conclusion looks back central idea (information or simple article may not follow this point).Thus, for each paragraph, same language Its semantic weight (relative importance for expressing text semantic) is different in different paragraphs for sentence, word and word frequency.
On the whole, for paragraph:Title weight>Summary (if there is) weight>First paragraph weight>Tail paragraph weight>Its Its paragraph weight;For each sentence in paragraph:First sentence weight>Other weights.And current text feature extraction technology is not The characteristics of having during this style of writing by paragraph tissue semanteme is taken into account.
The content of the invention
For deficiency of the prior art, it is an object of the invention to provide a kind of feature extraction system of many paragraph texts and Method, the present invention is to realize a kind of general, feasible method to realize Text character extraction, and in Text character extraction process In, the weight difference of different paragraphs in text can be embodied.
The purpose of the present invention is realized using following technical proposals:
The present invention provides a kind of feature extraction system of many paragraph texts, and it is theed improvement is that, including first calculates mould Block, main control module, weight setting module, text processing module, segmenter and the second computing module;First computing module, Weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.
Further, in addition to text vector storehouse, the text vector storehouse is used for the section for storing the main control module transmission Fall text vector.
Further, first computing module is used to calculate the equation group in paragraph text;Second computing module TF-IDF for information retrieval and the conventional weighting of data mining is calculated.
Further, the weight setting module is used to carry out weight setting, the text-processing to the equation group of generation Module is used to carry out segment processing to paragraph text.
The present invention also provides a kind of abstracting method of the feature extraction system of many paragraph texts, and it is theed improvement is that, bag Include:
Any text T paragraph is marked;
Any text T is set and expects relative weighting vector;
Relative weighting vector to above-mentioned mark paragraph and is expected using weight setting module and text processing module respectively Feature extraction is carried out, the text vector of the different weights of paragraph is obtained.
Further, the paragraph to any text T is marked, including:
To any text T, it is made up of n paragraph, i-th of paragraph marks is Pi, then T=[P1, P2 ..., Pn].
Further, described set to any text T expects relative weighting vector, including:
For any text T, there is one and expect relative weighting vector weights=[w1, w2 ... wn], wherein, wi tables Show Pi relative weighting;Wi is represented with absolute figure or relative value.
Further, it is described that feature extraction is carried out to above-mentioned mark paragraph and expectation relative weighting vector, obtain paragraph The text vector of different weights, comprises the steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and will obtain Paragraph text vector deposit n-dimensional vector array, wherein array element be paragraph Pi corresponding to text vector;
2) for each array element in n-dimensional vector array, the text corresponding to text processing module generation paragraph Pi The weight of vector and, and be stored in weight and array;
3) based on weight and array and expectation relative weighting vector weights, the linear homogeneous for weight distribution is generated Equation group, and add adjustment factor for equation group;
4) the first computing module solving equations are used, it is adjustment factor array to draw solution;
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the paragraph after regulation is drawn Text vector;
6) paragraph text vector is merged:The paragraph text vector for multiplying the system of overregulating in n-dimensional vector array is added up, The paragraph text vector of the final different weights of each paragraph of embodiment is obtained, paragraph text vector is stored into text vector storehouse.
Further, the step 2) in, the weight and computational methods of text vector are:By the text corresponding to paragraph Pi Each element value is added in vector, returns to accumulation result, the weight of text vector and the relevant position of deposit weight and array.
Further, the step 3) in, system of homogeneous linear equations is represented with matrix, finally returns that two groups of arrays; Increase each paragraph an adjustment factor, including:Adjustment factor ci meet equation (paragraphWeight [i] * ci)/ (paragraphWeight [i] * ci)=weights [i];
Wherein:Ci is paragraph Pi adjustment factor;When obtaining specific solution, constraints sum (ci) is added in equation group =1;ParagraphWeight [] is weight and array;Weights [i] is expectation relative weighting vector;
It is preferred that, the step 5) in, i.e., for each array element in n-dimensional vector array, meet vectorArray On [i] [j] * coefficients [i], the original position for being then stored in original text vector.
Wherein:Coefficients [i] is adjustment factor array element, and vectorArray [i] [j] is n-dimensional vector number Group, i, j=1,2,3 ..., n, i represents the row of array, and j represents the row of array.
Compared with immediate prior art, the beneficial effect that the technical scheme that the present invention is provided reaches is:
The present invention realizes a kind of general, feasible method to realize Text character extraction, and in Text character extraction process In, the weight difference of different paragraphs in text can be embodied, is specially:
1. precision and efficiency high:The text vector of extraction can preferably react the semantic feature of original text, can be significantly Improve the appreciable text of user and recommend precision, and can be according to the relative power for adjusting each paragraph the need for types of applications at any time Weight.
2. it is with low cost:It various text processing systems can easily be connected, need to only replace original text vector generation Part.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the primary structure and schematic diagram of the Text Feature Extraction of present technology;
Fig. 2 is the structure chart of the feature extraction system of many paragraph texts.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, technical scheme will be carried out below Detailed description.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.Base Embodiment in the present invention, those of ordinary skill in the art are resulting on the premise of creative work is not made to be owned Other embodiment, belongs to the scope that the present invention is protected.
Embodiment one,
The present invention provides a kind of feature extraction system of many paragraph texts, its structure chart as shown in Fig. 2 being calculated including first Module, main control module, weight setting module, text processing module, segmenter and the second computing module;Described first calculates mould Block, weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.
In above-described embodiment, in addition to text vector storehouse, the text vector storehouse is used to store the main control module transmission Paragraph text vector.
In above-described embodiment, first computing module is used to calculate the equation group in paragraph text;Described second calculates The TF-IDF that module is used for the conventional weighting of information retrieval and data mining is calculated.
In above-described embodiment, the weight setting module is used to carry out weight setting, the text to the equation group of generation Processing module is used to carry out segment processing to paragraph text.
Embodiment two,
The present invention also provides a kind of abstracting method of the feature extraction system of many paragraph texts, including:
S1:For any text T, it is assumed that it is made up of n paragraph, i-th paragraph marks is Pi, then T=[P1, P2,…,Pn]。
S2:For any text T, it is assumed that there are one is expected relative weighting vector weights=[w1, w2 ... wn], its In, wi represents Pi relative weighting.Wi can be represented (such as with absolute figure:Integer numerical value), can also relative value carry out table Show (such as:Percentage).
S3:Using weight setting module and text processing module respectively to above-mentioned mark paragraph and expect relative weighting to Amount carries out feature extraction, obtains the text vector of the different weights of paragraph, including following sub-steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and will obtain Paragraph text vector deposit n-dimensional vector array vectorArray [], wherein array element vectorArray [i] be paragraph Text vector corresponding to Pi;
2) for each element vectorArray [i] in vectorArray, the vectorial weight and calculating side are generated Method is:Each element value in the vector is added, accumulation result is returned.Vector sum deposit array paragraphWeight's [] Relevant position.
3) paragraphWeight and weights is based on, the system of homogeneous linear equations for weight distribution is generated.Equation Group is directly represented with matrix, finally returns that two groups of arrays.To make the text vector of each paragraph be reached in final vector To the requirement of corresponding relative weighting, it is necessary to increase an adjustment factor to each paragraph, equation group is on adjustment factor. Assuming that paragraph Pi adjustment factor be ci, then ci need to meet equation (paragraphWeight [i] * ci)/ (paragraphWeight [i] * ci)=weights [i].To obtain specific solution, constraints sum is added in equation group (ci)=1.
4) the first computing module solving equations are used, it is adjustment factor array coefficients [n] to draw solution.
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the text after regulation is drawn Vector.I.e. for each vector v ectorArray [i] in vectorArray, by each element vectorArray therein On [i] [j] * coefficients [i], the original position for being then stored in original text vector.
6) text vector is merged:The text vector for multiplying the system of overregulating in vectorArray is added up, obtains final The different weights of each paragraph of embodiment text vector, paragraph text vector is stored into text vector storehouse.
Technical scheme can preferably react the semantic feature of original text in the text vector of extraction, can be significantly Degree improves the appreciable text of user and recommends precision, precision and efficiency high, and can at any time be adjusted according to the need for types of applications The relative weighting of whole each paragraph.It various text processing systems can easily be connected, need to only replace original text vector generation Part, it is with low cost.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (10)

1. a kind of feature extraction system of many paragraph texts, it is characterised in that including the first computing module, main control module, weight Setup module, text processing module, segmenter and the second computing module;First computing module, weight setting module, text Processing module, segmenter and the second computing module carry out data interaction with main control module.
2. feature extraction system as claimed in claim 1, it is characterised in that also including text vector storehouse, the text vector Storehouse is used for the paragraph text vector for storing the main control module transmission.
3. feature extraction system as claimed in claim 1, it is characterised in that first computing module is used to calculate paragraph text Equation group in this;The TF-IDF that second computing module is used for the conventional weighting of information retrieval and data mining is calculated.
4. feature extraction system as claimed in claim 1, it is characterised in that the weight setting module is used for the side to generation Journey group carries out weight setting, and the text processing module is used to carry out segment processing to paragraph text.
5. a kind of abstracting method of the feature extraction system of many paragraph texts as any one of claim 1-4, it is special Levy and be, including:
Any text T paragraph is marked;
Any text T is set and expects relative weighting vector;
To above-mentioned mark paragraph and it is expected that relative weighting vector is carried out respectively using weight setting module and text processing module Feature extraction, obtains the text vector of the different weights of paragraph.
6. abstracting method as claimed in claim 5, it is characterised in that the paragraph to any text T is marked, bag Include:
To any text T, it is made up of n paragraph, i-th of paragraph marks is Pi, then T=[P1, P2 ..., Pn].
7. abstracting method as claimed in claim 5, it is characterised in that it is described any text T is set expect relative weighting to Amount, including:
For any text T, there is one and expect relative weighting vector weights=[w1, w2 ... wn], wherein, wi represents Pi Relative weighting;Wi is represented with absolute figure or relative value.
8. abstracting method as claimed in claim 5, it is characterised in that described to above-mentioned mark paragraph and expectation relative weighting Vector carries out feature extraction, obtains the text vector of the different weights of paragraph, comprises the steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and by obtained section Fall text vector deposit n-dimensional vector array, wherein array element is the text vector corresponding to paragraph Pi;
2) for each array element in n-dimensional vector array, the text vector corresponding to text processing module generation paragraph Pi Weight and, and be stored in weight and array;
3) based on weight and array and expectation relative weighting vector weights, the homogeneous linear equations for weight distribution are generated Group, and add adjustment factor for equation group;
4) the first computing module solving equations are used, it is adjustment factor array to draw solution;
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the paragraph text after regulation is drawn Vector;
6) paragraph text vector is merged:The paragraph text vector for multiplying the system of overregulating in n-dimensional vector array is added up, obtained The paragraph text vector of the final different weights of each paragraph of embodiment, paragraph text vector is stored into text vector storehouse.
9. abstracting method as claimed in claim 8, it is characterised in that the step 2) in, the weight of text vector and calculating Method is:Each element value in text vector corresponding to paragraph Pi is added, accumulation result is returned, the weight of text vector and It is stored in the relevant position of weight and array.
10. abstracting method as claimed in claim 8, it is characterised in that the step 3) in, system of homogeneous linear equations is with matrix Represent, finally return that two groups of arrays;Increase each paragraph an adjustment factor, including:Adjustment factor ci meets equation (paragraphWeight [i] * ci)/(paragraphWeight [i] * ci)=weights [i];
Wherein:Ci is paragraph Pi adjustment factor;When obtaining specific solution, constraints sum (ci)=1 is added in equation group; ParagraphWeight [] is weight and array;Weights [i] is expectation relative weighting vector;
It is preferred that, the step 5) in, i.e., for each array element in n-dimensional vector array, meet vectorArray [i] On [j] * coefficients [i], the original position for being then stored in original text vector.
Wherein:Coefficients [i] is adjustment factor array element, and vectorArray [i] [j] is n-dimensional vector array, i, j =1,2,3 ..., n, i represents the row of array, and j represents the row of array.
CN201710287337.9A 2017-04-27 2017-04-27 Method of multi-paragraph text feature extraction system Active CN107122350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710287337.9A CN107122350B (en) 2017-04-27 2017-04-27 Method of multi-paragraph text feature extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710287337.9A CN107122350B (en) 2017-04-27 2017-04-27 Method of multi-paragraph text feature extraction system

Publications (2)

Publication Number Publication Date
CN107122350A true CN107122350A (en) 2017-09-01
CN107122350B CN107122350B (en) 2021-02-05

Family

ID=59725061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710287337.9A Active CN107122350B (en) 2017-04-27 2017-04-27 Method of multi-paragraph text feature extraction system

Country Status (1)

Country Link
CN (1) CN107122350B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952279A (en) * 2022-12-02 2023-04-11 杭州瑞成信息技术股份有限公司 Text outline extraction method and device, electronic device and storage medium
CN118568266A (en) * 2024-08-05 2024-08-30 湖州南浔交水规划设计研究有限公司 Municipal engineering design data processing method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
WO2013038824A1 (en) * 2011-09-15 2013-03-21 株式会社富士通マーケティング Accounting data generating device, method, program, system, server device, and recording medium
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN104408083A (en) * 2014-10-27 2015-03-11 六盘水职业技术学院 Socialized media analyzing system
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
WO2013038824A1 (en) * 2011-09-15 2013-03-21 株式会社富士通マーケティング Accounting data generating device, method, program, system, server device, and recording medium
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN104408083A (en) * 2014-10-27 2015-03-11 六盘水职业技术学院 Socialized media analyzing system
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952279A (en) * 2022-12-02 2023-04-11 杭州瑞成信息技术股份有限公司 Text outline extraction method and device, electronic device and storage medium
CN115952279B (en) * 2022-12-02 2023-09-12 杭州瑞成信息技术股份有限公司 Text outline extraction method and device, electronic device and storage medium
CN118568266A (en) * 2024-08-05 2024-08-30 湖州南浔交水规划设计研究有限公司 Municipal engineering design data processing method

Also Published As

Publication number Publication date
CN107122350B (en) 2021-02-05

Similar Documents

Publication Publication Date Title
US9418147B2 (en) Method and apparatus of determining product category information
Glimm et al. Conservative front tracking and level set algorithms
CN101650709B (en) Report generation method and report system
CN107204184A (en) Audio recognition method and system
CN106897340A (en) A kind of data table updating method and device
CN102317943B (en) Method and device for full-text search
CN110059163B (en) Method and device for generating template, electronic equipment and computer readable medium
CN103440288A (en) Big data storage method and device
CN102567421B (en) Document retrieval method and device
CN108228745A (en) A kind of proposed algorithm and device based on collaborative filtering optimization
CN107679208A (en) A kind of searching method of picture, terminal device and storage medium
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN106156239A (en) A kind of form abstracting method and device
CN102289523A (en) Method for intelligently extracting text labels
CN104239373A (en) Document tag adding method and document tag adding device
CN106528877A (en) Modular method and system for word document
CN116644168A (en) Interactive data construction method, device, equipment and storage medium
CN107256144A (en) Front and back code automatic generation method, terminal and computer-readable recording medium
CN104572785A (en) Method and device for establishing index in distributed form
CN105354182B (en) The method and the method and device using its generation special topic for obtaining correlated digital resource
CN107122350A (en) A kind of feature extraction system and method for many paragraph texts
CN105786901B (en) A kind of method and device adjusting webpage font size
CN113761114A (en) Phrase generation method and device and computer-readable storage medium
CN109409848A (en) Node intelligent recommended method, terminal device and the storage medium of open process
CN107766036A (en) A kind of construction method of module, construction device and terminal device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant