CN107122350A - A kind of feature extraction system and method for many paragraph texts - Google Patents

A kind of feature extraction system and method for many paragraph texts Download PDF

Info

Publication number
CN107122350A
CN107122350A CN201710287337.9A CN201710287337A CN107122350A CN 107122350 A CN107122350 A CN 107122350A CN 201710287337 A CN201710287337 A CN 201710287337A CN 107122350 A CN107122350 A CN 107122350A
Authority
CN
China
Prior art keywords
text
paragraph
vector
array
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710287337.9A
Other languages
Chinese (zh)
Other versions
CN107122350B (en
Inventor
许延祥
王飞剑
刘宗福
周东红
黄世祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Easy Mike Technology Co Ltd
Original Assignee
Beijing Easy Mike Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Easy Mike Technology Co Ltd filed Critical Beijing Easy Mike Technology Co Ltd
Priority to CN201710287337.9A priority Critical patent/CN107122350B/en
Publication of CN107122350A publication Critical patent/CN107122350A/en
Application granted granted Critical
Publication of CN107122350B publication Critical patent/CN107122350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of feature extraction system and method for many paragraph texts, including the first computing module, main control module, weight setting module, text processing module, segmenter and the second computing module;First computing module, weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.Technical scheme that the present invention is provided it is general, feasible realize Text character extraction, and during Text character extraction, the weight difference of different paragraphs in text can be embodied.

Description

A kind of feature extraction system and method for many paragraph texts
Technical field
The present invention relates to a kind of text feature extraction technique, and in particular to a kind of feature extraction system of many paragraph texts and Method.
Background technology
Pretreatment, participle, word frequency statisticses, TF-IDF calculating and the vector generation of original document Jing Guo text processing system etc. Step is stored in persistent storage body, in case further text calculates application call.
By extracting text feature and being converted into vectorial storage, the mesh that can be calculated and compared between text is met , while needing the main semanteme for ensureing text to be retained in text vector.So weigh Text character extraction quality Key is exactly:Whether the semanteme of text can preferably retain.Prior art has one in Text character extraction significantly Shortcoming, be exactly text in the whole text content to wait.But, people, when organizing content of text, are complete using text as one Entire chapter chapter is typically the theme for summarizing a full piece with title, field and the category of article is implied, in text first section come what is treated Fall the main contents and core concept of clear and definite full text, other paragraphs are illustrated for some aspect of theme respectively, generally every The first sentence of individual paragraph can express the theme (but this point is often broken) of full section.General final stage can be stated as paragraph is summarized Conclusion looks back central idea (information or simple article may not follow this point).Thus, for each paragraph, same language Its semantic weight (relative importance for expressing text semantic) is different in different paragraphs for sentence, word and word frequency.
On the whole, for paragraph:Title weight>Summary (if there is) weight>First paragraph weight>Tail paragraph weight>Its Its paragraph weight;For each sentence in paragraph:First sentence weight>Other weights.And current text feature extraction technology is not The characteristics of having during this style of writing by paragraph tissue semanteme is taken into account.
The content of the invention
For deficiency of the prior art, it is an object of the invention to provide a kind of feature extraction system of many paragraph texts and Method, the present invention is to realize a kind of general, feasible method to realize Text character extraction, and in Text character extraction process In, the weight difference of different paragraphs in text can be embodied.
The purpose of the present invention is realized using following technical proposals:
The present invention provides a kind of feature extraction system of many paragraph texts, and it is theed improvement is that, including first calculates mould Block, main control module, weight setting module, text processing module, segmenter and the second computing module;First computing module, Weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.
Further, in addition to text vector storehouse, the text vector storehouse is used for the section for storing the main control module transmission Fall text vector.
Further, first computing module is used to calculate the equation group in paragraph text;Second computing module TF-IDF for information retrieval and the conventional weighting of data mining is calculated.
Further, the weight setting module is used to carry out weight setting, the text-processing to the equation group of generation Module is used to carry out segment processing to paragraph text.
The present invention also provides a kind of abstracting method of the feature extraction system of many paragraph texts, and it is theed improvement is that, bag Include:
Any text T paragraph is marked;
Any text T is set and expects relative weighting vector;
Relative weighting vector to above-mentioned mark paragraph and is expected using weight setting module and text processing module respectively Feature extraction is carried out, the text vector of the different weights of paragraph is obtained.
Further, the paragraph to any text T is marked, including:
To any text T, it is made up of n paragraph, i-th of paragraph marks is Pi, then T=[P1, P2 ..., Pn].
Further, described set to any text T expects relative weighting vector, including:
For any text T, there is one and expect relative weighting vector weights=[w1, w2 ... wn], wherein, wi tables Show Pi relative weighting;Wi is represented with absolute figure or relative value.
Further, it is described that feature extraction is carried out to above-mentioned mark paragraph and expectation relative weighting vector, obtain paragraph The text vector of different weights, comprises the steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and will obtain Paragraph text vector deposit n-dimensional vector array, wherein array element be paragraph Pi corresponding to text vector;
2) for each array element in n-dimensional vector array, the text corresponding to text processing module generation paragraph Pi The weight of vector and, and be stored in weight and array;
3) based on weight and array and expectation relative weighting vector weights, the linear homogeneous for weight distribution is generated Equation group, and add adjustment factor for equation group;
4) the first computing module solving equations are used, it is adjustment factor array to draw solution;
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the paragraph after regulation is drawn Text vector;
6) paragraph text vector is merged:The paragraph text vector for multiplying the system of overregulating in n-dimensional vector array is added up, The paragraph text vector of the final different weights of each paragraph of embodiment is obtained, paragraph text vector is stored into text vector storehouse.
Further, the step 2) in, the weight and computational methods of text vector are:By the text corresponding to paragraph Pi Each element value is added in vector, returns to accumulation result, the weight of text vector and the relevant position of deposit weight and array.
Further, the step 3) in, system of homogeneous linear equations is represented with matrix, finally returns that two groups of arrays; Increase each paragraph an adjustment factor, including:Adjustment factor ci meet equation (paragraphWeight [i] * ci)/ (paragraphWeight [i] * ci)=weights [i];
Wherein:Ci is paragraph Pi adjustment factor;When obtaining specific solution, constraints sum (ci) is added in equation group =1;ParagraphWeight [] is weight and array;Weights [i] is expectation relative weighting vector;
It is preferred that, the step 5) in, i.e., for each array element in n-dimensional vector array, meet vectorArray On [i] [j] * coefficients [i], the original position for being then stored in original text vector.
Wherein:Coefficients [i] is adjustment factor array element, and vectorArray [i] [j] is n-dimensional vector number Group, i, j=1,2,3 ..., n, i represents the row of array, and j represents the row of array.
Compared with immediate prior art, the beneficial effect that the technical scheme that the present invention is provided reaches is:
The present invention realizes a kind of general, feasible method to realize Text character extraction, and in Text character extraction process In, the weight difference of different paragraphs in text can be embodied, is specially:
1. precision and efficiency high:The text vector of extraction can preferably react the semantic feature of original text, can be significantly Improve the appreciable text of user and recommend precision, and can be according to the relative power for adjusting each paragraph the need for types of applications at any time Weight.
2. it is with low cost:It various text processing systems can easily be connected, need to only replace original text vector generation Part.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is the primary structure and schematic diagram of the Text Feature Extraction of present technology;
Fig. 2 is the structure chart of the feature extraction system of many paragraph texts.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, technical scheme will be carried out below Detailed description.Obviously, described embodiment is only a part of embodiment of the invention, rather than whole embodiments.Base Embodiment in the present invention, those of ordinary skill in the art are resulting on the premise of creative work is not made to be owned Other embodiment, belongs to the scope that the present invention is protected.
Embodiment one,
The present invention provides a kind of feature extraction system of many paragraph texts, its structure chart as shown in Fig. 2 being calculated including first Module, main control module, weight setting module, text processing module, segmenter and the second computing module;Described first calculates mould Block, weight setting module, text processing module, segmenter and the second computing module carry out data interaction with main control module.
In above-described embodiment, in addition to text vector storehouse, the text vector storehouse is used to store the main control module transmission Paragraph text vector.
In above-described embodiment, first computing module is used to calculate the equation group in paragraph text;Described second calculates The TF-IDF that module is used for the conventional weighting of information retrieval and data mining is calculated.
In above-described embodiment, the weight setting module is used to carry out weight setting, the text to the equation group of generation Processing module is used to carry out segment processing to paragraph text.
Embodiment two,
The present invention also provides a kind of abstracting method of the feature extraction system of many paragraph texts, including:
S1:For any text T, it is assumed that it is made up of n paragraph, i-th paragraph marks is Pi, then T=[P1, P2,…,Pn]。
S2:For any text T, it is assumed that there are one is expected relative weighting vector weights=[w1, w2 ... wn], its In, wi represents Pi relative weighting.Wi can be represented (such as with absolute figure:Integer numerical value), can also relative value carry out table Show (such as:Percentage).
S3:Using weight setting module and text processing module respectively to above-mentioned mark paragraph and expect relative weighting to Amount carries out feature extraction, obtains the text vector of the different weights of paragraph, including following sub-steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and will obtain Paragraph text vector deposit n-dimensional vector array vectorArray [], wherein array element vectorArray [i] be paragraph Text vector corresponding to Pi;
2) for each element vectorArray [i] in vectorArray, the vectorial weight and calculating side are generated Method is:Each element value in the vector is added, accumulation result is returned.Vector sum deposit array paragraphWeight's [] Relevant position.
3) paragraphWeight and weights is based on, the system of homogeneous linear equations for weight distribution is generated.Equation Group is directly represented with matrix, finally returns that two groups of arrays.To make the text vector of each paragraph be reached in final vector To the requirement of corresponding relative weighting, it is necessary to increase an adjustment factor to each paragraph, equation group is on adjustment factor. Assuming that paragraph Pi adjustment factor be ci, then ci need to meet equation (paragraphWeight [i] * ci)/ (paragraphWeight [i] * ci)=weights [i].To obtain specific solution, constraints sum is added in equation group (ci)=1.
4) the first computing module solving equations are used, it is adjustment factor array coefficients [n] to draw solution.
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the text after regulation is drawn Vector.I.e. for each vector v ectorArray [i] in vectorArray, by each element vectorArray therein On [i] [j] * coefficients [i], the original position for being then stored in original text vector.
6) text vector is merged:The text vector for multiplying the system of overregulating in vectorArray is added up, obtains final The different weights of each paragraph of embodiment text vector, paragraph text vector is stored into text vector storehouse.
Technical scheme can preferably react the semantic feature of original text in the text vector of extraction, can be significantly Degree improves the appreciable text of user and recommends precision, precision and efficiency high, and can at any time be adjusted according to the need for types of applications The relative weighting of whole each paragraph.It various text processing systems can easily be connected, need to only replace original text vector generation Part, it is with low cost.
The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (10)

1. a kind of feature extraction system of many paragraph texts, it is characterised in that including the first computing module, main control module, weight Setup module, text processing module, segmenter and the second computing module;First computing module, weight setting module, text Processing module, segmenter and the second computing module carry out data interaction with main control module.
2. feature extraction system as claimed in claim 1, it is characterised in that also including text vector storehouse, the text vector Storehouse is used for the paragraph text vector for storing the main control module transmission.
3. feature extraction system as claimed in claim 1, it is characterised in that first computing module is used to calculate paragraph text Equation group in this;The TF-IDF that second computing module is used for the conventional weighting of information retrieval and data mining is calculated.
4. feature extraction system as claimed in claim 1, it is characterised in that the weight setting module is used for the side to generation Journey group carries out weight setting, and the text processing module is used to carry out segment processing to paragraph text.
5. a kind of abstracting method of the feature extraction system of many paragraph texts as any one of claim 1-4, it is special Levy and be, including:
Any text T paragraph is marked;
Any text T is set and expects relative weighting vector;
To above-mentioned mark paragraph and it is expected that relative weighting vector is carried out respectively using weight setting module and text processing module Feature extraction, obtains the text vector of the different weights of paragraph.
6. abstracting method as claimed in claim 5, it is characterised in that the paragraph to any text T is marked, bag Include:
To any text T, it is made up of n paragraph, i-th of paragraph marks is Pi, then T=[P1, P2 ..., Pn].
7. abstracting method as claimed in claim 5, it is characterised in that it is described any text T is set expect relative weighting to Amount, including:
For any text T, there is one and expect relative weighting vector weights=[w1, w2 ... wn], wherein, wi represents Pi Relative weighting;Wi is represented with absolute figure or relative value.
8. abstracting method as claimed in claim 5, it is characterised in that described to above-mentioned mark paragraph and expectation relative weighting Vector carries out feature extraction, obtains the text vector of the different weights of paragraph, comprises the steps:
1) for each paragraph P in T, vectorization processing is carried out using segmenter and the second computing module, and by obtained section Fall text vector deposit n-dimensional vector array, wherein array element is the text vector corresponding to paragraph Pi;
2) for each array element in n-dimensional vector array, the text vector corresponding to text processing module generation paragraph Pi Weight and, and be stored in weight and array;
3) based on weight and array and expectation relative weighting vector weights, the homogeneous linear equations for weight distribution are generated Group, and add adjustment factor for equation group;
4) the first computing module solving equations are used, it is adjustment factor array to draw solution;
5) each paragraph text vector is adjusted:Adjustment factor is multiplied with corresponding text vector, the paragraph text after regulation is drawn Vector;
6) paragraph text vector is merged:The paragraph text vector for multiplying the system of overregulating in n-dimensional vector array is added up, obtained The paragraph text vector of the final different weights of each paragraph of embodiment, paragraph text vector is stored into text vector storehouse.
9. abstracting method as claimed in claim 8, it is characterised in that the step 2) in, the weight of text vector and calculating Method is:Each element value in text vector corresponding to paragraph Pi is added, accumulation result is returned, the weight of text vector and It is stored in the relevant position of weight and array.
10. abstracting method as claimed in claim 8, it is characterised in that the step 3) in, system of homogeneous linear equations is with matrix Represent, finally return that two groups of arrays;Increase each paragraph an adjustment factor, including:Adjustment factor ci meets equation (paragraphWeight [i] * ci)/(paragraphWeight [i] * ci)=weights [i];
Wherein:Ci is paragraph Pi adjustment factor;When obtaining specific solution, constraints sum (ci)=1 is added in equation group; ParagraphWeight [] is weight and array;Weights [i] is expectation relative weighting vector;
It is preferred that, the step 5) in, i.e., for each array element in n-dimensional vector array, meet vectorArray [i] On [j] * coefficients [i], the original position for being then stored in original text vector.
Wherein:Coefficients [i] is adjustment factor array element, and vectorArray [i] [j] is n-dimensional vector array, i, j =1,2,3 ..., n, i represents the row of array, and j represents the row of array.
CN201710287337.9A 2017-04-27 2017-04-27 Method of multi-paragraph text feature extraction system Active CN107122350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710287337.9A CN107122350B (en) 2017-04-27 2017-04-27 Method of multi-paragraph text feature extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710287337.9A CN107122350B (en) 2017-04-27 2017-04-27 Method of multi-paragraph text feature extraction system

Publications (2)

Publication Number Publication Date
CN107122350A true CN107122350A (en) 2017-09-01
CN107122350B CN107122350B (en) 2021-02-05

Family

ID=59725061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710287337.9A Active CN107122350B (en) 2017-04-27 2017-04-27 Method of multi-paragraph text feature extraction system

Country Status (1)

Country Link
CN (1) CN107122350B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952279A (en) * 2022-12-02 2023-04-11 杭州瑞成信息技术股份有限公司 Text outline extraction method and device, electronic device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
WO2013038824A1 (en) * 2011-09-15 2013-03-21 株式会社富士通マーケティング Accounting data generating device, method, program, system, server device, and recording medium
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN104408083A (en) * 2014-10-27 2015-03-11 六盘水职业技术学院 Socialized media analyzing system
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
WO2013038824A1 (en) * 2011-09-15 2013-03-21 株式会社富士通マーケティング Accounting data generating device, method, program, system, server device, and recording medium
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN104408083A (en) * 2014-10-27 2015-03-11 六盘水职业技术学院 Socialized media analyzing system
CN105760474A (en) * 2016-02-14 2016-07-13 Tcl集团股份有限公司 Document collection feature word extracting method and system based on position information
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method
CN106372064A (en) * 2016-11-18 2017-02-01 北京工业大学 Characteristic word weight calculating method for text mining

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952279A (en) * 2022-12-02 2023-04-11 杭州瑞成信息技术股份有限公司 Text outline extraction method and device, electronic device and storage medium
CN115952279B (en) * 2022-12-02 2023-09-12 杭州瑞成信息技术股份有限公司 Text outline extraction method and device, electronic device and storage medium

Also Published As

Publication number Publication date
CN107122350B (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN101290632B (en) Input method for user words participating in intelligent word-making and input method system
US9047369B2 (en) Method and apparatus of determining product category information
CN107038207A (en) A kind of data query method, data processing method and device
CN107204184A (en) Audio recognition method and system
CN106897340A (en) A kind of data table updating method and device
CN102317943B (en) Method and device for full-text search
CN103440288A (en) Big data storage method and device
CN103049433A (en) Automatic question answering method, automatic question answering system and method for constructing question answering case base
CN102567421B (en) Document retrieval method and device
CN109635077A (en) Calculation method, device, electronic equipment and the storage medium of text similarity
CN107943952A (en) A kind of implementation method that full-text search is carried out based on Spark frames
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN103186612A (en) Lexical classification method and system and realization method
CN108171528A (en) A kind of attribution method and attribution system
CN110020312A (en) The method and apparatus for extracting Web page text
CN105159927B (en) Method and device for selecting subject term of target text and terminal
CN104572785A (en) Method and device for establishing index in distributed form
CN109213480A (en) A kind of method, storage medium, equipment and system for developing the back-stage management page
CN105786901B (en) A kind of method and device adjusting webpage font size
CN109409848A (en) Node intelligent recommended method, terminal device and the storage medium of open process
CN104077274B (en) Method and device for extracting hot word phrases from document set
Romein The tensor-core correlator
CN107122350A (en) A kind of feature extraction system and method for many paragraph texts
Zhang et al. Efficient generation and processing of word co-occurrence networks using corpus2graph
CN110119410A (en) Processing method and processing device, computer equipment and the storage medium of reference book data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant