CN104281566A - Semantic text description method and semantic text description system - Google Patents

Semantic text description method and semantic text description system Download PDF

Info

Publication number
CN104281566A
CN104281566A CN201410537829.5A CN201410537829A CN104281566A CN 104281566 A CN104281566 A CN 104281566A CN 201410537829 A CN201410537829 A CN 201410537829A CN 104281566 A CN104281566 A CN 104281566A
Authority
CN
China
Prior art keywords
semantic
paragraph
dimension
text
semantization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410537829.5A
Other languages
Chinese (zh)
Inventor
贾岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201410537829.5A priority Critical patent/CN104281566A/en
Publication of CN104281566A publication Critical patent/CN104281566A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a semantic text description method and a semantic text description system. The semantic text description method includes the steps of subjecting each semantic paragraph of a whole text document to paragraph-level semantic analysis; summarizing semantic information for each paragraph-level semanteme; subjecting each paragraph to semantic dimensional description; performing document feature description; performing document feature description dimensionality mutual correction. The semantic text description method and the semantic text description system have the advantages that opening, developing, changing and concluding among natural language paragraphs are enabled to be embodied on a semantic space and dimensionality features, semantic vectors and dimensionality between each two adjacent paragraphs are highly-relative, and core semantic features are enhanced and noise is inhibited by means of enhancing relative dimensionality calculation and inhibiting irrelative dimensionality calculation.

Description

A kind of semantization text describing method and system
Technical field
The present invention relates to grid computing technology field, particularly relate to a kind of semantization text describing method and system.
Background technology
Text is roughly the same with the meaning of message, refers to the information structure be made up of certain symbol or symbol, and this structure can adopt different display forms, as language, word, image etc.Text is made by specific people, and the semanteme of text inevitably reflects the specific position of people, viewpoint, value and interests, therefore, by text content analysis, can infer intention and the object of text supplier.
Text analyzing refers to the expression of text and choosing of characteristic item thereof; Text analyzing is a basic problem of text mining, information retrieval, and it carries out quantification to represent text message the Feature Words extracted from text.They are converted into structurized computing machine from a structureless urtext can the information of identifying processing, namely carries out scientific abstraction to text, set up its mathematical model, in order to describe and to replace text.Enable the identification of computing machine by realizing text calculating and the operation of this model.Because text is non-structured data, want from a large amount of texts, excavate useful information and just first text must be converted into accessible structured form.Current people adopt vector space model to describe text vector usually, but if directly with the characteristic item that segmentation methods and word frequency statistics method obtain represent in text vector each tie up, so the dimension of this vector will be very large.This undressed text vector not only brings huge computing cost to follow-up work, make the efficiency of whole processing procedure very low, and can damage the accuracy of classification, clustering algorithm, thus it is satisfactory that obtained result is difficult to.
Summary of the invention
In order to solve the technical matters existed in background technology, the present invention proposes a kind of semantization text describing method and system, suppressing the calculating of irrelevant dimension, strengthening core semantic feature, and play the effect of restraint speckle.
A kind of semantization text describing method that the present invention proposes, comprises the following steps:
Each semantic paragraph in whole text document is carried out paragraph level semantic analysis;
Semantic information is gathered to each paragraph level semanteme;
Semantic dimension description is carried out to each paragraph;
Carry out file characteristics description;
File characteristics describes dimension and corrects mutually.
Preferably, described each semantic paragraph in whole text document is carried out paragraph level semantic analysis, be specially and semantic analysis is done to every words in text document, the semantic point of mark verb, nominal semanteme point and semantic tendency.
Preferably, described semantic information is gathered to each paragraph level semanteme, be specially: the semantic side emphasis being aggregated into paragraph and whole chapter, finally utilize semantic side emphasis, in conjunction with chapter feature, take number of words as constraint condition, select and contain several group compositions summary in full semantic in full as far as possible.
Preferably, described to each paragraph carry out semantic dimensionization describe comprise time series, Area distribution dimension.
Preferably, described in carry out file characteristics description, specifically comprise and calculate file characteristics parameter, and be used for describing the document.
Preferably, described file characteristics describes dimension and corrects mutually, specifically comprises the dimensional analysis vector to neighboring semantic paragraph, dimension vector corrects: carry out quadratic sum and take advantage of mutually according to phrase semantic association, dimension association, then evolution, and sentence dimension and, as correct after value.
A kind of semantization text descriptive system that the present invention proposes, comprising:
Analysis module, for carrying out paragraph level semantic analysis by each semantic paragraph in whole text document;
Summarizing module, is connected with described analysis module, for gathering semantic information to each paragraph level semanteme;
Dimension describing module, is connected with described summarizing module, for carrying out semantic dimension description to each paragraph;
Feature interpretation module, is connected with described dimension describing module, for carrying out file characteristics description;
Mutual correction module, with described feature interpretation model calling, describes dimension for file characteristics and corrects mutually.
Preferably, described analysis module is specifically for carrying out paragraph level semantic analysis by each semantic paragraph in whole text document, and for doing semantic analysis to every words in text document, the semantic point of mark verb, nominal semanteme are put and semantic tendency.
Preferably, described summarizing module, specifically for being aggregated into the semantic side emphasis of paragraph and whole chapter, utilizes semantic side emphasis, in conjunction with chapter feature, take number of words as constraint condition, selects and contains several group compositions summary in full semantic in full as far as possible.
Preferably, described mutual correction module, corrects specifically for the dimensional analysis vector to neighboring semantic paragraph, dimension vector: carry out quadratic sum and take advantage of mutually according to phrase semantic association, dimension association, then evolution, and sentence dimension and, as the value after correcting.
In the present invention, the introduction, elucidation of the theme between natural language paragraph is made to be embodied in semantic space and dimensional characteristics, the stronger relevance of semantic vector and dimension between adjacent paragraph can be there is, by strengthening relevant dimension, suppress the calculating of irrelevant dimension, strengthen core semantic feature, and play the effect of restraint speckle.
Accompanying drawing explanation
Fig. 1 is a kind of semantization text describing method process flow diagram that the embodiment of the present invention proposes;
Fig. 2 is a kind of semantization text descriptive system structural drawing that the embodiment of the present invention proposes.
Embodiment
As shown in Figure 1, the embodiment of the present invention proposes a kind of semantization text describing method and system, comprises the following steps:
Step 101, carries out paragraph level semantic analysis by each semantic paragraph in whole text document.Be specially and semantic analysis is done to every words in text document, the semantic point of mark verb, nominal semanteme point and semantic tendency etc.
Step 102, gathers semantic information to each paragraph level semanteme.Being specially: the semantic side emphasis being aggregated into paragraph and whole chapter, finally utilizes semantic side emphasis, in conjunction with chapter feature, take number of words as constraint condition, selects and contains several " sentence groups " composition summary in full semantic in full as far as possible.
Step 103, carries out semantic dimension description to each paragraph.Comprise the various dimensions such as time series, Area distribution to describe.
Step 104, carries out file characteristics description.Calculate file characteristics parameter, and be used for describing the document; By this characteristic parameter, can retrieve, call the document.Wherein, for the description of document aspect, main characteristic parameters is document semantic vector, dimension vector and document semantic flow graph.
File characteristics comprises: can really nameplate content, the ability that target text is distinguished mutually with other texts, number moderate, be separated and realize than being easier to.Word, word or phrase can be adopted in Chinese document as the file characteristics representing text.Because word has stronger ability to express than word, and word is compared with phrase, and the cutting difficulty of word is more much smaller than the cutting difficulty of phrase.Therefore, adopt word as file characteristics, as the intermediate representation form of document, be used for realizing document and document, Similarity Measure between document and ownership goal.Usually the score value of each feature is calculated according to certain feature evaluation function, then by score value, these features are sorted, choose several score values the highest as Feature Words, feature extraction that Here it is: with map or conversion method primitive character is transformed to less new feature; The representational feature of some most is picked out from primitive character; Knowledge according to expert selects the most influential feature; Choose by the method for mathematics, find out the feature of most classified information, this method is a kind of more accurate method, and the interference of human factor is less, is particularly suitable for the application of text automatic classification digging system.
Step 105, file characteristics describes dimension and corrects mutually.The dimensional analysis vector of neighboring semantic paragraph, dimension vector are corrected: carry out quadratic sum take advantage of mutually according to phrase semantic association, dimension association, then evolution, and sentence dimension and, as the value after rectification, the dimensional characteristics be associated is enhanced, and can be suitably weakened without the dimension of definition dimension association.
As shown in Figure 2, the embodiment of the present invention proposes a kind of semantization text descriptive system, comprising: analysis module 10, for each semantic paragraph in whole text document is carried out paragraph level semantic analysis; Summarizing module 20, is connected with described analysis module 10, for gathering semantic information to each paragraph level semanteme; Dimension describing module 30, is connected with described summarizing module 20, for carrying out semantic dimension description to each paragraph; Feature interpretation module 40, is connected with described dimension describing module 30, for carrying out file characteristics description; Mutual correction module 50, is connected with described feature interpretation module 40, describes dimension correct mutually for file characteristics.
Described analysis module 10 is specifically for carrying out paragraph level semantic analysis by each semantic paragraph in whole text document, and for doing semantic analysis to every words in text document, the semantic point of mark verb, nominal semanteme are put and semantic tendency.
Described summarizing module 20, specifically for being aggregated into the semantic side emphasis of paragraph and whole chapter, utilizes semantic side emphasis, in conjunction with chapter feature, take number of words as constraint condition, selects and contains several group compositions summary in full semantic in full as far as possible.
Described mutual correction module 50, corrects specifically for the dimensional analysis vector to neighboring semantic paragraph, dimension vector: carry out quadratic sum and take advantage of mutually according to phrase semantic association, dimension association, then evolution, and sentence dimension and, as the value after correcting.
The above; be only the present invention's preferably embodiment; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; be equal to according to technical scheme of the present invention and inventive concept thereof and replace or change, all should be encompassed within protection scope of the present invention.

Claims (10)

1. a semantization text describing method, is characterized in that, comprises the following steps:
Each semantic paragraph in whole text document is carried out paragraph level semantic analysis;
Semantic information is gathered to each paragraph level semanteme;
Semantic dimension description is carried out to each paragraph;
Carry out file characteristics description;
Carry out file characteristics to describe dimension and correct mutually.
2. semantization text describing method according to claim 1, it is characterized in that, described each semantic paragraph in whole text document is carried out paragraph level semantic analysis, be specially and semantic analysis is done to every words in text document, the semantic point of mark verb, nominal semanteme point and semantic tendency.
3. semantization text describing method according to claim 1, it is characterized in that, described semantic information is gathered to each paragraph level semanteme, be specially: the semantic side emphasis gathering paragraph and whole chapter, finally utilize semantic side emphasis, in conjunction with chapter feature, take number of words as constraint condition, select and contain several group compositions summary in full semantic in full as far as possible.
4. semantization text describing method according to claim 1, is characterized in that, described to each paragraph carry out semantic dimensionization describe comprise time series, Area distribution dimension.
5. semantization text describing method according to claim 1, is characterized in that, described in carry out file characteristics description, specifically comprise and calculate file characteristics parameter, and be used for describing the document.
6. semantization text describing method according to claim 1, it is characterized in that, described file characteristics describes dimension and corrects mutually, specifically comprise the dimensional analysis vector to neighboring semantic paragraph, dimension vector corrects: carry out quadratic sum take advantage of mutually according to phrase semantic association, dimension association, then evolution, and sentence dimension and, as correct after value.
7. a semantization text descriptive system, is characterized in that, comprising:
Analysis module, for carrying out paragraph level semantic analysis by each semantic paragraph in whole text document;
Summarizing module, is connected with described analysis module, for gathering semantic information to each paragraph level semanteme;
Dimension describing module, is connected with described summarizing module, for carrying out semantic dimension description to each paragraph;
Feature interpretation module, is connected with described dimension describing module, for carrying out file characteristics description;
Mutual correction module, with described feature interpretation model calling, describes dimension for file characteristics and corrects mutually.
8. semantization text descriptive system according to claim 7, it is characterized in that, described analysis module is specifically for carrying out paragraph level semantic analysis by each semantic paragraph in whole text document, for doing semantic analysis to every words in text document, the semantic point of mark verb, nominal semanteme point and semantic tendency.
9. semantization text descriptive system according to claim 7, it is characterized in that, described summarizing module is specifically for being aggregated into the semantic side emphasis of paragraph and whole chapter, utilize semantic side emphasis, in conjunction with chapter feature, take number of words as constraint condition, select and contain several group compositions summary in full semantic in full as far as possible.
10. semantization text descriptive system according to claim 7, it is characterized in that, described mutual correction module, correct specifically for the dimensional analysis vector to neighboring semantic paragraph, dimension vector: carry out quadratic sum take advantage of mutually according to phrase semantic association, dimension association, then evolution, and sentence dimension and, as correct after value.
CN201410537829.5A 2014-10-13 2014-10-13 Semantic text description method and semantic text description system Pending CN104281566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410537829.5A CN104281566A (en) 2014-10-13 2014-10-13 Semantic text description method and semantic text description system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410537829.5A CN104281566A (en) 2014-10-13 2014-10-13 Semantic text description method and semantic text description system

Publications (1)

Publication Number Publication Date
CN104281566A true CN104281566A (en) 2015-01-14

Family

ID=52256451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410537829.5A Pending CN104281566A (en) 2014-10-13 2014-10-13 Semantic text description method and semantic text description system

Country Status (1)

Country Link
CN (1) CN104281566A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017137859A1 (en) * 2016-02-09 2017-08-17 International Business Machines Corporation Systems and methods for language feature generation over multi-layered word representation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004111997A1 (en) * 2003-06-19 2004-12-23 International Business Machines Corporation System and method for configuring voice readers using semantic analysis
CN101645083A (en) * 2009-01-16 2010-02-10 中国科学院声学研究所 Acquisition system and method of text field based on concept symbols
WO2013110288A1 (en) * 2012-01-23 2013-08-01 Microsoft Corporation Fixed format document conversion engine
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103678273A (en) * 2012-09-14 2014-03-26 安徽华贞信息科技有限公司 Internet paragraph level topic recognition system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004111997A1 (en) * 2003-06-19 2004-12-23 International Business Machines Corporation System and method for configuring voice readers using semantic analysis
CN1788305A (en) * 2003-06-19 2006-06-14 国际商业机器公司 System and method for configuring voice readers using semantic analysis
CN101645083A (en) * 2009-01-16 2010-02-10 中国科学院声学研究所 Acquisition system and method of text field based on concept symbols
WO2013110288A1 (en) * 2012-01-23 2013-08-01 Microsoft Corporation Fixed format document conversion engine
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103678273A (en) * 2012-09-14 2014-03-26 安徽华贞信息科技有限公司 Internet paragraph level topic recognition system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017137859A1 (en) * 2016-02-09 2017-08-17 International Business Machines Corporation Systems and methods for language feature generation over multi-layered word representation
GB2562983A (en) * 2016-02-09 2018-11-28 Ibm Systems and methods for language feature generation over multi-layered word representation

Similar Documents

Publication Publication Date Title
EP3508992A1 (en) Error correction method and device for search term
CN101408898B (en) Method and device for extracting web page text
CN109684476B (en) Text classification method, text classification device and terminal equipment
CN103970733B (en) A kind of Chinese new word identification method based on graph structure
US9286526B1 (en) Cohort-based learning from user edits
CN103324745A (en) Text garbage identifying method and system based on Bayesian model
CN107748745B (en) Enterprise name keyword extraction method
CN102693279A (en) Method, device and system for fast calculating comment similarity
KR20150037924A (en) Information classification based on product recognition
CN105117740A (en) Font identification method and device
CN103324641B (en) Information record recommendation method and device
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN108241682B (en) Method and device for determining text emotion
CN109213974A (en) A kind of electronic document conversion method and device
CN111339778B (en) Text processing method, device, storage medium and processor
CN107665222B (en) Keyword expansion method and device
CN103729354B (en) web information processing method and device
CN110489514B (en) System and method for improving event extraction labeling efficiency, event extraction method and system
CN104281566A (en) Semantic text description method and semantic text description system
US20210081607A1 (en) Argument structure extension device, argument structure extension method, program, and data structure
CN107590163B (en) The methods, devices and systems of text feature selection
CN115630304A (en) Event segmentation and extraction method and system in text extraction task
CN106649367B (en) Method and device for detecting keyword popularization degree
CN104281692A (en) Method and system for realizing paragraph dimensionalized description
CN111708862B (en) Text matching method and device and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150114