CN108363696A - A kind of processing method and processing device of text message - Google Patents

A kind of processing method and processing device of text message Download PDF

Info

Publication number
CN108363696A
CN108363696A CN201810157712.2A CN201810157712A CN108363696A CN 108363696 A CN108363696 A CN 108363696A CN 201810157712 A CN201810157712 A CN 201810157712A CN 108363696 A CN108363696 A CN 108363696A
Authority
CN
China
Prior art keywords
information
text
ultimate constituent
importance
information unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810157712.2A
Other languages
Chinese (zh)
Inventor
李小明
李大明
杜鸣笛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201810157712.2A priority Critical patent/CN108363696A/en
Publication of CN108363696A publication Critical patent/CN108363696A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of processing method and processing device of text message, the method includes:To text segmentation, ultimate constituent is generated;According to the functional role that the ultimate constituent has in the text, corresponding information unit is determined, and importance level label is carried out to the ultimate constituent of described information unit;Establish the linked database and index centered on ultimate constituent and correlation combiner information;According to the importance level of input text ultimate constituent sequentially to export the data in the corresponding linked database, and corresponding document is sequentially exported according to importance of the associated data in source text.The processing method and its device of text message of the present invention can effectively filter the unilateral and impurity information occurred in information retrieval, improve integration and the delivery efficiency of text message, allow user that can more substantially efficiently utilize existing text information.

Description

A kind of processing method and processing device of text message
Technical field
The present invention relates to text information processing technical fields, and in particular to a kind of processing method and processing device of text message.
Background technology
Text message plays important role in the process of human civilization, and the wisdom that it records forefathers is accumulated with experience It is tired, carry the progress of the succession and science and technology of culture.With the development of society, there is explosive growth in all kinds of text datas, together When it is also to push the important energy source that develops forward of all trades and professions.Existing a large amount of text datas how are allowed more preferably more efficiently to be The people service, and are all of great immediate significance to country and the whole mankind.
In existing technology, if it is desired to can make full use of existing text data to instruct or innovate existing work Make, key is that user how to be made precisely efficiently to obtain useful information from a large amount of complicated text messages.Wherein, most often The method seen is showed and the relevant documentation & info of input keyword and source by search engine or specialized database.Generally In the case of, single text information often can only at most embody the concern relevant content in a certain respect of keyword, so as to cause inspection The information that rope comes out is mostly very unilateral, scrappy, even incoherent.It is past to obtain valuable information comprehensively Toward needs by a large amount of data of literatures of access.On the one hand, current computer systems can not be to the content and semanteme of text It is precisely analyzed, the functionality and importance of be possible to keyword in text cannot be distinguished.
If retrieved to some particular keywords according to traditional approach such as click volume, hyperlink numbers, it will so that very Mostly inessential, even impurity information arrangement is in position (and the valuable information content of true kernel in these documents earlier above It is not the particular keywords of user's concern, and is only merely to refer to the particular keywords), so as to cause what is shown to user Information is also unilateral and mixed and disorderly.This unordered and scrappy information transmission hinders current and future people and efficiently utilizes document Possibility, and problems are by gradually increasing and become increasingly severe with Document Quantity.
In conclusion processing and the classifying mode of text message in the prior art, the spy that specific user can not be paid close attention to Determine information carry out it is clear, accurately show, can not efficiently utilize existing text information information, it would be highly desirable to be further improved.
Invention content
The purpose of the present invention is to provide a kind of processing method and processing devices of text message, to solve prior art Chinese This information can not clearly, precisely be shown, can not efficiently utilize the defect of text message data.
To achieve the above object, the present invention provides a kind of processing method of text message, the method includes:
To text segmentation, ultimate constituent is generated;
According to the functional role that the ultimate constituent has in the text, corresponding information unit is determined, and to institute The ultimate constituent for stating information unit carries out importance level label;
Establish the linked database and index centered on ultimate constituent and correlation combiner information;
The corresponding linked database is sequentially exported according to the importance level of input text ultimate constituent In data, and corresponding document is sequentially exported according to importance of the associated data in source text.
Preferably, based on according to the ultimate constituent in the text, Behavior Expression mode, receptor, feature As a result or the additional information functional role of modification and restriction is played, determines corresponding information unit;
The main body is the main object of description event or the information of executor;
The Behavior Expression mode is expression, description or the action behavior information executed;
The receptor is effective object or the person's of being performed information;
The characteristic results are the information for describing the explanation to result or conclusion after behavior executes;
The information of background, purpose and condition that the additional information for playing modification and limiting occurs as main secretary in charge of something's part.
Preferably, importance level label is carried out to the ultimate constituent of described information unit, specifically included:
According to effect of the ultimate constituent in described information unit and status and described information unit in the text Effect and status in offering determine importance level;
It is place paragraph, chapters and sections or full text central idea, conclusion and meaning that the ultimate constituent, which is in the information unit, Justice is set as high-level importance;
The ultimate constituent is that information smaller with the original text central idea degree of association in the information unit is set as low layer Grade importance;
The ultimate constituent be in the information unit to original text core work or thought be discussed in detail or flow is said The information of bright part is intermediate level importance.
Preferably, the linked database centered on ultimate constituent and correlation combiner information is established, is specifically included:
The ultimate constituent and its various significant combined information are set to centre data respectively, by other texts Information unit associated with the centre data flocks together to form linked database in offering;
The associated information unit refers to two or more information units of the same or similar degree higher than specified threshold.
Preferably, foundation index is the centre data and the linked database, the centre data to it is related Index between document.
Preferably, the corresponding association is sequentially exported according to the importance level of input text ultimate constituent Data in database, and corresponding document is sequentially exported according to importance of the associated data in source text, specifically Including:Generation ultimate constituent is split to the input text, to the importance of the ultimate constituent into rower Note handle and carry out correlation combiner, matched with the centre data in the linked database, according to the matching degree sequentially The data in linked database, the importance level according to the target data in the linked database in source text are exported, Sequentially export pertinent literature.
The present invention also provides a kind of text processing apparatus, the text processing apparatus includes:
Text segmentation module, for text segmentation, generating ultimate constituent;
Information flag module, the functional role for being had in the text according to the ultimate constituent determine corresponding Information unit, and to the ultimate constituent of described information unit carry out importance level label;
Database module, for establishing linked database and rope centered on ultimate constituent and correlation combiner information Draw;
Information searching module exports the corresponding pass according to the importance level of input text ultimate constituent Join database, and corresponding text is exported according to importance of the target data in source text in the linked database It offers.
Preferably, described information mark module is additionally operable to according to main body, Behavior Expression mode, receptor, the feature in text As a result and modification is played and the additional information of restriction determines corresponding information unit;
The main body is the main object of description event or the information of executor;
The Behavior Expression mode is expression, description or the action behavior information executed;
The receptor is effective object or the person's of being performed information;
The characteristic results are the information for describing the explanation to result or conclusion after behavior executes;
The information of background, purpose and condition that the additional information for playing modification and limiting occurs as main secretary in charge of something's part.
Preferably, described information mark module is additionally operable to the effect in described information unit according to ultimate constituent Importance level is determined with the effect in the text of status and described information unit and status;
It is place paragraph, chapters and sections or full text central idea, conclusion and meaning that the ultimate constituent, which is in the information unit, Justice is set as high-level importance;
The ultimate constituent is that information smaller with the original text central idea degree of association in the information unit is set as low layer Grade importance;
The ultimate constituent be in the information unit to original text core work or thought be discussed in detail or flow is said The information of bright part is intermediate level importance.
Preferably, the database module is used for the ultimate constituent and its various significant combined information It is set as centre data, flocks together information unit associated with the centre data in other documents to form incidence number According to library;
The associated information unit refers to two or more information units of the same or similar degree higher than specified threshold.
Preferably, the database module is additionally operable to establish index to be the centre data and the linked database, institute State the index between centre data and pertinent literature.
Preferably, described information retrieves module, for generating ultimate constituent to the input text segmentation, to described Ultimate constituent is combined, matched with the centre data in the linked database, according to the matching degree sequentially The data in linked database, the importance level according to the target data in the linked database in source text are exported, Sequentially export pertinent literature.
The invention has the advantages that:
The present invention is syntactic structure based on content of text and the role that each ultimate constituent is undertaken in sentence, will The content of each sentence has carried out functional indicia in text and processing, shape is marked in each ultimate constituent importance level At content of text layer of structure it is clear, make a distinction between the important and the lesser one.The text handling method of the present invention efficiently solves text retrieval system The problems such as information of appearance is scrappy, unilateral and unordered output allows users to the function of combining input text key word and important Property degree, precisely, system, orderly inquire all relevant informations.
Description of the drawings
Fig. 1 is the process flow figure of the text message of the present invention;
Fig. 2 is that the ultimate constituent of the present invention divides schematic diagram in information unit;
Fig. 3 be the present invention information unit in ultimate constituent importance hierarchical structure schematic diagram;
Fig. 4 is the data of the database of the retrieval of the present invention and the flow diagram of document output;
Fig. 5 is the structural schematic diagram of the text processing apparatus of the present invention.
Specific implementation mode
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in mode is applied, the technical solution in the application embodiment is clearly and completely described, it is clear that described Embodiment is only a part of embodiment of the application, rather than whole embodiments.Based on the embodiment party in the application Formula, all other embodiment obtained by those of ordinary skill in the art without making creative efforts, is all answered When the range for belonging to the application protection.
Embodiment 1
The processing method of text message is described in detail in the present embodiment, specifically the processing method of the text message of the present invention, Method includes the following steps:
Step S100:To text segmentation, ultimate constituent is generated;
First, based on context and context, by required processing text part pronoun and omit content carry out it is necessary Supplement and replacement so that when handling single text and its ultimate constituent, can more completely express related to reproduction Important content.Due to language communicative habits and terseness requirement so that occur in sentence a large amount of pronoun (such as " here ", " those ", " they " etc.) (for example some is limited or word of modification was originally different for two simultaneously with the omission content of part Object, but be only used on some object in expression, and omitted when describing another object).
When carrying out layer of structure processing to single text sentence, only by the important pronoun in part and the content of omission into Row is replaced and supplement, can just make the information of each ingredient in single statement in further detail, it is complete, significant.In the present invention, wait for The text source of processing is in all kinds of document, picture, voice and video data etc. comprising text message or verbal description.Wherein, The expression-form of text includes various formats and category of language.
It is the ultimate constituent with practical significance by text segmentation according to certain syntax rule, and marks respective Part of speech.Here ultimate constituent includes word, word, phrase, formula etc.;Part of speech is named word, verb, adjective, article, Jie Word, conjunction, adverbial word etc..In order to which sentence is divided into the constituent with practical significance, integrated point in combination with the following Analysis is handled:Entire sentence is divided by several big modules by core verb, the additional character etc. in sentence first, such as can It is divided into the part after the part before verb, verb and verb;Then according to the segmentation characteristic of feature conjunction, preposition, Above several big modules are divided into smaller unit, such as conjunction " and " often separated "AND" is two different arranged side by side Ingredient etc.;It is smaller component part according to the characteristics of combination of part of speech, then by the dividing elements of acquisition.Such as can by " ... " judgement of " ... ground " etc., it can be by adjective from entire relevant noun phrase and adverbial word from entire relevant verb word It is separated in group;Article or numeral-classifier compound can isolate the adjacent noun differentiation in two positions ....Finally, for from form It is upper cannot be distinguished or the built-up section of segmentation, it needs to analyze and determine for the context of context or existing corpus.For example, If some constituent contains the smaller constituent of some divided above, this constituent can continue It is divided into smaller ultimate constituent.It, can be specific to the language of country variant, the specific segmentation side of text in the present invention Formula will be different.
Step S101:The functional role being had in the text according to ultimate constituent determines corresponding information unit, and Importance level label is carried out to the ultimate constituent of information unit;According to ultimate constituent based in the text, Behavior Expression mode, receptor, characteristic results and play modification and the additional information of restriction determines corresponding information unit, main body To describe the main object of event or the information of executor, generally comprises subject in sentence and associated rising is modified or limit Tailor-made additional information;Behavior Expression mode is expression, description or the action behavior information executed, the generally meaning in sentence Language verb and content is modified accordingly;Receptor is effective object or the person's of being performed information, in sentence generally in active voice Object and associated content information;Characteristic results are to describe the explanation to result or conclusion after behavior executes Information generally comprises the state or result information of supplementary explanation object, subject or predicate;Play the additional information of modification and restriction Background, the information of purpose and condition occurred for main secretary in charge of something's part, generally is used for modifying or limiting part adverbial modifier's information of whole sentence.
As shown in Fig. 2, specifically, first with core verb and the associated section (pair of the adjacent modification in such as position Word) it is used as the information unit of behavior;Again using before behavior core noun and relevant supplementary information as main body Unit, using later core noun and relevant portion as receptor unit, and using the content information of and then receptor unit as Characteristic results unit (when there is no receptor, it can be directly after behavior unit;In addition, this part can be for mending It fills another class noun part, the adjective for illustrating main body or receptor status and the infinitive of receptor, can also be with description The adverbial word etc. of behavior significant condition).Finally, the punctuation mark in combinable sentence and relevant conjunction or preposition, to It determines modification or limits the additional information cell of full sentence, for example, the purpose adverbial modifier or the condition adverbial modifier etc. of the guiding of certain prepositions.Its In, it when modification, limits or when the content of explanation is subordinate clause, subordinate clause can individually be extracted, carried out again according still further to complete sentence Processing.As shown in Fig. 2, a case in being divided which show information unit, in practical operation, same information unit is not yet It is confined to the description content being positioned next to, it is also possible to while including multiple and different core element, for example can include in main body Multiple nouns arranged side by side or corresponding phrase, infinitive ingredient.In addition, dividing to obtain perfect information unit, need to fill Dividing can in the segmentation for considering all kinds of parts of speech such as feature preposition, conjunction, article, verb and connection function and each information unit Can have syntactic structure feature (can be the simple form of adjective+noun such as in main unit, can also be noun+ The restriction unit etc. of preposition connection).Specifically, early period is used as sample by artificial treatment and label, machine is allowed to learn it In rule, large-scale machine processing and identification are reached eventually by the mode of artificial intelligence.
In the present invention, importance level label is carried out to the ultimate constituent of information unit, is specifically included:According to basic Effect and status and information unit of the constituent in information unit effect in the literature and status determine importance layer Grade.
As shown in figure 3, according to effect of each ultimate constituent (word, word, phrase etc.) in the information unit and ground Position determines its importance level.The first step need to determine importance of the ultimate constituent in affiliated information unit.General feelings Under condition, it sets the importance of the trunk ingredient of core in information unit to high-level importance, and by relevant modification or limit Determine ingredient and is set as lower-level importance.Such as in main information unit, it is high relevant core noun can be set to Level importance, and low level rank will be set as modification or adjective, various phrases or the infinitive etc. that limit.When same The ultimate constituent of the segmentation of one information unit include multiple and different elements when, can according to each ultimate constituent it Between the various practical factors such as internal relation they are set with identical or different importance level.Second step need to determine this The importance of information unit in the text belonging to constituent, i.e., whole labeled with pertinent information unit is in affiliated paragraph, chapters and sections Or the status in full text and effect.The criterion of the step core is, but is not limited to following several:(1) it is in the sentence The no word or phrase for containing certain specific informations, voice or tense.For example, when include " this article introduce (or analysis, Compare etc.) ... " printed words, so that it may using the importance whole installation of the sentence and Related Information Unit is important as high-level Property;And be compared in the summary statement for quoting forefathers certain work or expository writing in illustrated method viewpoint when, generally Contain the verb etc. of the reference third person used or past tense.It can be come language where analyzing and determining according to these important elements The importance of sentence and Related Information Unit in this article is low-level;And can be judged in advance not comprising information above is intermediate important Property level.(2) carry out comprehensive descision in conjunction with the semanteme and context of context, needed at this time through the key letter in all correlatives Various internal logical relationships between breath and key message carry out comprehensive analysis, judge and tag-related unit is in global scope Interior importance level.Wherein so-called key message includes not only all kinds of conjunctions, adversative, progressive word etc., also includes language Ultimate constituent in sentence semantically whether there is progressive, antisense, subordinate relation etc..
In summary the handling result of two steps can obtain the final importance level of the ultimate constituent.Generally In the case of, when the core component that ultimate constituent is affiliated information unit, while the information unit plays core work in the text Used time, the final importance information highest of the ultimate constituent;And when ultimate constituent is the non-of affiliated information unit Core, the while when information unit and the also uncorrelated core concept of full text, the ultimate constituent it is final important Property rank is minimum.Note that any other any text (data) information combined using ultimate constituent, importance Level can be treated according to the highest ultimate constituent of the importance information for being included.
Fig. 3 illustrates a case in a kind of layered approach, wherein each functional unit may include it is multiple and different Trunk ingredient, ornamental equivalent and/or determinant, and importance level can in any form, any format, any number of levels Or evaluation rule is indicated and is arranged.It wherein, can should when for playing modification or the content information that limits as subordinate clause Subordinate clause is individually extracted as a complete information unit, while labeled as the attribute of subordinate clause, according still further to above method Divide importance level.
Step S102:Establish the linked database and index centered on ultimate constituent and correlation combiner information;
It specifically includes:Respectively by number centered on the ultimate constituent and its various significant combined information setting According to flocking together information unit associated with the centre data in other documents to form linked database;Establish rope It is cited as centre data and linked database, the index between centre data and pertinent literature.With ultimate constituent and relevant group Respective linked database is formed centered on conjunction information, by the single ultimate constituent of segmentation or its arbitrary two or more The combined information of a ingredient is respectively set to centre data, by other all documents segmentation information or correlation combiner information according to It is secondary to compare, judge with the centre data, all data information real-time updates associated therewith, addition are flocked together, Form linked database.Here associated data information unit refers to that same or similar degree is two or more higher than specified threshold Data (word, word, phrase etc.).Wherein, it is identical refer to have different parts of speech, different tense, different voices same information unit; Similarity includes two or more data and/or literal meaning semantic identical but that expression is different higher than the data of specified threshold Between the degree of association be higher than specified threshold information unit.
The index between centre data and linked database, each data and place document is established, including:By in text points It establishes and indexes between each ultimate constituent cut or its correlation combiner information and linked database centered on it, use To help user, faster more comprehensively grasp all associative keys associated with centre data keyword, acceleration user find certainly Oneself target keyword;By each segmentation ingredient and a combination thereof respectively with its corresponding to document between establish index, to add Fast inquiry velocity.
May include multiple identical data information units in this step, in two different linked databases.For example, All data, which centainly include at least, in the linked database of centre data AB the one of information units of A or B, such as ACD/BD;The linked database of data, which centainly includes at least, centered on BC its a information unit of B or C, such as BD/ CD.Be respectively with AB and BC centre data linked database in, be bound to the information that there are its a large amount of identical redundancy.It is main Purpose is that the space with storage exchanges the later retrieval required time for.In one new segmentation ultimate constituent of every acquisition Or when combined information, it is required for being compared with original all databases, real-time update has the information in linked database. If coincided with some previous centre data, this obtained new data will be incorporated into pervious linked database, Without resettling new linked database.
Step S103:The corresponding pass is sequentially exported according to the importance level of input text ultimate constituent Join the data in database, and corresponding document is sequentially exported according to importance of the associated data in source text, has Body includes:Ultimate constituent is generated to the input text segmentation, the importance of the ultimate constituent is marked Correlation combiner is handled and carried out, is matched with the centre data in the linked database, is sequentially defeated according to the matching degree Go out the data in linked database, the importance level according to the target data in the linked database in source text, presses Sequence exports pertinent literature.
The importance level of the ultimate constituent of the text of input is analyzed, is judged comprising:By the text of input This carries out ultimate constituent segmentation and importance level division mark according to step S101 and step S102.It is defeated to user successively In the entirety, the important component in keyword and correlation combiner, keyword that enter the keyword in the ultimate constituent of text Submember and combination are retrieved, and carry out matching inquiry with the centre data in each database respectively.
The data sequentially exported in corresponding associated database include:When the keyword of input text is associated with some After centre data matching in database, need to each data in linked database be subjected to matching degree with input keyword again It calculates, wherein needing to consider the weights of importance of each ultimate constituent;It determines to be associated eventually by the matching degree of calculating Putting in order when all data displayings in database.In practical applications, data put in order additionally depend on it is a variety of other Factor, such as attention rate etc. that the data are subject within certain a period of time.
As shown in figure 4, sequentially the document corresponding to output target data includes:User is in the corresponding tool of inquiry target data When body document, all pertinent literature sequences need to be arranged in conjunction with importance of the target data in original text.The text preferentially shown It is most important in the information unit of pertinent literature and place sentence to offer comprising target data;It is other successively combine it is respective Importance degree is sequentially unfolded to show.Wherein, for the pertinent literature of same importance, the priority of arrangement can integrate Other factors are considered, such as attention rate, time and source publication and document classification etc..The output displaying sequence of document can be with Including relevant other important keyword datas in each document, to accelerate selection speed of the user to specific document.This is defeated Going out sequence not only allows user very system can recognize existing all related ends around the input keyword, also can Allow the user to be quickly found out oneself interested or demand target data, to by corresponding document can be more specific decorrelation Details.
Embodiment 2
Method based on above-described embodiment, the present invention also provides a kind of text processing apparatus.Specifically, as shown in figure 5, originally The text processing apparatus that embodiment provides, including:Text segmentation module, for text segmentation, generating ultimate constituent;Letter Mark module is ceased, the functional role for having in the text according to ultimate constituent determines corresponding information unit, and right The ultimate constituent of information unit carries out importance level label;Database module, for establishing with ultimate constituent and Linked database centered on correlation combiner information and index;Information searching module, according to input text ultimate constituent Importance level exports the corresponding linked database, and according to target data in the linked database in source document Importance in this exports corresponding document.
Wherein, information flag module, be additionally operable to according in text main body, Behavior Expression mode, receptor, characteristic results with And play modification and the additional information of restriction determines corresponding information unit;Main body is main object or the execution of description event The information of person;Behavior Expression mode is expression, description or the action behavior information executed;Receptor is effective object or the person of being performed Information;Characteristic results are the information for describing the explanation to result or conclusion after behavior executes;Play modification and restriction Additional information is background, the information of purpose and condition that main secretary in charge of something's part occurs.
In one specific implementation mode of the present embodiment, information flag module is additionally operable to believed according to ultimate constituent Effect and status and information unit in interest statement member effect in the text and status determine importance level;Basic composition at It is divided into the information unit and is set as high-level importance for place paragraph, chapters and sections or full text central idea, conclusion and meaning;Substantially Constituent is that information smaller with the original text central idea degree of association in the information unit is set as low-level importance;Basic composition Ingredient be in the information unit to original text core work or thought be discussed in detail or the information of process description part is middle layer Grade importance.
In one specific implementation mode of the present embodiment, database module is used for the ultimate constituent and Qi Ge The significant combined information of kind is set as centre data, and information unit associated with the centre data in other documents is gathered It gathers together to form linked database;
Associated information unit refers to two or more information units of the same or similar degree higher than specified threshold.Its In, it is identical refer to have different parts of speech, different tense, different voices same information unit;Similarity is higher than specified threshold Data include semantic identical but express the degree of association between different two or more data and/or literal meaning higher than specified The information unit of threshold value.
In one specific implementation mode of the present embodiment, database module be additionally operable to establish index centered on data be associated with Database, the index between centre data and pertinent literature.
In one specific implementation mode of the present embodiment, information searching module, for being generated to the input text segmentation Ultimate constituent, the ultimate constituent is combined, is matched with the centre data in the linked database, Data in linked database are sequentially exported according to the matching degree, according to the target data in the linked database in source document Importance level in this, sequentially exports pertinent literature.
The processing method and its device of text message of the present invention can improve integration and the delivery efficiency of text message.This hair Bright is to complete each ultimate constituent in function based on the role function of each ultimate constituent in the text in text data On dividing elements and content on importance level label.Such text-processing technology is that information retrieval system is brought more Rational result displaying, priority, which puts in order, is more in line with the demand and custom of user;It is simultaneously so that trifling in single text Information more collect neutralized system.The processing method and its device of text message of the present invention effectively can filter in information retrieval Existing unilateral and impurity information, improves integration and the delivery efficiency of text message, makes user more sharp With existing text information.
Although above having used general explanation and specific embodiment, the present invention is described in detail, at this On the basis of invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Therefore, These modifications or improvements without departing from theon the basis of the spirit of the present invention belong to the scope of protection of present invention.

Claims (12)

1. a kind of processing method of text message, which is characterized in that the method includes:
To text segmentation, ultimate constituent is generated;
According to the functional role that the ultimate constituent has in the text, corresponding information unit is determined, and to the letter The ultimate constituent of interest statement member carries out importance level label;
Establish the linked database and index centered on ultimate constituent and correlation combiner information;
It is sequentially exported in the corresponding linked database according to the importance level of input text ultimate constituent Data, and corresponding document is sequentially exported according to importance of the associated data in source text.
2. the processing method of text message according to claim 1, which is characterized in that
According to the ultimate constituent in the text based on, Behavior Expression mode, receptor, characteristic results or play Modification and the additional information functional role limited, determine corresponding information unit;
The main body is the main object of description event or the information of executor;
The Behavior Expression mode is expression, description or the action behavior information executed;
The receptor is effective object or the person's of being performed information;
The characteristic results are the information for describing the explanation to result or conclusion after behavior executes;
The information of background, purpose and condition that the additional information for playing modification and limiting occurs as main secretary in charge of something's part.
3. the processing method of text message according to claim 2, which is characterized in that
Importance level label is carried out to the ultimate constituent of described information unit, is specifically included:
According to effect of the ultimate constituent in described information unit and status and described information unit in the document Effect and status determine importance level;
It is that place paragraph, chapters and sections or full text central idea, conclusion and meaning are set that the ultimate constituent, which is in the information unit, For high-level importance;
The ultimate constituent is that information smaller with the original text central idea degree of association in the information unit is set as low-level weight The property wanted;
The ultimate constituent be in the information unit to original text core work or thought be discussed in detail or process description portion The information divided is intermediate level importance.
4. the processing method of text message according to claim 1, which is characterized in that
The linked database centered on ultimate constituent and correlation combiner information is established, is specifically included:
Set the ultimate constituent and its various significant combined information to centre data respectively, it will be in other documents Information unit associated with the centre data flocks together to form linked database;
The associated information unit refers to two or more information units of the same or similar degree higher than specified threshold.
5. the processing method of text message according to claim 4, which is characterized in that
The foundation index is the centre data and the linked database, the rope between the centre data and pertinent literature Draw.
6. the processing method of text message according to claim 1, which is characterized in that
It is sequentially exported in the corresponding linked database according to the importance level of input text ultimate constituent Data, and corresponding document is sequentially exported according to importance of the associated data in source text, it specifically includes:To described Input text is split generation ultimate constituent, and processing is marked to the importance of the ultimate constituent and carries out Correlation combiner is matched with the centre data in the linked database, sequentially exports associated data according to the matching degree Data in library, the importance level according to the target data in the linked database in source text, sequentially output is related Document.
7. a kind of text processing apparatus, which is characterized in that the text processing apparatus includes:
Text segmentation module, for text segmentation, generating ultimate constituent;
Information flag module, the functional role for being had in the text according to the ultimate constituent determine corresponding letter Interest statement member, and importance level label is carried out to the ultimate constituent of described information unit;
Database module, for establishing linked database and index centered on ultimate constituent and correlation combiner information;
Information searching module exports the corresponding incidence number according to the importance level of input text ultimate constituent Corresponding document is exported according to library, and according to importance of the target data in source text in the linked database.
8. text processing apparatus according to claim 7, which is characterized in that
Described information mark module, be additionally operable to according in text main body, Behavior Expression mode, receptor, characteristic results and rise Corresponding information unit is determined to modification and the additional information limited;
The main body is the main object of description event or the information of executor;
The Behavior Expression mode is expression, description or the action behavior information executed;
The receptor is effective object or the person's of being performed information;
The characteristic results are the information for describing the explanation to result or conclusion after behavior executes;
The information of background, purpose and condition that the additional information for playing modification and limiting occurs as main secretary in charge of something's part.
9. text processing apparatus according to claim 7, which is characterized in that
Described information mark module, the effect being additionally operable to according to ultimate constituent in described information unit and status and institute It states effect and status of the information unit in the text and determines importance level;
It is that place paragraph, chapters and sections or full text central idea, conclusion and meaning are set that the ultimate constituent, which is in the information unit, For high-level importance;
The ultimate constituent is that information smaller with the original text central idea degree of association in the information unit is set as low-level weight The property wanted;
The ultimate constituent be in the information unit to original text core work or thought be discussed in detail or process description portion The information divided is intermediate level importance.
10. text processing apparatus according to claim 7, which is characterized in that
The database module, for will be counted centered on the ultimate constituent and its various significant combined information setting According to flocking together information unit associated with the centre data in other documents to form linked database;
Associated information unit refers to two or more information units of the same or similar degree higher than specified threshold.
11. text processing apparatus according to claim 8, which is characterized in that
It is the centre data and the linked database that the database module, which is additionally operable to establish index, the centre data with Index between pertinent literature.
12. text processing apparatus according to claim 8, which is characterized in that
Described information retrieve module, for the input text segmentation generate ultimate constituent, to it is described it is basic form at Divide and is combined, is matched with the centre data in the linked database, incidence number is sequentially exported according to the matching degree According to the data in library, the importance level according to the target data in the linked database in source text, sequentially the output phase Close document.
CN201810157712.2A 2018-02-24 2018-02-24 A kind of processing method and processing device of text message Pending CN108363696A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810157712.2A CN108363696A (en) 2018-02-24 2018-02-24 A kind of processing method and processing device of text message

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810157712.2A CN108363696A (en) 2018-02-24 2018-02-24 A kind of processing method and processing device of text message

Publications (1)

Publication Number Publication Date
CN108363696A true CN108363696A (en) 2018-08-03

Family

ID=63002680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810157712.2A Pending CN108363696A (en) 2018-02-24 2018-02-24 A kind of processing method and processing device of text message

Country Status (1)

Country Link
CN (1) CN108363696A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378576A (en) * 2021-05-08 2021-09-10 重庆航天信息有限公司 Food safety data mining method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040243408A1 (en) * 2003-05-30 2004-12-02 Microsoft Corporation Method and apparatus using source-channel models for word segmentation
US20070124301A1 (en) * 2004-09-30 2007-05-31 Elbaz Gilad I Methods and systems for improving text segmentation
CN101482881A (en) * 2003-07-30 2009-07-15 Google公司 Methods and systems for determining a meaning of a document to match the document to content
CN101876965A (en) * 2009-04-30 2010-11-03 国际商业机器公司 Method and system used for processing text
CN102682040A (en) * 2011-03-16 2012-09-19 日电(中国)有限公司 Device and method for calculating importance of documents
US20160041949A1 (en) * 2014-08-06 2016-02-11 International Business Machines Corporation Dynamic highlighting of repetitions in electronic documents
CN105740375A (en) * 2016-01-27 2016-07-06 西安小光子网络科技有限公司 Multi-optical label based information pushing system and method
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040243408A1 (en) * 2003-05-30 2004-12-02 Microsoft Corporation Method and apparatus using source-channel models for word segmentation
CN101482881A (en) * 2003-07-30 2009-07-15 Google公司 Methods and systems for determining a meaning of a document to match the document to content
US20070124301A1 (en) * 2004-09-30 2007-05-31 Elbaz Gilad I Methods and systems for improving text segmentation
CN101876965A (en) * 2009-04-30 2010-11-03 国际商业机器公司 Method and system used for processing text
CN102682040A (en) * 2011-03-16 2012-09-19 日电(中国)有限公司 Device and method for calculating importance of documents
US20160041949A1 (en) * 2014-08-06 2016-02-11 International Business Machines Corporation Dynamic highlighting of repetitions in electronic documents
CN105740375A (en) * 2016-01-27 2016-07-06 西安小光子网络科技有限公司 Multi-optical label based information pushing system and method
CN107133213A (en) * 2017-05-06 2017-09-05 广东药科大学 A kind of text snippet extraction method and system based on algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378576A (en) * 2021-05-08 2021-09-10 重庆航天信息有限公司 Food safety data mining method
CN113378576B (en) * 2021-05-08 2023-05-26 重庆航天信息有限公司 Food safety data mining method

Similar Documents

Publication Publication Date Title
Afsharizadeh et al. Query-oriented text summarization using sentence extraction technique
Gambhir et al. Recent automatic text summarization techniques: a survey
Bjarnadóttir The database of modern Icelandic inflection (Beygingarlýsing íslensks nútímamáls)
Sahu et al. Prashnottar: a Hindi question answering system
Nualart et al. How we draw texts: a review of approaches to text visualization and exploration
Jayaram et al. A review: Information extraction techniques from research papers
Yeloglu et al. Multi-document summarization of scientific corpora
Alhamzeh et al. Distilbert-based argumentation retrieval for answering comparative questions
Anoop et al. A topic modeling guided approach for semantic knowledge discovery in e-commerce
Byrd et al. Tools and methods for computational linguistics
Collarana et al. A question answering system on regulatory documents
Malik et al. NLP techniques, tools, and algorithms for data science
Revanth et al. Nl2sql: Natural language to sql query translator
CN108363696A (en) A kind of processing method and processing device of text message
Rahul et al. Social media sentiment analysis for Malayalam
Schirmer et al. A new dataset for topic-based paragraph classification in genocide-related court transcripts
Ma et al. Combining n-gram and dependency word pair for multi-document summarization
Ali et al. Word embedding based new corpus for low-resourced language: Sindhi
JP2002183175A (en) Text mining method
JP2006139484A (en) Information retrieval method, system therefor and computer program
Elwert Network analysis between distant reading and close reading
Shamsfard et al. Persian document summarization by parsumist
CN110688453A (en) Scene application method, system, medium and device based on information classification
Asif et al. Bidirectional Encoder Approach for Abstractive Text Summarization of Urdu Language
Shivani et al. Study on Techniques for Analyzing Semantic Similarity in Question Answering System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180803

RJ01 Rejection of invention patent application after publication