CN106250490A - A kind of text gene extracting method, device and electronic equipment - Google Patents

A kind of text gene extracting method, device and electronic equipment Download PDF

Info

Publication number
CN106250490A
CN106250490A CN201610622162.8A CN201610622162A CN106250490A CN 106250490 A CN106250490 A CN 106250490A CN 201610622162 A CN201610622162 A CN 201610622162A CN 106250490 A CN106250490 A CN 106250490A
Authority
CN
China
Prior art keywords
text
text gene
candidate
gene
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610622162.8A
Other languages
Chinese (zh)
Inventor
康潮明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Original Assignee
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeTV Holding Beijing Co Ltd, LeTV Information Technology Beijing Co Ltd filed Critical LeTV Holding Beijing Co Ltd
Priority to CN201610622162.8A priority Critical patent/CN106250490A/en
Publication of CN106250490A publication Critical patent/CN106250490A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to natural language processing technique, particularly relate to a kind of text gene extracting method, device and electronic equipment.Wherein, text gene extracting method includes: builds text gene dictionary, according to described text gene dictionary, generates first candidate's text gene sets of text to be extracted;According to text gene extracting rule, generate second candidate's text gene sets of described text to be extracted;According to described first candidate's text gene sets and described second candidate's text gene sets, generate target text gene sets.The embodiment of the present invention obtains two candidate's text gene sets respectively by two ways, thus obtains the target text gene sets of text to be extracted, enriches text gene extracting method, improves text gene and extracts accuracy rate.

Description

A kind of text gene extracting method, device and electronic equipment
[technical field]
The present invention relates to natural language processing technique, particularly relate to a kind of text gene extracting method, device and electronics and set Standby.
[background technology]
Along with the arrival of cybertimes, the obtainable information of user contains from technical data, business information to Xin Wen Bao Road, amusement plurality of classes and the document of form such as information, constitute exception huge there is isomerism, open characteristics Distributed data base, and deposit in this data base is non-structured text data, utilizes the text gene excavating method can To obtain the knowledge that user is interested or useful from this unstructured text data.
At present, the method for text gene excavating mainly includes statistical method and method based on dictionary, based on statistical method Gene excavating refer to collect the data relevant to text to be extracted in a large number, and carry out arranging, analyze and explaining, conventional statistics Method includes regression analysis, principal component analysis, discriminant analysis and cluster analysis etc..Gene excavating method based on dictionary refers to look into Find out all of word to be extracted in text.
During realizing the present invention, inventor finds that prior art at least there is problems in that based on statistical method Gene excavating be limited to the size of data volume, the accuracy rate often ratio that text gene extracts is relatively low, and gene based on dictionary digs Pick method only need to be found out word to be extracted and not consider this word context relation in sentence, therefore cannot process in sentence The noise existed, has had a strong impact on the accuracy rate that word extracts.
[summary of the invention]
The object of the invention aims to provide a kind of text gene extracting method, device and electronic equipment, in order to realize raising literary composition The accuracy rate that this gene extracts.
One aspect of the embodiment of the present invention, it is provided that a kind of text gene extracting method, including:
Build text gene dictionary, according to described text gene dictionary, generate first candidate's text base of text to be extracted Because of set;
According to text gene extracting rule, generate second candidate's text gene sets of described text to be extracted;
According to described first candidate's text gene sets and described second candidate's text gene sets, generate target text base Because of set.
The another aspect of the embodiment of the present invention, it is provided that a kind of text gene extraction element, including:
First candidate's text gene sets generation module, is used for building text gene dictionary, according to described text gene word Allusion quotation, generates first candidate's text gene sets of text to be extracted;
Second candidate's text gene sets generation module, for according to text gene extracting rule, generates described to be extracted Second candidate's text gene sets of text;
Target text gene sets generation module, for according to described first text gene sets and described second candidate's literary composition This gene sets, generates target text gene sets.
The another aspect of the embodiment of the present invention, it is provided that a kind of electronic equipment, including:
At least one processor;And,
Memorizer;Wherein,
Described memorizer storage has the instruction repertorie that can be performed by least one or more processor described, described instruction journey Sequence is configured to:
Build text gene dictionary, according to described text gene dictionary, generate first candidate's text base of text to be extracted Because of set;
According to text gene extracting rule, generate second candidate's text gene sets of described text to be extracted;
According to described first candidate's text gene sets and described second candidate's text gene sets, generate target text base Because of set.
The embodiment of the present invention is by generating first candidate's text gene sets based on dictionary mode, and rule-based mode generates Second candidate's text gene sets, according to the two candidate's text gene sets, finally obtains target text gene sets, compares In relatively prior art, single use is based on statistical method or gene excavating method based on dictionary methods, and the embodiment of the present invention is fully examined Consider the semanteme of text gene, not only increased the accuracy rate that text gene extracts, and enrich text gene extracting method.
[accompanying drawing explanation]
One or more embodiments are illustrative by the picture in corresponding accompanying drawing, these exemplary theorys Bright it is not intended that the restriction to embodiment, accompanying drawing has the element that the element of same reference numbers label is expressed as being similar to, removes Non-have statement especially, and the not composition of the figure in accompanying drawing limits.
The flow chart of the text gene extracting method that Fig. 1 provides for the embodiment of the present invention one;
The flow chart building text gene dictionary methods that Fig. 2 provides for the embodiment of the present invention two;
The flow chart of generation the first candidate text gene sets method that Fig. 3 provides for the embodiment of the present invention two;
The flow chart generating text gene extracting rule method that Fig. 4 provides for the embodiment of the present invention two;
The flow chart of the method generating target text gene sets that Fig. 5 provides for the embodiment of the present invention three;
The structured flowchart of the text gene extraction element that Fig. 6 provides for the embodiment of the present invention four;
The structured flowchart of the text gene extraction element that Fig. 7 provides for the embodiment of the present invention five;
The structured flowchart of the target text gene sets generation module 203 that Fig. 8 provides for the embodiment of the present invention six;
The structural representation of a kind of electronic equipment that Fig. 9 provides for the embodiment of the present invention seven.
[detailed description of the invention]
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, not For limiting the present invention.
Scheme in the following embodiment of the present invention, is mainly performed by server, and described server can be a station server, Or the server cluster being made up of some station servers, or a cloud computing service center.
The text to be excavated of the embodiment of the present invention is the non-structured text source (such as film) that a kind of data volume is huge.This The text gene of inventive embodiments can be understood as the key word of text to be excavated, and this key word is to this text subject to be excavated The refine of information, it is possible to help the purport of this text to be excavated of user's fast understanding, it is judged that whether this text to be excavated is its sense The association area of interest, improves message reference and the efficiency of information search.Additionally, by the synonym excavating text gene, The text relevant to this text to be excavated can be clustered, thus recommend similar its in its domain of interest to user His knowledge or information.
Embodiment one:
As it is shown in figure 1, the embodiment of the present invention one provides a kind of text gene extracting method, the method comprising the steps of 11- Step 13:
Step 11, structure text gene dictionary, according to described text gene dictionary, generate the first candidate of text to be extracted Text gene sets.
In embodiments of the present invention, build text gene dictionary to specifically include that text data set life after pretreatment Become the first sentence set;Each sentence in described first sentence set is carried out respectively word segmentation processing and filtration, generates first Set of words;Each word in described first set of words is combined, generates two tuple-sets;Ask for described two respectively The support of two tuples and confidence level in tuple-set;Judge whether the support of described two tuples meets the minimum support preset Degree threshold value, and the most satisfied minimal confidence threshold preset of the confidence level of described two tuples, if the support of described two tuples Degree meets the confidence level of minimum support threshold value and described two tuples preset and meets default minimal confidence threshold, then by institute State two tuples for building relation integration;Binary composition in described relation integration is not carried out with predetermined center set of words Coupling, if the match is successful, then is used for building text gene by another corresponding in described two tuples, the match is successful word word Dictionary.
In embodiments of the present invention, according to described text gene dictionary, first candidate's text base of text to be extracted is generated Because set includes: described text to be extracted is generated the second sentence set after pretreatment;By in described second sentence set Sentence carry out word segmentation processing and filtration, generate the second set of words;By described second set of words and predetermined center word Set is mated, if the match is successful, according to described second set of words and described text gene dictionary, generates first candidate's literary composition This gene sets.
In embodiments of the present invention, according to described text gene dictionary, first candidate's text base of text to be extracted is generated Because set also includes: merged by the synonym in described text gene dictionary, generate synonymicon set;Statistics is described The frequency that in first candidate's text gene sets, word occurs;By the word in described first candidate's text gene sets with described Synonymicon set is mated, if the match is successful, then described word is added described synonymicon set, by institute's predicate The frequency that the synonym of language and described word occurs adds up.
Step 12, according to text gene extracting rule, generate second candidate's text gene sets of described text to be extracted.
In embodiments of the present invention, the generation method of described text gene extracting rule includes: passed through by text data set The 3rd sentence set is generated after pretreatment;Described 3rd sentence set is mated with predetermined center set of words, if It is made into merit, then generates the 4th sentence set;Described 4th sentence set is mated with described text gene dictionary, if coupling Success, then generate the 5th sentence set;Extract the target word meeting predetermined condition of assigned direction in described 5th sentence set Language set;According to described center set of words and described target set of words, generate text gene extracting rule.
It should be noted that described text gene extracting rule generates described rule except above-mentioned according to text gene dictionary Outward, it is also possible to be according to special algorithm rule set in advance by people.
In embodiments of the present invention, described second candidate's text gene sets is according to above-mentioned literary composition by described text to be extracted The set of words that this gene extracting rule generates.
Step 13, according to described first candidate's text gene sets and described second candidate's text gene sets, generate mesh Mark text gene sets.
In embodiments of the present invention, described text gene dictionary generation according to described first candidate's text gene sets Set of words, described second text gene sets is the set of words generated according to text gene extracting rule, by two words Word in set mates one by one, if coupling is consistent, then is used for generating target text gene sets by this word.
The embodiment of the present invention is by generating first candidate's text gene sets based on dictionary mode, and rule-based mode generates Second candidate's text gene sets, mates the two text gene sets, finally obtains target text gene sets, phase Relatively in prior art, single use is based on statistical method or gene excavating method based on dictionary methods, and the embodiment of the present invention is abundant Consider the semanteme of text gene, not only increase the accuracy rate that text gene extracts, and enrich text gene extraction side Method.
Above-mentioned text gene extracting method has been described in detail by example below, detailed in Example two.
Embodiment two:
As it is shown in figure 1, the embodiment of the present invention two provides a kind of text gene extracting method, the method comprising the steps of 11- Step 13:
Step 11, structure text gene dictionary, according to described text gene dictionary, generate the first candidate of text to be extracted Text gene sets.
In embodiments of the present invention, text gene word is built based on association rule algorithm (Apriori algorith) Allusion quotation, the core concept of this algorithm is a kind of recurrence method theoretical based on frequency collection, it is therefore an objective to excavate from data support and Confidence level is all not less than given minimum support threshold value and the item of minimal confidence threshold, and analyzes the pass between described item Connection relation.
Specifically, as in figure 2 it is shown, build text gene dictionary to include step 111-step 117:
The center set of words that step 111, generation are arranged in pairs or groups with described text gene.
This step mainly generates and the center set of words of described text gene collocation by the way of artificial summary, described Center set of words is the set of and the words such as the noun of described text gene-correlation, verb, adjective.
For example, it is assumed that described text is film, include comedy, love, ethics, literature and art with the centre word of film gene collocation Deng the word that defines of other story of a play or opera contents, or other electricity such as title, country, age, duration, color, director, performer, scoring Shadow label define word, or special cosmetic, 3D, Electronic cartoon, computer graphical, Image compounding etc. other with film system Make relevant word, etc., by the way of artificial summary, set up center based on a film gene set of words.
Step 112, text data set is generated after pretreatment the first sentence set.
All text data sets with text gene-correlation to be extracted are carried out pretreatment by this step, and described pretreatment is concrete For described text data set being carried out sentence segmentation in units of punctuation mark, this process is typically by matching regular expressions phase The punctuation mark answered, is split by Software Coding sentence completion thus generates the first sentence set, and described punctuation mark can be Comma, fullstop, branch, exclamation mark, question mark and ellipsis etc..
Step 113, each sentence in described first sentence set is carried out respectively word segmentation processing and filtration, generate first Set of words.
Each sentence in described first sentence set is carried out word segmentation processing by this step respectively, and described word segmentation processing is general Including using the segmenting method of string matching, meaning of a word participle method, statistical morphology etc..Segmenting method based on string matching Including Forward Maximum Method method, will a word participle from left to right, such as, " life just as a box chocolate ", pass through After word segmentation processing be " life, just, as, a box, chocolate ";Segmenting method based on string matching also includes the most maximum Join method will a word participle from right to left, the segmenting method of the reverse maximum matching method of " life just as a box chocolate " Obtain is " life, just, as, one, box, chocolate ";Segmenting method based on string matching also includes shortest path participle Method, the word quantity that the method cuts out in requiring in short is minimum.It is that a kind of machine language judges based on meaning of a word participle method Segmenting method, carry out participle by utilizing syntactic information and semantic information to process Ambiguity.Based on statistical morphology, mainly The frequency that two adjacent words occur is added up further, if frequency is higher, then this word is the heaviest according to statistics phrase Want, it is possible to provide the user the separator of character string, carry out participle.
It should be noted that the segmenting method described in the present embodiment is not limited to the present invention, it is also possible to be other points Word method.
Further, the first word collection described in regeneration after the set of words after described word segmentation processing needs to filter Closing, described filtration includes the non-nominal word etc. in the set of words after removing described word segmentation processing.
Step 114, the word in described first set of words is combined, generates two tuple-sets.
In embodiments of the present invention, the mode being combined of the word in described first set of words includes: with { < c1, c2>,<c1,c3>,…,<c1,cn>,<c2,c3>,<c2,c4>,…,<c2,cn>,…,<cn-1,cn> form generate binary group Set, wherein, ciRepresent a word.This compound mode ensure that all to enter between each word in described first set of words Go and be combined with each other and there is no repeated combination.
Step 115, the support asking for two tuples in described two tuple-sets respectively and confidence level.
In embodiments of the present invention, described support discloses the probability that in described two tuples, two words occur simultaneously, In described two tuples of described support explanation less than normal, the relation of two words is little, described two tuples of described support explanation bigger than normal In two words be relevant.When described confidence level discloses that in described two tuples, one of them word occurs, another word The most also the probability that there will be or occur has much, and described confidence level one of them word of explanation less than normal occurs and another word The relation whether occurred is little, and described confidence level is 100%, then two word contacts in described two tuples are described closely.
This step asks for support and the confidence level of described two tuples, such as, obtains binary with the compound mode of step 114 Group element<ci, cj>, if<ci, cj>=<wordA, wordB>, calculates the support of wordA and wordB, confidence level respectively.Meter Calculating the joint probability of support, i.e. A and B, computing formula is: P (A, B)=count (A ∩ B)/(count (A)+count (B)), Wherein, count (A ∩ B) represents the frequency that A and B occurs simultaneously, and count (A) represents the frequency that A occurs, count (B) represents B The frequency occurred.Calculating confidence level, the probability that i.e. B occurs under A occurrence condition, computing formula is: and P (B | A)=P (A, B)/P (A), wherein, P (A, B) is described support, and P (A) is the probability that A occurs.
Step 116, judge that whether the support of described two tuples meets the minimum support threshold value preset, and described two Whether the confidence level of tuple meets the minimal confidence threshold preset, if the support of described two tuples meets default ramuscule The confidence level of degree of holding threshold value and described two tuples meets default minimal confidence threshold, then be used for building pass by described two tuples Connection set.
The value being understood described support and described confidence level by step 115 is the least, in described two tuples two words it Between incidence relation the least, accordingly, it would be desirable to arrange minimum threshold to get rid of two tuples that incidence relation is little.Specifically, by step The method introduced in rapid 115 asks for the support of described two tuples, by described support and minimum support threshold set in advance Value compares, and filters out the support two tuples more than minimum support threshold value, is used for generating frequent item set;Pass through step The method introduced in 115 asks for the confidence level of two tuples in described frequent item set, by described confidence level and minimum set in advance Confidence threshold value compares, and filters out the confidence level two tuples more than minimal confidence threshold, is used for building described incidence set Close.
Step 117, the binary composition in described relation integration is not mated with predetermined center set of words, if It is made into merit, is then used for building text gene dictionary by another corresponding in described two tuples, the match is successful word word.
The center set of words that binary composition in described relation integration does not generate with step 111 is carried out by this step Joining, the word in described two tuples is specifically made a look up in the set of words of described center by described matching process, if finding, Then represent that the match is successful.Described another word that the match is successful associated by word is used for building text gene dictionary.Such as, two In tuple<wordA, wordB>, wordA finds in the set of words of described center, then wordB adds text gene dictionary, or Person wordB finds in the set of words of described center, then wordA adds text gene dictionary, or wordA and wordB exists Described center set of words all finds, then wordA and wordB is all added in text gene dictionary.
In embodiments of the present invention, according to described text gene dictionary, first candidate's text base of text to be extracted is generated Because of set, specifically, as it is shown on figure 3, the first candidate's text gene sets generating text to be extracted includes step 111 '-step 113 ':
Step 111 ', described text to be extracted is generated the second sentence set after pretreatment.
Described text to be extracted (such as film gene text) is carried out pretreatment by this step, described pretreatment be specially with Punctuation mark is that unit carries out sentence segmentation to described text data set, and this process is typically corresponding by matching regular expressions Punctuation mark, is split by Software Coding sentence completion thus generates the second sentence set, described punctuation mark can be comma, Fullstop, branch, exclamation mark, question mark and ellipsis etc..
Step 112 ', the sentence in described second sentence set is carried out word segmentation processing and filtration, generate the second word collection Close.
Each sentence in described second sentence set is carried out word segmentation processing, described participle processing method by this step respectively Identical with the participle processing method of explanation in step 113 in the present embodiment, here is omitted.It should be noted that this enforcement Segmenting method described in example is not limited to the present invention, it is also possible to be other segmenting methods.Further, described word segmentation processing After set of words need to filter after the first set of words described in regeneration, described filtration includes removing described word segmentation processing After set of words in non-nominal word etc..
Step 113 ', described second set of words is mated with predetermined center set of words, if the match is successful, root According to described second set of words and described text gene dictionary, generate first candidate's text gene sets.
First word in second set of words described in this step makes a look up in predetermined center set of words, if looking for Represent the success of described word match to identical word, then described word made a look up in described text gene dictionary, If again finding identical word, then using described word as first candidate's text gene, the like, realize described one by one In two set of words, word and described center set of words and described text gene dictionary mates.If described second set of words In word all can not be found in the set of words of described center, then return to step 112 ', again to described second sentence Set carries out word segmentation processing and filtration.The present embodiment generates the first text gene set by performing above-mentioned circulation operation Close.
In another embodiment, according to described text gene dictionary, first candidate's text gene of text to be extracted is generated Set also includes:
Synonym in described text gene dictionary is merged, generates synonymicon set;Add up described first The frequency that in candidate's text gene sets, each word occurs;By the word in described first candidate's text gene sets with described Synonymicon set is mated, if the match is successful, then described word is added described synonymicon set, by institute's predicate The frequency that the synonym of language and described word occurs adds up.
Step 12, according to text gene extracting rule, generate second candidate's text gene sets of described text to be extracted.
In embodiments of the present invention, described text gene extracting rule can be predetermined rule, it is also possible to be according to literary composition The rule that this gene dictionary generates, it is also possible to be the rules that generate of other modes.Text gene is generated according to text gene dictionary Extracting rule, specifically, as shown in Figure 4, generates text gene extracting rule and includes step 121-step 125:
Step 121, text data set is generated after pretreatment the 3rd sentence set.
All text data sets with text gene-correlation to be extracted are carried out pretreatment by this step, and described pretreatment is concrete For described text data set being carried out sentence segmentation in units of punctuation mark, this process is typically by matching regular expressions phase The punctuation mark answered, is split by Software Coding sentence completion thus generates the 3rd sentence set, and described punctuation mark can be Comma, fullstop, branch, exclamation mark, question mark and ellipsis etc..
Step 122, described 3rd sentence set is mated with predetermined center set of words, if the match is successful, then Generate the 4th sentence set.
3rd sentence set described in this step is mated with predetermined center set of words, specially finds out described Comprising the sentence of word in the set of words of described center in three sentence set, described sentence is for generating the 4th sentence set.Institute State the quantity that each sentence in the 4th sentence set comprises the word in the set of words of described center not limit.
Step 123, described 4th sentence set is mated with described text gene dictionary, if the match is successful, then give birth to Become the 5th sentence set.
4th sentence set described in this step is mated with described text gene dictionary further, specially finds out institute Stating the sentence of the word comprising described text gene dictionary in the 4th sentence set, described sentence is for generating the 5th sentence collection Close.The quantity of the word that each sentence in described 5th sentence set comprises described text gene dictionary does not limits.
Step 124, extract the target set of words meeting predetermined condition of assigned direction in described 5th sentence set.
Extract the target set of words meeting predetermined condition of assigned direction in the 5th sentence set described in this step to include: Extract the word of the gene word next-door neighbour that in described 5th sentence set, each sentence comprises, it is judged that whether described word is non-dynamic Part of speech word, if being non-verb word, directly abandons, if verb word then retains, is used for generating target word Set.
Step 125, according to described center set of words and described target set of words, generate text gene extracting rule.
Step 13, according to described first candidate's text gene sets and described second candidate's text gene sets, generate mesh Mark text gene sets.
First candidate's text gene sets described in this step is the word collection generated based on text gene dictionary in step 11 Closing, described second candidate's text gene sets is the set of words generated according to text gene extracting rule in step 12, by two Word in individual set of words mates one by one, if coupling is consistent, then is used for generating target text gene sets by this word.
The embodiment of the present invention is by generating first candidate's text gene sets based on text gene dictionary mode, and this collection is combined into One set of words, rule-based mode generates second candidate's text gene sets, and this set is also a set of words, finally Will be present in the word in two set of words as target text gene, the embodiment of the present invention is not only with mass data collection as base Plinth but also consider the semanteme of text gene, make the accuracy rate of acquisition text gene, recall ratio, precision ratio all be carried High.Additionally, the synonym in text gene sets is merged, decrease may produce because of double counting need not The expense wanted, and solve the problem that text gene extraction result exists redundant representation.
Other text gene extracting method, detailed in Example three is also included on the basis of said method embodiment.
Embodiment three:
As it is shown in figure 5, on the basis of embodiment 2, raw described in embodiment of the present invention one text gene extracting method Target text gene sets is become to include:
Step 131, according to word length, to described first candidate's text gene sets and second candidate's text gene sets In word filter.
In embodiments of the present invention, described to described first candidate's text gene sets and second candidate's text gene sets In word carry out filtration and include: remove in described first candidate's text gene sets and second candidate's text gene sets and be not inconsistent Close the word of preset length (such as word length L=8), do not meet and include more than or be more than or equal to, wherein, described word length Can be in units of the number of character, by longer word is screened, leave the word meeting preset length.
Step 132, according to occur frequency, will filter after described first candidate's text gene sets and the second candidate literary composition Word in this gene is ranked up.
In embodiments of the present invention, described frequency refers to described first candidate's text gene sets and second candidate's text gene In word occur frequency, described word includes identical word and synonym, according to frequency size according to frequency values from height It is ranked up to low.
Step 133, according to sequence, extract the word of predetermined quantity, as described target text gene sets.
The embodiment of the present invention, by the target text gene sets got is filtered and sorted, selects word long The word that degree is suitable and the frequency of occurrences is big is as the last text gene extracted so that described text gene is comprehensively and accurately Have expressed the subject information of described text, thus effectively raise the quality that text gene extracts.
Based on above-mentioned text gene extracting method, shown below is the text gene extraction element that the method is corresponding.
Embodiment four:
As shown in Figure 6, the embodiment of the present invention four provides a kind of text gene extraction element 10, and this device 10 includes: the One candidate's text gene sets generation module the 101, second candidate text gene sets generation module 102 and target text gene Set generation module 103, wherein,
Described first candidate's text gene sets generation module 101 is used for building text gene dictionary, according to described text Gene dictionary, generates first candidate's text gene sets of text to be extracted;
Described second candidate's text gene sets generation module 102, for according to text gene extracting rule, generates described Second candidate's text gene sets of text to be extracted;
Described target text gene sets generation module 103 is used for according to described first candidate's text gene sets with described Second candidate's text gene sets, generates target text gene sets.
In embodiments of the present invention, described first candidate's text gene sets generation module and described second candidate's text base Because set generation module generates first candidate's text gene sets and second candidate's text gene sets of text to be extracted respectively, Described target text gene sets generation module receives described first candidate's text gene sets and described second candidate's text base Because of set, thus generate target text gene sets.
What deserves to be explained is, the content such as the information between module in said apparatus is mutual, execution process, due to this Bright embodiment of the method one is based on same design, and particular content can be found in the narration in the inventive method embodiment one, the most not Repeat again.
The embodiment of the present invention is by generating first candidate's text gene sets based on dictionary mode, and rule-based mode generates Second candidate's text gene sets, mates the two text gene sets, finally obtains target text gene sets, phase Relatively in prior art, single use is based on statistical method or gene excavating method based on dictionary methods, and the embodiment of the present invention is abundant Consider the semanteme of text gene, not only increase the accuracy rate that text gene extracts, and enrich text gene extraction side Method.
Above-mentioned text gene extraction element has been described in detail by example below, detailed in Example five.
Embodiment five:
As it is shown in fig. 7, the embodiment of the present invention five provides a kind of text gene extraction element 20, this device 20 includes:
First candidate's text gene sets generation module 201, is used for building text gene dictionary, according to described text gene Dictionary, generates first candidate's text gene sets of text to be extracted.
Second candidate's text gene sets generation module 202, for according to text gene extracting rule, waits to carry described in generation Take second candidate's text gene sets of text.
Target text gene sets generation module 203, for according to described first candidate's text gene sets and described the Two candidate's text gene sets, generate target text gene sets.
Further, described first candidate's text gene sets generation module 201 includes the first submodule 2011, described One submodule 2011 includes:
First sentence set signal generating unit 2011a, for generating the first sentence collection after pretreatment by text data set Close.
First set of words signal generating unit 2011b, for carrying out participle respectively by the sentence in described first sentence set Process and filter, generate the first set of words.
Two tuple-set signal generating units 2011c, for the word in described first set of words is carried out binary combination, raw Become two tuple-sets.
First computing unit 2011d, for asking for support and the confidence level of two tuples in described two tuple-sets respectively.
Relation integration signal generating unit 2011e, for judging whether the support of described two tuples meets the ramuscule preset Degree of holding threshold value, the and whether confidence level of described two tuples meet the minimal confidence threshold preset, if described two tuples Degree of holding meets the confidence level of default minimum support threshold value and described two tuples and meets default minimal confidence threshold, then will Described two tuples are used for building relation integration.
First processing unit 2011f, for the center word collection that the binary composition in described relation integration is other and predetermined Conjunction is mated, if the match is successful, then is used for building literary composition by another corresponding in described two tuples, the match is successful word word This gene dictionary.
Further, described first candidate's text gene sets generation module 201 also includes the second submodule 2012, described Second submodule 2012 includes:
Second sentence set signal generating unit 2012a, for generating second by described text to be extracted after pretreatment Subclass;
Second set of words signal generating unit 2012b, for carrying out word segmentation processing by the sentence in described second sentence set And filtration, generate the second set of words;
First matching unit 2012c, for described second set of words is mated with predetermined center set of words, If the match is successful, according to described second set of words and described text gene dictionary, generate first candidate's text gene sets.
Wherein, described center set of words be one with the words such as the noun of described text gene-correlation, verb, adjective Set, mainly generate by the way of artificial summary and the center set of words of described text gene collocation.
Further, described second submodule 2012 also includes:
Synonymicon set signal generating unit 2012d, for the synonym in described text gene dictionary is merged, Generate synonymicon set;
Second computing unit 2012e, for adding up the frequency that in described first candidate's text gene sets, word occurs;
Synonym combining unit 2012f, for by the word in described first candidate's text gene sets and described synonym Word dictionary set is mated, if the match is successful, then described word is added described synonymicon set, by described word and The frequency that the synonym of described word occurs adds up.
In embodiments of the present invention, described second candidate's text gene sets generation module 202 includes that text gene extracts Rule submodule 2021, described text gene extracting rule submodule 2021 includes:
3rd sentence set signal generating unit 2021a, for generating the 3rd sentence collection after pretreatment by text data set Close;
Second matching unit 2021b, for described 3rd sentence set is mated with predetermined center set of words, If the match is successful, then generate the 4th sentence set;
3rd matching unit 2021c, for described 4th sentence set is mated with described text gene dictionary, if The match is successful, then generate the 5th sentence set;
Target set of words extraction unit 2021d, pre-for extracting meeting of assigned direction in described 5th sentence set The target set of words of fixed condition;
Text gene extracting rule unit 2021e, is used for according to described center set of words and described target set of words, Generate text gene extracting rule.
What deserves to be explained is, the information between module, submodule and unit in said apparatus is mutual, in execution process etc. Holding, owing to the embodiment of the method two with the present invention is based on same design, particular content can be found in the inventive method embodiment two Narration, here is omitted.
The embodiment of the present invention is by generating first candidate's text gene sets based on dictionary mode, and this collection is combined into a word Set, rule-based mode generates second candidate's text gene sets, and this set is also a set of words, finally will be present in Word in two set of words as target text gene, the embodiment of the present invention not only based on mass data collection but also Consider the semanteme of text gene, make the accuracy rate of acquisition text gene, recall ratio, precision ratio all be improved.Additionally, Synonym in text gene sets is merged, decreases unnecessary the opening that may produce because of double counting Pin, and solve the problem that text gene extraction result exists redundant representation.
On the basis of said apparatus embodiment, this device also includes that other modules are extracted for text gene, refers to reality Execute example six.
Embodiment six:
As shown in Figure 8, on the basis of embodiment five, mesh described in embodiment of the present invention one text gene extraction element Mark text gene sets generation module 203 includes:
Filter submodule 2031, for according to word length, to described first candidate's text gene sets and the second candidate Word in text gene sets filters.
Sorting sub-module 2032, for according to the frequency occurred, the described first candidate's text gene sets after filtering It is ranked up with the word in second candidate's text gene.
Extract submodule 2033, for according to sequence, extracting the word of predetermined quantity, as described target text gene set Close.
What deserves to be explained is, the content such as the information between submodule in said apparatus is mutual, execution process, due to this The embodiment of the method three of invention is based on same design, and particular content can be found in the narration in the inventive method embodiment three, herein Repeat no more.
The embodiment of the present invention, by the target text gene sets got is filtered and sorted, selects word long The word that degree is suitable and the frequency of occurrences is big is as the last text gene extracted so that described text gene is comprehensively and accurately Have expressed the subject information of described text, thus effectively raise the quality that text gene extracts.
By above-mentioned to text gene extracting method and the description of device, it be given below and realize said method embodiment and dress Put the electronic equipment embodiment of embodiment, detailed in Example seven.
Embodiment seven:
As it is shown in figure 9, embodiments provide a kind of electronic equipment 30, this equipment 30 includes one or more process Device 301 and memorizer 302.Wherein, in Fig. 9 as a example by a processor 301.
The electronic equipment performing text gene extracting method can also include input equipment 303 and output device 304.Process Device 301, memorizer 302, input equipment 303 and output device 304 can be connected by bus or other modes, in Fig. 9 with As a example by being connected by bus.
Memorizer 302, as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module, as corresponding in the text gene extracting method in the embodiment of the present invention Programmed instruction or module, such as, first candidate's text gene sets generation module the 101, second candidate text base shown in accompanying drawing 6 Because of set generation module 102 and target text gene sets generation module 103, or the modules shown in accompanying drawing 7.Process Device 301 stores non-volatile software program, instruction and module in the memory 302 by operation, thus performs server Various functions application and data process, i.e. realize said method embodiment text gene extracting method.
Memorizer 302 can include storing program area and storage data field, and wherein, storage program area can store operation system Application program required for system, at least one function;Storage data field can store the use institute presenting device according to Search Results The data etc. created.Additionally, memorizer 302 can include high-speed random access memory, it is also possible to include non-volatile memories Device, for example, at least one disk memory, flush memory device or other non-volatile solid state memory parts.In some embodiments In, memorizer 302 is optional includes the memorizer remotely located relative to processor 301, and these remote memories can pass through net Network is connected to video preview device.The example of above-mentioned network includes but not limited to the Internet, intranet, LAN, movement Communication network and combinations thereof.
Input equipment 303 can receive numeral or the character information of input, and generation presents the use of device with Search Results Family is arranged and function controls relevant key signals input.Output device 304 can include the display devices such as display screen.
One or more module stores is in described memorizer 302, when by one or more processor During 301 execution, perform the text gene extracting method in above-mentioned any means embodiment.
Through the above description of the embodiments, those skilled in the art it can be understood that to each embodiment can The mode adding required general hardware platform by software realizes, naturally it is also possible to pass through hardware.Based on such understanding, on State the part that prior art contributes by technical scheme the most in other words to embody with the form of software product, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD etc., including some fingers Make with so that a computer equipment (can be personal computer, server, or the network equipment etc.) performs each and implements The method described in some part of example or embodiment.
The embodiment of the present invention is by generating first candidate's text gene sets based on dictionary mode, and rule-based mode generates Second candidate's text gene sets, according to the two candidate's text gene sets, finally obtains target text gene sets, compares In relatively prior art, single use is based on statistical method or gene excavating method based on dictionary methods, and the embodiment of the present invention is fully examined Consider the semanteme of text gene, not only increased the accuracy rate that text gene extracts, and enrich text gene extracting method. Described electronic equipment can perform the method that the embodiment of the present invention is provided, and possesses the corresponding functional module of execution method and useful effect Really.The ins and outs of the most detailed description, can be found in the method that the embodiment of the present invention is provided.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Any amendment, equivalent and the improvement etc. made within god and principle, should be included within the scope of the present invention.

Claims (11)

1. a text gene extracting method, it is characterised in that including:
Build text gene dictionary, according to described text gene dictionary, generate first candidate's text gene set of text to be extracted Close;
According to text gene extracting rule, generate second candidate's text gene sets of described text to be extracted;
According to described first candidate's text gene sets and described second candidate's text gene sets, generate target text gene set Close.
2. the method for claim 1, it is characterised in that described structure text gene dictionary includes:
Text data set is generated after pretreatment the first sentence set;
Sentence in described first sentence set is carried out word segmentation processing and filtration respectively, generates the first set of words;
Word in described first set of words is combined, generates two tuple-sets;
Ask for support and the confidence level of two tuples in described two tuple-sets respectively;
Judge whether the support of described two tuples meets the minimum support threshold value preset, and the confidence level of described two tuples Whether meet the minimal confidence threshold preset, if the support of described two tuples meets default minimum support threshold value and institute The confidence level stating two tuples meets default minimal confidence threshold, then be used for building relation integration by described two tuples;
Binary composition in described relation integration is not mated with predetermined center set of words, if the match is successful, then will Another word that in described two tuples, the match is successful word is corresponding is used for building text gene dictionary.
3. the method for claim 1, it is characterised in that described according to described text gene dictionary, generates literary composition to be extracted This first candidate's text gene sets includes:
Described text to be extracted is generated the second sentence set after pretreatment;
Sentence in described second sentence set is carried out word segmentation processing and filtration, generates the second set of words;
Described second set of words is mated with predetermined center set of words, if the match is successful, according to described second word Language set and described text gene dictionary, generate first candidate's text gene sets.
4. method as claimed in claim 3, it is characterised in that also include:
Synonym in described text gene dictionary is merged, generates synonymicon set;
Add up the frequency that in described first candidate's text gene sets, word occurs;
Word in described first candidate's text gene sets is mated, if mating into described synonymicon set Merit, then add described synonymicon set by described word, the frequency occurred by the synonym of described word and described word Add up.
5. the method for claim 1, it is characterised in that the generation method of described text gene extracting rule includes:
Text data set is generated after pretreatment the 3rd sentence set;
Described 3rd sentence set is mated with predetermined center set of words, if the match is successful, then generates the 4th sentence Set;
Described 4th sentence set is mated with described text gene dictionary, if the match is successful, then generates the 5th sentence collection Close;
Extract the target set of words meeting predetermined condition of assigned direction in described 5th sentence set;
According to described center set of words and described target set of words, generate text gene extracting rule.
6. a text gene extraction element, it is characterised in that including:
First candidate's text gene sets generation module, is used for building text gene dictionary, according to described text gene dictionary, raw Become first candidate's text gene sets of text to be extracted;
Second candidate's text gene sets generation module, for according to text gene extracting rule, generates described text to be extracted Second candidate's text gene sets;
Target text gene sets generation module, for according to described first candidate's text gene sets and described second candidate's literary composition This gene sets, generates target text gene sets.
7. device as claimed in claim 6, it is characterised in that described first candidate's text gene sets generation module includes the One submodule, described first submodule includes:
First sentence set signal generating unit, for generating the first sentence set after pretreatment by text data set;
First set of words signal generating unit, for carrying out word segmentation processing and mistake respectively by the sentence in described first sentence set Filter, generates the first set of words;
Two tuple-set signal generating units, for the word in described first set of words is carried out binary combination, generate two tuples Set;
First computing unit, for asking for support and the confidence level of two tuples in described two tuple-sets respectively;
Relation integration signal generating unit, for judging whether the support of described two tuples meets the minimum support threshold value preset, And whether the confidence level of described two tuples meets the minimal confidence threshold preset, if the support of described two tuples meets pre- If minimum support threshold value and the confidence level of described two tuples meet default minimal confidence threshold, then by described two tuples For building relation integration;
First processing unit, for not carrying out the binary composition in described relation integration with predetermined center set of words Join, if the match is successful, then be used for building text gene word by another corresponding in described two tuples, the match is successful word word Allusion quotation.
8. device as claimed in claim 6, it is characterised in that described first candidate's text gene sets generation module also includes Second submodule, described second submodule includes:
Second sentence set signal generating unit, for generating the second sentence set by described text to be extracted after pretreatment;
Second set of words signal generating unit, for the sentence in described second sentence set is carried out word segmentation processing and filtration, raw Become the second set of words;
First matching unit, for mating described second set of words, if mating into predetermined center set of words Merit, according to described second set of words and described text gene dictionary, generates first candidate's text gene sets.
9. device as claimed in claim 8, it is characterised in that described second submodule also includes: synonymicon collection symphysis Become unit, for being merged by the synonym in described text gene dictionary, generate synonymicon set;
Second computing unit, for adding up the frequency that in described first candidate's text gene sets, word occurs;
Synonym combining unit, for by the word in described first candidate's text gene sets and described synonymicon set Mate, if the match is successful, then described word is added described synonymicon set, by described word and described word The frequency that synonym occurs adds up.
10. device as claimed in claim 6, it is characterised in that described second candidate's text gene sets generation module includes Text gene extracting rule submodule, described text gene extracting rule submodule includes:
3rd sentence set signal generating unit, for generating the 3rd sentence set after pretreatment by text data set;
Second matching unit, for mating described 3rd sentence set with predetermined center set of words, if mating into Merit, then generate the 4th sentence set;
3rd matching unit, for described 4th sentence set is mated with described text gene dictionary, if the match is successful, Then generate the 5th sentence set;
Target set of words extraction unit, for extracting the mesh meeting predetermined condition of assigned direction in described 5th sentence set Mark set of words;
Text gene extracting rule unit, for according to described center set of words and described target set of words, generates text Gene extracting rule.
11. 1 kinds of electronic equipments, it is characterised in that including:
At least one processor;And,
Memorizer;Wherein,
Described memorizer storage has the instruction repertorie that can be performed by least one or more processor described, described instruction repertorie quilt It is configured that
Build text gene dictionary, according to described text gene dictionary, generate first candidate's text gene set of text to be extracted Close;
According to text gene extracting rule, generate second candidate's text gene sets of described text to be extracted;
According to described first candidate's text gene sets and described second candidate's text gene sets, generate target text gene set Close.
CN201610622162.8A 2016-08-01 2016-08-01 A kind of text gene extracting method, device and electronic equipment Pending CN106250490A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610622162.8A CN106250490A (en) 2016-08-01 2016-08-01 A kind of text gene extracting method, device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610622162.8A CN106250490A (en) 2016-08-01 2016-08-01 A kind of text gene extracting method, device and electronic equipment

Publications (1)

Publication Number Publication Date
CN106250490A true CN106250490A (en) 2016-12-21

Family

ID=57605845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610622162.8A Pending CN106250490A (en) 2016-08-01 2016-08-01 A kind of text gene extracting method, device and electronic equipment

Country Status (1)

Country Link
CN (1) CN106250490A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106910501A (en) * 2017-02-27 2017-06-30 腾讯科技(深圳)有限公司 Text entities extracting method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541958A (en) * 2010-12-30 2012-07-04 百度在线网络技术(北京)有限公司 Method, device and computer equipment for identifying short text category information
CN102693244A (en) * 2011-03-23 2012-09-26 日电(中国)有限公司 Method and device for identifying information in non-structured text
CN104166682A (en) * 2014-07-21 2014-11-26 安徽华贞信息科技有限公司 Method and system for extracting natural-language-like semantic information on the basis combinatorial theory
CN104375989A (en) * 2014-12-01 2015-02-25 国家电网公司 Natural language text keyword association network construction system
CN105183733A (en) * 2014-06-05 2015-12-23 阿里巴巴集团控股有限公司 Methods for matching text information and pushing business object, and devices for matching text information and pushing business object

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541958A (en) * 2010-12-30 2012-07-04 百度在线网络技术(北京)有限公司 Method, device and computer equipment for identifying short text category information
CN102693244A (en) * 2011-03-23 2012-09-26 日电(中国)有限公司 Method and device for identifying information in non-structured text
CN105183733A (en) * 2014-06-05 2015-12-23 阿里巴巴集团控股有限公司 Methods for matching text information and pushing business object, and devices for matching text information and pushing business object
CN104166682A (en) * 2014-07-21 2014-11-26 安徽华贞信息科技有限公司 Method and system for extracting natural-language-like semantic information on the basis combinatorial theory
CN104375989A (en) * 2014-12-01 2015-02-25 国家电网公司 Natural language text keyword association network construction system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106910501A (en) * 2017-02-27 2017-06-30 腾讯科技(深圳)有限公司 Text entities extracting method and device
CN106910501B (en) * 2017-02-27 2019-03-01 腾讯科技(深圳)有限公司 Text entities extracting method and device
US11222178B2 (en) 2017-02-27 2022-01-11 Tencent Technology (Shenzhen) Company Ltd Text entity extraction method for extracting text from target text based on combination probabilities of segmentation combination of text entities in the target text, apparatus, and device, and storage medium

Similar Documents

Publication Publication Date Title
CN109101620B (en) Similarity calculation method, clustering method, device, storage medium and electronic equipment
EP3832488A2 (en) Method and apparatus for generating event theme, device and storage medium
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN109684647A (en) Film comment sentiment analysis method and device
CN111563192A (en) Entity alignment method and device, electronic equipment and storage medium
CN107066633A (en) Deep learning method and apparatus based on human-computer interaction
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
JP2007219947A (en) Causal relation knowledge extraction device and program
CN112883182A (en) Question-answer matching method and device based on machine reading
JP2010146171A (en) Representation complementing device and computer program
Campbell et al. Content+ context networks for user classification in twitter
CN113761270A (en) Video recall method and device, electronic equipment and storage medium
Dulceanu et al. PhotoshopQuiA: A corpus of non-factoid questions and answers for why-question answering
CN106250490A (en) A kind of text gene extracting method, device and electronic equipment
CN111930959B (en) Method and device for generating text by map knowledge
Paliouras et al. Bootstrapping ontology evolution with multimedia information extraction
CN110929085B (en) System and method for processing electric customer service message generation model sample based on meta-semantic decomposition
CN115774797A (en) Video content retrieval method, device, equipment and computer readable storage medium
Perri et al. One Graph to Rule them All: Using NLP and Graph Neural Networks to analyse Tolkien's Legendarium
CN117591698B (en) Training method of video retrieval model, video retrieval method, device and equipment
KR102030742B1 (en) Idea selection support system and method
CN117789099B (en) Video feature extraction method and device, storage medium and electronic equipment
CN113886535B (en) Knowledge graph-based question and answer method and device, storage medium and electronic equipment
US20220292126A1 (en) Information management system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161221

WD01 Invention patent application deemed withdrawn after publication