CN106156196A - Extract the apparatus and method of text feature - Google Patents

Extract the apparatus and method of text feature Download PDF

Info

Publication number
CN106156196A
CN106156196A CN201510193912.XA CN201510193912A CN106156196A CN 106156196 A CN106156196 A CN 106156196A CN 201510193912 A CN201510193912 A CN 201510193912A CN 106156196 A CN106156196 A CN 106156196A
Authority
CN
China
Prior art keywords
speech
word
weight
comparator matrix
calculate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510193912.XA
Other languages
Chinese (zh)
Inventor
杨振华
皮冰锋
周恩策
孙俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201510193912.XA priority Critical patent/CN106156196A/en
Publication of CN106156196A publication Critical patent/CN106156196A/en
Pending legal-status Critical Current

Links

Abstract

The present invention relates to extract the apparatus and method of text feature.A kind of device extracting text feature, including participle unit, is configured to carry out input document participle and obtains multiple word, the part of speech of each word and each word and be adjacent the part of speech combination of word;Importance computing unit, is configured to calculate the significance level of each word;Part of speech weight calculation unit, is configured to calculate the weight of the part of speech of each word;Part of speech combining weights computing unit, is configured to calculate the weight of the part of speech combination that each word is adjacent word;And Text character extraction unit, it is configured to for each word, the weight combined according to its significance level, the weight of part of speech and part of speech extracts the text feature of this word.According to apparatus and method of the present invention, part of speech and part of speech being combined the contribution to text feature and incorporates in feature extracting method, the extraction to text message is more abundant, processes the speed of real time data faster.

Description

Extract the apparatus and method of text feature
Technical field
The present invention relates to field of information processing, relate more specifically to a kind of device extracting text feature and Method.
Background technology
Along with internet information constantly increases, text data gets more and more, simultaneously flying along with network Speed development, provides the acquisition of information approach of simplicity, the electricity such as webpage, mail, e-book for people The quantity of subdocument gets more and more, while people obtain bulk information, when also having to spend substantial amounts of Between read and arrange these information, the easiest, obtain these texts fast and accurately Key message just becomes abnormal important.Owing to the process in Chinese basic participle field is more complicated, lead Having caused Chinese information extraction technique the most backward, therefore the information extraction technology of Chinese text is more and more heavier Want.
The Text Extraction of one quasi-tradition is to calculate document word frequency, i.e. concentrates each at training text Its document frequency of feature calculation and word frequency, such computational methods there is the problem that (1) does not has Consider that part of speech does not accounts for the semantic structure description to text feature to the contribution (2) of text feature.
At present, in most text feature and modification method thereof, the tradition used Word frequency reverse document frequency method be single calculating word frequency, do not introduce part of speech and sentence structure Influence factor to Text character extraction.Meanwhile, different text languages, its sentence structure gap is relatively Greatly, it is difficult to be applied to unified extracting method different text language.
Summary of the invention
Brief overview about the present invention given below, in order to some side about the present invention is provided The basic comprehension in face.Should be appreciated that this general introduction is not that the exhaustive about the present invention is summarized.It It is not intended to determine the key of the present invention or pith, is not the model of the intended limitation present invention Enclose.Its purpose is only to provide some concept in simplified form, more detailed in this, as discuss after a while The thin preamble described.
One main purpose of the present invention is, it is provided that a kind of device extracting text feature, including: Participle unit, is configured to carry out input document participle and obtains multiple word, the part of speech of each word and every Individual word is adjacent the part of speech combination of word;Importance computing unit, is configured to calculate the weight of each word Want degree;Part of speech weight calculation unit, is configured to calculate the weight of the part of speech of each word;Part of speech group Close weight calculation unit, be configured to calculate the weight of the part of speech combination that each word is adjacent word;With And Text character extraction unit, it is configured to for each word, according to its significance level, the power of part of speech The weight of weight and part of speech combination extracts the text feature of this word.
According to an aspect of the invention, it is provided a kind of method extracting text feature, including: right Input document carries out participle and obtains multiple word, the part of speech of each word and each word and be adjacent the part of speech of word Combination;Calculate the significance level of each word;Calculate the weight of the part of speech of each word;Calculate each word with The weight of the part of speech combination of its adjacent word;And for each word, according to its significance level, part of speech The weight of weight and part of speech combination extracts the text feature of this word.
It addition, embodiments of the invention additionally provide the computer program for realizing said method.
Additionally, embodiments of the invention additionally provide the computer journey of at least computer-readable medium form Sequence product, on it, record has the computer program code for realizing said method.
By below in conjunction with the accompanying drawing detailed description to highly preferred embodiment of the present invention, these of the present invention And other advantages will be apparent from.
Accompanying drawing explanation
Below with reference to the accompanying drawings illustrate embodiments of the invention, the present invention can be more readily understood that Above and other objects, features and advantages.Parts in accompanying drawing are intended merely to illustrate that the present invention's is former Reason.In the accompanying drawings, same or similar technical characteristic or parts will use same or similar accompanying drawing Labelling represents.
Fig. 1 shows showing of the method 100 of extraction text feature according to an embodiment of the invention The flow chart of example process;
Fig. 2 is the flow chart of a kind of example process illustrating step S106 in Fig. 1;
Fig. 3 shows a concrete example of the calculating process of part of speech weight;
Fig. 4 is the flow chart of a kind of example process illustrating step S108 in Fig. 1;
Fig. 5 is to illustrate part of speech and the hierarchical chart of part of speech combination;
Fig. 6 shows the system knot of the method for extraction text feature according to an embodiment of the invention Composition;
Fig. 7 is the showing of device 700 illustrating extraction text feature according to an embodiment of the invention The block diagram of example configuration;
Fig. 8 is the block diagram of a kind of exemplary configuration illustrating the part of speech weight calculation unit 706 in Fig. 7;
Fig. 9 is a kind of exemplary configuration illustrating the part of speech combining weights computing unit 708 in Fig. 7 Block diagram;And
Figure 10 is the meter illustrating the apparatus and method extracting text feature that may be used for implementing the present invention The exemplary block diagram of calculation equipment.
Detailed description of the invention
Embodiments of the invention are described with reference to the accompanying drawings.An accompanying drawing or a kind of real in the present invention Execute the element described in mode and feature can with in one or more other accompanying drawing or embodiment The element and the feature that illustrate combine.It should be noted that, for purposes of clarity, accompanying drawing and explanation save Omit unrelated to the invention, parts known to persons of ordinary skill in the art and the expression of process and retouched State.
The present invention proposes a kind of based on part of speech position relative with word and the text of the significance level of word Feature extracting method.The present invention has the difference of essence on text feature with traditional method. It is special as text that traditional method only considered the significance level of word, such as word frequency or reverse word frequency Levy.In order to introduce the contribution to feature of the part of speech relative position with word, the present invention is by calculating part of speech The weight of weight and the relative position of word, then the two weight is combined with the significance level of word determines Final text feature.Thus the positional information of part of speech and word is incorporated spy to the contribution of text feature Levy in extracting method.
Describe extraction text feature according to an embodiment of the invention below in conjunction with the accompanying drawings in detail Method and apparatus.
Fig. 1 shows showing of the method 100 of extraction text feature according to an embodiment of the invention The flow chart of example process.
First, in step s 102, input document is carried out participle and obtain multiple word, each word Part of speech and each word are adjacent the part of speech combination of word.Here, each word is adjacent the part of speech group of word Close namely represent the information of relative position of the above word.
In one example, participle can be carried out based on hidden Markov model and obtain participle, part of speech With the part of speech combination that each word is adjacent word.
Document is carried out the adoptable specific means of participle or mode is well known to those skilled in the art, Do not repeat them here.
It follows that in step S104, calculate the significance level of each word.Word frequency, word can be used Frequently-reverse document-frequency (TF-IDF) etc. represents the significance level of word.In the following description, As a example by word frequency-reverse document-frequency, all represent the significance level of word.
It follows that in step s 106, the weight of the part of speech of each word is calculated.
Fig. 2 is the flow chart of a kind of example process illustrating step S106 in Fig. 1.
As in figure 2 it is shown, when calculating the weight of part of speech, first construct part of speech ratio in step S1062 Relatively matrix.I.e. according to the significance level of part of speech, important between all parts of speech relatively more given two-by-two Degree, provides marking afterwards, constructs part of speech comparator matrix with all marking.In one example, Part of speech comparator matrix can be constructed based on analytic hierarchy process (AHP).
In one example, table 1 below can be used to give a mark, build part of speech comparator matrix.
Table 1 1-9 Score Lists
It is to say, when two factor no less importants, 1 must be divided into, when factor 1 than factor 2 slightly Micro-important, 2 must be divided into, then when factor 2 compares with factor 1, must be divided into 1/2, by that analogy, The significance level that can compare all parts of speech two-by-two is given a mark.
Part of speech comparator matrix A can be constructed as follows with all marking.
Then, in step S1064, the spy corresponding to Maximum characteristic root of part of speech comparator matrix is calculated Levy vector.Then, in step S1066, characteristic vector is normalized the weight obtaining part of speech.
In one example, the characteristic vector corresponding to the Maximum characteristic root of part of speech comparator matrix is being calculated Step before, also include determining that step that whether logic of this part of speech comparator matrix set up is (in figure not Illustrate).
In one example, by part of speech comparator matrix is carried out consistency check, this part of speech is determined Whether the logic of comparator matrix is set up.
Fig. 3 shows a concrete example of the calculating process of part of speech weight.
First, in step S301, input the comparator matrix constructed.
In step s 302, the feature maximum root of this matrix is calculated.
In step S303, calculate coincident indicator CI=(λmax-m)/(m-1), search corresponding Aver-age Random Consistency Index RI, then calculate consistency ration CR=CI/RI.
In step s 304, judge, when CR is < when 0.1, it is believed that the concordance of this matrix is can Accepting, its logic can be set up, and continues executing with step S305;If being unsatisfactory for CR < 0.1, then Adjust comparator matrix by amendment marking, and repeat above step S302 to step S304.
In step S305, calculate Maximum characteristic root characteristic of correspondence vector.
Finally, in step S306, calculated characteristic vector is normalized, can obtain Weight to part of speech.
After step S106 calculating the weight of part of speech of each word, in step S108, calculate Each word is adjacent the weight of the part of speech combination of word.
Calculate each word be adjacent word part of speech combination weight method with calculate part of speech weight Method be similar to.Fig. 4 is the flow chart of a kind of example process illustrating step S108 in Fig. 1.
First, step S1082 constructs the relative location comparison matrix of each part of speech, i.e. compares word Property combination with part of speech combine between significance level, provide marking, with all marking construct part of speech combination Comparator matrix.Here it is also adopted by table 1 to give a mark.
Then, in step S1084, calculate corresponding to the Maximum characteristic root of part of speech combination comparator matrix Characteristic vector.Then, in step S1086, characteristic vector is normalized and obtains part of speech group The weight closed.
In one example, the feature corresponding to the Maximum characteristic root of part of speech combination comparator matrix is being calculated Before the step of vector, also include determining the step whether logic of this part of speech combination comparator matrix is set up (not shown).
In one example, by part of speech combination comparator matrix is carried out consistency check, this is determined Whether the logic of part of speech combination comparator matrix is set up.
The method identical with the computational methods of the part of speech weight shown in Fig. 3 can be used to calculate part of speech group The weight closed.
Illustrate to utilize above method to calculate part of speech weight and part of speech combining weights below in conjunction with Fig. 5 One exemplary calculation procedure.Fig. 5 is to illustrate part of speech and the hierarchical chart of part of speech combination.Here, To illustrate as a example by a sentence only has four class parts of speech, i.e. noun, verb, adjective and adverbial word.
It is primarily based on analytic hierarchy process (AHP) (AHP) and builds hierarchical structure as shown in Figure 5, according to this layer It is as follows that aggregated(particle) structure and above-mentioned table 1 can construct part of speech comparator matrix A.
A = 1 1 / 3 1 / 7 1 / 3 3 1 1 / 5 1 7 5 1 3 3 1 1 / 3 1
The consistency ration CR=of matrix A can be calculated according to the method described above (4.06-4)/(4-1) * (1/0.9)=0.022 < 0.1, thus it is calculated part of speech weight for [0.0650.163 0.5880.183]。
As a same reason, the comparator matrix B1 of noun and other parts of speech combination can be constructed:
B 1 = 1 3 1 / 3 2 1 / 3 1 1 / 5 1 / 3 3 5 1 2 1 / 2 3 1 / 2 1
The comparator matrix B2 that verb combines with other parts of speech:
B 2 = 1 1 / 5 1 / 5 1 5 1 1 5 5 1 1 5 1 1 / 5 1 / 5 1
The comparator matrix B3 that adjective combines with other parts of speech:
B 3 = 1 1 / 7 1 / 5 1 / 2 7 1 2 3 5 1 / 3 1 2 1 1 / 3 1 / 2 1
The comparator matrix B4 that adverbial word combines with other parts of speech:
B 4 = 1 3 1 / 2 3 1 / 3 1 1 / 5 1 2 5 1 5 1 / 3 1 1 / 5 1
Combining comparator matrix B1-B4 for part of speech, can calculate its consistency ration CR respectively is:
CRB1=0.04 < 0.1,
CRB2=0 < 0.1,
CRB3=0.06 < 0.1,
CRB4=0.0015 < 0.1,
Thus, calculating its weight respectively is:
WB1=[0.250.0780.480.19],
WB2=[0.0830.420.420.083],
WB3=[0.0700.510.280.14],
WB4=[0.280.0990.520.099].
Finally, in step s 110, according to the significance level of each word, the weight of part of speech and word Property combination weight extract the text feature of this word.
In one example, can be by the word frequency of word-reverse document frequency be multiplied by the weight of part of speech again It is multiplied by the weight of part of speech combination to obtain final text feature.
Fig. 6 shows the system knot of the method for extraction text feature according to an embodiment of the invention Composition.
Illustrate to extract an illustrative methods of text feature below in conjunction with Fig. 6.
First, use equation below (1) that input text carried out participle:
X ^ = arg max x P ( X ) P ( Y | X ) P ( Y ) = arg max x P ( X ) P ( Y | X ) = arg max x P ( x 1 x 2 . . . x n ) P ( y 1 y 2 . . . y m | x 1 x 2 . . . x n ) - - - ( 1 )
Output word xiAnd part of speech and its positional information.
According to the word x obtainedi, use equation below (2) that word frequency-reverse document frequency can be calculated
TfIdf x i = Tf i &times; Idf i = n x i &Sigma; i n x i log | D | 1 + | { j : x i &Element; d j } | - - - ( 2 )
In equation (2), TfiIt is word frequency, represents the frequency that certain given word occurs in the document Rate,Wherein,It is word xiThe number of times occurred in a document,It is at document In the occurrence number sum of all words.
IdfiIt is reverse document frequency, is the tolerance of a universal significance level of word, can be by total document Number is divided by the number of the document comprising this word, then the business obtained is taken the logarithm and obtain.
In formula (2),Wherein, the file during | D | is corpus is total Number, { j:xi∈djRepresent the number of documents comprising this word.
Then, Tf is calculatediWith IdfiProduct obtain word frequency-reverse document frequency
Then, based on AHP model, weight and the weight of part of speech combination of above-mentioned calculating part of speech are used Method, be calculated the weight of part of speech respectivelyWeight with part of speech combination
Finally, final text feature can be calculated by equation (3)
f x i s i = e charc x i &times; TfIdf x i &times; w position x i - - - ( 3 )
It will be understood by those skilled in the art that calculating text featureFormula be not limited to above-mentioned equation (3), for example, it is also possible to use equation below (4) or (5) to calculate.
f x i s i = TfIdf x i &times; 1 2 ( w charc x i + w position x i ) - - - ( 4 )
f x i s i = TfIdf x i &times; ( m &CenterDot; w charc x i + n &CenterDot; w position x i ) - - - ( 5 )
Wherein, m and n is arbitrary integer.
Fig. 7 is the showing of device 700 illustrating extraction text feature according to an embodiment of the invention The block diagram of example configuration.
As it is shown in fig. 7, the device 700 extracting text feature includes participle unit 702, importance meter Calculate unit 704, part of speech weight calculation unit 706, part of speech combining weights computing unit 708 and text Feature extraction unit 710.
Wherein, participle unit 702 is configured to that input document is carried out participle and obtains multiple word, each The part of speech of word and each word are adjacent the part of speech combination of word.
Importance computing unit 704 is configured to calculate the significance level of each word.
Part of speech weight calculation unit 706 is configured to calculate the weight of the part of speech of each word.
Part of speech combining weights computing unit 708 is configured to calculate each word and is adjacent the part of speech group of word The weight closed.
Text character extraction unit 710 is configured to for each word, according to its significance level, part of speech Weight and the weight of part of speech combination extract the text feature of this word.
Fig. 8 is the block diagram of a kind of exemplary configuration illustrating the part of speech weight calculation unit 706 in Fig. 7.
As described in Figure 8, part of speech weight calculation unit 706 includes the first comparator matrix constructor unit 7062, first eigenvector constructor unit 7064 and part of speech weight calculation subelement 7066.
First comparator matrix constructor unit 7062 is configured to the significance level to part of speech and carries out two-by-two Relatively and give a mark, the first comparator matrix is constructed.
First eigenvector constructor unit 7064 is configured to calculate described first comparator matrix Big first eigenvector corresponding to characteristic root.
Part of speech weight calculation subelement 7066 is configured to be normalized described first eigenvector Obtain the weight of described part of speech.
Fig. 9 is a kind of exemplary configuration illustrating the part of speech combining weights computing unit 708 in Fig. 7 Block diagram.
As it is shown in figure 9, part of speech combining weights computing unit 708 includes the second comparator matrix constructor list Unit 7082, second feature vector constructor unit 7084 and part of speech combining weights computation subunit 7086.
Second comparator matrix constructor unit 7082 is configured to the significance level to part of speech combination two-by-two Compare and give a mark, constructing the second comparator matrix.
Second feature vector constructor unit 7084 is configured to calculate the maximum special of the second comparator matrix Levy the second feature vector corresponding to root.
Part of speech combining weights computation subunit 7086 is configured to be normalized second feature vector Obtain the weight of part of speech combination.
In one example, part of speech weight calculation unit 706 also includes that the first logic determines subelement (figure Not shown in).First logic determines whether subelement is configured to determine that the logic of the first comparator matrix Set up.
In one example, part of speech combining weights computing unit 708 also includes that the second logic determines that son is single Unit's (not shown).Second logic determines that subelement is configured to determine that patrolling of the second comparator matrix Collect and whether set up.
In one example, the first logic determines that subelement is configured to by comparing first Matrix carries out consistency check, determines whether the logic of the first comparator matrix is set up.
In one example, the second logic determines that subelement is configured to by comparing second Matrix carries out consistency check, determines whether the logic of the second comparator matrix is set up.
In one example, participle unit 702 is configured to come based on hidden Markov model Carry out participle.
In one example, the first comparator matrix constructor unit 7062 be configured to based on Analytic hierarchy process (AHP) constructs the first comparator matrix.
In one example, the second comparator matrix constructor unit 7064 be configured to based on Analytic hierarchy process (AHP) constructs described second comparator matrix.
In one example, importance computing unit 704 be configured to calculate word word frequency- Reverse document frequency.
Wherein, Text character extraction unit 710 is configured to: for each word, by inciting somebody to action The weight that the word frequency of this word-reverse document frequency is multiplied by part of speech is multiplied by the weight of part of speech combination again and extracts The text feature of this word.
Can join about the operation of various piece of device 700 and the details of function extracting text feature According to the embodiment of the method extracting text feature combining the present invention that Fig. 1-6 describes, the most detailed Thin description.
At this it should be noted that shown in Fig. 7-9 extract text feature device 700 and composition The structure of unit is merely exemplary, those skilled in the art can be as required to shown in Fig. 7-9 Structured flowchart modify.
The present invention proposes a kind of combination and the text feature of significance level of word based on part of speech and part of speech Extracting method.Present invention have the advantage that
(1) feature of text is reflected in the position (i.e. part of speech combination) utilizing part of speech and word, makes up Single word frequency and reverse document word frequency (TFIDF) the feature defect insufficient to information retrieval.
(2) analytic hierarchy process (AHP) is introduced, it is possible to by different language, and the reason that different people is to language Solution incorporates in characteristic extraction procedure.
(3) weight of part of speech and part of speech combination is precalculated, to processing real-time data, Speed is faster.
The ultimate principle of the present invention is described above in association with specific embodiment, however, it is desirable to it is noted that For those of ordinary skill in the art, it is to be understood that methods and apparatus of the present invention whole or Any step or parts, can any calculating device (including processor, storage medium etc.) or Person calculates in the network of device, is realized with hardware, firmware, software or combinations thereof, this It is that those of ordinary skill in the art use in the case of the explanation having read the present invention that theirs is basic Programming skill can be achieved with.
Therefore, the purpose of the present invention can also by any calculating device run a program or Batch processing realizes.Described calculating device can be known fexible unit.Therefore, the present invention Purpose can also be only by providing the program comprising the program code realizing described method or device Product realizes.It is to say, such program product also constitutes the present invention, and storage has so The storage medium of program product also constitute the present invention.Obviously, described storage medium can be any public affairs The storage medium known or any storage medium developed in the future.
In the case of realizing embodiments of the invention by software and/or firmware, from storage medium or net Network is to having the computer of specialized hardware structure, and the such as general purpose computer 1000 shown in Figure 10 is installed Constituting the program of this software, this computer is when being provided with various program, it is possible to perform various functions etc. Deng.
In Fig. 10, CPU (CPU) 1001 is according in read only memory (ROM) 1002 The program stored or the program being loaded into random access memory (RAM) 1003 from storage part 1008 Perform various process.In RAM 1003, perform various also according to needs storage as CPU 1001 Data required during process etc..CPU 1001, ROM 1002 and RAM 1003 are via bus 1004 links each other.Input/output interface 1005 also link is to bus 1004.
Components described below link is to input/output interface 1005: importation 1006 (includes keyboard, Mus Mark etc.), output part 1007 (include display, such as cathode ray tube (CRT), liquid crystal Show device (LCD) etc., and speaker etc.), storage part 1008 (including hard disk etc.), communications portion 1009 (including NIC such as LAN card, modem etc.).Communications portion 1009 is via net Network such as the Internet performs communication process.As required, driver 1010 also can link to input/defeated Outgoing interface 1005.Detachable media 1011 such as disk, CD, magneto-optic disk, semiconductor memory Etc. be installed in as required in driver 1010 so that the computer program read out according to Needs are installed to store in part 1008.
In the case of realizing above-mentioned series of processes by software, it is situated between from network such as the Internet or storage Matter such as detachable media 1011 installs the program constituting software.
It will be understood by those of skill in the art that this storage medium is not limited to its shown in Figure 10 In have program stored therein and equipment distributes the detachable media of the program that provides a user with separately 1011.The example of detachable media 1011 comprises disk (comprising floppy disk (registered trade mark)), CD (comprises Compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprise mini-disk (MD) (registered trade mark)) and semiconductor memory.Or, storage medium can be ROM 1002, deposit Hard disk of comprising etc. in storage part 1008, wherein computer program stored, and with comprise their equipment It is distributed to user together.
The present invention also proposes the program product that a kind of storage has the instruction code of machine-readable.Instruction generation When code is read by machine and performed, above-mentioned method according to embodiments of the present invention can be performed.
Correspondingly, for carrying the depositing of program product that above-mentioned storage has the instruction code of machine-readable Storage media is also included within disclosure of the invention.Storage medium includes but not limited to floppy disk, CD, magnetic CD, storage card, memory stick etc..
It should be appreciated by those skilled in the art that enumerated at this is exemplary, the present invention is also It is not limited to this.
In this manual, the statement such as " first ", " second " and " n-th " is in order to by institute The feature described distinguishes, so that the present invention is explicitly described on word.Therefore, should not serve to There is any determinate implication.
As an example, each step of said method and all modules of the said equipment and / or unit may be embodied as software, firmware, hardware or a combination thereof, and as in relevant device Point.In said apparatus, all modules, unit are by software, firmware, hardware or the side of a combination thereof When formula configures, spendable specific means or mode are well known to those skilled in the art, at this not Repeat again.
As an example, in the case of being realized by software or firmware, can from storage medium or Network is to the computer (the such as general purpose computer 1000 shown in Figure 10) with specialized hardware structure Installing the program constituting this software, this computer is when being provided with various program, it is possible to perform various merit Can etc..
In description to the specific embodiment of the invention above, describe for a kind of embodiment and/or show The feature gone out can use in one or more other embodiments in same or similar mode, Combined with the feature in other embodiments, or substitute the feature in other embodiments.
It should be emphasized that term " includes/comprises " referring to when using feature, key element, step or assembly herein Existence, but be not precluded from one or more other features, key element, step or the existence of assembly or Additional.
Additionally, the method for the present invention be not limited to specifications described in time sequencing perform, also Can according to other time sequencing ground, perform concurrently or independently.Therefore, this specification is retouched The technical scope of the present invention is not construed as limiting by the execution sequence of the method stated.
The present invention and advantage thereof it should be appreciated that without departing from being defined by the claims appended hereto Various change can be carried out in the case of the spirit and scope of the present invention, substitute and convert.And, this The scope of invention is not limited only to the tool of the process described by description, equipment, means, method and steps Body embodiment.One of ordinary skilled in the art will readily appreciate that from the disclosure, root According to the present invention can use execution the function essentially identical to corresponding embodiment in this or obtain and Process its essentially identical result, existing and in the future the most to be developed, equipment, means, method or Person's step.Therefore, appended claim is directed in the range of them including such process, setting Standby, means, method or step.
Based on above explanation, it is known that the open techniques below scheme that at least discloses:
1. extract a device for text feature, including:
Participle unit, is configured to that input document is carried out participle and obtains multiple word, the part of speech of each word With the part of speech combination that each word is adjacent word;
Importance computing unit, is configured to calculate the significance level of each word;
Part of speech weight calculation unit, is configured to calculate the weight of the part of speech of each word;
Part of speech combining weights computing unit, is configured to calculate each word and is adjacent the part of speech combination of word Weight;And
Text character extraction unit, be configured to for each word, according to its significance level, part of speech The weight of weight and part of speech combination extracts the text feature of this word.
2. according to the device described in remarks 1, wherein, described part of speech weight calculation unit includes:
First comparator matrix constructor unit, is configured to compare the significance level of part of speech two-by-two And give a mark, construct the first comparator matrix;
First eigenvector constructor unit, is configured to calculate the maximum special of described first comparator matrix Levy the first eigenvector corresponding to root;And
Part of speech weight calculation subelement, is configured to be normalized described first eigenvector obtain The weight of described part of speech.
3. according to the device described in remarks 2, wherein, described part of speech combining weights computing unit includes:
Second comparator matrix constructor unit, is configured to carry out the significance level of part of speech combination two-by-two Relatively and give a mark, the second comparator matrix is constructed;
Second feature vector constructor unit, is configured to calculate the maximum special of described second comparator matrix Levy the second feature vector corresponding to root;And
Part of speech combining weights computation subunit, is configured to be normalized described second feature vector Obtain the weight of described part of speech combination.
4. according to the device described in remarks 3, wherein, described part of speech weight calculation unit also includes One logic determines subelement, is configured to determine that whether the logic of described first comparator matrix is set up, with And described part of speech combining weights computing unit also includes that the second logic determines subelement, it is configured to determine that Whether the logic of described second comparator matrix is set up.
5. according to the device described in remarks 4, wherein, described first logic determines that subelement is further It is configured to described first comparator matrix is carried out consistency check, determines that described first compares Whether the logic of matrix is set up, and described second logic determines that subelement is configured to pass through Described second comparator matrix is carried out consistency check, determines that the logic of described second comparator matrix is No establishment.
6. according to the device described in remarks 1, wherein, described participle unit is configured to base Described participle is carried out in hidden Markov model.
7. according to the device described in remarks 3, wherein, described first comparator matrix constructor unit enters One step is configured to construct described first comparator matrix, and described second ratio based on analytic hierarchy process (AHP) Relatively matrix construction subelement is configured to construct described second based on analytic hierarchy process (AHP) and compares Matrix.
8. according to the device described in remarks 1, wherein, described significance level computing unit further by It is configured to calculate the word frequency-reverse document frequency of this word.
9. according to the device described in remarks 8, wherein, described Text character extraction unit further by It is configured that for each word, by the word frequency of this word-reverse document frequency being multiplied by the weight of part of speech It is multiplied by the weight of part of speech combination again to extract the text feature of this word.
10. the method extracting text feature, including:
Input document is carried out participle obtain multiple word, the part of speech of each word and each word and be adjacent word Part of speech combination;
Calculate the significance level of each word;
Calculate the weight of the part of speech of each word;
Calculate the weight that each word is adjacent the part of speech combination of word;And
For each word, carry according to the weight that its significance level, the weight of part of speech and part of speech combine Take the text feature of this word.
11. according to the method described in remarks 10, and wherein, the weight of described part of speech is by the following method Calculate:
The significance level of part of speech is compared two-by-two and gives a mark, constructs the first comparator matrix;
Calculate the first eigenvector corresponding to Maximum characteristic root of described first comparator matrix;And
Described first eigenvector is normalized the weight obtaining described part of speech.
12. according to the method described in remarks 11, and wherein, the weight of described part of speech combination is by following Method calculates:
The significance level of part of speech combination is compared two-by-two and gives a mark, constructs the second comparator matrix;
Calculate the vector of the second feature corresponding to Maximum characteristic root of described second comparator matrix;And
Described second feature vector is normalized the weight obtaining the combination of described part of speech.
13. according to the method described in remarks 12, wherein, is calculating according to described first comparator matrix First determine before described first eigenvector whether the logic of described first comparator matrix is set up, and First determine that described second compares before calculating described second feature vector according to described second comparator matrix Whether the logic of matrix is set up.
14. according to the method described in remarks 13, wherein, by described first comparator matrix or institute State the second comparator matrix and random matrix and carry out consistency check, determine described first comparator matrix or Whether the logic of described second comparator matrix is set up.
15., according to the method described in remarks 10, wherein, carry out institute based on hidden Markov model State participle.
16. according to the method described in remarks 12, wherein, constructs described based on analytic hierarchy process (AHP) One comparator matrix and described second comparator matrix.
17. according to the method described in remarks 10, and wherein, the significance level calculating each word includes meter Calculate the word frequency-reverse document frequency of this word.
18. according to the method described in remarks 17, wherein, for each word, by by the word of this word Frequently the weight that-reverse document frequency is multiplied by part of speech is multiplied by the weight of part of speech combination again to obtain the literary composition of this word Eigen.

Claims (10)

1. extract a device for text feature, including:
Participle unit, is configured to that input document is carried out participle and obtains multiple word, the part of speech of each word With the part of speech combination that each word is adjacent word;
Importance computing unit, is configured to calculate the significance level of each word;
Part of speech weight calculation unit, is configured to calculate the weight of the part of speech of each word;
Part of speech combining weights computing unit, is configured to calculate each word and is adjacent the part of speech combination of word Weight;And
Text character extraction unit, be configured to for each word, according to its significance level, part of speech The weight of weight and part of speech combination extracts the text feature of this word.
Device the most according to claim 1, wherein, described part of speech weight calculation unit includes:
First comparator matrix constructor unit, is configured to compare the significance level of part of speech two-by-two And give a mark, construct the first comparator matrix;
First eigenvector constructor unit, is configured to calculate the maximum special of described first comparator matrix Levy the first eigenvector corresponding to root;And
Part of speech weight calculation subelement, is configured to be normalized described first eigenvector obtain The weight of described part of speech.
Device the most according to claim 2, wherein, described part of speech combining weights computing unit Including:
Second comparator matrix constructor unit, is configured to carry out the significance level of part of speech combination two-by-two Relatively and give a mark, the second comparator matrix is constructed;
Second feature vector constructor unit, is configured to calculate the maximum special of described second comparator matrix Levy the second feature vector corresponding to root;And
Part of speech combining weights computation subunit, is configured to be normalized described second feature vector Obtain the weight of described part of speech combination.
Device the most according to claim 3, wherein, described part of speech weight calculation unit is also wrapped Include the first logic and determine subelement, be configured to determine that whether the logic of described first comparator matrix becomes Stand, and described part of speech combining weights computing unit also includes that the second logic determines subelement, is configured For determining whether the logic of described second comparator matrix is set up.
Device the most according to claim 4, wherein, described first logic determines that subelement enters One step is configured to described first comparator matrix is carried out consistency check, determines described first Whether the logic of comparator matrix is set up, and described second logic determines that subelement is configured to By described second comparator matrix is carried out consistency check, determine patrolling of described second comparator matrix Collect and whether set up.
Device the most according to claim 1, wherein, described participle unit is further configured For carrying out described participle based on hidden Markov model.
Device the most according to claim 3, wherein, described first comparator matrix constructor list Unit is configured to construct described first comparator matrix, and described based on analytic hierarchy process (AHP) Two comparator matrix constructor unit are configured to construct described second based on analytic hierarchy process (AHP) Comparator matrix.
Device the most according to claim 1, wherein, described significance level computing unit enters one Step is configured to calculate the word frequency-reverse document frequency of word.
Device the most according to claim 8, wherein, described Text character extraction unit enters one Step is configured to: for each word, by the word frequency of this word-reverse document frequency is multiplied by part of speech Weight is multiplied by the weight of part of speech combination again to extract the text feature of this word.
10. the method extracting text feature, including:
Input document is carried out participle obtain multiple word, the part of speech of each word and each word and be adjacent word Part of speech combination;
Calculate the significance level of each word;
Calculate the weight of the part of speech of each word;
Calculate the weight that each word is adjacent the part of speech combination of word;And
For each word, carry according to the weight that its significance level, the weight of part of speech and part of speech combine Take the text feature of this word.
CN201510193912.XA 2015-04-22 2015-04-22 Extract the apparatus and method of text feature Pending CN106156196A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510193912.XA CN106156196A (en) 2015-04-22 2015-04-22 Extract the apparatus and method of text feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510193912.XA CN106156196A (en) 2015-04-22 2015-04-22 Extract the apparatus and method of text feature

Publications (1)

Publication Number Publication Date
CN106156196A true CN106156196A (en) 2016-11-23

Family

ID=57346298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510193912.XA Pending CN106156196A (en) 2015-04-22 2015-04-22 Extract the apparatus and method of text feature

Country Status (1)

Country Link
CN (1) CN106156196A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598945A (en) * 2016-12-02 2017-04-26 北京小米移动软件有限公司 Template inspection method and device
CN108170668A (en) * 2017-12-01 2018-06-15 厦门快商通信息技术有限公司 A kind of Characters independent positioning method and computer readable storage medium
CN108363790A (en) * 2018-02-12 2018-08-03 百度在线网络技术(北京)有限公司 For the method, apparatus, equipment and storage medium to being assessed
CN108733653A (en) * 2018-05-18 2018-11-02 华中科技大学 A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
CN109190123A (en) * 2018-09-14 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for output information
CN110147421A (en) * 2019-05-10 2019-08-20 腾讯科技(深圳)有限公司 A kind of target entity link method, device, equipment and storage medium
CN110413956A (en) * 2018-04-28 2019-11-05 南京云问网络技术有限公司 A kind of Text similarity computing method based on bootstrapping

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
CN103123624A (en) * 2011-11-18 2013-05-29 阿里巴巴集团控股有限公司 Method of confirming head word, device of confirming head word, searching method and device
CN104199811A (en) * 2014-09-10 2014-12-10 携程计算机技术(上海)有限公司 Short sentence analytic model establishing method and system
WO2015019723A1 (en) * 2013-08-07 2015-02-12 シャープ株式会社 Information processing device, information processing method, information processing program, information processing system, and electronic device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
CN103123624A (en) * 2011-11-18 2013-05-29 阿里巴巴集团控股有限公司 Method of confirming head word, device of confirming head word, searching method and device
WO2015019723A1 (en) * 2013-08-07 2015-02-12 シャープ株式会社 Information processing device, information processing method, information processing program, information processing system, and electronic device
CN104199811A (en) * 2014-09-10 2014-12-10 携程计算机技术(上海)有限公司 Short sentence analytic model establishing method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
卢伟胜等: "基于词性标注序列特征提取的微博情感分类", 《计算机应用》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598945A (en) * 2016-12-02 2017-04-26 北京小米移动软件有限公司 Template inspection method and device
CN106598945B (en) * 2016-12-02 2019-08-06 北京小米移动软件有限公司 The template method of inspection and device
CN108170668A (en) * 2017-12-01 2018-06-15 厦门快商通信息技术有限公司 A kind of Characters independent positioning method and computer readable storage medium
CN108363790A (en) * 2018-02-12 2018-08-03 百度在线网络技术(北京)有限公司 For the method, apparatus, equipment and storage medium to being assessed
WO2019153737A1 (en) * 2018-02-12 2019-08-15 百度在线网络技术(北京)有限公司 Comment assessing method, device, equipment and storage medium
US11403680B2 (en) 2018-02-12 2022-08-02 Baidu Online Network Technology (Beijing) Co., Ltd. Method, apparatus for evaluating review, device and storage medium
CN110413956A (en) * 2018-04-28 2019-11-05 南京云问网络技术有限公司 A kind of Text similarity computing method based on bootstrapping
CN110413956B (en) * 2018-04-28 2023-08-01 南京云问网络技术有限公司 Text similarity calculation method based on bootstrapping
CN108733653A (en) * 2018-05-18 2018-11-02 华中科技大学 A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
CN108733653B (en) * 2018-05-18 2020-07-10 华中科技大学 Sentiment analysis method of Skip-gram model based on fusion of part-of-speech and semantic information
CN109190123B (en) * 2018-09-14 2020-03-27 北京字节跳动网络技术有限公司 Method and apparatus for outputting information
CN109190123A (en) * 2018-09-14 2019-01-11 北京字节跳动网络技术有限公司 Method and apparatus for output information
CN110147421B (en) * 2019-05-10 2022-06-21 腾讯科技(深圳)有限公司 Target entity linking method, device, equipment and storage medium
CN110147421A (en) * 2019-05-10 2019-08-20 腾讯科技(深圳)有限公司 A kind of target entity link method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106156196A (en) Extract the apparatus and method of text feature
Khuc et al. Towards building large-scale distributed systems for twitter sentiment analysis
CN110378409A (en) It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN106055623A (en) Cross-language recommendation method and system
Ferrández et al. Aligning FrameNet and WordNet based on Semantic Neighborhoods.
KR101717230B1 (en) Document summarization method using recursive autoencoder based sentence vector modeling and document summarization system
CN107092605A (en) A kind of entity link method and device
CN106202065A (en) A kind of across language topic detecting method and system
Wang et al. Representing document as dependency graph for document clustering
Alian et al. Arabic semantic similarity approaches-review
CN109063184A (en) Multilingual newsletter archive clustering method, storage medium and terminal device
Zhan et al. Survey on event extraction technology in information extraction research area
Khan et al. Genetic semantic graph approach for multi-document abstractive summarization
Utomo et al. New instances classification framework on Quran ontology applied to question answering system
CN103761225B (en) A kind of Chinese word semantic similarity calculation method of data-driven
Emu et al. An efficient approach for keyphrase extraction from english document
Yajian et al. A short text classification algorithm based on semantic extension
Galitsky et al. Improving text retrieval efficiency with pattern structures on parse thickets
CN107784112A (en) Short text data Enhancement Method, system and detection authentication service platform
Korobkin et al. Prior art candidate search on base of statistical and semantic patent analysis
Prasad et al. Document summarization and information extraction for generation of presentation slides
Hu et al. Residual-duet network with tree dependency representation for chinese question-answering sentiment analysis
Volkovskiy et al. Mathematical model for automatic creation the semantic thesaurus for the scientific text
Tsumuraya et al. Semantic Search of Japanese Sentences Using Distributed Representations
CN103678355A (en) Text mining method and text mining device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161123

WD01 Invention patent application deemed withdrawn after publication