CN109408797A - A kind of text sentence vector expression method and system - Google Patents
A kind of text sentence vector expression method and system Download PDFInfo
- Publication number
- CN109408797A CN109408797A CN201710712075.6A CN201710712075A CN109408797A CN 109408797 A CN109408797 A CN 109408797A CN 201710712075 A CN201710712075 A CN 201710712075A CN 109408797 A CN109408797 A CN 109408797A
- Authority
- CN
- China
- Prior art keywords
- text
- lexical item
- weight
- parameter
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000004422 calculation algorithm Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 10
- 230000001052 transient effect Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 abstract description 6
- 238000004364 calculation method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/157—Transformation using dictionaries or tables
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of text sentence vector representation method, comprising: S1, obtains classification belonging to each lexical item for including in text and the text;S2, based on the relevant parameter between preset lexical item and category set, according to classification belonging to the text, calculate weight of each lexical item for including in the text in the text;S3, the weight based on each lexical item for including in each lexical item and the text for including in the text in the text, determine the corresponding text sentence vector of the text.Text sentence vector provided by the invention indicates method and system, the correlation degree in document between each lexical item and document generic is differentiated by the relevant parameter between preset lexical item and the text categories, so that the sentence vector information after weighting is complete and not redundancy, text-processing accuracy rate is improved.
Description
Technical field
The present invention relates to text information processing fields, indicate method and system more particularly, to a kind of text sentence vector.
Background technique
It is more and more to use by the agility of Internet communication message with the rapid development of internet and mobile network
Family selection is exchanged by internet platform with other people, shared information, and a big chunk in network information is from text envelope
Breath.How effectively to handle these text informations is present research hotspot.Wherein, text representation is document information retrieval process
In cannot be neglected key.Text representation refers to the text conversion that will can read at the mistake of the identifiable data structure of computer
Journey is the Basic Problems in text information processing field.In general, being all to convert contents of vector for the word content of text
It is indicated, to realize natural language counting in each specific field such as text classification, similarity calculation, pattern-recognition
The property calculated.
It is general that text information is indicated, it is all to be indicated by the way of sentence vector.What the prior art used
Method are as follows: text to be processed the pretreatment such as segmented to first, remove stop words, it then again will be literary using word2vec tool
Word in this is converted to term vector, is finally added term vector to obtain a vector.
Term vector is directly added to obtain a vector by the prior art, and such way fails to consider lexical item and text categories
Relevance causes information loss, to influence text-processing effect to a certain extent.
Summary of the invention
The present invention provides a kind of a kind of text sentence vector for overcoming the above problem or at least being partially solved the above problem
Representation method and system.
According to an aspect of the present invention, a kind of text sentence vector representation method is provided, comprising:
S1, classification belonging to each lexical item for including in text and the text is obtained;
S2, it is calculated based on the relevant parameter between preset lexical item and category set according to classification belonging to the text
Weight of each lexical item for including in the text in the text;
S3, based on each lexical item for including in each lexical item and the text for including in the text in the text
Weight in this, determines the corresponding text sentence vector of the text.
Wherein, step S2 includes:
S21, it is counted based on the relevant parameter between preset lexical item and category set according to classification belonging to the text
Calculate the correlation degree value between each lexical item for including in the text and the text generic;
S22, it is based on the correlation degree value and TF-IDF weighting algorithm, calculates each lexical item for including in the text and exists
Weight in the text.
Wherein, step S22 is specifically included:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ωikFor the power of lexical item in the text
Weight, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
Wherein, step S2 further comprises:
Based on the relevant parameter between the preset lexical item and category set, classification belonging to the text and default
Part of speech parameter, calculate weight of each lexical item for including in the text in the text.
Wherein, the preset part of speech parameter includes:
First part of speech parameter, the second part of speech parameter and third part of speech parameter;
Wherein, the first part of speech parameter is greater than the second part of speech parameter, and the second part of speech parameter is greater than described the
Three part of speech parameters, and the value range of the first part of speech parameter is greater than 0 and less than 1.
Wherein, the relevant parameter based between the preset lexical item and category set, class belonging to the text
Other and preset part of speech parameter calculates weight of each lexical item for including in the text in the text, comprising:
Relevant parameter between the preset lexical item and classification is concentrated, and according to classification belonging to the text, is calculated
Correlation degree value between each lexical item for including in the text and the text generic;
Based on part of speech belonging to each lexical item for including in the text, corresponding part of speech parameter value is obtained;
Based on the correlation degree value, the part of speech parameter value and TF-IDF weighting algorithm, calculates and wrapped in the text
Weight of each lexical item contained in the text.
Wherein, described to be based on the correlation degree, the part of speech parameter and TF-IDF weighting algorithm, calculate the text
In include weight of each lexical item in the text, specifically include:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ω 'ikFor the power of lexical item in the text
Weight, θkFor the part of speech parameter value, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk)
To be inverted text frequency.
According to the second aspect of the invention, a kind of text sentence vector expression system is provided, comprising:
Module is obtained, for obtaining classification belonging to each lexical item for including in text and the text;
Computing module, for based on the relevant parameter between preset lexical item and category set, according to belonging to the text
Classification, calculate weight of each lexical item for including in the text in the text;
Determining module, for based on each word for including in each lexical item and the text for including in the text
Weight of the item in the text, determines the corresponding text sentence vector of the text.
According to the third aspect of the invention we, a kind of computer program product, including program code, said program code are provided
For executing text sentence vector representation method described above.
According to the fourth aspect of the invention, a kind of non-transient computer readable storage medium is provided, for storing such as preceding institute
The computer program stated.
Text sentence vector provided by the invention indicates method and system, by between preset lexical item and the text categories
Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting
Vector information is complete and not redundancy, improves text-processing accuracy rate.
Detailed description of the invention
Fig. 1 is that a kind of text sentence vector provided in an embodiment of the present invention indicates method flow diagram;
Fig. 2 is that a kind of text sentence vector provided in an embodiment of the present invention indicates system construction drawing.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below
Example is not intended to limit the scope of the invention for illustrating the present invention.
Fig. 1 is that a kind of text sentence vector provided in an embodiment of the present invention indicates method flow diagram, as shown in Figure 1, comprising:
S1, classification belonging to each lexical item for including in text and the text is obtained;
S2, it is calculated based on the relevant parameter between preset lexical item and category set according to classification belonging to the text
Weight of each lexical item for including in the text in the text;
S3, based on each lexical item for including in each lexical item and the text for including in the text in the text
Weight in this, determines the corresponding text sentence vector of the text.
In the prior art, word is converted by term vector using word2vec tool, then again by term vector with simply
After TF-IDF method is weighted, text sentence vector is obtained.
But the TF-IDF method that the prior art uses is only a kind of statistical method, to assess a lexical item (term) for
The significance level of one file set or a copy of it file in a corpus.The importance of words goes out hereof with it
The directly proportional increase of existing number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.TF-IDF weighting
Various forms is often searched engine application, measurement or grading as degree of correlation between file and user query.
It is understood that TF-IDF weighting algorithm used in the prior art is only the appearance according to word in the text
Frequency is to simply judge that the weight of the words in the text, this method of weighting fail to consider the association of lexical item and text categories
Property, information loss is caused to a certain extent, to influence text-processing effect.
In S1, the present invention is also to be obtained to the term vector of lexical item using word2vec tool, but the embodiment of the present invention
The classification of text to be processed can be labeled based on preset text classification standard simultaneously.
In S2, the embodiment of the present invention proposes a preset lexical item and the text on the basis of TF-IDF weighting algorithm
Relevant parameter c between this classification, to differentiate the correlation degree between lexical item and text categories, if correlation degree is higher, then
Mean that weight shared by the lexical item is bigger.
Such as: when being traffic for text categories, lexical item high-speed rail and the textual association degree of traffic classification are larger, then c value
It is larger, then can be determined that lexical item high-speed rail is higher in the shared weight of traffic text;Likewise, be sport for text categories,
So lexical item high-speed rail and the correlation degree of sport are lower at this time, then c value is then smaller, then can be determined that lexical item high-speed rail in sport
The shared weight of class text is smaller.
In S3, based on the embodiment of the present invention based on being associated with ginseng between preset lexical item and the text categories
Number, calculate each lexical item for including in the text after the weight in the text, then by the word of lexical item each in text to
Amount is added according to corresponding respective weight, obtains text sentence vector.
Specific formula is as follows:
Wherein, SiFor text sentence vector, HiFor sentence SiIn include lexical item number, pik=(V1k,V2k,Vdk,...,VTk) indicate
The term vector of k-th of word in i-th of sentence, T indicate the dimension of term vector.
It is understood that the text sentence vector that the embodiment of the present invention determines is according between lexical item and the text categories
Relevant parameter determine weight, to ensure that the integrated degree of the text information of information as far as possible, to improve text
The accuracy rate and treatment effect of present treatment.
For example, for two text Sv1And Sv2, the sentence vector representation method provided through the embodiment of the present invention, Ke Yitong
It crosses sentence vector and further calculates cosine similarity between two texts, to differentiate the correlation between two texts.
Specific cosine similarity calculates as follows:
sim(Sv1,Sv2)=cos (Sv1,Sv2)。
Text sentence vector representation method provided in an embodiment of the present invention, by between preset lexical item and the text categories
Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting
Vector information is complete and not redundancy, improves text-processing accuracy rate.
On the basis of the above embodiments, step S2 includes:
S21, it is counted based on the relevant parameter between preset lexical item and category set according to classification belonging to the text
Calculate the correlation degree value between each lexical item for including in the text and the text generic;
S22, it is based on the correlation degree value and TF-IDF weighting algorithm, calculates each lexical item for including in the text and exists
Weight in the text.
On the basis of the above embodiments, step S2 includes:
S21, based on the relevant parameter between preset lexical item and the text categories, calculate include in the text each
Correlation degree between text categories belonging to a lexical item and the text;
S22, it is based on the correlation degree and TF-IDF weighting algorithm, calculates each lexical item for including in the text in institute
State the weight in text.
In S21, the embodiment of the present invention propose the relevant parameter c between lexical item and the text categories show lexical item and
The expression formula of correlation degree between text categories, c is as follows:
Wherein: A indicates CiClass includes that the amount of text of t lexical item adds 1;B indicates non-CiClass includes that the amount of text of t lexical item adds
1;C indicates CiThe amount of text that class does not include t lexical item adds 1;D indicates non-CiThe amount of text that class does not include t lexical item adds 1;N is indicated
The total quantity of text.
So by above-mentioned expression formula, the embodiment of the present invention can will be established between lexical item and text categories by c value and be closed
System, the value of c is higher, it means that the correlation degree between the lexical item and text categories is higher, conversely, the value of c is lower, then
Correlation degree between lexical item and text categories is lower.
So according to the TF-IDF weighting algorithm of the value of the correlation degree c and the prior art, then text can be calculated
In include weight of each lexical item in the text.
Above-mentioned TF-IDF weighting algorithm are as follows:
Wherein, tfikThe frequency occurred in text i for lexical item k;N is text total quantity, nkIt is the text comprising lexical item k
Number, log (N/nk) indicate to be inverted text frequency.
On the basis of the above embodiments, step S22 is specifically included:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ωikFor the power of lexical item in the text
Weight, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
It is understood that provided in an embodiment of the present invention weight the correlation degree and TF-IDF of lexical item and text categories
It is a kind of linear combination that algorithm, which combines, by introducing between lexical item and text categories in TF-IDF weighting algorithm formula
Relevant parameter is shown below to form new weight calculation formula:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ωikFor the power of lexical item in the text
Weight, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
On the basis of the above embodiments, step S2 further comprises:
Based on the relevant parameter between the preset lexical item and category set, classification belonging to the text and default
Part of speech parameter, calculate weight of each lexical item for including in the text in the text.
It should be noted that the above-mentioned offer of the embodiment of the present invention is associated with ginseng according between lexical item and the text categories
Number, calculates weight of each lexical item for including in the text in the text, can be preferably by lexical item and text categories
Between relationship contacted, it is preferable that, the embodiment of the present invention is considering except the relationship between lexical item and text categories,
A kind of part of speech parameter is additionally provided, weight distribution is advanced optimized.
The part of speech, which refers to, carries out parts of speech classification to word according to the property of word, and general word nature includes:
A variety of parts of speech such as noun, verb, adjective, adverbial word, preposition, different parts of speech are different the significance level of a text,
In general, nominal word is relatively high for the importance of text, then for the word institute in the text of different parts of speech
The weight accounted for is different, and scheme provided in an embodiment of the present invention then allows for part of speech to the contribution degree of text, to propose
Part of speech parameter optimizes the calculating process of part of speech weight.
It is preferred to propose on the basis of the embodiment of the present invention is by the calculating of the correlation degree between lexical item and text categories
Part of speech parameter, to further increase the computational accuracy of weight, so that it is guaranteed that text information complete and not will cause information redundancy.
On the basis of the above embodiments, the preset part of speech parameter includes:
First part of speech parameter, the second part of speech parameter and third part of speech parameter;
Wherein, the first part of speech parameter is greater than the second part of speech parameter, and the second part of speech parameter is greater than described the
Three part of speech parameters, and the value range of the first part of speech parameter is greater than 0 and less than 1.
It is understood that can be preferably classified as according to the significance level that different parts of speech are showed in the text
Three classes, while according to the corresponding different part of speech parameter of classification setting, respectively the first part of speech parameter, the second part of speech parameter and the
Three part of speech parameters, wherein the parameter of first part of speech is greater than the second part of speech parameter, and the second part of speech parameter is greater than institute
Third part of speech parameter is stated, then it is understood that the significance level highest of the first part of speech in the text, followed by the second word
Property, it is finally third part of speech.
Since the parameter is coefficient factor, therefore its value range should be between 0-1.
Preferably, part of speech is that noun is assigned as the first part of speech, verb, adjective, adverbial word, preposition point by the embodiment of the present invention
With for the second part of speech, remaining all part of speech is assigned as third part of speech, is shown below:
Wherein, θ is default part of speech parameter, and α is the first part of speech parameter, and β is the second part of speech parameter, and η is third part of speech parameter.
Preferably, 0≤η≤β≤α≤1.
On the basis of the above embodiments, the relevant parameter based between the preset lexical item and category set,
Classification belonging to the text and preset part of speech parameter calculate each lexical item for including in the text in the text
Weight, comprising:
Relevant parameter between the preset lexical item and classification is concentrated, and according to classification belonging to the text, is calculated
Correlation degree value between each lexical item for including in the text and the text generic;
Based on part of speech belonging to each lexical item for including in the text, corresponding part of speech parameter value is obtained;
Based on the correlation degree value, the part of speech parameter value and TF-IDF weighting algorithm, calculates and wrapped in the text
Weight of each lexical item contained in the text.
It is understood that the embodiment of the present invention it is above-mentioned be based on correlation degree and TF-IDF weighting algorithm on the basis of,
Weight is further calculated in conjunction with part of speech parameter.
Part of speech division is carried out according to part of speech belonging to each lexical item for including in the text, it is corresponding to draw each lexical item
It is divided into the first part of speech, the second part of speech or third part of speech, is then joined according to the first part of speech parameter, the second part of speech parameter and third part of speech
Several value conditions, determines weight.
So it is understood that the first part of speech parameter is greater than the second part of speech parameter, therefore the corresponding word of the first part of speech parameter
Weight shared by is then larger, conversely, third part of speech parameter is minimum, therefore weight shared by the corresponding lexical item of third part of speech parameter is most
It is small.
On the basis of the above embodiments, described based on the correlation degree, the part of speech parameter and TF-IDF weighting
Algorithm calculates weight of each lexical item for including in the text in the text, specifically includes:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ω 'ikFor the power of lexical item in the text
Weight, θkFor the part of speech parameter value, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk)
To be inverted text frequency.
It is understood that the embodiment of the present invention is by the correlation degree of lexical item and text categories, part of speech parameter and TF-
It is a kind of linear combination that IDF weighting algorithm, which combines, by introducing lexical item and text class in TF-IDF weighting algorithm formula
Relevant parameter and part of speech parameter between not are shown below to form new weight calculation formula:
Wherein, ω 'ikFor the weight of lexical item in the text, θkFor the part of speech parameter, ckFor the correlation degree, tfikFor
The probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
It should be noted that calculating θkAnd ckWhen there is no sequential relationship, and the two conduct simultaneously in calculating process
The factor that weighing factor calculates, works as ckBigger, weight is then bigger, likewise, working as θkBigger, weight is also bigger.
Fig. 2 is that a kind of text sentence vector provided in an embodiment of the present invention indicates system construction drawing, as shown in Figure 2, comprising: obtain
Modulus block 1, computing module 2 and determining module 3, in which:
Module 1 is obtained for obtaining classification belonging to each lexical item for including in text and the text;
Computing module 2 is used for based on the relevant parameter between preset lexical item and category set, according to belonging to the text
Classification, calculate weight of each lexical item for including in the text in the text;
Determining module 3 is used for based on each word for including in each lexical item and the text for including in the text
Weight of the item in the text, determines the corresponding text sentence vector of the text.
The each lexical item for including in text is obtained according to word2vec tool and obtains each word specifically, obtaining module 1
The term vector of item, while also obtaining the generic of the text.
Computing module 2 is calculated and is wrapped in the text for the relevant parameter between preset lexical item and the text categories
Weight of each lexical item contained in the text, it is preferred that computing module is also based on preset lexical item and the text
Relevant parameter and preset part of speech parameter between classification, calculate each lexical item for including in the text in the text
In weight.
The term vector of word each in text is added by determining module 3 according to respective weight, determines text sentence vector.
Text sentence vector provided in an embodiment of the present invention indicates system, by between preset lexical item and the text categories
Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting
Vector information is complete and not redundancy, improves text-processing accuracy rate.
The embodiment of the present invention provides a kind of text sentence vector expression system, comprising: at least one processor;And with it is described
At least one processor of processor communication connection, in which:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to
Enable to execute method provided by above-mentioned each method embodiment, for example, S1, obtain text in include each lexical item and
Classification belonging to the text;S2, based on the relevant parameter between preset lexical item and category set, according to belonging to the text
Classification, calculate weight of each lexical item for including in the text in the text;S3, based on including in the text
Weight of each lexical item for including in each lexical item and the text in the text, determines the corresponding text of the text
This vector.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating
Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated
When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, S1, obtains and wraps in text
Classification belonging to each lexical item contained and the text;S2, based on the relevant parameter between preset lexical item and category set,
According to classification belonging to the text, weight of each lexical item for including in the text in the text is calculated;S3, it is based on
The weight of each lexical item for including in the text and each lexical item for including in the text in the text determines
The corresponding text sentence vector of the text.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium
Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment, example
As included: S1, obtaining classification belonging to each lexical item for including in text and the text;S2, based on preset lexical item with
Relevant parameter between category set calculates each lexical item for including in the text and exists according to classification belonging to the text
Weight in the text;S3, based on each lexical item for including in each lexical item and the text for including in the text
Weight in the text determines the corresponding text sentence vector of the text.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through
The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program
When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light
The various media that can store program code such as disk.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can
It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on
Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should
Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Method described in certain parts of example or embodiment.
Finally, the present processes are only preferable embodiment, it is not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention
Within the scope of.
Claims (10)
1. a kind of text sentence vector representation method characterized by comprising
S1, classification belonging to each lexical item for including in text and the text is obtained;
S2, based on the relevant parameter between preset lexical item and category set, according to classification belonging to the text, described in calculating
Weight of each lexical item for including in text in the text;
S3, based on each lexical item for including in each lexical item and the text for including in the text in the text
Weight, determine the corresponding text sentence vector of the text.
2. the method according to claim 1, wherein step S2 includes:
S21, institute is calculated according to classification belonging to the text based on the relevant parameter between preset lexical item and category set
State the correlation degree value between each lexical item for including in text and the text generic;
S22, it is based on the correlation degree value and TF-IDF weighting algorithm, calculates each lexical item for including in the text described
Weight in text.
3. according to the method described in claim 2, it is characterized in that, step S22 is specifically included:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ωikFor the weight of lexical item in the text, ck
For the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
4. the method according to claim 1, wherein step S2 further comprises:
Based on the relevant parameter between the preset lexical item and category set, classification and preset word belonging to the text
Property parameter, calculates weight of each lexical item for including in the text in the text.
5. according to the method described in claim 4, it is characterized in that, the preset part of speech parameter includes:
First part of speech parameter, the second part of speech parameter and third part of speech parameter;
Wherein, the first part of speech parameter is greater than the second part of speech parameter, and the second part of speech parameter is greater than the third word
Property parameter, and the value range of the first part of speech parameter be greater than 0 and less than 1.
6. according to the method described in claim 5, it is characterized in that, described based between the preset lexical item and category set
Relevant parameter, classification and preset part of speech parameter belonging to the text, calculate each lexical item for including in the text
Weight in the text, comprising:
Relevant parameter between the preset lexical item and classification is concentrated, according to classification belonging to the text, described in calculating
Correlation degree value between each lexical item for including in text and the text generic;
Based on part of speech belonging to each lexical item for including in the text, corresponding part of speech parameter value is obtained;
Based on the correlation degree value, the part of speech parameter value and TF-IDF weighting algorithm, calculates in the text and include
Weight of each lexical item in the text.
7. according to the method described in claim 6, it is characterized in that, it is described based on the correlation degree, the part of speech parameter with
And TF-IDF weighting algorithm, weight of each lexical item for including in the text in the text is calculated, is specifically included:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ω 'ikFor the weight of lexical item in the text, θk
For the part of speech parameter value, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to fall
Set text frequency.
8. a kind of text sentence vector indicates system characterized by comprising
Module is obtained, for obtaining classification belonging to each lexical item for including in text and the text;
Computing module, for based on the relevant parameter between preset lexical item and category set, according to class belonging to the text
Not, weight of each lexical item for including in the text in the text is calculated;
Determining module, for being existed based on each lexical item for including in each lexical item and the text for including in the text
Weight in the text determines the corresponding text sentence vector of the text.
9. a kind of computer program product, which is characterized in that the computer program product includes being stored in non-transient computer
Computer program on readable storage medium storing program for executing, the computer program include program instruction, when described program is instructed by computer
When execution, the computer is made to execute the method as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited
Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710712075.6A CN109408797A (en) | 2017-08-18 | 2017-08-18 | A kind of text sentence vector expression method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710712075.6A CN109408797A (en) | 2017-08-18 | 2017-08-18 | A kind of text sentence vector expression method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109408797A true CN109408797A (en) | 2019-03-01 |
Family
ID=65463188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710712075.6A Pending CN109408797A (en) | 2017-08-18 | 2017-08-18 | A kind of text sentence vector expression method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109408797A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114398469A (en) * | 2021-12-10 | 2022-04-26 | 北京百度网讯科技有限公司 | Method and device for determining search term weight and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11143892A (en) * | 1997-11-07 | 1999-05-28 | Fujitsu Ltd | Device and method for keyword weight generation and program storage medium |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN105224695A (en) * | 2015-11-12 | 2016-01-06 | 中南大学 | A kind of text feature quantization method based on information entropy and device and file classification method and device |
JP2016001399A (en) * | 2014-06-11 | 2016-01-07 | 日本電信電話株式会社 | Relevance determination device, model learning device, method, and program |
-
2017
- 2017-08-18 CN CN201710712075.6A patent/CN109408797A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11143892A (en) * | 1997-11-07 | 1999-05-28 | Fujitsu Ltd | Device and method for keyword weight generation and program storage medium |
JP2016001399A (en) * | 2014-06-11 | 2016-01-07 | 日本電信電話株式会社 | Relevance determination device, model learning device, method, and program |
CN104778158A (en) * | 2015-03-04 | 2015-07-15 | 新浪网技术(中国)有限公司 | Method and device for representing text |
CN105224695A (en) * | 2015-11-12 | 2016-01-06 | 中南大学 | A kind of text feature quantization method based on information entropy and device and file classification method and device |
Non-Patent Citations (1)
Title |
---|
张小川等: "一种改进的向量空间模型的文本表示算法", 《重庆理工大学学报(自然科学)》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114398469A (en) * | 2021-12-10 | 2022-04-26 | 北京百度网讯科技有限公司 | Method and device for determining search term weight and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Integrating extractive and abstractive models for long text summarization | |
US10061766B2 (en) | Systems and methods for domain-specific machine-interpretation of input data | |
US9613024B1 (en) | System and methods for creating datasets representing words and objects | |
US10042896B2 (en) | Providing search recommendation | |
US8423546B2 (en) | Identifying key phrases within documents | |
US11288453B1 (en) | Key-word identification | |
US10496756B2 (en) | Sentence creation system | |
CN109299280B (en) | Short text clustering analysis method and device and terminal equipment | |
Gacitua et al. | Relevance-based abstraction identification: technique and evaluation | |
WO2024078105A1 (en) | Method for extracting technical problem in patent literature and related device | |
US20230282018A1 (en) | Generating weighted contextual themes to guide unsupervised keyphrase relevance models | |
WO2024015323A1 (en) | Methods and systems for improved document processing and information retrieval | |
Ma et al. | Stream-based live public opinion monitoring approach with adaptive probabilistic topic model | |
CN114997288A (en) | Design resource association method | |
CN109284389A (en) | A kind of information processing method of text data, device | |
Ertam et al. | Abstractive text summarization using deep learning with a new Turkish summarization benchmark dataset | |
CN113761875B (en) | Event extraction method and device, electronic equipment and storage medium | |
Zheng et al. | An adaptive LDA optimal topic number selection method in news topic identification | |
Phan et al. | Applying skip-gram word estimation and SVM-based classification for opinion mining Vietnamese food places text reviews | |
Quan et al. | Combine sentiment lexicon and dependency parsing for sentiment classification | |
Visser et al. | Sentiment and intent classification of in-text citations using bert | |
CN109408797A (en) | A kind of text sentence vector expression method and system | |
CN110929513A (en) | Text-based label system construction method and device | |
Yuan et al. | Personalized sentence generation using generative adversarial networks with author-specific word usage | |
Kong et al. | Construction of microblog-specific chinese sentiment lexicon based on representation learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190301 |