CN109408797A

CN109408797A - A kind of text sentence vector expression method and system

Info

Publication number: CN109408797A
Application number: CN201710712075.6A
Authority: CN
Inventors: 李广森; 张春荣; 赵琦
Original assignee: Putian Information Technology Co Ltd
Current assignee: Potevio Information Technology Co Ltd; Putian Information Technology Co Ltd
Priority date: 2017-08-18
Filing date: 2017-08-18
Publication date: 2019-03-01

Abstract

The present invention provides a kind of text sentence vector representation method, comprising: S1, obtains classification belonging to each lexical item for including in text and the text；S2, based on the relevant parameter between preset lexical item and category set, according to classification belonging to the text, calculate weight of each lexical item for including in the text in the text；S3, the weight based on each lexical item for including in each lexical item and the text for including in the text in the text, determine the corresponding text sentence vector of the text.Text sentence vector provided by the invention indicates method and system, the correlation degree in document between each lexical item and document generic is differentiated by the relevant parameter between preset lexical item and the text categories, so that the sentence vector information after weighting is complete and not redundancy, text-processing accuracy rate is improved.

Description

A kind of text sentence vector expression method and system

Technical field

The present invention relates to text information processing fields, indicate method and system more particularly, to a kind of text sentence vector.

Background technique

It is more and more to use by the agility of Internet communication message with the rapid development of internet and mobile network Family selection is exchanged by internet platform with other people, shared information, and a big chunk in network information is from text envelope Breath.How effectively to handle these text informations is present research hotspot.Wherein, text representation is document information retrieval process In cannot be neglected key.Text representation refers to the text conversion that will can read at the mistake of the identifiable data structure of computer Journey is the Basic Problems in text information processing field.In general, being all to convert contents of vector for the word content of text It is indicated, to realize natural language counting in each specific field such as text classification, similarity calculation, pattern-recognition The property calculated.

It is general that text information is indicated, it is all to be indicated by the way of sentence vector.What the prior art used Method are as follows: text to be processed the pretreatment such as segmented to first, remove stop words, it then again will be literary using word2vec tool Word in this is converted to term vector, is finally added term vector to obtain a vector.

Term vector is directly added to obtain a vector by the prior art, and such way fails to consider lexical item and text categories Relevance causes information loss, to influence text-processing effect to a certain extent.

Summary of the invention

The present invention provides a kind of a kind of text sentence vector for overcoming the above problem or at least being partially solved the above problem Representation method and system.

According to an aspect of the present invention, a kind of text sentence vector representation method is provided, comprising:

S1, classification belonging to each lexical item for including in text and the text is obtained；

S2, it is calculated based on the relevant parameter between preset lexical item and category set according to classification belonging to the text Weight of each lexical item for including in the text in the text；

S3, based on each lexical item for including in each lexical item and the text for including in the text in the text Weight in this, determines the corresponding text sentence vector of the text.

Wherein, step S2 includes:

S21, it is counted based on the relevant parameter between preset lexical item and category set according to classification belonging to the text Calculate the correlation degree value between each lexical item for including in the text and the text generic；

S22, it is based on the correlation degree value and TF-IDF weighting algorithm, calculates each lexical item for including in the text and exists Weight in the text.

Wherein, step S22 is specifically included:

Wherein, k is lexical item, n_kFor the textual data comprising lexical item k, N is text sum, ω_ikFor the power of lexical item in the text Weight, c_kFor the correlation degree value, tf_ikFor the probability that lexical item occurs in the text, log (N/n_k) it is to be inverted text frequency.

Wherein, step S2 further comprises:

Based on the relevant parameter between the preset lexical item and category set, classification belonging to the text and default Part of speech parameter, calculate weight of each lexical item for including in the text in the text.

Wherein, the preset part of speech parameter includes:

First part of speech parameter, the second part of speech parameter and third part of speech parameter；

Wherein, the first part of speech parameter is greater than the second part of speech parameter, and the second part of speech parameter is greater than described the Three part of speech parameters, and the value range of the first part of speech parameter is greater than 0 and less than 1.

Wherein, the relevant parameter based between the preset lexical item and category set, class belonging to the text Other and preset part of speech parameter calculates weight of each lexical item for including in the text in the text, comprising:

Relevant parameter between the preset lexical item and classification is concentrated, and according to classification belonging to the text, is calculated Correlation degree value between each lexical item for including in the text and the text generic；

Based on part of speech belonging to each lexical item for including in the text, corresponding part of speech parameter value is obtained；

Based on the correlation degree value, the part of speech parameter value and TF-IDF weighting algorithm, calculates and wrapped in the text Weight of each lexical item contained in the text.

Wherein, described to be based on the correlation degree, the part of speech parameter and TF-IDF weighting algorithm, calculate the text In include weight of each lexical item in the text, specifically include:

Wherein, k is lexical item, n_kFor the textual data comprising lexical item k, N is text sum, ω '_ikFor the power of lexical item in the text Weight, θ_kFor the part of speech parameter value, c_kFor the correlation degree value, tf_ikFor the probability that lexical item occurs in the text, log (N/n_k) To be inverted text frequency.

According to the second aspect of the invention, a kind of text sentence vector expression system is provided, comprising:

Module is obtained, for obtaining classification belonging to each lexical item for including in text and the text；

Computing module, for based on the relevant parameter between preset lexical item and category set, according to belonging to the text Classification, calculate weight of each lexical item for including in the text in the text；

Determining module, for based on each word for including in each lexical item and the text for including in the text Weight of the item in the text, determines the corresponding text sentence vector of the text.

According to the third aspect of the invention we, a kind of computer program product, including program code, said program code are provided For executing text sentence vector representation method described above.

According to the fourth aspect of the invention, a kind of non-transient computer readable storage medium is provided, for storing such as preceding institute The computer program stated.

Text sentence vector provided by the invention indicates method and system, by between preset lexical item and the text categories Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting Vector information is complete and not redundancy, improves text-processing accuracy rate.

Detailed description of the invention

Fig. 1 is that a kind of text sentence vector provided in an embodiment of the present invention indicates method flow diagram；

Fig. 2 is that a kind of text sentence vector provided in an embodiment of the present invention indicates system construction drawing.

Specific embodiment

With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.

Fig. 1 is that a kind of text sentence vector provided in an embodiment of the present invention indicates method flow diagram, as shown in Figure 1, comprising:

In the prior art, word is converted by term vector using word2vec tool, then again by term vector with simply After TF-IDF method is weighted, text sentence vector is obtained.

But the TF-IDF method that the prior art uses is only a kind of statistical method, to assess a lexical item (term) for The significance level of one file set or a copy of it file in a corpus.The importance of words goes out hereof with it The directly proportional increase of existing number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.TF-IDF weighting Various forms is often searched engine application, measurement or grading as degree of correlation between file and user query.

It is understood that TF-IDF weighting algorithm used in the prior art is only the appearance according to word in the text Frequency is to simply judge that the weight of the words in the text, this method of weighting fail to consider the association of lexical item and text categories Property, information loss is caused to a certain extent, to influence text-processing effect.

In S1, the present invention is also to be obtained to the term vector of lexical item using word2vec tool, but the embodiment of the present invention The classification of text to be processed can be labeled based on preset text classification standard simultaneously.

In S2, the embodiment of the present invention proposes a preset lexical item and the text on the basis of TF-IDF weighting algorithm Relevant parameter c between this classification, to differentiate the correlation degree between lexical item and text categories, if correlation degree is higher, then Mean that weight shared by the lexical item is bigger.

Such as: when being traffic for text categories, lexical item high-speed rail and the textual association degree of traffic classification are larger, then c value It is larger, then can be determined that lexical item high-speed rail is higher in the shared weight of traffic text；Likewise, be sport for text categories, So lexical item high-speed rail and the correlation degree of sport are lower at this time, then c value is then smaller, then can be determined that lexical item high-speed rail in sport The shared weight of class text is smaller.

In S3, based on the embodiment of the present invention based on being associated with ginseng between preset lexical item and the text categories Number, calculate each lexical item for including in the text after the weight in the text, then by the word of lexical item each in text to Amount is added according to corresponding respective weight, obtains text sentence vector.

Specific formula is as follows:

Wherein, S_iFor text sentence vector, H_iFor sentence S_iIn include lexical item number, p_ik=(V_1k,V_2k,V_dk,...,V_Tk) indicate The term vector of k-th of word in i-th of sentence, T indicate the dimension of term vector.

It is understood that the text sentence vector that the embodiment of the present invention determines is according between lexical item and the text categories Relevant parameter determine weight, to ensure that the integrated degree of the text information of information as far as possible, to improve text The accuracy rate and treatment effect of present treatment.

For example, for two text S_v1And S_v2, the sentence vector representation method provided through the embodiment of the present invention, Ke Yitong It crosses sentence vector and further calculates cosine similarity between two texts, to differentiate the correlation between two texts.

Specific cosine similarity calculates as follows:

sim(S_v1,S_v2)=cos (S_v1,S_v2)。

Text sentence vector representation method provided in an embodiment of the present invention, by between preset lexical item and the text categories Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting Vector information is complete and not redundancy, improves text-processing accuracy rate.

On the basis of the above embodiments, step S2 includes:

S21, based on the relevant parameter between preset lexical item and the text categories, calculate include in the text each Correlation degree between text categories belonging to a lexical item and the text；

S22, it is based on the correlation degree and TF-IDF weighting algorithm, calculates each lexical item for including in the text in institute State the weight in text.

In S21, the embodiment of the present invention propose the relevant parameter c between lexical item and the text categories show lexical item and The expression formula of correlation degree between text categories, c is as follows:

Wherein: A indicates C_iClass includes that the amount of text of t lexical item adds 1；B indicates non-C_iClass includes that the amount of text of t lexical item adds 1；C indicates C_iThe amount of text that class does not include t lexical item adds 1；D indicates non-C_iThe amount of text that class does not include t lexical item adds 1；N is indicated The total quantity of text.

So by above-mentioned expression formula, the embodiment of the present invention can will be established between lexical item and text categories by c value and be closed System, the value of c is higher, it means that the correlation degree between the lexical item and text categories is higher, conversely, the value of c is lower, then Correlation degree between lexical item and text categories is lower.

So according to the TF-IDF weighting algorithm of the value of the correlation degree c and the prior art, then text can be calculated In include weight of each lexical item in the text.

Above-mentioned TF-IDF weighting algorithm are as follows:

Wherein, tf_ikThe frequency occurred in text i for lexical item k；N is text total quantity, n_kIt is the text comprising lexical item k Number, log (N/n_k) indicate to be inverted text frequency.

On the basis of the above embodiments, step S22 is specifically included:

It is understood that provided in an embodiment of the present invention weight the correlation degree and TF-IDF of lexical item and text categories It is a kind of linear combination that algorithm, which combines, by introducing between lexical item and text categories in TF-IDF weighting algorithm formula Relevant parameter is shown below to form new weight calculation formula:

On the basis of the above embodiments, step S2 further comprises:

It should be noted that the above-mentioned offer of the embodiment of the present invention is associated with ginseng according between lexical item and the text categories Number, calculates weight of each lexical item for including in the text in the text, can be preferably by lexical item and text categories Between relationship contacted, it is preferable that, the embodiment of the present invention is considering except the relationship between lexical item and text categories, A kind of part of speech parameter is additionally provided, weight distribution is advanced optimized.

The part of speech, which refers to, carries out parts of speech classification to word according to the property of word, and general word nature includes: A variety of parts of speech such as noun, verb, adjective, adverbial word, preposition, different parts of speech are different the significance level of a text, In general, nominal word is relatively high for the importance of text, then for the word institute in the text of different parts of speech The weight accounted for is different, and scheme provided in an embodiment of the present invention then allows for part of speech to the contribution degree of text, to propose Part of speech parameter optimizes the calculating process of part of speech weight.

It is preferred to propose on the basis of the embodiment of the present invention is by the calculating of the correlation degree between lexical item and text categories Part of speech parameter, to further increase the computational accuracy of weight, so that it is guaranteed that text information complete and not will cause information redundancy.

On the basis of the above embodiments, the preset part of speech parameter includes:

It is understood that can be preferably classified as according to the significance level that different parts of speech are showed in the text Three classes, while according to the corresponding different part of speech parameter of classification setting, respectively the first part of speech parameter, the second part of speech parameter and the Three part of speech parameters, wherein the parameter of first part of speech is greater than the second part of speech parameter, and the second part of speech parameter is greater than institute Third part of speech parameter is stated, then it is understood that the significance level highest of the first part of speech in the text, followed by the second word Property, it is finally third part of speech.

Since the parameter is coefficient factor, therefore its value range should be between 0-1.

Preferably, part of speech is that noun is assigned as the first part of speech, verb, adjective, adverbial word, preposition point by the embodiment of the present invention With for the second part of speech, remaining all part of speech is assigned as third part of speech, is shown below:

Wherein, θ is default part of speech parameter, and α is the first part of speech parameter, and β is the second part of speech parameter, and η is third part of speech parameter.

Preferably, 0≤η≤β≤α≤1.

On the basis of the above embodiments, the relevant parameter based between the preset lexical item and category set, Classification belonging to the text and preset part of speech parameter calculate each lexical item for including in the text in the text Weight, comprising:

It is understood that the embodiment of the present invention it is above-mentioned be based on correlation degree and TF-IDF weighting algorithm on the basis of, Weight is further calculated in conjunction with part of speech parameter.

Part of speech division is carried out according to part of speech belonging to each lexical item for including in the text, it is corresponding to draw each lexical item It is divided into the first part of speech, the second part of speech or third part of speech, is then joined according to the first part of speech parameter, the second part of speech parameter and third part of speech Several value conditions, determines weight.

So it is understood that the first part of speech parameter is greater than the second part of speech parameter, therefore the corresponding word of the first part of speech parameter Weight shared by is then larger, conversely, third part of speech parameter is minimum, therefore weight shared by the corresponding lexical item of third part of speech parameter is most It is small.

On the basis of the above embodiments, described based on the correlation degree, the part of speech parameter and TF-IDF weighting Algorithm calculates weight of each lexical item for including in the text in the text, specifically includes:

It is understood that the embodiment of the present invention is by the correlation degree of lexical item and text categories, part of speech parameter and TF- It is a kind of linear combination that IDF weighting algorithm, which combines, by introducing lexical item and text class in TF-IDF weighting algorithm formula Relevant parameter and part of speech parameter between not are shown below to form new weight calculation formula:

Wherein, ω '_ikFor the weight of lexical item in the text, θ_kFor the part of speech parameter, c_kFor the correlation degree, tf_ikFor The probability that lexical item occurs in the text, log (N/n_k) it is to be inverted text frequency.

It should be noted that calculating θ_kAnd c_kWhen there is no sequential relationship, and the two conduct simultaneously in calculating process The factor that weighing factor calculates, works as c_kBigger, weight is then bigger, likewise, working as θ_kBigger, weight is also bigger.

Fig. 2 is that a kind of text sentence vector provided in an embodiment of the present invention indicates system construction drawing, as shown in Figure 2, comprising: obtain Modulus block 1, computing module 2 and determining module 3, in which:

Module 1 is obtained for obtaining classification belonging to each lexical item for including in text and the text；

Computing module 2 is used for based on the relevant parameter between preset lexical item and category set, according to belonging to the text Classification, calculate weight of each lexical item for including in the text in the text；

Determining module 3 is used for based on each word for including in each lexical item and the text for including in the text Weight of the item in the text, determines the corresponding text sentence vector of the text.

The each lexical item for including in text is obtained according to word2vec tool and obtains each word specifically, obtaining module 1 The term vector of item, while also obtaining the generic of the text.

Computing module 2 is calculated and is wrapped in the text for the relevant parameter between preset lexical item and the text categories Weight of each lexical item contained in the text, it is preferred that computing module is also based on preset lexical item and the text Relevant parameter and preset part of speech parameter between classification, calculate each lexical item for including in the text in the text In weight.

The term vector of word each in text is added by determining module 3 according to respective weight, determines text sentence vector.

Text sentence vector provided in an embodiment of the present invention indicates system, by between preset lexical item and the text categories Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting Vector information is complete and not redundancy, improves text-processing accuracy rate.

The embodiment of the present invention provides a kind of text sentence vector expression system, comprising: at least one processor；And with it is described At least one processor of processor communication connection, in which:

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to Enable to execute method provided by above-mentioned each method embodiment, for example, S1, obtain text in include each lexical item and Classification belonging to the text；S2, based on the relevant parameter between preset lexical item and category set, according to belonging to the text Classification, calculate weight of each lexical item for including in the text in the text；S3, based on including in the text Weight of each lexical item for including in each lexical item and the text in the text, determines the corresponding text of the text This vector.

The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, S1, obtains and wraps in text Classification belonging to each lexical item contained and the text；S2, based on the relevant parameter between preset lexical item and category set, According to classification belonging to the text, weight of each lexical item for including in the text in the text is calculated；S3, it is based on The weight of each lexical item for including in the text and each lexical item for including in the text in the text determines The corresponding text sentence vector of the text.

The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment, example As included: S1, obtaining classification belonging to each lexical item for including in text and the text；S2, based on preset lexical item with Relevant parameter between category set calculates each lexical item for including in the text and exists according to classification belonging to the text Weight in the text；S3, based on each lexical item for including in each lexical item and the text for including in the text Weight in the text determines the corresponding text sentence vector of the text.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed；And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally, the present processes are only preferable embodiment, it is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention Within the scope of.

Claims

1. a kind of text sentence vector representation method characterized by comprising

S2, based on the relevant parameter between preset lexical item and category set, according to classification belonging to the text, described in calculating Weight of each lexical item for including in text in the text；

S3, based on each lexical item for including in each lexical item and the text for including in the text in the text Weight, determine the corresponding text sentence vector of the text.

2. the method according to claim 1, wherein step S2 includes:

S21, institute is calculated according to classification belonging to the text based on the relevant parameter between preset lexical item and category set State the correlation degree value between each lexical item for including in text and the text generic；

S22, it is based on the correlation degree value and TF-IDF weighting algorithm, calculates each lexical item for including in the text described Weight in text.

3. according to the method described in claim 2, it is characterized in that, step S22 is specifically included:

Wherein, k is lexical item, n_kFor the textual data comprising lexical item k, N is text sum, ω_ikFor the weight of lexical item in the text, c_k For the correlation degree value, tf_ikFor the probability that lexical item occurs in the text, log (N/n_k) it is to be inverted text frequency.

4. the method according to claim 1, wherein step S2 further comprises:

Based on the relevant parameter between the preset lexical item and category set, classification and preset word belonging to the text Property parameter, calculates weight of each lexical item for including in the text in the text.

5. according to the method described in claim 4, it is characterized in that, the preset part of speech parameter includes:

Wherein, the first part of speech parameter is greater than the second part of speech parameter, and the second part of speech parameter is greater than the third word Property parameter, and the value range of the first part of speech parameter be greater than 0 and less than 1.

6. according to the method described in claim 5, it is characterized in that, described based between the preset lexical item and category set Relevant parameter, classification and preset part of speech parameter belonging to the text, calculate each lexical item for including in the text Weight in the text, comprising:

Relevant parameter between the preset lexical item and classification is concentrated, according to classification belonging to the text, described in calculating Correlation degree value between each lexical item for including in text and the text generic；

Based on the correlation degree value, the part of speech parameter value and TF-IDF weighting algorithm, calculates in the text and include Weight of each lexical item in the text.

7. according to the method described in claim 6, it is characterized in that, it is described based on the correlation degree, the part of speech parameter with And TF-IDF weighting algorithm, weight of each lexical item for including in the text in the text is calculated, is specifically included:

Wherein, k is lexical item, n_kFor the textual data comprising lexical item k, N is text sum, ω '_ikFor the weight of lexical item in the text, θ_k For the part of speech parameter value, c_kFor the correlation degree value, tf_ikFor the probability that lexical item occurs in the text, log (N/n_k) it is to fall Set text frequency.

8. a kind of text sentence vector indicates system characterized by comprising

Computing module, for based on the relevant parameter between preset lexical item and category set, according to class belonging to the text Not, weight of each lexical item for including in the text in the text is calculated；

Determining module, for being existed based on each lexical item for including in each lexical item and the text for including in the text Weight in the text determines the corresponding text sentence vector of the text.

9. a kind of computer program product, which is characterized in that the computer program product includes being stored in non-transient computer Computer program on readable storage medium storing program for executing, the computer program include program instruction, when described program is instructed by computer When execution, the computer is made to execute the method as described in claim 1 to 7 is any.

10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.