CN109408797A - A kind of text sentence vector expression method and system - Google Patents

A kind of text sentence vector expression method and system Download PDF

Info

Publication number
CN109408797A
CN109408797A CN201710712075.6A CN201710712075A CN109408797A CN 109408797 A CN109408797 A CN 109408797A CN 201710712075 A CN201710712075 A CN 201710712075A CN 109408797 A CN109408797 A CN 109408797A
Authority
CN
China
Prior art keywords
text
lexical item
weight
parameter
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710712075.6A
Other languages
Chinese (zh)
Inventor
李广森
张春荣
赵琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Potevio Information Technology Co Ltd
Putian Information Technology Co Ltd
Original Assignee
Putian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Putian Information Technology Co Ltd filed Critical Putian Information Technology Co Ltd
Priority to CN201710712075.6A priority Critical patent/CN109408797A/en
Publication of CN109408797A publication Critical patent/CN109408797A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/157Transformation using dictionaries or tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of text sentence vector representation method, comprising: S1, obtains classification belonging to each lexical item for including in text and the text;S2, based on the relevant parameter between preset lexical item and category set, according to classification belonging to the text, calculate weight of each lexical item for including in the text in the text;S3, the weight based on each lexical item for including in each lexical item and the text for including in the text in the text, determine the corresponding text sentence vector of the text.Text sentence vector provided by the invention indicates method and system, the correlation degree in document between each lexical item and document generic is differentiated by the relevant parameter between preset lexical item and the text categories, so that the sentence vector information after weighting is complete and not redundancy, text-processing accuracy rate is improved.

Description

A kind of text sentence vector expression method and system
Technical field
The present invention relates to text information processing fields, indicate method and system more particularly, to a kind of text sentence vector.
Background technique
It is more and more to use by the agility of Internet communication message with the rapid development of internet and mobile network Family selection is exchanged by internet platform with other people, shared information, and a big chunk in network information is from text envelope Breath.How effectively to handle these text informations is present research hotspot.Wherein, text representation is document information retrieval process In cannot be neglected key.Text representation refers to the text conversion that will can read at the mistake of the identifiable data structure of computer Journey is the Basic Problems in text information processing field.In general, being all to convert contents of vector for the word content of text It is indicated, to realize natural language counting in each specific field such as text classification, similarity calculation, pattern-recognition The property calculated.
It is general that text information is indicated, it is all to be indicated by the way of sentence vector.What the prior art used Method are as follows: text to be processed the pretreatment such as segmented to first, remove stop words, it then again will be literary using word2vec tool Word in this is converted to term vector, is finally added term vector to obtain a vector.
Term vector is directly added to obtain a vector by the prior art, and such way fails to consider lexical item and text categories Relevance causes information loss, to influence text-processing effect to a certain extent.
Summary of the invention
The present invention provides a kind of a kind of text sentence vector for overcoming the above problem or at least being partially solved the above problem Representation method and system.
According to an aspect of the present invention, a kind of text sentence vector representation method is provided, comprising:
S1, classification belonging to each lexical item for including in text and the text is obtained;
S2, it is calculated based on the relevant parameter between preset lexical item and category set according to classification belonging to the text Weight of each lexical item for including in the text in the text;
S3, based on each lexical item for including in each lexical item and the text for including in the text in the text Weight in this, determines the corresponding text sentence vector of the text.
Wherein, step S2 includes:
S21, it is counted based on the relevant parameter between preset lexical item and category set according to classification belonging to the text Calculate the correlation degree value between each lexical item for including in the text and the text generic;
S22, it is based on the correlation degree value and TF-IDF weighting algorithm, calculates each lexical item for including in the text and exists Weight in the text.
Wherein, step S22 is specifically included:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ωikFor the power of lexical item in the text Weight, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
Wherein, step S2 further comprises:
Based on the relevant parameter between the preset lexical item and category set, classification belonging to the text and default Part of speech parameter, calculate weight of each lexical item for including in the text in the text.
Wherein, the preset part of speech parameter includes:
First part of speech parameter, the second part of speech parameter and third part of speech parameter;
Wherein, the first part of speech parameter is greater than the second part of speech parameter, and the second part of speech parameter is greater than described the Three part of speech parameters, and the value range of the first part of speech parameter is greater than 0 and less than 1.
Wherein, the relevant parameter based between the preset lexical item and category set, class belonging to the text Other and preset part of speech parameter calculates weight of each lexical item for including in the text in the text, comprising:
Relevant parameter between the preset lexical item and classification is concentrated, and according to classification belonging to the text, is calculated Correlation degree value between each lexical item for including in the text and the text generic;
Based on part of speech belonging to each lexical item for including in the text, corresponding part of speech parameter value is obtained;
Based on the correlation degree value, the part of speech parameter value and TF-IDF weighting algorithm, calculates and wrapped in the text Weight of each lexical item contained in the text.
Wherein, described to be based on the correlation degree, the part of speech parameter and TF-IDF weighting algorithm, calculate the text In include weight of each lexical item in the text, specifically include:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ω 'ikFor the power of lexical item in the text Weight, θkFor the part of speech parameter value, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) To be inverted text frequency.
According to the second aspect of the invention, a kind of text sentence vector expression system is provided, comprising:
Module is obtained, for obtaining classification belonging to each lexical item for including in text and the text;
Computing module, for based on the relevant parameter between preset lexical item and category set, according to belonging to the text Classification, calculate weight of each lexical item for including in the text in the text;
Determining module, for based on each word for including in each lexical item and the text for including in the text Weight of the item in the text, determines the corresponding text sentence vector of the text.
According to the third aspect of the invention we, a kind of computer program product, including program code, said program code are provided For executing text sentence vector representation method described above.
According to the fourth aspect of the invention, a kind of non-transient computer readable storage medium is provided, for storing such as preceding institute The computer program stated.
Text sentence vector provided by the invention indicates method and system, by between preset lexical item and the text categories Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting Vector information is complete and not redundancy, improves text-processing accuracy rate.
Detailed description of the invention
Fig. 1 is that a kind of text sentence vector provided in an embodiment of the present invention indicates method flow diagram;
Fig. 2 is that a kind of text sentence vector provided in an embodiment of the present invention indicates system construction drawing.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention will be described in further detail.Implement below Example is not intended to limit the scope of the invention for illustrating the present invention.
Fig. 1 is that a kind of text sentence vector provided in an embodiment of the present invention indicates method flow diagram, as shown in Figure 1, comprising:
S1, classification belonging to each lexical item for including in text and the text is obtained;
S2, it is calculated based on the relevant parameter between preset lexical item and category set according to classification belonging to the text Weight of each lexical item for including in the text in the text;
S3, based on each lexical item for including in each lexical item and the text for including in the text in the text Weight in this, determines the corresponding text sentence vector of the text.
In the prior art, word is converted by term vector using word2vec tool, then again by term vector with simply After TF-IDF method is weighted, text sentence vector is obtained.
But the TF-IDF method that the prior art uses is only a kind of statistical method, to assess a lexical item (term) for The significance level of one file set or a copy of it file in a corpus.The importance of words goes out hereof with it The directly proportional increase of existing number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.TF-IDF weighting Various forms is often searched engine application, measurement or grading as degree of correlation between file and user query.
It is understood that TF-IDF weighting algorithm used in the prior art is only the appearance according to word in the text Frequency is to simply judge that the weight of the words in the text, this method of weighting fail to consider the association of lexical item and text categories Property, information loss is caused to a certain extent, to influence text-processing effect.
In S1, the present invention is also to be obtained to the term vector of lexical item using word2vec tool, but the embodiment of the present invention The classification of text to be processed can be labeled based on preset text classification standard simultaneously.
In S2, the embodiment of the present invention proposes a preset lexical item and the text on the basis of TF-IDF weighting algorithm Relevant parameter c between this classification, to differentiate the correlation degree between lexical item and text categories, if correlation degree is higher, then Mean that weight shared by the lexical item is bigger.
Such as: when being traffic for text categories, lexical item high-speed rail and the textual association degree of traffic classification are larger, then c value It is larger, then can be determined that lexical item high-speed rail is higher in the shared weight of traffic text;Likewise, be sport for text categories, So lexical item high-speed rail and the correlation degree of sport are lower at this time, then c value is then smaller, then can be determined that lexical item high-speed rail in sport The shared weight of class text is smaller.
In S3, based on the embodiment of the present invention based on being associated with ginseng between preset lexical item and the text categories Number, calculate each lexical item for including in the text after the weight in the text, then by the word of lexical item each in text to Amount is added according to corresponding respective weight, obtains text sentence vector.
Specific formula is as follows:
Wherein, SiFor text sentence vector, HiFor sentence SiIn include lexical item number, pik=(V1k,V2k,Vdk,...,VTk) indicate The term vector of k-th of word in i-th of sentence, T indicate the dimension of term vector.
It is understood that the text sentence vector that the embodiment of the present invention determines is according between lexical item and the text categories Relevant parameter determine weight, to ensure that the integrated degree of the text information of information as far as possible, to improve text The accuracy rate and treatment effect of present treatment.
For example, for two text Sv1And Sv2, the sentence vector representation method provided through the embodiment of the present invention, Ke Yitong It crosses sentence vector and further calculates cosine similarity between two texts, to differentiate the correlation between two texts.
Specific cosine similarity calculates as follows:
sim(Sv1,Sv2)=cos (Sv1,Sv2)。
Text sentence vector representation method provided in an embodiment of the present invention, by between preset lexical item and the text categories Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting Vector information is complete and not redundancy, improves text-processing accuracy rate.
On the basis of the above embodiments, step S2 includes:
S21, it is counted based on the relevant parameter between preset lexical item and category set according to classification belonging to the text Calculate the correlation degree value between each lexical item for including in the text and the text generic;
S22, it is based on the correlation degree value and TF-IDF weighting algorithm, calculates each lexical item for including in the text and exists Weight in the text.
On the basis of the above embodiments, step S2 includes:
S21, based on the relevant parameter between preset lexical item and the text categories, calculate include in the text each Correlation degree between text categories belonging to a lexical item and the text;
S22, it is based on the correlation degree and TF-IDF weighting algorithm, calculates each lexical item for including in the text in institute State the weight in text.
In S21, the embodiment of the present invention propose the relevant parameter c between lexical item and the text categories show lexical item and The expression formula of correlation degree between text categories, c is as follows:
Wherein: A indicates CiClass includes that the amount of text of t lexical item adds 1;B indicates non-CiClass includes that the amount of text of t lexical item adds 1;C indicates CiThe amount of text that class does not include t lexical item adds 1;D indicates non-CiThe amount of text that class does not include t lexical item adds 1;N is indicated The total quantity of text.
So by above-mentioned expression formula, the embodiment of the present invention can will be established between lexical item and text categories by c value and be closed System, the value of c is higher, it means that the correlation degree between the lexical item and text categories is higher, conversely, the value of c is lower, then Correlation degree between lexical item and text categories is lower.
So according to the TF-IDF weighting algorithm of the value of the correlation degree c and the prior art, then text can be calculated In include weight of each lexical item in the text.
Above-mentioned TF-IDF weighting algorithm are as follows:
Wherein, tfikThe frequency occurred in text i for lexical item k;N is text total quantity, nkIt is the text comprising lexical item k Number, log (N/nk) indicate to be inverted text frequency.
On the basis of the above embodiments, step S22 is specifically included:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ωikFor the power of lexical item in the text Weight, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
It is understood that provided in an embodiment of the present invention weight the correlation degree and TF-IDF of lexical item and text categories It is a kind of linear combination that algorithm, which combines, by introducing between lexical item and text categories in TF-IDF weighting algorithm formula Relevant parameter is shown below to form new weight calculation formula:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ωikFor the power of lexical item in the text Weight, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
On the basis of the above embodiments, step S2 further comprises:
Based on the relevant parameter between the preset lexical item and category set, classification belonging to the text and default Part of speech parameter, calculate weight of each lexical item for including in the text in the text.
It should be noted that the above-mentioned offer of the embodiment of the present invention is associated with ginseng according between lexical item and the text categories Number, calculates weight of each lexical item for including in the text in the text, can be preferably by lexical item and text categories Between relationship contacted, it is preferable that, the embodiment of the present invention is considering except the relationship between lexical item and text categories, A kind of part of speech parameter is additionally provided, weight distribution is advanced optimized.
The part of speech, which refers to, carries out parts of speech classification to word according to the property of word, and general word nature includes: A variety of parts of speech such as noun, verb, adjective, adverbial word, preposition, different parts of speech are different the significance level of a text, In general, nominal word is relatively high for the importance of text, then for the word institute in the text of different parts of speech The weight accounted for is different, and scheme provided in an embodiment of the present invention then allows for part of speech to the contribution degree of text, to propose Part of speech parameter optimizes the calculating process of part of speech weight.
It is preferred to propose on the basis of the embodiment of the present invention is by the calculating of the correlation degree between lexical item and text categories Part of speech parameter, to further increase the computational accuracy of weight, so that it is guaranteed that text information complete and not will cause information redundancy.
On the basis of the above embodiments, the preset part of speech parameter includes:
First part of speech parameter, the second part of speech parameter and third part of speech parameter;
Wherein, the first part of speech parameter is greater than the second part of speech parameter, and the second part of speech parameter is greater than described the Three part of speech parameters, and the value range of the first part of speech parameter is greater than 0 and less than 1.
It is understood that can be preferably classified as according to the significance level that different parts of speech are showed in the text Three classes, while according to the corresponding different part of speech parameter of classification setting, respectively the first part of speech parameter, the second part of speech parameter and the Three part of speech parameters, wherein the parameter of first part of speech is greater than the second part of speech parameter, and the second part of speech parameter is greater than institute Third part of speech parameter is stated, then it is understood that the significance level highest of the first part of speech in the text, followed by the second word Property, it is finally third part of speech.
Since the parameter is coefficient factor, therefore its value range should be between 0-1.
Preferably, part of speech is that noun is assigned as the first part of speech, verb, adjective, adverbial word, preposition point by the embodiment of the present invention With for the second part of speech, remaining all part of speech is assigned as third part of speech, is shown below:
Wherein, θ is default part of speech parameter, and α is the first part of speech parameter, and β is the second part of speech parameter, and η is third part of speech parameter.
Preferably, 0≤η≤β≤α≤1.
On the basis of the above embodiments, the relevant parameter based between the preset lexical item and category set, Classification belonging to the text and preset part of speech parameter calculate each lexical item for including in the text in the text Weight, comprising:
Relevant parameter between the preset lexical item and classification is concentrated, and according to classification belonging to the text, is calculated Correlation degree value between each lexical item for including in the text and the text generic;
Based on part of speech belonging to each lexical item for including in the text, corresponding part of speech parameter value is obtained;
Based on the correlation degree value, the part of speech parameter value and TF-IDF weighting algorithm, calculates and wrapped in the text Weight of each lexical item contained in the text.
It is understood that the embodiment of the present invention it is above-mentioned be based on correlation degree and TF-IDF weighting algorithm on the basis of, Weight is further calculated in conjunction with part of speech parameter.
Part of speech division is carried out according to part of speech belonging to each lexical item for including in the text, it is corresponding to draw each lexical item It is divided into the first part of speech, the second part of speech or third part of speech, is then joined according to the first part of speech parameter, the second part of speech parameter and third part of speech Several value conditions, determines weight.
So it is understood that the first part of speech parameter is greater than the second part of speech parameter, therefore the corresponding word of the first part of speech parameter Weight shared by is then larger, conversely, third part of speech parameter is minimum, therefore weight shared by the corresponding lexical item of third part of speech parameter is most It is small.
On the basis of the above embodiments, described based on the correlation degree, the part of speech parameter and TF-IDF weighting Algorithm calculates weight of each lexical item for including in the text in the text, specifically includes:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ω 'ikFor the power of lexical item in the text Weight, θkFor the part of speech parameter value, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) To be inverted text frequency.
It is understood that the embodiment of the present invention is by the correlation degree of lexical item and text categories, part of speech parameter and TF- It is a kind of linear combination that IDF weighting algorithm, which combines, by introducing lexical item and text class in TF-IDF weighting algorithm formula Relevant parameter and part of speech parameter between not are shown below to form new weight calculation formula:
Wherein, ω 'ikFor the weight of lexical item in the text, θkFor the part of speech parameter, ckFor the correlation degree, tfikFor The probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
It should be noted that calculating θkAnd ckWhen there is no sequential relationship, and the two conduct simultaneously in calculating process The factor that weighing factor calculates, works as ckBigger, weight is then bigger, likewise, working as θkBigger, weight is also bigger.
Fig. 2 is that a kind of text sentence vector provided in an embodiment of the present invention indicates system construction drawing, as shown in Figure 2, comprising: obtain Modulus block 1, computing module 2 and determining module 3, in which:
Module 1 is obtained for obtaining classification belonging to each lexical item for including in text and the text;
Computing module 2 is used for based on the relevant parameter between preset lexical item and category set, according to belonging to the text Classification, calculate weight of each lexical item for including in the text in the text;
Determining module 3 is used for based on each word for including in each lexical item and the text for including in the text Weight of the item in the text, determines the corresponding text sentence vector of the text.
The each lexical item for including in text is obtained according to word2vec tool and obtains each word specifically, obtaining module 1 The term vector of item, while also obtaining the generic of the text.
Computing module 2 is calculated and is wrapped in the text for the relevant parameter between preset lexical item and the text categories Weight of each lexical item contained in the text, it is preferred that computing module is also based on preset lexical item and the text Relevant parameter and preset part of speech parameter between classification, calculate each lexical item for including in the text in the text In weight.
The term vector of word each in text is added by determining module 3 according to respective weight, determines text sentence vector.
Text sentence vector provided in an embodiment of the present invention indicates system, by between preset lexical item and the text categories Relevant parameter differentiate the correlation degree in text between each lexical item and text generic so that the sentence after weighting Vector information is complete and not redundancy, improves text-processing accuracy rate.
The embodiment of the present invention provides a kind of text sentence vector expression system, comprising: at least one processor;And with it is described At least one processor of processor communication connection, in which:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to Enable to execute method provided by above-mentioned each method embodiment, for example, S1, obtain text in include each lexical item and Classification belonging to the text;S2, based on the relevant parameter between preset lexical item and category set, according to belonging to the text Classification, calculate weight of each lexical item for including in the text in the text;S3, based on including in the text Weight of each lexical item for including in each lexical item and the text in the text, determines the corresponding text of the text This vector.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, S1, obtains and wraps in text Classification belonging to each lexical item contained and the text;S2, based on the relevant parameter between preset lexical item and category set, According to classification belonging to the text, weight of each lexical item for including in the text in the text is calculated;S3, it is based on The weight of each lexical item for including in the text and each lexical item for including in the text in the text determines The corresponding text sentence vector of the text.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment, example As included: S1, obtaining classification belonging to each lexical item for including in text and the text;S2, based on preset lexical item with Relevant parameter between category set calculates each lexical item for including in the text and exists according to classification belonging to the text Weight in the text;S3, based on each lexical item for including in each lexical item and the text for including in the text Weight in the text determines the corresponding text sentence vector of the text.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.
Finally, the present processes are only preferable embodiment, it is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in protection of the invention Within the scope of.

Claims (10)

1. a kind of text sentence vector representation method characterized by comprising
S1, classification belonging to each lexical item for including in text and the text is obtained;
S2, based on the relevant parameter between preset lexical item and category set, according to classification belonging to the text, described in calculating Weight of each lexical item for including in text in the text;
S3, based on each lexical item for including in each lexical item and the text for including in the text in the text Weight, determine the corresponding text sentence vector of the text.
2. the method according to claim 1, wherein step S2 includes:
S21, institute is calculated according to classification belonging to the text based on the relevant parameter between preset lexical item and category set State the correlation degree value between each lexical item for including in text and the text generic;
S22, it is based on the correlation degree value and TF-IDF weighting algorithm, calculates each lexical item for including in the text described Weight in text.
3. according to the method described in claim 2, it is characterized in that, step S22 is specifically included:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ωikFor the weight of lexical item in the text, ck For the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to be inverted text frequency.
4. the method according to claim 1, wherein step S2 further comprises:
Based on the relevant parameter between the preset lexical item and category set, classification and preset word belonging to the text Property parameter, calculates weight of each lexical item for including in the text in the text.
5. according to the method described in claim 4, it is characterized in that, the preset part of speech parameter includes:
First part of speech parameter, the second part of speech parameter and third part of speech parameter;
Wherein, the first part of speech parameter is greater than the second part of speech parameter, and the second part of speech parameter is greater than the third word Property parameter, and the value range of the first part of speech parameter be greater than 0 and less than 1.
6. according to the method described in claim 5, it is characterized in that, described based between the preset lexical item and category set Relevant parameter, classification and preset part of speech parameter belonging to the text, calculate each lexical item for including in the text Weight in the text, comprising:
Relevant parameter between the preset lexical item and classification is concentrated, according to classification belonging to the text, described in calculating Correlation degree value between each lexical item for including in text and the text generic;
Based on part of speech belonging to each lexical item for including in the text, corresponding part of speech parameter value is obtained;
Based on the correlation degree value, the part of speech parameter value and TF-IDF weighting algorithm, calculates in the text and include Weight of each lexical item in the text.
7. according to the method described in claim 6, it is characterized in that, it is described based on the correlation degree, the part of speech parameter with And TF-IDF weighting algorithm, weight of each lexical item for including in the text in the text is calculated, is specifically included:
Wherein, k is lexical item, nkFor the textual data comprising lexical item k, N is text sum, ω 'ikFor the weight of lexical item in the text, θk For the part of speech parameter value, ckFor the correlation degree value, tfikFor the probability that lexical item occurs in the text, log (N/nk) it is to fall Set text frequency.
8. a kind of text sentence vector indicates system characterized by comprising
Module is obtained, for obtaining classification belonging to each lexical item for including in text and the text;
Computing module, for based on the relevant parameter between preset lexical item and category set, according to class belonging to the text Not, weight of each lexical item for including in the text in the text is calculated;
Determining module, for being existed based on each lexical item for including in each lexical item and the text for including in the text Weight in the text determines the corresponding text sentence vector of the text.
9. a kind of computer program product, which is characterized in that the computer program product includes being stored in non-transient computer Computer program on readable storage medium storing program for executing, the computer program include program instruction, when described program is instructed by computer When execution, the computer is made to execute the method as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.
CN201710712075.6A 2017-08-18 2017-08-18 A kind of text sentence vector expression method and system Pending CN109408797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710712075.6A CN109408797A (en) 2017-08-18 2017-08-18 A kind of text sentence vector expression method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710712075.6A CN109408797A (en) 2017-08-18 2017-08-18 A kind of text sentence vector expression method and system

Publications (1)

Publication Number Publication Date
CN109408797A true CN109408797A (en) 2019-03-01

Family

ID=65463188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710712075.6A Pending CN109408797A (en) 2017-08-18 2017-08-18 A kind of text sentence vector expression method and system

Country Status (1)

Country Link
CN (1) CN109408797A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398469A (en) * 2021-12-10 2022-04-26 北京百度网讯科技有限公司 Method and device for determining search term weight and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11143892A (en) * 1997-11-07 1999-05-28 Fujitsu Ltd Device and method for keyword weight generation and program storage medium
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN105224695A (en) * 2015-11-12 2016-01-06 中南大学 A kind of text feature quantization method based on information entropy and device and file classification method and device
JP2016001399A (en) * 2014-06-11 2016-01-07 日本電信電話株式会社 Relevance determination device, model learning device, method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11143892A (en) * 1997-11-07 1999-05-28 Fujitsu Ltd Device and method for keyword weight generation and program storage medium
JP2016001399A (en) * 2014-06-11 2016-01-07 日本電信電話株式会社 Relevance determination device, model learning device, method, and program
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN105224695A (en) * 2015-11-12 2016-01-06 中南大学 A kind of text feature quantization method based on information entropy and device and file classification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张小川等: "一种改进的向量空间模型的文本表示算法", 《重庆理工大学学报(自然科学)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398469A (en) * 2021-12-10 2022-04-26 北京百度网讯科技有限公司 Method and device for determining search term weight and electronic equipment

Similar Documents

Publication Publication Date Title
Wang et al. Integrating extractive and abstractive models for long text summarization
US10061766B2 (en) Systems and methods for domain-specific machine-interpretation of input data
US9613024B1 (en) System and methods for creating datasets representing words and objects
US10042896B2 (en) Providing search recommendation
US8423546B2 (en) Identifying key phrases within documents
US11288453B1 (en) Key-word identification
US10496756B2 (en) Sentence creation system
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
Gacitua et al. Relevance-based abstraction identification: technique and evaluation
WO2024078105A1 (en) Method for extracting technical problem in patent literature and related device
US20230282018A1 (en) Generating weighted contextual themes to guide unsupervised keyphrase relevance models
WO2024015323A1 (en) Methods and systems for improved document processing and information retrieval
Ma et al. Stream-based live public opinion monitoring approach with adaptive probabilistic topic model
CN114997288A (en) Design resource association method
CN109284389A (en) A kind of information processing method of text data, device
Ertam et al. Abstractive text summarization using deep learning with a new Turkish summarization benchmark dataset
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
Zheng et al. An adaptive LDA optimal topic number selection method in news topic identification
Phan et al. Applying skip-gram word estimation and SVM-based classification for opinion mining Vietnamese food places text reviews
Quan et al. Combine sentiment lexicon and dependency parsing for sentiment classification
Visser et al. Sentiment and intent classification of in-text citations using bert
CN109408797A (en) A kind of text sentence vector expression method and system
CN110929513A (en) Text-based label system construction method and device
Yuan et al. Personalized sentence generation using generative adversarial networks with author-specific word usage
Kong et al. Construction of microblog-specific chinese sentiment lexicon based on representation learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190301