CN110362815A - Text vector generation method and device - Google Patents

Text vector generation method and device Download PDF

Info

Publication number
CN110362815A
CN110362815A CN201810321444.3A CN201810321444A CN110362815A CN 110362815 A CN110362815 A CN 110362815A CN 201810321444 A CN201810321444 A CN 201810321444A CN 110362815 A CN110362815 A CN 110362815A
Authority
CN
China
Prior art keywords
text
feature words
vector
cluster classification
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810321444.3A
Other languages
Chinese (zh)
Inventor
王硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201810321444.3A priority Critical patent/CN110362815A/en
Publication of CN110362815A publication Critical patent/CN110362815A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of text vector generation method and devices, are related to field of computer technology.Wherein, this method comprises: determining the term vector of each Feature Words of text set based on word2vec model;Clustering processing is carried out to the term vector, to obtain the cluster classification of each Feature Words;Weight of each Feature Words in affiliated text is determined based on TextRank algorithm or TF-IDF algorithm;Corresponding text vector is generated according to the cluster classification of Feature Words all in one text and weight.By above step, the accuracy of text representation can be improved, have very great help to subsequent progress text mining.

Description

Text vector generation method and device
Technical field
The present invention relates to field of computer technology more particularly to a kind of text vector generation method and devices.
Background technique
In the prior art, it is primarily present following four text vector representation method: bag of words (Bag of Words, BOW), LDA model (Latent Dirichlet Allocation), word2vec model, doc2vec model.
Text is regarded as the set of some words by bag of words.In the set, each word is independent from each other, And the information such as sequence, grammer, semanteme for not considering word.Bag of words by text representation at training lexical set identical dimensional Vector, the value of each position can TF-IDF (the term frequency- of the word representated by the position in the text in vector Inverse document frequency) value expression.Also, with the increase of vocabulary, the dimension of text vector also be will increase.
LDA model is a kind of non-supervisory machine learning techniques, can be used to identify in extensive text set or corpus Hiding subject information.LDA model also uses the method for bag of words, every text is considered as a word frequency vector, thus by literary This information is converted into the digital information for being easy to model.
Word2vec model uses the term vector representation of Distributed Representation.It is a incites somebody to action Word is characterized as the efficient tool of real number value vector, can will be reduced to n to the processing of content of text by the training to corpus Vector operation in dimensional vector space.
Doc2vec model increased on the basis of the input of word2vec model paragraph mark (Paragraph id) this , i.e., using Paragraph id as the representative of corresponding text.In the training process, for each text, Paragraph Id is appeared in the training of each sliding window of the text, this will be such that Paragraph id and all words of the text produces Raw association.After training, each Paragraph id can have corresponding vector result, using the vector result as accordingly The vector of text indicates.
In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery:
1, in bag of words, the number of the word occurred in the dimension and training set of text vector as many, is easy Existing " dimension disaster " phenomenon.The number of word in one plain text is 1000 or so, and the dimension of term vector can reach 100,000, Utilization rate is only 1%, thus it is very sparse based on the text vector that BOW is indicated, it is unfavorable for text mining task.In addition, due to It is mutually indepedent between bag of words suppositive and word, it is not intended that the relationship between word and word, therefore meaning of a word wide gap can be generated and asked Topic, and then the semanteme of a text can not be indicated well.
2, LDA model does not account for the context relation of text, simplifies the practical semanteme of text, leads to the vector generated As a result there is gap with practical semanteme.
3, word2vec model indicates as text vector the average of term vector, does not account for word in the text important Property, the sentence structure information and word order information of text are not accounted for yet, the text vector generated inaccuracy is caused, is not particularly suitable for Indicate long text.
4, Paragraph id is considered as with the correlation degree of words all in text and is equal by doc2vec model, does not account for The importance of word in the text does not account for influence of each word to entire chapter text yet, leads to the text vector generated inaccuracy, It is not particularly suitable for indicating long text.
Summary of the invention
In view of this, the present invention provides a kind of text vector generation method and device, the accurate of text representation can be improved Property, have very great help to subsequent progress text mining.
To achieve the above object, according to an aspect of the invention, there is provided a kind of text vector generation method.
Text vector generation method of the invention includes: each Feature Words that text set is determined based on word2vec model Term vector;Clustering processing is carried out to the term vector, to obtain the cluster classification of each Feature Words;Based on TextRank algorithm or TF-IDF algorithm determines weight of each Feature Words in affiliated text;According to the cluster classification of all Feature Words of one text Corresponding text vector is generated with weight.
Optionally, the cluster classification and weight of all Feature Words according to one text generate corresponding text vector The step of include: that the number of dimensions of text vector is determined according to the cluster classification of all Feature Words of one text, and by this article Belong to value of the sum of the weight of Feature Words of same cluster classification as text vector under every dimension in this.
Optionally, the step of term vector of each Feature Words that text set is determined based on word2vec model includes: Text set is pre-processed, the Feature Words for then obtaining pretreatment input word2vec model, to obtain the every of text set The term vector of a Feature Words;Wherein, the pretreatment includes word segmentation processing.
Optionally, described that clustering processing is carried out to the term vector, the step of to obtain the cluster classification of each Feature Words It include: that clustering processing is carried out to the term vector based on k-means algorithm, to obtain the cluster classification of each Feature Words.
To achieve the above object, according to another aspect of the present invention, a kind of text vector generating means are provided.
Text vector generating means of the invention include: the first determining module, for determining text based on word2vec model The term vector of each Feature Words of this collection;Clustering processing module, it is each to obtain for carrying out clustering processing to the term vector The cluster classification of Feature Words;Second determining module, for determining each Feature Words based on TextRank algorithm or TF-IDF algorithm Weight in affiliated text;Generation module, for being generated according to the cluster classification and weight of all Feature Words of one text Corresponding text vector.
Optionally, the generation module generates corresponding according to the cluster classification and weight of all Feature Words of one text Text vector includes: the dimension that the generation module determines text vector according to the cluster classification of all Feature Words of one text Quantity, and the sum of weight of Feature Words of same cluster classification will be belonged in the text as text vector under every dimension Value.
Optionally, first determining module determines the term vector of each Feature Words of text set based on word2vec model It include: that first determining module pre-processes text set, the Feature Words for then obtaining pretreatment input word2vec Model, to obtain the term vector of each Feature Words of text set;Wherein, the pretreatment includes word segmentation processing.
Optionally, the clustering processing module carries out clustering processing to the term vector, to obtain the poly- of each Feature Words Class classification include: the clustering processing module be based on k-means algorithm to the term vector carry out clustering processing, it is each to obtain The cluster classification of Feature Words.
To achieve the above object, according to a further aspect of the invention, a kind of electronic equipment is provided.
Electronic equipment of the invention, comprising: one or more processors;And storage device, for storing one or more A program;When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes text vector generation method of the invention.
To achieve the above object, according to a further aspect of the invention, a kind of computer-readable medium is provided.
Computer-readable medium of the invention is stored thereon with computer program, real when described program is executed by processor Existing text vector generation method of the invention.
One embodiment in foregoing invention is had the following advantages that or the utility model has the advantages that by being determined based on word2vec model The term vector of each Feature Words in text set, and determine each Feature Words in institute based on TextRank algorithm or TF-IDF algorithm Belong to the weight in text, clustering processing is carried out to obtain the cluster classification of each Feature Words to the term vector, and according to same The cluster classification and weight of all Feature Words generate corresponding text vector in text, can be improved the accuracy of text representation, Have very great help to subsequent progress text mining.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the main flow schematic diagram of text vector generation method according to an embodiment of the invention;
Fig. 2 is the main flow schematic diagram of text vector generation method according to another embodiment of the present invention;
Fig. 3 is to carry out the result schematic diagram after clustering processing in the embodiment of the present invention to term vector;
Fig. 4 is the main modular schematic diagram of text vector generating means according to an embodiment of the present invention;
Fig. 5 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 6 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present invention.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
It should be pointed out that in the absence of conflict, the feature in embodiment and embodiment in the present invention can be with It is combined with each other.
Fig. 1 is the main flow schematic diagram of text vector generation method according to an embodiment of the invention.Such as Fig. 1 institute Show, the text vector generation method of the embodiment of the present invention includes:
Step S101, the term vector of each Feature Words of text set is determined based on word2vec model.
Wherein, the text set can be regarded as the set being made of multiple texts.One text can be sentence, one A paragraph or a chapter.For example, text set can be the set of 1,000,000 chapters composition.
Illustratively, step S101 can include: text set is pre-processed, the Feature Words for then obtaining pretreatment are defeated Enter word2vec model, to obtain the term vector of each Feature Words of text set.Wherein, the pretreatment may include at participle Reason.
When it is implemented, Chinese word segmentation tool can be used and (for example stammer if the text set is the set of Chinese text Participle packet) it is segmented.Further, after carrying out word segmentation processing, the pretreatment may also include that the stop words in text It is filtered processing, that is, filters out the stop words in text.
In embodiments of the present invention, it is trained by using Feature Words of the word2vec model to text set, it can be Each Feature Words is expressed as the short amount of unified dimensional in unified vector space, and the term vector generated in this way being capable of preferable earth's surface Show the semantic information of word itself, so that the Feature Words of semantic similarity are also smaller in the distance of vector space.
Step S102, clustering processing is carried out to the term vector, to obtain the cluster classification of each Feature Words.
Illustratively, in this step, k-means (k- mean value) algorithm can be used or k-medoids (k- central point) is calculated The clustering algorithms such as method carry out clustering processing to term vector obtained in step S101.
In embodiments of the present invention, the obtained term vector of step S101 is clustered by step S102, so that same The similarity clustered between the term vector in classification is higher, and the similarity between different classes of interior term vector is lower.It is clustering After the completion of processing, the set for the Feature Words that each cluster classification includes can be considered as a kind of " concept ".For example, 500 cluster classes Another characteristic word is regarded as 500 kinds " concepts ", and 500 kinds " concept " may make up one 500 " concept space " tieed up.
Step S103, power of each Feature Words in affiliated text is determined based on TextRank algorithm or TF-IDF algorithm Weight.
TextRank algorithm is a kind of sort algorithm based on figure for text, and basic thought is from Google PageRank algorithm, i.e., by the way that text segmentation at several units and is established graph model, using voting mechanism to the weight in text Ingredient is wanted to be ranked up.
TF-IDF (term frequency-inverse document frequency) algorithm is a kind of statistical method, To assess a given word for the significance level of a text in text set.In TF-IDF algorithm, word Importance increase directly proportional to the number that it occurs in the text, but simultaneously can with the frequency that it occurs in text set at Inverse ratio decline.
Step S104, corresponding text vector is generated according to the cluster classification of all Feature Words of one text and weight.
It should be pointed out that in embodiments of the present invention, can both first carry out step S101 and execute step S103 again, it can also Step S101 is executed again to first carry out step S103.In addition, can also be executed parallel to improve the efficiency for generating text vector Step S101 and step S103.
It in embodiments of the present invention, can not only be by every text representation in a large amount of texts at one by above step The relatively low-dimensional real number feature vector of regular length, and the text vector generated can be made to retain word in the text important as far as possible The practical semantic information of property information and text, and then the accuracy of text representation is helped to improve, to subsequent carry out text mining Have very great help.
Fig. 2 is the main flow schematic diagram of text vector generation method according to another embodiment of the present invention.Such as Fig. 2 institute Show, the text vector generation method of the embodiment of the present invention includes:
Step S201, text set is pre-processed, the pretreatment includes word segmentation processing.
Wherein, the text set can be regarded as the set being made of multiple texts.One text can be sentence, one A paragraph or a chapter.For example, text set can be the set of 1,000,000 chapters composition.
When it is implemented, Chinese word segmentation tool can be used in step S201 if the text set is the set of Chinese text (such as stammerer participle packet) is segmented.Further, after carrying out word segmentation processing, the pretreatment may also include that text In stop words be filtered processing.For example, some text is " beautiful rivers and mountains that I loves one's country ", pretreated result can be with For with the character string of space-separated " I love one's country beautiful rivers and mountains ".Wherein, " I " in the character string, " love ", " motherland ", " big Good territory " is all Feature Words.
Step S202, the Feature Words for obtaining pretreatment input word2vec model, to obtain each feature of text set The term vector of word.
There are two types of training patterns for word2vec model: CBOW model and skip-gram model.Wherein, CBOW model is logical Context-prediction current word is crossed, and skip-gram model is to predict context by current word.In step S202, it can be used CBOW model is trained Feature Words, to obtain the term vector of Feature Words.When it is implemented, the dimension of term vector can flexibly into Row setting, for example the dimension of term vector can be set as 200 dimensions.
In embodiments of the present invention, each Feature Words can be expressed as uniting in unified vector space by step S202 The short amount of dimension, the term vector generated in this way can preferably indicate the semantic information of word itself, so that semantic similarity Feature Words it is also smaller in the distance of vector space.
Step S203, clustering processing is carried out to the term vector based on k-means algorithm, to obtain the poly- of each Feature Words Class classification.
Specifically, step S203 the following steps are included:
1) initial cluster center is determined for each cluster, that is, the initial value of k cluster centre point is set.For example, c [0]=data [0] ..., c [k-1]=data [k-1].Wherein, c [0] to c [k-1] is k cluster centre, and data [0] is extremely Data [k-1] is k term vector of any selection, the meaning of " c [0]=data [0] ..., c [k-1]=data [k-1] " are as follows: The initial value for enabling k cluster centre is data [0] to data [k-1].When it is implemented, the value of k can be flexibly configured, For example the value of k can be set as 500.
2) sample in sample set is assigned to according to minimal distance principle closest to cluster classification.Illustratively, can divide Not Ji Suan term vector (data [0] to data [n]) and k cluster centre Euclidean distance, then each term vector is divided respectively It is assigned in the cluster classification of that the smallest cluster centre of Euclidean distance.
3) use the sample average in each cluster classification as new cluster centre.Illustratively, i-th (i=1 ..., K) it is a cluster classification in sample average can by " all term vectors in ith cluster and/it is all be in i-th gather The number of the term vector of class " obtains.Wherein, symbol "/" herein is division symbol.
4) step 2,3 are repeated, until cluster centre no longer changes or the variation of cluster centre is less than preset threshold.
In embodiments of the present invention, adjacent area semantically can will be embedded in by similar word by step S202, and will pass through step Rapid S203 clusters the obtained term vector of step S202 so that the similarity between the term vector in same cluster classification compared with Height, it is different classes of in term vector between similarity it is lower.After the completion of clustering processing, can include by each cluster classification The set of Feature Words is considered as a kind of " concept ".For example, the Feature Words of 500 cluster classifications are regarded as 500 kinds " concepts ", 500 kinds " concept " may make up one 500 " concept space " tieed up.
Step S204, weight of each Feature Words in affiliated text is calculated based on TextRank algorithm.
TextRank algorithm is a kind of sort algorithm based on figure for text, and basic thought is from Google PageRank algorithm, i.e., by the way that text segmentation at several units and is established graph model, using voting mechanism to the weight in text Ingredient is wanted to be ranked up.Specifically, TextRank algorithm mainly comprises the steps that
A, text is segmented using participle tool.Illustratively, it is assumed that given text T is a paragraph, then may be used Text T is split according to complete words, participle then is carried out for each sentence in T and part-of-speech tagging is handled, then Stop words is filtered out, and retains the word (such as noun, verb and adjective) of specified part of speech, and then obtain the candidate key of text Word (or being " Feature Words ").
B, oriented authorized graph G=(V, E) is constructed.Wherein, V is node set, the candidate key phrase obtained by step A At;E is the set on the side between two nodes, is obtained by cooccurrence relation (co-occurrence) construction.It is specific next It says, only when the corresponding candidate keywords of two nodes occur jointly in the window that length is K, is just deposited between two nodes On side.
C, the weight of each node of iterative diffusion according to the following formula, until convergence;
Wherein, WS (Vi) indicate node ViWeight;WS(Vj) indicate node VjWeight;wjiFor node ViWith node VjIt Between side weight;wjkFor node VkWith node VjBetween side weight;In(Vi) indicate to be directed toward node ViNode collection It closes;Out(Vj) it is point VjThe set of the node of direction;D is damped coefficient, represents a certain specified point from oriented authorized graph and is directed toward The probability of any other point, the value range of d are 0 to 1, generally desirable 0.85.
It specifically, can be by calling python if minor function calculating Feature Words are in affiliated text in step S204 Weight: jieba.analyse.textrank (content, k, withWeight=True).It is text with the return value of superior function The list of this each Feature Words and respective weights value pair.In above-mentioned function, parameter k indicates the number of output Feature Words, WithWeight=true indicates to need to export the weight of Feature Words.
Step S205, the number of dimensions of text vector is determined according to the cluster classification of all Feature Words of one text, and Value of the sum of the weight of Feature Words of same cluster classification as text vector under every dimension will be belonged in the text.
Illustratively, it is assumed that 3 cluster classifications (3 " concept " in other words) are obtained by step S203, then text vector Number of dimensions be 3.Further, it is assumed that there are two Feature Words to belong to cluster classification 1 (i.e. " concept 1 ") by text a, and the two are special Levying weight of the word in text a is respectively 1.3 and 1.9, then value of the text vector of text a under " concept 1 " this dimension is 3.2.And so on, the text vector of text a and text b shown in table 1 can be obtained.
Table 1
In embodiments of the present invention, by combining word2vec model, clustering algorithm and TextRank algorithm, and It, can be quickly and efficiently by every text representation in a large amount of texts at one by mapping the text in " concept space " The relatively low-dimensional real number feature vector of regular length.Moreover, in the text vector of generation, Feature Words are remained as far as possible in the text The practical semantic information of material information and text, helps to improve the accuracy of text representation, to subsequent carry out text mining Have very great help.In addition, by above step generate text vector, overcome the prior art generation text vector can not It is explanatory, there is the expression advantage of distributed representation method (such as bag of words and LDA model), and there is distributed indicate The non-sparsity that method does not have.
Fig. 3 is to carry out the result schematic diagram after clustering processing in the embodiment of the present invention to term vector.As shown in Figure 3, it is assumed that The Feature Words of a certain text set are polymerized to 3 classes according to step S203, each category feature word be can be considered into a kind of " concept ", then obtained With 3 kinds " concept " of " concept space ".In the next steps, it can map the text in this " concept space ", thus Generate text vector.Since each " concept " includes the Feature Words with similar meaning or common context, based on " general Read space " mapping generate text vector be it is intuitively interpretable, not only overcome the prior art generation text vector not Interpretation, and there is the expression advantage and non-sparsity of distributed representation method.
Fig. 4 is the main modular schematic diagram of text vector generating means according to an embodiment of the present invention.As shown in figure 4, this The text vector generating means 400 of inventive embodiments include: the first determining module 401, the determination of clustering processing module 402, second Module 403, generation module 404.
First determining module 401, the term vector of each Feature Words for determining text set based on word2vec model.
Wherein, the text set can be regarded as the set being made of multiple texts.One text can be sentence, one A paragraph or a chapter.For example, text set can be the set of 1,000,000 chapters composition.
Illustratively, the first determining module 401 determines the term vector of each Feature Words of text set can include: first determines Module 401 pre-processes text set, and the Feature Words for then obtaining pretreatment input word2vec model, to obtain text The term vector of each Feature Words of collection.Wherein, the pretreatment may include word segmentation processing.
When it is implemented, Chinese point can be used in the first determining module 401 if the text set is the set of Chinese text Word tool (such as stammerer participle packet) is segmented.Further, after carrying out word segmentation processing, the first determining module 401 is executed Pretreatment may also include that processing is filtered to the stop words in text, that is, filter out the stop words in text.
There are two types of training patterns for word2vec model: CBOW model and skip-gram model.Wherein, CBOW model is logical Context-prediction current word is crossed, and skip-gram model is to predict context by current word.In embodiments of the present invention, One determining module 401 can be used CBOW model and be trained to Feature Words, to obtain the term vector of Feature Words.When it is implemented, The dimension of term vector can be flexibly configured, for example the dimension of term vector can be set as 200 dimensions.
In embodiments of the present invention, it is trained by using Feature Words of the word2vec model to text set, it can be Each Feature Words is expressed as the short amount of unified dimensional in unified vector space, and the term vector generated in this way being capable of preferable earth's surface Show the semantic information of word itself, so that the Feature Words of semantic similarity are also smaller in the distance of vector space.
Clustering processing module 402, for carrying out clustering processing to the term vector, to obtain the cluster class of each Feature Words Not.
Illustratively, k-means (k- mean value) algorithm or the (center k- k-medoids can be used in clustering processing module 402 Point) clustering algorithms such as algorithm carry out clustering processing to term vector.About in the embodiment of the present invention use k-means algorithm to word to The detailed step that amount carries out clustering processing can refer to the related description of embodiment illustrated in fig. 2.
Second determining module 403, for determining weight of each Feature Words in affiliated text.
Illustratively, TextRank algorithm or TF-IDF (term frequency- can be used in the second determining module 403 Inverse document frequency) algorithm etc. calculates weight of the Feature Words in affiliated text.Implement about the present invention It can refer to the related description of embodiment illustrated in fig. 2 in example using the detailed step that TextRank algorithm calculates term weight function.
Generation module 404, for generating corresponding text according to the cluster classification and weight of all Feature Words of one text This vector.
Illustratively, generation module 404 is generated according to the cluster classification and weight of all Feature Words of one text and is corresponded to Text vector can include: generation module 404 determines text vector according to the cluster classification of all Feature Words of one text Number of dimensions, and the sum of the weight of Feature Words that will belong to same cluster classification in the text is used as text vector in every dimension Under value.
For example, it is assumed that obtain 3 cluster classifications (3 " concept " in other words) by clustering processing module 402, then text to The number of dimensions of amount is 3.Further, it is assumed that there are two Feature Words to belong to cluster classification 1 (i.e. " concept 1 "), and the two by text a Weight of the Feature Words in text a is respectively 1.3 and 1.9, then value of the text vector of text a under " concept 1 " this dimension It is 3.2.And so on, the text vector that text a can be obtained is (3.2,1.4,1.7).
It should be pointed out that in embodiments of the present invention, can both first carry out the first determining module 401 and execute second again really Cover half block 403 can also first carry out the second determining module 403 and execute the first determining module 401 again.In addition, being generated to improve The efficiency of text vector can also execute the first determining module 401 and the second determining module 403 parallel.
In the device of the embodiment of the present invention, determined in text set often by the first determining module based on word2vec model The term vector of a Feature Words carries out clustering processing to the term vector by clustering processing module, to obtain each Feature Words Classification is clustered, determines each Feature Words in affiliated text based on TextRank algorithm or TF-IDF algorithm by the second determining module In weight, and corresponding text is generated according to the cluster classification and weight of Feature Words all in one text by generation module Vector, can not only be quickly and efficiently by every text representation in a large amount of texts at the relatively low-dimensional real number of a regular length Feature vector, and the reality of word material information in the text and text can be retained as far as possible in the text vector of generation Semantic information, and then the accuracy of text representation is helped to improve, have very great help to subsequent progress text mining.
Fig. 5, which is shown, can apply the text vector generation method of the embodiment of the present invention or showing for text vector generating means Example property system architecture 500.
As shown in figure 5, system architecture 500 may include terminal device 501,502,503, network 504 and server 505. Network 504 between terminal device 501,502,503 and server 505 to provide the medium of communication link.Network 504 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 501,502,503 and be interacted by network 504 with server 505, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 501,502,503 The application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 501,502,503 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 505 can be to provide the server of various services, such as utilize terminal device 501,502,503 to user The text vector that provides browsed generates the back-stage management server that the website offer of service is supported.Back-stage management server can be with To data such as the operation of the generation text vector received requests analyze etc. processing, and by processing result (such as generation Text vector) feed back to terminal device.
It should be noted that text vector generation method provided by the embodiment of the present invention is generally executed by server 505, Correspondingly, text vector generating means are generally positioned in server 505.
It should be understood that the number of terminal device, network and server in Fig. 5 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
Fig. 6 shows the structural representation for being suitable for the computer system 600 for the electronic equipment for being used to realize the embodiment of the present invention Figure.Electronic equipment shown in Fig. 6 is only an example, should not function to the embodiment of the present invention and use scope bring it is any Limitation.
As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and Execute various movements appropriate and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data. CPU 601, ROM 602 and RAM 603 are connected with each other by bus Y04.Input/output (I/O) interface 605 is also connected to always Line 604.
I/O interface 605 is connected to lower component: the importation 606 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 608 including hard disk etc.; And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net executes communication process.Driver 610 is also connected to I/O interface 605 as needed.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 610, in order to read from thereon Computer program be mounted into storage section 608 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.? In such embodiment, which can be downloaded and installed from network by communications portion 609, and/or from can Medium 611 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) 601, system of the invention is executed The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet Include the first determining module, clustering processing module, the second determining module, generation module.Wherein, the title of these modules is in certain feelings The restriction to the module itself is not constituted under condition, for example, the first determining module is also described as " determining the spy of text set Levy the module of the term vector of word ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes It obtains the equipment and executes following below scheme: determining the term vector of each Feature Words of text set based on word2vec model;To institute's predicate Vector carries out clustering processing, to obtain the cluster classification of each Feature Words;It is determined based on TextRank algorithm or TF-IDF algorithm Weight of each Feature Words in affiliated text;It is generated and is corresponded to according to the cluster classification of all Feature Words of one text and weight Text vector.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims (10)

1. a kind of text vector generation method, which is characterized in that the described method includes:
The term vector of each Feature Words of text set is determined based on word2vec model;
Clustering processing is carried out to the term vector, to obtain the cluster classification of each Feature Words;
Weight of each Feature Words in affiliated text is determined based on TextRank algorithm or TF-IDF algorithm;
Corresponding text vector is generated according to the cluster classification of all Feature Words of one text and weight.
2. the method according to claim 1, wherein the cluster class of all Feature Words according to one text Not and weight generation corresponding text vector the step of include:
The number of dimensions of text vector is determined according to the cluster classification of all Feature Words of one text, and will be belonged in the text Value of the sum of the weight of Feature Words of same cluster classification as text vector under every dimension.
3. the method according to claim 1, wherein described determine each of text set based on word2vec model The step of term vector of Feature Words includes:
Text set is pre-processed, the Feature Words for then obtaining pretreatment input word2vec model, to obtain text set Each Feature Words term vector;Wherein, the pretreatment includes word segmentation processing.
4. the method according to claim 1, wherein described carry out clustering processing to the term vector, to obtain The step of cluster classification of each Feature Words includes:
Clustering processing is carried out to the term vector based on k-means algorithm, to obtain the cluster classification of each Feature Words.
5. a kind of text vector generating means, which is characterized in that described device includes:
First determining module, the term vector of each Feature Words for determining text set based on word2vec model;
Clustering processing module, for carrying out clustering processing to the term vector, to obtain the cluster classification of each Feature Words;
Second determining module, for determining each Feature Words in affiliated text based on TextRank algorithm or TF-IDF algorithm Weight;
Generation module, for generating corresponding text vector according to the cluster classification and weight of all Feature Words of one text.
6. device according to claim 5, which is characterized in that the generation module is according to all Feature Words of one text Cluster classification and weight generate corresponding text vector and include:
The generation module determines the number of dimensions of text vector according to the cluster classification of all Feature Words of one text, and will Belong to value of the sum of the weight of Feature Words of same cluster classification as text vector under every dimension in the text.
7. device according to claim 5, which is characterized in that first determining module is determined based on word2vec model The term vector of each Feature Words of text set includes:
First determining module pre-processes text set, and the Feature Words for then obtaining pretreatment input word2vec mould Type, to obtain the term vector of each Feature Words of text set;Wherein, the pretreatment includes word segmentation processing.
8. device according to claim 5, which is characterized in that the clustering processing module clusters the term vector It handles, includes: to obtain the cluster classification of each Feature Words
The clustering processing module is based on k-means algorithm and carries out clustering processing to the term vector, to obtain each Feature Words Cluster classification.
9. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in Claims 1-4.
10. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method as described in any in Claims 1-4 is realized when row.
CN201810321444.3A 2018-04-11 2018-04-11 Text vector generation method and device Pending CN110362815A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810321444.3A CN110362815A (en) 2018-04-11 2018-04-11 Text vector generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810321444.3A CN110362815A (en) 2018-04-11 2018-04-11 Text vector generation method and device

Publications (1)

Publication Number Publication Date
CN110362815A true CN110362815A (en) 2019-10-22

Family

ID=68214275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810321444.3A Pending CN110362815A (en) 2018-04-11 2018-04-11 Text vector generation method and device

Country Status (1)

Country Link
CN (1) CN110362815A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111008281A (en) * 2019-12-06 2020-04-14 浙江大搜车软件技术有限公司 Text classification method and device, computer equipment and storage medium
CN111177365A (en) * 2019-12-20 2020-05-19 山东科技大学 Unsupervised automatic abstract extraction method based on graph model
CN112825078A (en) * 2019-11-21 2021-05-21 北京沃东天骏信息技术有限公司 Information processing method and device
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN113377607A (en) * 2021-05-13 2021-09-10 长沙理工大学 Method and device for detecting log abnormity based on Word2Vec and electronic equipment
CN113761905A (en) * 2020-07-01 2021-12-07 北京沃东天骏信息技术有限公司 Method and device for constructing domain modeling vocabulary
CN113807073A (en) * 2020-06-16 2021-12-17 中国电信股份有限公司 Text content abnormity detection method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN106095996A (en) * 2016-06-22 2016-11-09 量子云未来(北京)信息科技有限公司 Method for text classification
CN106599072A (en) * 2016-11-21 2017-04-26 东软集团股份有限公司 Text clustering method and device
CN107357895A (en) * 2017-01-05 2017-11-17 大连理工大学 A kind of processing method of the text representation based on bag of words

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778158A (en) * 2015-03-04 2015-07-15 新浪网技术(中国)有限公司 Method and device for representing text
CN106095996A (en) * 2016-06-22 2016-11-09 量子云未来(北京)信息科技有限公司 Method for text classification
CN106599072A (en) * 2016-11-21 2017-04-26 东软集团股份有限公司 Text clustering method and device
CN107357895A (en) * 2017-01-05 2017-11-17 大连理工大学 A kind of processing method of the text representation based on bag of words

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112825078A (en) * 2019-11-21 2021-05-21 北京沃东天骏信息技术有限公司 Information processing method and device
CN111008281A (en) * 2019-12-06 2020-04-14 浙江大搜车软件技术有限公司 Text classification method and device, computer equipment and storage medium
CN111008281B (en) * 2019-12-06 2021-09-21 浙江大搜车软件技术有限公司 Text classification method and device, computer equipment and storage medium
CN111177365A (en) * 2019-12-20 2020-05-19 山东科技大学 Unsupervised automatic abstract extraction method based on graph model
CN111177365B (en) * 2019-12-20 2022-08-02 山东科技大学 Unsupervised automatic abstract extraction method based on graph model
CN113807073A (en) * 2020-06-16 2021-12-17 中国电信股份有限公司 Text content abnormity detection method, device and storage medium
CN113807073B (en) * 2020-06-16 2023-11-14 中国电信股份有限公司 Text content anomaly detection method, device and storage medium
CN113761905A (en) * 2020-07-01 2021-12-07 北京沃东天骏信息技术有限公司 Method and device for constructing domain modeling vocabulary
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN113377607A (en) * 2021-05-13 2021-09-10 长沙理工大学 Method and device for detecting log abnormity based on Word2Vec and electronic equipment

Similar Documents

Publication Publication Date Title
CN110362815A (en) Text vector generation method and device
US9495345B2 (en) Methods and systems for modeling complex taxonomies with natural language understanding
Ding et al. Entity discovery and assignment for opinion mining applications
US9323767B2 (en) Performance and scalability in an intelligent data operating layer system
CN110489558B (en) Article aggregation method and device, medium and computing equipment
US20190034977A1 (en) Determination of targeted food recommendation
CN104239373B (en) Add tagged method and device for document
Dumani et al. A framework for argument retrieval: Ranking argument clusters by frequency and specificity
US10956470B2 (en) Facet-based query refinement based on multiple query interpretations
US9940355B2 (en) Providing answers to questions having both rankable and probabilistic components
US11003701B2 (en) Dynamic faceted search on a document corpus
US9990359B2 (en) Computer-based analysis of virtual discussions for products and services
CN110347428A (en) A kind of detection method and device of code similarity
Altintas et al. Machine learning based ticket classification in issue tracking systems
CN109948141A (en) A kind of method and apparatus for extracting Feature Words
CN104615723B (en) The determination method and apparatus of query word weighted value
CN108875065A (en) A kind of Indonesia's news web page recommended method based on content
CN109190123A (en) Method and apparatus for output information
Basmatkar et al. An overview of contextual topic modeling using bidirectional encoder representations from transformers
CN110390011A (en) The method and apparatus of data classification
CN112925872A (en) Data searching method and device
CN111126073A (en) Semantic retrieval method and device
Jiang et al. A semantic-based approach to service clustering from service documents
CN111723201A (en) Method and device for clustering text data
Yang et al. Short text similarity measurement using context from bag of word pairs and word co-occurrence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination