CN112686043A - Word vector-based classification method for emerging industries to which enterprises belong - Google Patents
Word vector-based classification method for emerging industries to which enterprises belong Download PDFInfo
- Publication number
- CN112686043A CN112686043A CN202110034145.3A CN202110034145A CN112686043A CN 112686043 A CN112686043 A CN 112686043A CN 202110034145 A CN202110034145 A CN 202110034145A CN 112686043 A CN112686043 A CN 112686043A
- Authority
- CN
- China
- Prior art keywords
- enterprise
- word
- emerging
- emerging industry
- obtaining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 239000013598 vector Substances 0.000 title claims abstract description 22
- 238000011156 evaluation Methods 0.000 claims abstract description 50
- 230000011218 segmentation Effects 0.000 claims description 24
- 230000001502 supplementing effect Effects 0.000 claims description 4
- 238000003491 array Methods 0.000 claims description 3
- 230000003472 neutralizing effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 238000012549 training Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 20
- 238000004458 analytical method Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Abstract
The invention provides a word vector-based classification method for emerging industries to which enterprises belong. The invention obtains the input new industry and obtains the relevant information on the internet according to the name; obtaining candidate keywords by using a Textrank algorithm according to related information of emerging industries; clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords; acquiring an enterprise operation range from an official website, and acquiring an enterprise operation word bank according to the operation range; expanding emerging industry clustering keywords according to an enterprise operation word bank to obtain an emerging industry keyword word bank; obtaining the inverse document frequency weight of the words according to the enterprise operation word bank; sequentially obtaining a basic evaluation score, a comprehensive evaluation score and an enterprise classification score according to the enterprise operation range to be classified and a newly emerging industrial keyword word bank; and obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification score. The method has the advantages of no need of manual marking and training, high accuracy and capability of classifying new and emerging industries.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a word vector-based classification method for emerging industries to which enterprises belong.
Background
In the analysis of enterprise-to-industry connections, it is time-consuming and labor-consuming to manually classify enterprises into corresponding enterprises, especially when facing a large number of samples of enterprises to be classified and emerging industries with a lack of related classification experience. All enterprises have operation ranges, the operation ranges can embody the industries of the enterprises, and the operation ranges are used for analyzing the classification of the industries of the enterprises. The business scope and the industry are both composed of words, and the words in the same business scope or the same industry have similarity, so the distance between word vectors can be used as the measure of word similarity.
In the algorithm research, proper introduction of external parameters including the use of industry description information as supplementary knowledge and the use of word inverse document frequency and word part of speech as word weight can be found out, so that better classification results can be obtained. In addition, the use of unsupervised algorithms can save time and labor costs for labeling large numbers of samples. In conclusion, based on the consideration of accelerating enterprise classification and improving analysis efficiency, the invention provides a word vector-based classification method for emerging industries to which enterprises belong.
In the existing invention technology, for example, patent application with publication number CN110019769 discloses an intelligent enterprise classification method, in which a supervised classification method based on SVM (support vector machine) is used, and the method has the following short boards: a large number of samples need to be manually pre-labeled and the model needs to be trained for a certain time. The method does not have the capacity of classifying emerging industries, a large amount of time is needed for retraining when new labels appear, and a network which is trained in advance needs to be deployed when the method is used, so that the computer power requirement is high.
Disclosure of Invention
The invention aims to provide a method for rapidly and accurately classifying emerging industries of enterprises when the enterprises and the industries are analyzed in a correlation mode, and provides a word vector-based method for classifying the emerging industries of the enterprises, which does not need manual labeling, training, high adaptability, high accuracy and strong expansibility, in view of the fact that the existing method needs to label a large amount of data and a large amount of time for model training and cannot expand the emerging industries.
The technical scheme adopted by the invention is as follows: a word vector-based classification method for emerging industries of enterprises is characterized by comprising the following steps:
step 1: acquiring input emerging industries, and acquiring related information on the Internet according to the names of the emerging industries; obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry; clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords;
step 2: acquiring an enterprise operation range from an official website, and acquiring an enterprise operation word bank according to the enterprise operation range; expanding the emerging industry clustering keywords according to the enterprise operation word bank to obtain an emerging industry keyword word bank;
and step 3: obtaining the inverse document frequency weight of the words according to the enterprise operation word bank;
and 4, step 4: obtaining a basic evaluation score according to the business range of the enterprise to be classified and a keyword lexicon of emerging industries; obtaining a comprehensive evaluation score according to the basic evaluation score; obtaining enterprise classification scores according to the comprehensive evaluation scores; obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score;
preferably, the emerging industry in step 1 is:
Indp
p∈[1,M]
wherein IndpName of the p-th emerging industry, and M represents the number of emerging industries;
step 1, obtaining relevant information on the Internet according to the name of emerging industry:
subjecting Ind topAutomatic retrieval of Ind as a keyword on the Internet using crawler technologypIs recorded asInformation related to the p emerging industry;
step 1, obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry:
using the Textrank algorithm fromExtracting key words to obtain IndpThe candidate keywords of (2) are noted as:
keyp=[wp,1,wp,2,…,wp,D]
wherein, keypCandidate keyword, w, representing the p-th emerging industryp,dThe D candidate keyword representing the p emerging industry, D ∈ [1, D [ ]]D represents the number of candidate keywords;
step 1, clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords:
using word2vec technology to combine keyspAll words in (2) are mapped to a multi-dimensional word vector space:
keyp=[wp,1,wp,2,…,wp,D]
where w2v (·) represents a function that converts words into word vectors, keypCandidate keyword, w, representing the p-th emerging industryp,dA d-th candidate keyword representing a p-th emerging industry;
using a K-means pair w2v (key)p) Clustering to obtain emerging industrial clustering keywords, wherein the clustering quantity is as follows:
wherein, KpRepresenting the number of clusters for the p-th emerging industry,denotes a rounded-down symbol, Len (key)p) Representing the total number of candidate keywords of the p emerging industry;
the emerging industry clustering keywords are as follows:
Dp,q[k]
p∈[1,M],q∈[1,Kp],k∈[1,Lp,q]
wherein D isp,q[k]Representing the kth key word in a key word array formed by the qth clustering result in the pth emerging industry, M representing the number of the emerging industries, KpRepresents the total number of clusters, L, for the p-th emerging industryp,qThe total number of keywords representing the qth clustering result in the pth emerging industry.
Preferably, the official website in step 2 acquires the enterprise operation range, and acquires an enterprise operation word bank according to the enterprise operation range:
the enterprise operation range stated in step 2 is recorded as
Sg
1≤g≤N
Wherein S isgThe management range information of the g enterprise is represented, and N represents the total number of the enterprises;
and (3) obtaining the enterprise operation word bank in the step 2 after removing stop words and word segmentation from the enterprise operation range, and recording the word bank as follows:
F=[Split(S1),Split(S2),…,Split(SN)]
Split(Sg)=[xg,1,xg,2,…,xg,h]
wherein F represents an enterprise operation word bank, Split (. cndot.) represents a function of removing stop words and participles, and xg,hRepresenting the h term obtained after the g enterprise operation range is subjected to stop word removal and word segmentation treatment, namely the h term in the g enterprise operation word bank;
step 2, according to the enterprise operation word bank, expanding the emerging industry clustering keywords:
the emerging industrial clustering keywords are used for searching 3 words with the highest similarity by cosine similarity:
where cossim (·,) represents a function for calculating cosine similarity, w2v (·) represents a function for converting words into word vectors, and xg,hExpress the h term in the g enterprise operation word bank, Dp,q[k]Representing the kth keyword in a keyword array formed by the qth clustering result in the pth emerging industry;
neutralizing F with Dp,q[k]And (3) supplementing the L words with the highest similarity to the emerging industry clustering keywords to obtain an emerging industry keyword word bank, and recording as follows:
Ap,q=[Dp,q[1],Dp,q[1]1,Dp,q[1]2,…,Dp,q[1]l,…,Dp,q[k],Dp,q[k]1,Dp,q[k]2,…,Dp,q[k]L]
wherein A isp,qRepresents the q-th auxiliary keyword array of the p-th emerging industry, Dp,q[k]Representing the kth keyword in a keyword array consisting of the qth clustering results in the pth emerging industry, Dp,q[k]lDenotes in F and Dp,q[k]The ith word with the highest similarity, and L represents the number of the highest-ranked word-taking numbers in turn according to the similarity ranking.
Preferably, in step 3, the inverse document frequency weight of the term is obtained according to the enterprise operation word bank:
calculating the inverse document frequency of all the words according to the distribution of the words in the enterprise operation word bank, and recording as follows:
1≤g≤G,1≤h≤Gg
wherein,idfwon(xg,h) The inverse document frequency of the h term in the g enterprise operation word bank, R is the total number of the operation ranges, Num (x)g,h) Representing the total number of the operation range containing the h word in the G enterprise operation word bank, G being the total number of the enterprise operation word bank, GgThe total number of words in the g enterprise operation word bank;
and obtaining the normalized inverse document frequency by using a normalization algorithm according to the inverse document frequency, and recording the normalized inverse document frequency as:
wherein idfnorm(xg,h) Normalized inverse document frequency, idf, for the h term in the g-th Enterprise thesauruswon(xg,h) The inverse document frequency, idf, of the h term in the g-th enterprise thesauruswonmin is the minimum value of the frequency of the inverse documents in the business word bank of all enterprises, idfwonmax is the maximum value of the frequency of the inverse documents in all the enterprise operation word banks;
obtaining an inverse document frequency weight according to the normalized inverse document frequency, and recording as:
wherein idf (·) is a function for calculating the frequency weight of the inverse document, word is any term, and F is an enterprise operation word bank;
preferably, step 4, obtaining a basic evaluation score according to the business operation range of the enterprise to be classified and the keyword lexicon of the emerging industry:
the enterprise to be classified is marked as:
Ce
wherein, CeRepresenting the e-th enterprise to be classified;
the operation range of the enterprise to be classified is recorded as:
Scopee
wherein, ScopeeIndicates the e-th waiting scoreThe business scope of the class enterprise;
scope to be describedeAnd (3) segmenting words and removing stop words to obtain enterprise operation range segmentation words, and recording the segmentation words as:
querye=[ye,1,ye,2,…,ye,r]
wherein, queryeMeaning the operation range word segmentation of the e-th enterprise to be classifiede[r]=ye,rThe r term represents the operation range segmentation of the e enterprise to be classified;
obtaining cosine similarity according to the word segmentation of the enterprise operation range to be classified and the emerging industry keyword word bank, and recording as follows:
wherein, cossim (·,) represents the function of calculating cosine similarity, w2v (·) represents the function of converting words into word vectors, querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating word similarity according to the cosine similarity, and recording as:
sim(querye[r],Ap,q[t])=cossim(w2v(querye[r]),w2v(Ap,q[t]))
wherein sim (·,) represents a function for calculating word similarity, cossim (·,) represents a function for calculating cosine similarity, and querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating a basic evaluation score according to the similarity of the words, and recording as:
wherein, base (q)uerye,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeMeaning the e-th enterprise operation range word segmentation, querye[i]The ith word representing the business operation range word division of the e-th enterprise to be classified, Ap,qRepresents the q-th auxiliary key word array of the p-th emerging industry, Ap,q[t]Representing the jth word in the qth auxiliary keyword array of the pth emerging industry, idf (·) is a function for calculating idf weight, n represents the total number of the business range participles of the e enterprise, and m represents the total number of the qth auxiliary keyword array of the pth emerging industry;
and 4, obtaining a comprehensive evaluation score according to the basic evaluation score:
introducing word part-of-speech weight according to the basic evaluation score, calculating a comprehensive evaluation score, and recording as:
wherein, score (query)e,Ap,q) Representing the comprehensive evaluation score, base (query), of the q-th auxiliary keyword array of the mth enterprise to be classified and the pth emerging industrye,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeExpress the e-th business operation area word, Ap,qRepresents the q auxiliary key word array of the p emerging industry, query _ neIs queryeAn array composed of Chinese nouns, n _ n is query _ neLength of (1), query _ veIs queryeAn array composed of medium verbs, n _ v is query _ veC is a weight parameter,is Ap,qAn array of Chinese nouns, m _ n isLength of (d);is Ap,qAn array of medium verbs, m _ v beingLength of (d);
and 4, obtaining enterprise classification scores according to the comprehensive evaluation scores:
and obtaining enterprise classification scores according to the comprehensive evaluation scores, and recording as:
wherein, classify (C)e,Indp) Score (query) for the classification scores of the e-th business to be classified and the p-th emerging businesse,Ap,i) Expressing the comprehensive evaluation score Q of the ith auxiliary keyword array of the enterprise operation range participle of the e-th to-be-classified enterprise and the p-th emerging industrypThe total number of the auxiliary keyword arrays of the p emerging industry;
and 4, obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification scores:
obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score, and recording the classification result as:
IndT=argmax(classify(Ce,Indi))
wherein IndTClassification (C) is an emerging industry that maximizes the enterprise classification scores of all the e-th enterprise to be classifiede,lndp) And classifying the enterprise classification scores of the e-th enterprise to be classified and the p-th emerging industry.
The method has the advantages of no need of manual marking and training, strong adaptability and high accuracy, and can classify newly-added emerging industries.
Drawings
FIG. 1: is a flow chart of an embodiment of the present invention.
FIG. 2: is a comparative effect diagram of the method of the embodiment of the invention.
FIG. 3: graphs are shown for the results of the examples of the invention.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
Emerging industries are obtained by statistics of the inventor according to the emerging industries mentioned in government reports and the development of the industries in recent years, and all the emerging industries are marked as Ind; enterprises in the enterprise operating word bank are obtained from the industrial and commercial registration database, and all the enterprise operating word banks are marked as F; and inquiring the official website of the corresponding enterprise according to the enterprise name of the enterprise to be classified, and recording all the enterprises to be classified as C.
Please refer to FIG. 1, the present invention provides a word vector-based classification method for emerging industries belonging to enterprises, comprising the following steps:
step 1: acquiring input emerging industries, and acquiring related information on the Internet according to the names of the emerging industries; obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry; clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords;
step 1 the emerging industry is:
Indp
p∈[1,M]
wherein IndpName indicating the pth emerging industry, M ═ 212 indicates the number of emerging industries;
step 1, obtaining relevant information on the Internet according to the name of emerging industry:
subjecting Ind topAutomatic retrieval of Ind as a keyword on the Internet using crawler technologypIs recorded asInformation related to the p emerging industry;
step 1, obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry:
using the Textrank algorithm fromExtracting key words to obtain IndpThe candidate keywords of (2) are noted as:
keyp=[wp,1,wp,2,…,wp,D]
wherein, keypCandidate keyword, w, representing the p-th emerging industryp,dThe D candidate keyword representing the p emerging industry, D ∈ [1, D [ ]]D18625 represents the number of candidate keywords;
step 1, clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords:
dey will be used using word2vec technologypAll words in (2) are mapped to a multi-dimensional word vector space:
keyp=[wp,1,wp,2,…,wp,D]
where w2v (·) represents a function that converts words into word vectors, keypCandidate keyword, w, representing the p-th emerging industryp,dA d-th candidate keyword representing a p-th emerging industry;
using a K-means pair w2v (key)p) Clustering to obtain emerging industrial clustering keywords, wherein the clustering quantity is as follows:
wherein, KpRepresenting the number of clusters for the p-th emerging industry,denotes a rounded-down symbol, Len(keyp) Representing the total number of candidate keywords of the p emerging industry;
the emerging industry clustering keywords are as follows:
Dp,q[k]
p∈[1,M],q∈[1,Kp],k∈[1,Lp,q]
wherein D isp,q[k]Representing the kth key word in a key word array formed by the qth clustering result in the pth emerging industry, M representing the number of the emerging industries, KpRepresents the total number of clusters, L, for the p-th emerging industryp,qThe total number of keywords representing the qth clustering result in the pth emerging industry.
Step 2: acquiring an enterprise operation range from an official website, and acquiring an enterprise operation word bank according to the enterprise operation range; expanding the emerging industry clustering keywords according to the enterprise operation word bank to obtain an emerging industry keyword word bank;
step 2, the official website acquires the enterprise operation range, and an enterprise operation word bank is obtained according to the enterprise operation range:
the enterprise operation range stated in step 2 is recorded as
Sg
1≤g≤N
Wherein S isgThe management range information of the g-th enterprise is represented, and the total number of the enterprises is represented by N100000;
and (3) obtaining the enterprise operation word bank in the step 2 after removing stop words and word segmentation from the enterprise operation range, and recording the word bank as follows:
F=[Split(S1),Split(S2),…,Split(SN)]
Split(Sg)=[xg,1,xg,2,…,xg,h]
wherein F represents an enterprise operation word bank, Split (. cndot.) represents a function of removing stop words and participles, and xg,hRepresenting the h term obtained after the g enterprise operation range is subjected to stop word removal and word segmentation treatment, namely the h term in the g enterprise operation word bank;
step 2, according to the enterprise operation word bank, expanding the emerging industry clustering keywords:
the emerging industrial clustering keywords are used for searching 3 words with the highest similarity by cosine similarity:
where cossim (·,) represents a function for calculating cosine similarity, w2v (·) represents a function for converting words into word vectors, and xg,hExpress the h term in the g enterprise operation word bank, Dp,q[k]Representing the kth keyword in a keyword array formed by the qth clustering result in the pth emerging industry;
neutralizing F with Dp,q[k]And (3) supplementing the L words with the highest similarity to the emerging industry clustering keywords to obtain an emerging industry keyword word bank, and recording as follows:
Ap,q=[Dp,q[1],Dp,q[1]1,Dp,q[1]2,…,Dp,q[1]l,…,Dp,q[k],Dp,q[k]1,Dp,q[2]2,…,Dp,q[k]L]
wherein A isp,qRepresents the q-th auxiliary keyword array of the p-th emerging industry, Dp,q[k]Representing the kth keyword in a keyword array consisting of the qth clustering results in the pth emerging industry, Dp,q[k]lDenotes in F and Dp,q[k]The 1 st word with the highest similarity, and L represents the number of the highest-ranked word-taking numbers in turn according to the similarity ranking.
And step 3: obtaining the inverse document frequency weight of the words according to the enterprise operation word bank;
step 3, obtaining the inverse document frequency weight of the words according to the enterprise operation word bank:
calculating the inverse document frequency of all the words according to the distribution of the words in the enterprise operation word bank, and recording as follows:
1≤g≤G,1≤h≤Gg
wherein idfwon(xg,h) The inverse document frequency of the h term in the g enterprise operation word bank, R is the total number of the operation ranges, Num (x)g,h) Representing the total number of the operation range containing the h word in the G enterprise operation word bank, G being the total number of the enterprise operation word bank, GgThe total number of words in the g enterprise operation word bank;
and obtaining the normalized inverse document frequency by using a normalization algorithm according to the inverse document frequency, and recording the normalized inverse document frequency as:
wherein idfnorm(xg,h) Normalized inverse document frequency, idf, for the h term in the g-th Enterprise thesauruswon(xg,h) The inverse document frequency, idf, of the h term in the g-th enterprise thesauruswonmin is the minimum value of the frequency of the inverse documents in the business word bank of all enterprises, idfwonmax is the maximum value of the frequency of the inverse documents in all the enterprise operation word banks;
obtaining an inverse document frequency weight according to the normalized inverse document frequency, and recording as:
wherein idf (·) is a function for calculating the frequency weight of the inverse document, word is any term, and F is an enterprise operation word bank;
and 4, step 4: obtaining a basic evaluation score according to the business range of the enterprise to be classified and a keyword lexicon of emerging industries; obtaining a comprehensive evaluation score according to the basic evaluation score; obtaining enterprise classification scores according to the comprehensive evaluation scores; obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score;
step 4, obtaining a basic evaluation score according to the business operation range of the enterprise to be classified and the emerging industry keyword word bank:
the enterprise to be classified is marked as:
Ce
wherein, CeRepresenting the e-th enterprise to be classified;
the operation range of the enterprise to be classified is recorded as:
Scopee
wherein, ScopeeRepresenting the operation range of the e-th enterprise to be classified;
scope to be describedeAnd (3) segmenting words and removing stop words to obtain enterprise operation range segmentation words, and recording the segmentation words as:
querye=[ye,1,ye,2,…,ye,r]
wherein, queryeMeaning the operation range word segmentation of the e-th enterprise to be classifiede[r]=ye,rThe r term represents the operation range segmentation of the e enterprise to be classified;
obtaining cosine similarity according to the word segmentation of the enterprise operation range to be classified and the emerging industry keyword word bank, and recording as follows:
wherein, cossim (·,) represents the function of calculating cosine similarity, w2v (·) represents the function of converting words into word vectors, querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating word similarity according to the cosine similarity, and recording as:
sim(querye[r],Ap,q[t])=cossim(w2v(querye[r]),w2v(Ap,q[t]))
wherein sim (·,) represents a function for calculating word similarity, cossim (·,) represents a function for calculating cosine similarity, and querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating a basic evaluation score according to the similarity of the words, and recording as:
wherein, base (query)e,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeMeaning the e-th enterprise operation range word segmentation, querye[i]The ith word representing the business operation range word division of the e-th enterprise to be classified, Ap,qRepresents the q-th auxiliary key word array of the p-th emerging industry, Ap,q[t]Representing the jth word in the qth auxiliary keyword array of the pth emerging industry, idf (·) is a function for calculating idf weight, n represents the total number of the business range participles of the e enterprise, and m represents the total number of the qth auxiliary keyword array of the pth emerging industry;
and 4, obtaining a comprehensive evaluation score according to the basic evaluation score:
introducing word part-of-speech weight according to the basic evaluation score, calculating a comprehensive evaluation score, and recording as:
wherein, score (query)e,Ap,q) Representing the comprehensive evaluation score, base (query), of the q-th auxiliary keyword array of the mth enterprise to be classified and the pth emerging industrye,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeExpress the e-th business operation area word, Ap,qRepresents the number of q auxiliary keywords of the p emerging industrySet, query _ neIs queryeAn array composed of Chinese nouns, n _ n is query _ neLength of (1), query _ veIs queryeAn array composed of medium verbs, n _ v is query _ veC is a weight parameter,is Ap,qAn array of Chinese nouns, m _ n isLength of (d);is Ap,qAn array of medium verbs, m _ v beingLength of (d);
and 4, obtaining enterprise classification scores according to the comprehensive evaluation scores:
and obtaining enterprise classification scores according to the comprehensive evaluation scores, and recording as:
wherein, classify (C)e,Indp) Score (query) for the classification scores of the e-th business to be classified and the p-th emerging businesse,Ap,i) Expressing the comprehensive evaluation score Q of the ith auxiliary keyword array of the enterprise operation range participle of the e-th to-be-classified enterprise and the p-th emerging industrypThe total number of the auxiliary keyword arrays of the p emerging industry;
and 4, obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification scores:
obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score, and recording the classification result as:
IndT=argmax(classify(Ce,Indi))
wherein the content of the first and second substances,IndTclassification (C) is an emerging industry that maximizes the enterprise classification scores of all the e-th enterprise to be classifiede,Indp) Classifying the enterprise classification scores of the e-th enterprise to be classified and the p-th emerging industry;
and 5: for industries which exist in a key word bank of a new industry, the existing word bank is expanded by adopting a method for supplementing key words, and the method is also effective for classifying the traditional industries to which the enterprises belong; for the emerging industries which do not exist in the keyword word stock of the emerging industries, the method for classifying the newly-built industries and searching by using the Internet crawler is adopted to create the corresponding keyword word stock of the emerging industries, so that the aim of dynamically expanding the emerging industries is fulfilled.
Finally, in order to illustrate the experimental effect of the invention, the invention is compared with other methods, and the experimental result is shown in the attached figure 2, which proves the feasibility and the accuracy of the invention. An example of the classification result is shown in fig. 3.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (5)
1. A word vector-based classification method for emerging industries of enterprises is characterized by comprising the following steps:
step 1: acquiring input emerging industries, and acquiring related information on the Internet according to the names of the emerging industries; obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry; clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords;
step 2: acquiring an enterprise operation range from an official website, and acquiring an enterprise operation word bank according to the enterprise operation range; expanding the emerging industry clustering keywords according to the enterprise operation word bank to obtain an emerging industry keyword word bank;
and step 3: obtaining the inverse document frequency weight of the words according to the enterprise operation word bank;
and 4, step 4: obtaining a basic evaluation score according to the business range of the enterprise to be classified and a keyword lexicon of emerging industries; obtaining a comprehensive evaluation score according to the basic evaluation score; obtaining enterprise classification scores according to the comprehensive evaluation scores; and obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification score.
2. The method of claim 1, wherein the method comprises the following steps:
step 1 the emerging industry is:
Indp
p∈[1,M]
wherein IndpName of the p-th emerging industry, and M represents the number of emerging industries;
step 1, obtaining relevant information on the Internet according to the name of emerging industry:
subjecting Ind topAutomatic retrieval of Ind as a keyword on the Internet using crawler technologypIs recorded asInformation related to the p emerging industry;
step 1, obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry:
using the Textrank algorithm fromExtracting key words to obtain IndpThe candidate keywords of (2) are noted as:
keyp=[wp,1,wp,2,…,wp,D]
wherein, keypCandidate keyword, w, representing the p-th emerging industryp,dThe D candidate keyword representing the p emerging industry, D ∈ [1, D [ ]]D represents the number of candidate keywords;
step 1, clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords:
using word2vec technology to combine keyspAll words in (2) are mapped to a multi-dimensional word vector space:
keyp=[wp,1,wp,2,…,wp,D]
where w2v (·) represents a function that converts words into word vectors, keypCandidate keyword, w, representing the p-th emerging industryp,dA d-th candidate keyword representing a p-th emerging industry;
using a K-means pair w2v (key)p) Clustering to obtain emerging industrial clustering keywords, wherein the clustering quantity is as follows:
wherein, KpRepresenting the number of clusters for the p-th emerging industry,denotes a rounded-down symbol, Len (key)p) Representing the total number of candidate keywords of the p emerging industry;
the emerging industry clustering keywords are as follows:
Dp,q[k]
p∈[1,M],q∈[1,Kp],k∈[1,Lp,q]
wherein D isp,q[k]Representing the kth key word in a key word array formed by the qth clustering result in the pth emerging industry, M representing the number of the emerging industries, KpRepresents the total number of clusters, L, for the p-th emerging industryp,qRepresents the qth cluster node in the pth emerging industryTotal number of keywords of fruit.
3. The method of claim 1, wherein the method comprises the following steps:
step 2, the official website acquires the enterprise operation range, and an enterprise operation word bank is obtained according to the enterprise operation range:
and 2, recording the enterprise operation range as Sg
1≤g≤N
Wherein S isgThe management range information of the g enterprise is represented, and N represents the total number of the enterprises;
and (3) obtaining the enterprise operation word bank in the step 2 after removing stop words and word segmentation from the enterprise operation range, and recording the word bank as follows:
F=[Split(S1),Split(S2),…,Split(SN)]
Split(Sg)=[xg,1,xg,2,…,xg,h]
wherein F represents an enterprise operation word bank, Split (. cndot.) represents a function of removing stop words and participles, and xg,hRepresenting the h term obtained after the g enterprise operation range is subjected to stop word removal and word segmentation treatment, namely the h term in the g enterprise operation word bank;
step 2, according to the enterprise operation word bank, expanding the emerging industry clustering keywords:
the emerging industrial clustering keywords are used for searching 3 words with the highest similarity by cosine similarity:
where cossim (·,) represents a function for calculating cosine similarity, w2v (·) represents a function for converting words into word vectors, and xg,hExpress the h term in the g enterprise operation word bank, Dp,q[k]Representing the kth keyword in a keyword array formed by the qth clustering result in the pth emerging industry;
neutralizing F with Dp,q[k]And (3) supplementing the L words with the highest similarity to the emerging industry clustering keywords to obtain an emerging industry keyword word bank, and recording as follows:
Ap,q
=[Dp,q[1],Dp,q[1]1,Dp,q[1]2,…,Dp,q[1]l,…,Dp,q[k],Dp,q[k]1,Dp,q[k]2,…,Dp,q[k]L]
wherein A isp,qRepresents the q-th auxiliary keyword array of the p-th emerging industry, Dp,q[k]Representing the kth keyword in a keyword array consisting of the qth clustering results in the pth emerging industry, Dp,q[k]lDenotes in F and Dp,q[k]The 1 st word with the highest similarity, and L represents the number of the highest-ranked word-taking numbers in turn according to the similarity ranking.
4. The method of claim 1, wherein the method comprises the following steps:
step 3, obtaining the inverse document frequency weight of the words according to the enterprise operation word bank:
calculating the inverse document frequency of all the words according to the distribution of the words in the enterprise operation word bank, and recording as follows:
wherein idfwon(xg,h) The inverse document frequency of the h term in the g enterprise operation word bank, R is the total number of the operation ranges, Num (x)g,h) Representing the total number of the operation range containing the h word in the G enterprise operation word bank, G being the total number of the enterprise operation word bank, GgThe total number of words in the g enterprise operation word bank;
and obtaining the normalized inverse document frequency by using a normalization algorithm according to the inverse document frequency, and recording the normalized inverse document frequency as:
wherein idfnorm(xg,h) Normalized inverse document frequency, idf, for the h term in the g-th Enterprise thesauruswon(xg,h) The inverse document frequency, idf, of the h term in the g-th enterprise thesauruswonmin is the minimum value of the frequency of the inverse documents in the business word bank of all enterprises, idfwonmax is the maximum value of the frequency of the inverse documents in all the enterprise operation word banks;
obtaining an inverse document frequency weight according to the normalized inverse document frequency, and recording as:
wherein idf (·) is a function for calculating the frequency weight of the inverse document, word is any term, and F is an enterprise operation word bank.
5. The method of claim 1, wherein the method comprises the following steps:
step 4, obtaining a basic evaluation score according to the business operation range of the enterprise to be classified and the emerging industry keyword word bank:
the enterprise to be classified is marked as:
Ce
wherein, CeRepresenting the e-th enterprise to be classified;
the operation range of the enterprise to be classified is recorded as:
Scopee
wherein, ScopeeRepresenting the operation range of the e-th enterprise to be classified;
scope to be describedeAnd (3) segmenting words and removing stop words to obtain enterprise operation range segmentation words, and recording the segmentation words as:
querye=[ye,1,ye,2,…,ye,r]
wherein, queryeMeaning the operation range word segmentation of the e-th enterprise to be classifiede[r]=ye,rThe r term represents the operation range segmentation of the e enterprise to be classified;
obtaining cosine similarity according to the word segmentation of the enterprise operation range to be classified and the emerging industry keyword word bank, and recording as follows:
wherein, cossim (·,) represents the function of calculating cosine similarity, w2v (·) represents the function of converting words into word vectors, querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating word similarity according to the cosine similarity, and recording as:
sim(querye[r],Ap,q[t])=cossim(w2v(querye[r]),w2v(Ap,q[t]))
wherein sim (·,) represents a function for calculating word similarity, cossim (·,) represents a function for calculating cosine similarity, and querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating a basic evaluation score according to the similarity of the words, and recording as:
wherein, base (query)e,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeMeaning the e-th enterprise operation range word segmentation, querye[i]Express the e-th to-be-classified enterpriseThe ith word of division of business tendency, Ap,qRepresents the q-th auxiliary key word array of the p-th emerging industry, Ap,q[t]Representing the jth word in the qth auxiliary keyword array of the pth emerging industry, idf (·) is a function for calculating idf weight, n represents the total number of the business range participles of the e enterprise, and m represents the total number of the qth auxiliary keyword array of the pth emerging industry;
and 4, obtaining a comprehensive evaluation score according to the basic evaluation score:
introducing word part-of-speech weight according to the basic evaluation score, calculating a comprehensive evaluation score, and recording as:
wherein, score (query)e,Ap,q) Representing the comprehensive evaluation score, base (query), of the q-th auxiliary keyword array of the mth enterprise to be classified and the pth emerging industrye,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeExpress the e-th business operation area word, Ap,qRepresents the q auxiliary key word array of the p emerging industry, query _ neIs queryeAn array composed of Chinese nouns, n _ n is query _ neLength of (1), query _ veIs queryeAn array composed of medium verbs, n _ v is query _ veC is a weight parameter,is Ap,qAn array of Chinese nouns, m _ n isLength of (d);is Ap,qMiddle verbAn array of compositions, m _ v isLength of (d);
and 4, obtaining enterprise classification scores according to the comprehensive evaluation scores:
and obtaining enterprise classification scores according to the comprehensive evaluation scores, and recording as:
wherein, classify (C)e,Indp) Score (query) for the classification scores of the e-th business to be classified and the p-th emerging businesse,Ap,i) Expressing the comprehensive evaluation score Q of the ith auxiliary keyword array of the enterprise operation range participle of the e-th to-be-classified enterprise and the p-th emerging industrypThe total number of the auxiliary keyword arrays of the p emerging industry;
and 4, obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification scores:
obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score, and recording the classification result as:
IndT=argmax(classify(Ce,Indi))
wherein IndTClassification (C) is an emerging industry that maximizes the enterprise classification scores of all the e-th enterprise to be classifiede,Indp) And classifying the enterprise classification scores of the e-th enterprise to be classified and the p-th emerging industry.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110034145.3A CN112686043B (en) | 2021-01-12 | 2021-01-12 | Word vector-based classification method for emerging industries of enterprises |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110034145.3A CN112686043B (en) | 2021-01-12 | 2021-01-12 | Word vector-based classification method for emerging industries of enterprises |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112686043A true CN112686043A (en) | 2021-04-20 |
CN112686043B CN112686043B (en) | 2024-02-06 |
Family
ID=75457447
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110034145.3A Active CN112686043B (en) | 2021-01-12 | 2021-01-12 | Word vector-based classification method for emerging industries of enterprises |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112686043B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113360652A (en) * | 2021-06-07 | 2021-09-07 | 深圳供电局有限公司 | Enterprise-level power user intelligent classification method and device |
CN114492308A (en) * | 2021-12-29 | 2022-05-13 | 北京航天智造科技发展有限公司 | Industrial information indexing method and system combining knowledge discovery and text mining |
CN115879901A (en) * | 2023-02-22 | 2023-03-31 | 陕西湘秦衡兴科技集团股份有限公司 | Intelligent personnel self-service platform |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
CN109086375A (en) * | 2018-07-24 | 2018-12-25 | 武汉大学 | A kind of short text subject extraction method based on term vector enhancing |
WO2019134554A1 (en) * | 2018-01-08 | 2019-07-11 | 阿里巴巴集团控股有限公司 | Content recommendation method and apparatus |
CN112182223A (en) * | 2020-10-12 | 2021-01-05 | 浙江工业大学 | Enterprise industry classification method and system based on domain ontology |
-
2021
- 2021-01-12 CN CN202110034145.3A patent/CN112686043B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
WO2017084267A1 (en) * | 2015-11-18 | 2017-05-26 | 乐视控股(北京)有限公司 | Method and device for keyphrase extraction |
WO2019134554A1 (en) * | 2018-01-08 | 2019-07-11 | 阿里巴巴集团控股有限公司 | Content recommendation method and apparatus |
CN109086375A (en) * | 2018-07-24 | 2018-12-25 | 武汉大学 | A kind of short text subject extraction method based on term vector enhancing |
CN112182223A (en) * | 2020-10-12 | 2021-01-05 | 浙江工业大学 | Enterprise industry classification method and system based on domain ontology |
Non-Patent Citations (2)
Title |
---|
夏天: "词向量聚类加权TextRank的关键词抽取", 数据分析与知识发现, vol. 1, no. 2 * |
彭敏;张泰玮;黄佳佳;朱佳晖;黄济民;: "基于回归模型与谱聚类的微博突发话题检测方法", 计算机工程, no. 12 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113360652A (en) * | 2021-06-07 | 2021-09-07 | 深圳供电局有限公司 | Enterprise-level power user intelligent classification method and device |
CN113360652B (en) * | 2021-06-07 | 2024-03-01 | 深圳供电局有限公司 | Enterprise-level power user intelligent classification method and device |
CN114492308A (en) * | 2021-12-29 | 2022-05-13 | 北京航天智造科技发展有限公司 | Industrial information indexing method and system combining knowledge discovery and text mining |
CN114492308B (en) * | 2021-12-29 | 2023-11-24 | 北京航天智造科技发展有限公司 | Industry information indexing method and system combining knowledge discovery and text mining |
CN115879901A (en) * | 2023-02-22 | 2023-03-31 | 陕西湘秦衡兴科技集团股份有限公司 | Intelligent personnel self-service platform |
CN115879901B (en) * | 2023-02-22 | 2023-07-28 | 陕西湘秦衡兴科技集团股份有限公司 | Intelligent personnel self-service platform |
Also Published As
Publication number | Publication date |
---|---|
CN112686043B (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112686043B (en) | Word vector-based classification method for emerging industries of enterprises | |
CN107895000B (en) | Cross-domain semantic information retrieval method based on convolutional neural network | |
CN107480200B (en) | Word labeling method, device, server and storage medium based on word labels | |
CN109885675B (en) | Text subtopic discovery method based on improved LDA | |
CN111104510B (en) | Text classification training sample expansion method based on word embedding | |
CN110399339A (en) | File classifying method, device, equipment and the storage medium of knowledge base management system | |
CN107463616B (en) | Enterprise information analysis method and system | |
CN109492105B (en) | Text emotion classification method based on multi-feature ensemble learning | |
CN108038099B (en) | Low-frequency keyword identification method based on word clustering | |
CN113420145B (en) | Semi-supervised learning-based bid-bidding text classification method and system | |
CN111158641B (en) | Automatic recognition method for transaction function points based on semantic analysis and text mining | |
CN110347791B (en) | Topic recommendation method based on multi-label classification convolutional neural network | |
CN112417863A (en) | Chinese text classification method based on pre-training word vector model and random forest algorithm | |
CN109902289A (en) | A kind of news video topic division method towards fuzzy text mining | |
Bouguila | A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity | |
CN112131876A (en) | Method and system for determining standard problem based on similarity | |
CN113672718A (en) | Dialog intention recognition method and system based on feature matching and field self-adaption | |
CN114611491A (en) | Intelligent government affair public opinion analysis research method based on text mining technology | |
CN115952292A (en) | Multi-label classification method, device and computer readable medium | |
CN113987175A (en) | Text multi-label classification method based on enhanced representation of medical topic word list | |
CN114265935A (en) | Science and technology project establishment management auxiliary decision-making method and system based on text mining | |
WO2021128529A1 (en) | Technology trend prediction method and system | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN115577080A (en) | Question reply matching method, system, server and storage medium | |
CN111159410A (en) | Text emotion classification method, system and device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |