CN112686043A - Word vector-based classification method for emerging industries to which enterprises belong - Google Patents

Word vector-based classification method for emerging industries to which enterprises belong Download PDF

Info

Publication number
CN112686043A
CN112686043A CN202110034145.3A CN202110034145A CN112686043A CN 112686043 A CN112686043 A CN 112686043A CN 202110034145 A CN202110034145 A CN 202110034145A CN 112686043 A CN112686043 A CN 112686043A
Authority
CN
China
Prior art keywords
enterprise
word
emerging
emerging industry
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110034145.3A
Other languages
Chinese (zh)
Other versions
CN112686043B (en
Inventor
彭敏
徐文杰
胡刚
贾旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110034145.3A priority Critical patent/CN112686043B/en
Publication of CN112686043A publication Critical patent/CN112686043A/en
Application granted granted Critical
Publication of CN112686043B publication Critical patent/CN112686043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a word vector-based classification method for emerging industries to which enterprises belong. The invention obtains the input new industry and obtains the relevant information on the internet according to the name; obtaining candidate keywords by using a Textrank algorithm according to related information of emerging industries; clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords; acquiring an enterprise operation range from an official website, and acquiring an enterprise operation word bank according to the operation range; expanding emerging industry clustering keywords according to an enterprise operation word bank to obtain an emerging industry keyword word bank; obtaining the inverse document frequency weight of the words according to the enterprise operation word bank; sequentially obtaining a basic evaluation score, a comprehensive evaluation score and an enterprise classification score according to the enterprise operation range to be classified and a newly emerging industrial keyword word bank; and obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification score. The method has the advantages of no need of manual marking and training, high accuracy and capability of classifying new and emerging industries.

Description

Word vector-based classification method for emerging industries to which enterprises belong
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a word vector-based classification method for emerging industries to which enterprises belong.
Background
In the analysis of enterprise-to-industry connections, it is time-consuming and labor-consuming to manually classify enterprises into corresponding enterprises, especially when facing a large number of samples of enterprises to be classified and emerging industries with a lack of related classification experience. All enterprises have operation ranges, the operation ranges can embody the industries of the enterprises, and the operation ranges are used for analyzing the classification of the industries of the enterprises. The business scope and the industry are both composed of words, and the words in the same business scope or the same industry have similarity, so the distance between word vectors can be used as the measure of word similarity.
In the algorithm research, proper introduction of external parameters including the use of industry description information as supplementary knowledge and the use of word inverse document frequency and word part of speech as word weight can be found out, so that better classification results can be obtained. In addition, the use of unsupervised algorithms can save time and labor costs for labeling large numbers of samples. In conclusion, based on the consideration of accelerating enterprise classification and improving analysis efficiency, the invention provides a word vector-based classification method for emerging industries to which enterprises belong.
In the existing invention technology, for example, patent application with publication number CN110019769 discloses an intelligent enterprise classification method, in which a supervised classification method based on SVM (support vector machine) is used, and the method has the following short boards: a large number of samples need to be manually pre-labeled and the model needs to be trained for a certain time. The method does not have the capacity of classifying emerging industries, a large amount of time is needed for retraining when new labels appear, and a network which is trained in advance needs to be deployed when the method is used, so that the computer power requirement is high.
Disclosure of Invention
The invention aims to provide a method for rapidly and accurately classifying emerging industries of enterprises when the enterprises and the industries are analyzed in a correlation mode, and provides a word vector-based method for classifying the emerging industries of the enterprises, which does not need manual labeling, training, high adaptability, high accuracy and strong expansibility, in view of the fact that the existing method needs to label a large amount of data and a large amount of time for model training and cannot expand the emerging industries.
The technical scheme adopted by the invention is as follows: a word vector-based classification method for emerging industries of enterprises is characterized by comprising the following steps:
step 1: acquiring input emerging industries, and acquiring related information on the Internet according to the names of the emerging industries; obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry; clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords;
step 2: acquiring an enterprise operation range from an official website, and acquiring an enterprise operation word bank according to the enterprise operation range; expanding the emerging industry clustering keywords according to the enterprise operation word bank to obtain an emerging industry keyword word bank;
and step 3: obtaining the inverse document frequency weight of the words according to the enterprise operation word bank;
and 4, step 4: obtaining a basic evaluation score according to the business range of the enterprise to be classified and a keyword lexicon of emerging industries; obtaining a comprehensive evaluation score according to the basic evaluation score; obtaining enterprise classification scores according to the comprehensive evaluation scores; obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score;
preferably, the emerging industry in step 1 is:
Indp
p∈[1,M]
wherein IndpName of the p-th emerging industry, and M represents the number of emerging industries;
step 1, obtaining relevant information on the Internet according to the name of emerging industry:
subjecting Ind topAutomatic retrieval of Ind as a keyword on the Internet using crawler technologypIs recorded as
Figure BDA0002893513890000021
Information related to the p emerging industry;
step 1, obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry:
using the Textrank algorithm from
Figure BDA0002893513890000022
Extracting key words to obtain IndpThe candidate keywords of (2) are noted as:
keyp=[wp,1,wp,2,…,wp,D]
wherein, keypCandidate keyword, w, representing the p-th emerging industryp,dThe D candidate keyword representing the p emerging industry, D ∈ [1, D [ ]]D represents the number of candidate keywords;
step 1, clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords:
using word2vec technology to combine keyspAll words in (2) are mapped to a multi-dimensional word vector space:
keyp=[wp,1,wp,2,…,wp,D]
where w2v (·) represents a function that converts words into word vectors, keypCandidate keyword, w, representing the p-th emerging industryp,dA d-th candidate keyword representing a p-th emerging industry;
using a K-means pair w2v (key)p) Clustering to obtain emerging industrial clustering keywords, wherein the clustering quantity is as follows:
Figure BDA0002893513890000031
wherein, KpRepresenting the number of clusters for the p-th emerging industry,
Figure BDA0002893513890000032
denotes a rounded-down symbol, Len (key)p) Representing the total number of candidate keywords of the p emerging industry;
the emerging industry clustering keywords are as follows:
Dp,q[k]
p∈[1,M],q∈[1,Kp],k∈[1,Lp,q]
wherein D isp,q[k]Representing the kth key word in a key word array formed by the qth clustering result in the pth emerging industry, M representing the number of the emerging industries, KpRepresents the total number of clusters, L, for the p-th emerging industryp,qThe total number of keywords representing the qth clustering result in the pth emerging industry.
Preferably, the official website in step 2 acquires the enterprise operation range, and acquires an enterprise operation word bank according to the enterprise operation range:
the enterprise operation range stated in step 2 is recorded as
Sg
1≤g≤N
Wherein S isgThe management range information of the g enterprise is represented, and N represents the total number of the enterprises;
and (3) obtaining the enterprise operation word bank in the step 2 after removing stop words and word segmentation from the enterprise operation range, and recording the word bank as follows:
F=[Split(S1),Split(S2),…,Split(SN)]
Split(Sg)=[xg,1,xg,2,…,xg,h]
wherein F represents an enterprise operation word bank, Split (. cndot.) represents a function of removing stop words and participles, and xg,hRepresenting the h term obtained after the g enterprise operation range is subjected to stop word removal and word segmentation treatment, namely the h term in the g enterprise operation word bank;
step 2, according to the enterprise operation word bank, expanding the emerging industry clustering keywords:
the emerging industrial clustering keywords are used for searching 3 words with the highest similarity by cosine similarity:
Figure BDA0002893513890000041
where cossim (·,) represents a function for calculating cosine similarity, w2v (·) represents a function for converting words into word vectors, and xg,hExpress the h term in the g enterprise operation word bank, Dp,q[k]Representing the kth keyword in a keyword array formed by the qth clustering result in the pth emerging industry;
neutralizing F with Dp,q[k]And (3) supplementing the L words with the highest similarity to the emerging industry clustering keywords to obtain an emerging industry keyword word bank, and recording as follows:
Ap,q=[Dp,q[1],Dp,q[1]1,Dp,q[1]2,…,Dp,q[1]l,…,Dp,q[k],Dp,q[k]1,Dp,q[k]2,…,Dp,q[k]L]
wherein A isp,qRepresents the q-th auxiliary keyword array of the p-th emerging industry, Dp,q[k]Representing the kth keyword in a keyword array consisting of the qth clustering results in the pth emerging industry, Dp,q[k]lDenotes in F and Dp,q[k]The ith word with the highest similarity, and L represents the number of the highest-ranked word-taking numbers in turn according to the similarity ranking.
Preferably, in step 3, the inverse document frequency weight of the term is obtained according to the enterprise operation word bank:
calculating the inverse document frequency of all the words according to the distribution of the words in the enterprise operation word bank, and recording as follows:
Figure BDA0002893513890000042
1≤g≤G,1≤h≤Gg
wherein,idfwon(xg,h) The inverse document frequency of the h term in the g enterprise operation word bank, R is the total number of the operation ranges, Num (x)g,h) Representing the total number of the operation range containing the h word in the G enterprise operation word bank, G being the total number of the enterprise operation word bank, GgThe total number of words in the g enterprise operation word bank;
and obtaining the normalized inverse document frequency by using a normalization algorithm according to the inverse document frequency, and recording the normalized inverse document frequency as:
Figure BDA0002893513890000051
wherein idfnorm(xg,h) Normalized inverse document frequency, idf, for the h term in the g-th Enterprise thesauruswon(xg,h) The inverse document frequency, idf, of the h term in the g-th enterprise thesauruswonmin is the minimum value of the frequency of the inverse documents in the business word bank of all enterprises, idfwonmax is the maximum value of the frequency of the inverse documents in all the enterprise operation word banks;
obtaining an inverse document frequency weight according to the normalized inverse document frequency, and recording as:
Figure BDA0002893513890000052
wherein idf (·) is a function for calculating the frequency weight of the inverse document, word is any term, and F is an enterprise operation word bank;
preferably, step 4, obtaining a basic evaluation score according to the business operation range of the enterprise to be classified and the keyword lexicon of the emerging industry:
the enterprise to be classified is marked as:
Ce
wherein, CeRepresenting the e-th enterprise to be classified;
the operation range of the enterprise to be classified is recorded as:
Scopee
wherein, ScopeeIndicates the e-th waiting scoreThe business scope of the class enterprise;
scope to be describedeAnd (3) segmenting words and removing stop words to obtain enterprise operation range segmentation words, and recording the segmentation words as:
querye=[ye,1,ye,2,…,ye,r]
wherein, queryeMeaning the operation range word segmentation of the e-th enterprise to be classifiede[r]=ye,rThe r term represents the operation range segmentation of the e enterprise to be classified;
obtaining cosine similarity according to the word segmentation of the enterprise operation range to be classified and the emerging industry keyword word bank, and recording as follows:
Figure BDA0002893513890000053
wherein, cossim (·,) represents the function of calculating cosine similarity, w2v (·) represents the function of converting words into word vectors, querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating word similarity according to the cosine similarity, and recording as:
sim(querye[r],Ap,q[t])=cossim(w2v(querye[r]),w2v(Ap,q[t]))
wherein sim (·,) represents a function for calculating word similarity, cossim (·,) represents a function for calculating cosine similarity, and querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating a basic evaluation score according to the similarity of the words, and recording as:
Figure BDA0002893513890000061
wherein, base (q)uerye,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeMeaning the e-th enterprise operation range word segmentation, querye[i]The ith word representing the business operation range word division of the e-th enterprise to be classified, Ap,qRepresents the q-th auxiliary key word array of the p-th emerging industry, Ap,q[t]Representing the jth word in the qth auxiliary keyword array of the pth emerging industry, idf (·) is a function for calculating idf weight, n represents the total number of the business range participles of the e enterprise, and m represents the total number of the qth auxiliary keyword array of the pth emerging industry;
and 4, obtaining a comprehensive evaluation score according to the basic evaluation score:
introducing word part-of-speech weight according to the basic evaluation score, calculating a comprehensive evaluation score, and recording as:
Figure BDA0002893513890000062
wherein, score (query)e,Ap,q) Representing the comprehensive evaluation score, base (query), of the q-th auxiliary keyword array of the mth enterprise to be classified and the pth emerging industrye,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeExpress the e-th business operation area word, Ap,qRepresents the q auxiliary key word array of the p emerging industry, query _ neIs queryeAn array composed of Chinese nouns, n _ n is query _ neLength of (1), query _ veIs queryeAn array composed of medium verbs, n _ v is query _ veC is a weight parameter,
Figure BDA0002893513890000071
is Ap,qAn array of Chinese nouns, m _ n is
Figure BDA0002893513890000072
Length of (d);
Figure BDA0002893513890000073
is Ap,qAn array of medium verbs, m _ v being
Figure BDA0002893513890000074
Length of (d);
and 4, obtaining enterprise classification scores according to the comprehensive evaluation scores:
and obtaining enterprise classification scores according to the comprehensive evaluation scores, and recording as:
Figure BDA0002893513890000075
wherein, classify (C)e,Indp) Score (query) for the classification scores of the e-th business to be classified and the p-th emerging businesse,Ap,i) Expressing the comprehensive evaluation score Q of the ith auxiliary keyword array of the enterprise operation range participle of the e-th to-be-classified enterprise and the p-th emerging industrypThe total number of the auxiliary keyword arrays of the p emerging industry;
and 4, obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification scores:
obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score, and recording the classification result as:
IndT=argmax(classify(Ce,Indi))
wherein IndTClassification (C) is an emerging industry that maximizes the enterprise classification scores of all the e-th enterprise to be classifiede,lndp) And classifying the enterprise classification scores of the e-th enterprise to be classified and the p-th emerging industry.
The method has the advantages of no need of manual marking and training, strong adaptability and high accuracy, and can classify newly-added emerging industries.
Drawings
FIG. 1: is a flow chart of an embodiment of the present invention.
FIG. 2: is a comparative effect diagram of the method of the embodiment of the invention.
FIG. 3: graphs are shown for the results of the examples of the invention.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
Emerging industries are obtained by statistics of the inventor according to the emerging industries mentioned in government reports and the development of the industries in recent years, and all the emerging industries are marked as Ind; enterprises in the enterprise operating word bank are obtained from the industrial and commercial registration database, and all the enterprise operating word banks are marked as F; and inquiring the official website of the corresponding enterprise according to the enterprise name of the enterprise to be classified, and recording all the enterprises to be classified as C.
Please refer to FIG. 1, the present invention provides a word vector-based classification method for emerging industries belonging to enterprises, comprising the following steps:
step 1: acquiring input emerging industries, and acquiring related information on the Internet according to the names of the emerging industries; obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry; clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords;
step 1 the emerging industry is:
Indp
p∈[1,M]
wherein IndpName indicating the pth emerging industry, M ═ 212 indicates the number of emerging industries;
step 1, obtaining relevant information on the Internet according to the name of emerging industry:
subjecting Ind topAutomatic retrieval of Ind as a keyword on the Internet using crawler technologypIs recorded as
Figure BDA0002893513890000081
Information related to the p emerging industry;
step 1, obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry:
using the Textrank algorithm from
Figure BDA0002893513890000082
Extracting key words to obtain IndpThe candidate keywords of (2) are noted as:
keyp=[wp,1,wp,2,…,wp,D]
wherein, keypCandidate keyword, w, representing the p-th emerging industryp,dThe D candidate keyword representing the p emerging industry, D ∈ [1, D [ ]]D18625 represents the number of candidate keywords;
step 1, clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords:
dey will be used using word2vec technologypAll words in (2) are mapped to a multi-dimensional word vector space:
keyp=[wp,1,wp,2,…,wp,D]
where w2v (·) represents a function that converts words into word vectors, keypCandidate keyword, w, representing the p-th emerging industryp,dA d-th candidate keyword representing a p-th emerging industry;
using a K-means pair w2v (key)p) Clustering to obtain emerging industrial clustering keywords, wherein the clustering quantity is as follows:
Figure BDA0002893513890000091
wherein, KpRepresenting the number of clusters for the p-th emerging industry,
Figure BDA0002893513890000092
denotes a rounded-down symbol, Len(keyp) Representing the total number of candidate keywords of the p emerging industry;
the emerging industry clustering keywords are as follows:
Dp,q[k]
p∈[1,M],q∈[1,Kp],k∈[1,Lp,q]
wherein D isp,q[k]Representing the kth key word in a key word array formed by the qth clustering result in the pth emerging industry, M representing the number of the emerging industries, KpRepresents the total number of clusters, L, for the p-th emerging industryp,qThe total number of keywords representing the qth clustering result in the pth emerging industry.
Step 2: acquiring an enterprise operation range from an official website, and acquiring an enterprise operation word bank according to the enterprise operation range; expanding the emerging industry clustering keywords according to the enterprise operation word bank to obtain an emerging industry keyword word bank;
step 2, the official website acquires the enterprise operation range, and an enterprise operation word bank is obtained according to the enterprise operation range:
the enterprise operation range stated in step 2 is recorded as
Sg
1≤g≤N
Wherein S isgThe management range information of the g-th enterprise is represented, and the total number of the enterprises is represented by N100000;
and (3) obtaining the enterprise operation word bank in the step 2 after removing stop words and word segmentation from the enterprise operation range, and recording the word bank as follows:
F=[Split(S1),Split(S2),…,Split(SN)]
Split(Sg)=[xg,1,xg,2,…,xg,h]
wherein F represents an enterprise operation word bank, Split (. cndot.) represents a function of removing stop words and participles, and xg,hRepresenting the h term obtained after the g enterprise operation range is subjected to stop word removal and word segmentation treatment, namely the h term in the g enterprise operation word bank;
step 2, according to the enterprise operation word bank, expanding the emerging industry clustering keywords:
the emerging industrial clustering keywords are used for searching 3 words with the highest similarity by cosine similarity:
Figure BDA0002893513890000101
where cossim (·,) represents a function for calculating cosine similarity, w2v (·) represents a function for converting words into word vectors, and xg,hExpress the h term in the g enterprise operation word bank, Dp,q[k]Representing the kth keyword in a keyword array formed by the qth clustering result in the pth emerging industry;
neutralizing F with Dp,q[k]And (3) supplementing the L words with the highest similarity to the emerging industry clustering keywords to obtain an emerging industry keyword word bank, and recording as follows:
Ap,q=[Dp,q[1],Dp,q[1]1,Dp,q[1]2,…,Dp,q[1]l,…,Dp,q[k],Dp,q[k]1,Dp,q[2]2,…,Dp,q[k]L]
wherein A isp,qRepresents the q-th auxiliary keyword array of the p-th emerging industry, Dp,q[k]Representing the kth keyword in a keyword array consisting of the qth clustering results in the pth emerging industry, Dp,q[k]lDenotes in F and Dp,q[k]The 1 st word with the highest similarity, and L represents the number of the highest-ranked word-taking numbers in turn according to the similarity ranking.
And step 3: obtaining the inverse document frequency weight of the words according to the enterprise operation word bank;
step 3, obtaining the inverse document frequency weight of the words according to the enterprise operation word bank:
calculating the inverse document frequency of all the words according to the distribution of the words in the enterprise operation word bank, and recording as follows:
Figure BDA0002893513890000102
1≤g≤G,1≤h≤Gg
wherein idfwon(xg,h) The inverse document frequency of the h term in the g enterprise operation word bank, R is the total number of the operation ranges, Num (x)g,h) Representing the total number of the operation range containing the h word in the G enterprise operation word bank, G being the total number of the enterprise operation word bank, GgThe total number of words in the g enterprise operation word bank;
and obtaining the normalized inverse document frequency by using a normalization algorithm according to the inverse document frequency, and recording the normalized inverse document frequency as:
Figure BDA0002893513890000111
wherein idfnorm(xg,h) Normalized inverse document frequency, idf, for the h term in the g-th Enterprise thesauruswon(xg,h) The inverse document frequency, idf, of the h term in the g-th enterprise thesauruswonmin is the minimum value of the frequency of the inverse documents in the business word bank of all enterprises, idfwonmax is the maximum value of the frequency of the inverse documents in all the enterprise operation word banks;
obtaining an inverse document frequency weight according to the normalized inverse document frequency, and recording as:
Figure BDA0002893513890000112
wherein idf (·) is a function for calculating the frequency weight of the inverse document, word is any term, and F is an enterprise operation word bank;
and 4, step 4: obtaining a basic evaluation score according to the business range of the enterprise to be classified and a keyword lexicon of emerging industries; obtaining a comprehensive evaluation score according to the basic evaluation score; obtaining enterprise classification scores according to the comprehensive evaluation scores; obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score;
step 4, obtaining a basic evaluation score according to the business operation range of the enterprise to be classified and the emerging industry keyword word bank:
the enterprise to be classified is marked as:
Ce
wherein, CeRepresenting the e-th enterprise to be classified;
the operation range of the enterprise to be classified is recorded as:
Scopee
wherein, ScopeeRepresenting the operation range of the e-th enterprise to be classified;
scope to be describedeAnd (3) segmenting words and removing stop words to obtain enterprise operation range segmentation words, and recording the segmentation words as:
querye=[ye,1,ye,2,…,ye,r]
wherein, queryeMeaning the operation range word segmentation of the e-th enterprise to be classifiede[r]=ye,rThe r term represents the operation range segmentation of the e enterprise to be classified;
obtaining cosine similarity according to the word segmentation of the enterprise operation range to be classified and the emerging industry keyword word bank, and recording as follows:
Figure BDA0002893513890000121
wherein, cossim (·,) represents the function of calculating cosine similarity, w2v (·) represents the function of converting words into word vectors, querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating word similarity according to the cosine similarity, and recording as:
sim(querye[r],Ap,q[t])=cossim(w2v(querye[r]),w2v(Ap,q[t]))
wherein sim (·,) represents a function for calculating word similarity, cossim (·,) represents a function for calculating cosine similarity, and querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating a basic evaluation score according to the similarity of the words, and recording as:
Figure BDA0002893513890000122
wherein, base (query)e,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeMeaning the e-th enterprise operation range word segmentation, querye[i]The ith word representing the business operation range word division of the e-th enterprise to be classified, Ap,qRepresents the q-th auxiliary key word array of the p-th emerging industry, Ap,q[t]Representing the jth word in the qth auxiliary keyword array of the pth emerging industry, idf (·) is a function for calculating idf weight, n represents the total number of the business range participles of the e enterprise, and m represents the total number of the qth auxiliary keyword array of the pth emerging industry;
and 4, obtaining a comprehensive evaluation score according to the basic evaluation score:
introducing word part-of-speech weight according to the basic evaluation score, calculating a comprehensive evaluation score, and recording as:
Figure BDA0002893513890000123
wherein, score (query)e,Ap,q) Representing the comprehensive evaluation score, base (query), of the q-th auxiliary keyword array of the mth enterprise to be classified and the pth emerging industrye,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeExpress the e-th business operation area word, Ap,qRepresents the number of q auxiliary keywords of the p emerging industrySet, query _ neIs queryeAn array composed of Chinese nouns, n _ n is query _ neLength of (1), query _ veIs queryeAn array composed of medium verbs, n _ v is query _ veC is a weight parameter,
Figure BDA0002893513890000131
is Ap,qAn array of Chinese nouns, m _ n is
Figure BDA0002893513890000132
Length of (d);
Figure BDA0002893513890000133
is Ap,qAn array of medium verbs, m _ v being
Figure BDA0002893513890000134
Length of (d);
and 4, obtaining enterprise classification scores according to the comprehensive evaluation scores:
and obtaining enterprise classification scores according to the comprehensive evaluation scores, and recording as:
Figure BDA0002893513890000135
wherein, classify (C)e,Indp) Score (query) for the classification scores of the e-th business to be classified and the p-th emerging businesse,Ap,i) Expressing the comprehensive evaluation score Q of the ith auxiliary keyword array of the enterprise operation range participle of the e-th to-be-classified enterprise and the p-th emerging industrypThe total number of the auxiliary keyword arrays of the p emerging industry;
and 4, obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification scores:
obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score, and recording the classification result as:
IndT=argmax(classify(Ce,Indi))
wherein the content of the first and second substances,IndTclassification (C) is an emerging industry that maximizes the enterprise classification scores of all the e-th enterprise to be classifiede,Indp) Classifying the enterprise classification scores of the e-th enterprise to be classified and the p-th emerging industry;
and 5: for industries which exist in a key word bank of a new industry, the existing word bank is expanded by adopting a method for supplementing key words, and the method is also effective for classifying the traditional industries to which the enterprises belong; for the emerging industries which do not exist in the keyword word stock of the emerging industries, the method for classifying the newly-built industries and searching by using the Internet crawler is adopted to create the corresponding keyword word stock of the emerging industries, so that the aim of dynamically expanding the emerging industries is fulfilled.
Finally, in order to illustrate the experimental effect of the invention, the invention is compared with other methods, and the experimental result is shown in the attached figure 2, which proves the feasibility and the accuracy of the invention. An example of the classification result is shown in fig. 3.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (5)

1. A word vector-based classification method for emerging industries of enterprises is characterized by comprising the following steps:
step 1: acquiring input emerging industries, and acquiring related information on the Internet according to the names of the emerging industries; obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry; clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords;
step 2: acquiring an enterprise operation range from an official website, and acquiring an enterprise operation word bank according to the enterprise operation range; expanding the emerging industry clustering keywords according to the enterprise operation word bank to obtain an emerging industry keyword word bank;
and step 3: obtaining the inverse document frequency weight of the words according to the enterprise operation word bank;
and 4, step 4: obtaining a basic evaluation score according to the business range of the enterprise to be classified and a keyword lexicon of emerging industries; obtaining a comprehensive evaluation score according to the basic evaluation score; obtaining enterprise classification scores according to the comprehensive evaluation scores; and obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification score.
2. The method of claim 1, wherein the method comprises the following steps:
step 1 the emerging industry is:
Indp
p∈[1,M]
wherein IndpName of the p-th emerging industry, and M represents the number of emerging industries;
step 1, obtaining relevant information on the Internet according to the name of emerging industry:
subjecting Ind topAutomatic retrieval of Ind as a keyword on the Internet using crawler technologypIs recorded as
Figure FDA0002893513880000011
Information related to the p emerging industry;
step 1, obtaining candidate keywords by using a Textrank algorithm according to related information of the emerging industry, and obtaining candidate keywords of the emerging industry:
using the Textrank algorithm from
Figure FDA0002893513880000012
Extracting key words to obtain IndpThe candidate keywords of (2) are noted as:
keyp=[wp,1,wp,2,…,wp,D]
wherein, keypCandidate keyword, w, representing the p-th emerging industryp,dThe D candidate keyword representing the p emerging industry, D ∈ [1, D [ ]]D represents the number of candidate keywords;
step 1, clustering by using a K-means algorithm according to the candidate keywords to obtain emerging industry clustering keywords:
using word2vec technology to combine keyspAll words in (2) are mapped to a multi-dimensional word vector space:
keyp=[wp,1,wp,2,…,wp,D]
where w2v (·) represents a function that converts words into word vectors, keypCandidate keyword, w, representing the p-th emerging industryp,dA d-th candidate keyword representing a p-th emerging industry;
using a K-means pair w2v (key)p) Clustering to obtain emerging industrial clustering keywords, wherein the clustering quantity is as follows:
Figure FDA0002893513880000021
wherein, KpRepresenting the number of clusters for the p-th emerging industry,
Figure FDA0002893513880000022
denotes a rounded-down symbol, Len (key)p) Representing the total number of candidate keywords of the p emerging industry;
the emerging industry clustering keywords are as follows:
Dp,q[k]
p∈[1,M],q∈[1,Kp],k∈[1,Lp,q]
wherein D isp,q[k]Representing the kth key word in a key word array formed by the qth clustering result in the pth emerging industry, M representing the number of the emerging industries, KpRepresents the total number of clusters, L, for the p-th emerging industryp,qRepresents the qth cluster node in the pth emerging industryTotal number of keywords of fruit.
3. The method of claim 1, wherein the method comprises the following steps:
step 2, the official website acquires the enterprise operation range, and an enterprise operation word bank is obtained according to the enterprise operation range:
and 2, recording the enterprise operation range as Sg
1≤g≤N
Wherein S isgThe management range information of the g enterprise is represented, and N represents the total number of the enterprises;
and (3) obtaining the enterprise operation word bank in the step 2 after removing stop words and word segmentation from the enterprise operation range, and recording the word bank as follows:
F=[Split(S1),Split(S2),…,Split(SN)]
Split(Sg)=[xg,1,xg,2,…,xg,h]
wherein F represents an enterprise operation word bank, Split (. cndot.) represents a function of removing stop words and participles, and xg,hRepresenting the h term obtained after the g enterprise operation range is subjected to stop word removal and word segmentation treatment, namely the h term in the g enterprise operation word bank;
step 2, according to the enterprise operation word bank, expanding the emerging industry clustering keywords:
the emerging industrial clustering keywords are used for searching 3 words with the highest similarity by cosine similarity:
Figure FDA0002893513880000031
where cossim (·,) represents a function for calculating cosine similarity, w2v (·) represents a function for converting words into word vectors, and xg,hExpress the h term in the g enterprise operation word bank, Dp,q[k]Representing the kth keyword in a keyword array formed by the qth clustering result in the pth emerging industry;
neutralizing F with Dp,q[k]And (3) supplementing the L words with the highest similarity to the emerging industry clustering keywords to obtain an emerging industry keyword word bank, and recording as follows:
Ap,q
=[Dp,q[1],Dp,q[1]1,Dp,q[1]2,…,Dp,q[1]l,…,Dp,q[k],Dp,q[k]1,Dp,q[k]2,…,Dp,q[k]L]
wherein A isp,qRepresents the q-th auxiliary keyword array of the p-th emerging industry, Dp,q[k]Representing the kth keyword in a keyword array consisting of the qth clustering results in the pth emerging industry, Dp,q[k]lDenotes in F and Dp,q[k]The 1 st word with the highest similarity, and L represents the number of the highest-ranked word-taking numbers in turn according to the similarity ranking.
4. The method of claim 1, wherein the method comprises the following steps:
step 3, obtaining the inverse document frequency weight of the words according to the enterprise operation word bank:
calculating the inverse document frequency of all the words according to the distribution of the words in the enterprise operation word bank, and recording as follows:
Figure FDA0002893513880000041
wherein idfwon(xg,h) The inverse document frequency of the h term in the g enterprise operation word bank, R is the total number of the operation ranges, Num (x)g,h) Representing the total number of the operation range containing the h word in the G enterprise operation word bank, G being the total number of the enterprise operation word bank, GgThe total number of words in the g enterprise operation word bank;
and obtaining the normalized inverse document frequency by using a normalization algorithm according to the inverse document frequency, and recording the normalized inverse document frequency as:
Figure FDA0002893513880000042
wherein idfnorm(xg,h) Normalized inverse document frequency, idf, for the h term in the g-th Enterprise thesauruswon(xg,h) The inverse document frequency, idf, of the h term in the g-th enterprise thesauruswonmin is the minimum value of the frequency of the inverse documents in the business word bank of all enterprises, idfwonmax is the maximum value of the frequency of the inverse documents in all the enterprise operation word banks;
obtaining an inverse document frequency weight according to the normalized inverse document frequency, and recording as:
Figure FDA0002893513880000043
wherein idf (·) is a function for calculating the frequency weight of the inverse document, word is any term, and F is an enterprise operation word bank.
5. The method of claim 1, wherein the method comprises the following steps:
step 4, obtaining a basic evaluation score according to the business operation range of the enterprise to be classified and the emerging industry keyword word bank:
the enterprise to be classified is marked as:
Ce
wherein, CeRepresenting the e-th enterprise to be classified;
the operation range of the enterprise to be classified is recorded as:
Scopee
wherein, ScopeeRepresenting the operation range of the e-th enterprise to be classified;
scope to be describedeAnd (3) segmenting words and removing stop words to obtain enterprise operation range segmentation words, and recording the segmentation words as:
querye=[ye,1,ye,2,…,ye,r]
wherein, queryeMeaning the operation range word segmentation of the e-th enterprise to be classifiede[r]=ye,rThe r term represents the operation range segmentation of the e enterprise to be classified;
obtaining cosine similarity according to the word segmentation of the enterprise operation range to be classified and the emerging industry keyword word bank, and recording as follows:
Figure FDA0002893513880000051
wherein, cossim (·,) represents the function of calculating cosine similarity, w2v (·) represents the function of converting words into word vectors, querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating word similarity according to the cosine similarity, and recording as:
sim(querye[r],Ap,q[t])=cossim(w2v(querye[r]),w2v(Ap,q[t]))
wherein sim (·,) represents a function for calculating word similarity, cossim (·,) represents a function for calculating cosine similarity, and querye[r]The r term, A, representing the operation range division of the e enterprise to be classifiedp,q[t]Representing the t word in the q auxiliary keyword array of the p emerging industry;
and calculating a basic evaluation score according to the similarity of the words, and recording as:
Figure FDA0002893513880000052
wherein, base (query)e,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeMeaning the e-th enterprise operation range word segmentation, querye[i]Express the e-th to-be-classified enterpriseThe ith word of division of business tendency, Ap,qRepresents the q-th auxiliary key word array of the p-th emerging industry, Ap,q[t]Representing the jth word in the qth auxiliary keyword array of the pth emerging industry, idf (·) is a function for calculating idf weight, n represents the total number of the business range participles of the e enterprise, and m represents the total number of the qth auxiliary keyword array of the pth emerging industry;
and 4, obtaining a comprehensive evaluation score according to the basic evaluation score:
introducing word part-of-speech weight according to the basic evaluation score, calculating a comprehensive evaluation score, and recording as:
Figure FDA0002893513880000061
wherein, score (query)e,Ap,q) Representing the comprehensive evaluation score, base (query), of the q-th auxiliary keyword array of the mth enterprise to be classified and the pth emerging industrye,Ap,q) Expressing the basic evaluation score, query, of the q-th auxiliary keyword array of the e-th enterprise to be classified in the operation range and the p-th emerging industryeExpress the e-th business operation area word, Ap,qRepresents the q auxiliary key word array of the p emerging industry, query _ neIs queryeAn array composed of Chinese nouns, n _ n is query _ neLength of (1), query _ veIs queryeAn array composed of medium verbs, n _ v is query _ veC is a weight parameter,
Figure FDA0002893513880000062
is Ap,qAn array of Chinese nouns, m _ n is
Figure FDA0002893513880000063
Length of (d);
Figure FDA0002893513880000064
is Ap,qMiddle verbAn array of compositions, m _ v is
Figure FDA0002893513880000065
Length of (d);
and 4, obtaining enterprise classification scores according to the comprehensive evaluation scores:
and obtaining enterprise classification scores according to the comprehensive evaluation scores, and recording as:
Figure FDA0002893513880000066
wherein, classify (C)e,Indp) Score (query) for the classification scores of the e-th business to be classified and the p-th emerging businesse,Ap,i) Expressing the comprehensive evaluation score Q of the ith auxiliary keyword array of the enterprise operation range participle of the e-th to-be-classified enterprise and the p-th emerging industrypThe total number of the auxiliary keyword arrays of the p emerging industry;
and 4, obtaining a classification result of the emerging industry to which the enterprise belongs according to the enterprise classification scores:
obtaining a classification result of a new industry to which the enterprise belongs according to the enterprise classification score, and recording the classification result as:
IndT=argmax(classify(Ce,Indi))
wherein IndTClassification (C) is an emerging industry that maximizes the enterprise classification scores of all the e-th enterprise to be classifiede,Indp) And classifying the enterprise classification scores of the e-th enterprise to be classified and the p-th emerging industry.
CN202110034145.3A 2021-01-12 2021-01-12 Word vector-based classification method for emerging industries of enterprises Active CN112686043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110034145.3A CN112686043B (en) 2021-01-12 2021-01-12 Word vector-based classification method for emerging industries of enterprises

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110034145.3A CN112686043B (en) 2021-01-12 2021-01-12 Word vector-based classification method for emerging industries of enterprises

Publications (2)

Publication Number Publication Date
CN112686043A true CN112686043A (en) 2021-04-20
CN112686043B CN112686043B (en) 2024-02-06

Family

ID=75457447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110034145.3A Active CN112686043B (en) 2021-01-12 2021-01-12 Word vector-based classification method for emerging industries of enterprises

Country Status (1)

Country Link
CN (1) CN112686043B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360652A (en) * 2021-06-07 2021-09-07 深圳供电局有限公司 Enterprise-level power user intelligent classification method and device
CN114492308A (en) * 2021-12-29 2022-05-13 北京航天智造科技发展有限公司 Industrial information indexing method and system combining knowledge discovery and text mining
CN115879901A (en) * 2023-02-22 2023-03-31 陕西湘秦衡兴科技集团股份有限公司 Intelligent personnel self-service platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
CN109086375A (en) * 2018-07-24 2018-12-25 武汉大学 A kind of short text subject extraction method based on term vector enhancing
WO2019134554A1 (en) * 2018-01-08 2019-07-11 阿里巴巴集团控股有限公司 Content recommendation method and apparatus
CN112182223A (en) * 2020-10-12 2021-01-05 浙江工业大学 Enterprise industry classification method and system based on domain ontology

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893410A (en) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 Keyword extraction method and apparatus
WO2017084267A1 (en) * 2015-11-18 2017-05-26 乐视控股(北京)有限公司 Method and device for keyphrase extraction
WO2019134554A1 (en) * 2018-01-08 2019-07-11 阿里巴巴集团控股有限公司 Content recommendation method and apparatus
CN109086375A (en) * 2018-07-24 2018-12-25 武汉大学 A kind of short text subject extraction method based on term vector enhancing
CN112182223A (en) * 2020-10-12 2021-01-05 浙江工业大学 Enterprise industry classification method and system based on domain ontology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
夏天: "词向量聚类加权TextRank的关键词抽取", 数据分析与知识发现, vol. 1, no. 2 *
彭敏;张泰玮;黄佳佳;朱佳晖;黄济民;: "基于回归模型与谱聚类的微博突发话题检测方法", 计算机工程, no. 12 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360652A (en) * 2021-06-07 2021-09-07 深圳供电局有限公司 Enterprise-level power user intelligent classification method and device
CN113360652B (en) * 2021-06-07 2024-03-01 深圳供电局有限公司 Enterprise-level power user intelligent classification method and device
CN114492308A (en) * 2021-12-29 2022-05-13 北京航天智造科技发展有限公司 Industrial information indexing method and system combining knowledge discovery and text mining
CN114492308B (en) * 2021-12-29 2023-11-24 北京航天智造科技发展有限公司 Industry information indexing method and system combining knowledge discovery and text mining
CN115879901A (en) * 2023-02-22 2023-03-31 陕西湘秦衡兴科技集团股份有限公司 Intelligent personnel self-service platform
CN115879901B (en) * 2023-02-22 2023-07-28 陕西湘秦衡兴科技集团股份有限公司 Intelligent personnel self-service platform

Also Published As

Publication number Publication date
CN112686043B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
CN112686043B (en) Word vector-based classification method for emerging industries of enterprises
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
CN107480200B (en) Word labeling method, device, server and storage medium based on word labels
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN111104510B (en) Text classification training sample expansion method based on word embedding
CN110399339A (en) File classifying method, device, equipment and the storage medium of knowledge base management system
CN107463616B (en) Enterprise information analysis method and system
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN113420145B (en) Semi-supervised learning-based bid-bidding text classification method and system
CN111158641B (en) Automatic recognition method for transaction function points based on semantic analysis and text mining
CN110347791B (en) Topic recommendation method based on multi-label classification convolutional neural network
CN112417863A (en) Chinese text classification method based on pre-training word vector model and random forest algorithm
CN109902289A (en) A kind of news video topic division method towards fuzzy text mining
Bouguila A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity
CN112131876A (en) Method and system for determining standard problem based on similarity
CN113672718A (en) Dialog intention recognition method and system based on feature matching and field self-adaption
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
CN115952292A (en) Multi-label classification method, device and computer readable medium
CN113987175A (en) Text multi-label classification method based on enhanced representation of medical topic word list
CN114265935A (en) Science and technology project establishment management auxiliary decision-making method and system based on text mining
WO2021128529A1 (en) Technology trend prediction method and system
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN115577080A (en) Question reply matching method, system, server and storage medium
CN111159410A (en) Text emotion classification method, system and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant