CN112580332A - Enterprise portrait method based on label layering and deepening modeling - Google Patents

Enterprise portrait method based on label layering and deepening modeling Download PDF

Info

Publication number
CN112580332A
CN112580332A CN202011303829.0A CN202011303829A CN112580332A CN 112580332 A CN112580332 A CN 112580332A CN 202011303829 A CN202011303829 A CN 202011303829A CN 112580332 A CN112580332 A CN 112580332A
Authority
CN
China
Prior art keywords
enterprise
deepening
label
len
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011303829.0A
Other languages
Chinese (zh)
Other versions
CN112580332B (en
Inventor
李翔
丁行硕
王媛媛
朱全银
高尚兵
王留洋
马甲林
张柯文
成洁怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yami Technology Guangzhou Co ltd
Original Assignee
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaiyin Institute of Technology filed Critical Huaiyin Institute of Technology
Priority to CN202011303829.0A priority Critical patent/CN112580332B/en
Publication of CN112580332A publication Critical patent/CN112580332A/en
Application granted granted Critical
Publication of CN112580332B publication Critical patent/CN112580332B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an enterprise portrait method based on label layered deepening modeling, which comprises the steps of firstly counting and screening fuzzy labels of an enterprise, screening out labels which can not completely summarize enterprise characteristics such as wholesale industry, retail industry and the like, and classifying and deepening the screened labels by using a Bert model according to an enterprise operation range and enterprise labels; integrating the enterprise name, enterprise introduction and the management range information, expanding the characteristics based on a pre-established enterprise word bank, extracting key words from the comprehensive information by using TextRank, TF-IDF and LDA topic models respectively, and taking the processed key words as deeper enterprise extension labels; and finally, applying the modeling method to an enterprise portrait system to optimize the accurate summarizing capability of the label. The method is generally suitable for the problems of tag deepening modeling and tag extraction, fully considers the hierarchical relationship of tag deepening, and can effectively improve the accuracy of tags and an enterprise portrait system.

Description

Enterprise portrait method based on label layering and deepening modeling
Technical Field
The invention belongs to the technical field of enterprise portrait and natural language processing, and particularly relates to an enterprise portrait method based on label layered deepening modeling.
Background
The layered extension of the label in the invention has important function and significance for the image technology. In the face of the portrait label problem, researchers usually select classification matching, but the model has obvious defects, neglects the hierarchical relation of labels from shallow to deep, the labels cannot accurately summarize the characteristics of enterprises, and further deepening modeling cannot be performed on the labels. Therefore, the problem of tag deepening modeling can be well solved by combining the neural network and natural language processing, and therefore the accuracy of the tag and the portrait system is improved.
The existing research bases of plum blossom, cinnabar and the like comprise: li, Z.Wang, S.Gao, R.Hu, Q.Zhu and L.Wang, "An Intelligent content-Aware Management Framework for Cold Chain Logistics Distribution," in IEEE Transactions on Intelligent transfer systems. doi: 10.1109/TITS.2018.2889069; li, Z.Wang, L.Wang, R.Hu and Q.Zhu, "A Multi-Dimensional Context-Aware communication application Based on Improved Random Forest project Algorithm," in IEEE Access, vol.6, pp.45071-45085,2018, doi: 10.1109/ACCESS.2018.2865436; li, X., Wang, Z., Hu, R.et al.Recommendation algorithm based on improved calibration and transfer learning. Pattern antenna application 22, 633-647 (2019); lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; the application, the disclosure and the related granted patents of Li Xiang and Zhu quan Yin, etc.: the patent numbers ZL 2017100546758.1, 2020.02.07 are patent numbers ZL 2017100546758.1, 2020.02.07; an incremental learning multi-level and two-classification method for scientific and technological news, namely Zhuquanhyin, Lixiang, Hurongling and the like, the invention has the patent numbers of ZL 201510642902, X and 2018.08.10; zhuquanhyin, Shaowujie, Lixiang and the like, a multilayer and multi-classification method for scientific news titles, invention patent numbers ZL 201610114278.0 and 2019.04.19; the patent numbers ZL201210325368.6, 2016.06.08 are invention patent numbers ZL201210325368.6, 2016.06.08; the patent number ZL 201610565749, X, 2019.06.11 is invented.
Enterprise portrait:
the enterprise portrait is a product in the big data era and is generated based on user portrait, a tagged enterprise model is extracted through the basic information of an enterprise, and the enterprise information is displayed in an all-round mode in a chart mode. The establishment of enterprise portrait tags is that the enterprise portrait tags are established through the most basic statistical class tags and the rule class tags generated by enterprise user behaviors, and finally, data mining is used for conducting prediction judgment on certain attributes of an enterprise, potential value information is mined, and the tags form an enterprise portrait tag system. The enterprise portrait can vividly show the comprehensive strength of an enterprise, and portrait information can be used as an important basis when the enterprise performs project cooperation. Meanwhile, the competition among enterprises can be reduced, and the enterprises are attracted to benefit and avoid harm. For governments, knowing the enterprise information is beneficial to the enterprise supervision.
Yang Ling Yun, Yang Wen Feng, a method and system for providing enterprise portrait: CN111666377A,2020.06.03, the invention provides a method and a system for enterprise portrait, which analyzes and processes label data to establish enterprise portrait by collecting identification information of enterprises, although the invention provides a construction method of enterprise portrait, no deeper research is carried out on labels; the patent refers to the field of 'pictorial communication,'. CN108572967A, 2018.09.25, the invention provides a method and a system for creating an enterprise portrait, which classify by acquiring enterprise portrait data, and then match the classified data with enterprise information, although the invention can divide enterprise labels, the generalized ability of the classified labels is limited, and the characteristics of enterprises cannot be accurately described; the method for establishing the enterprise portrait based on the regression model comprises the following steps: CN105512245A, 2016.4.20, the invention establishes an enterprise portrait based on a regression model, and the method makes full use of the potential semantic information of the text to make up for the deficiency of the traditional enterprise portrait, but does not consider the progressive relation of the labels from shallow to deep, and only expands and extracts the feature words.
The above inventions have remarkable effects in processing related fields, but the traditional enterprise images have the following problems: 1. the traditional label definition of the enterprise portrait is fuzzy, and the characteristics of the enterprise cannot be fully described, so that the label accuracy is reduced; 2. traditional enterprise images do not carry out deepening modeling from shallow to deep on the labels, and key words more suitable for enterprise characteristics cannot be extracted. Aiming at the problems, the invention provides an enterprise portrait method and an enterprise portrait system based on label layered deepening modeling. The method comprises the steps of firstly, counting and screening fuzzy labels of an enterprise, screening out labels which cannot completely summarize characteristics of the enterprise, and classifying and deepening the screened labels by using a Bert model according to the operation range of the enterprise and the statistical labels; and then integrating the information, expanding the characteristics based on a pre-established enterprise library, and extracting keywords by using various algorithms to serve as deeper enterprise extension labels. The method is generally suitable for the problems of tag deepening modeling and tag extraction, fully considers the hierarchical relationship of tag deepening, and can effectively improve the accuracy of tags and an enterprise portrait system.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides an enterprise sketch method based on label layered deepening modeling, which can accurately depict the characteristics of an enterprise, make up the defects of the traditional enterprise sketch and improve the actual application efficiency.
The technical scheme is as follows: in order to solve the technical problem, the invention provides an enterprise portrait method based on label layered deepening modeling, which comprises the following specific steps:
(1) removing the weight and the empty of the enterprise label data set D and the enterprise multi-source data set D1, and cleaning to obtain enterprise data sets D2 and D3;
(2) counting and screening a data set D2, screening a tag data set which cannot completely summarize the characteristics of the enterprise, defining the tag data set as D4, and counting all tag sets as a deepening basis;
(3) constructing a Bert model, taking a data set D4 as an input of the model, and after semantic learning, performing classification and deepening of a first layer of labels by using a softmax layer;
(4) integrating the enterprise name, enterprise introduction and operation range information in the D3 data set, respectively extracting keywords by using TextRank, TF-IDF and LDA topic models, then processing the extracted keywords, and taking the processed words as next-layer deep tags;
(5) based on the label deepening method, the method is applied to an enterprise portrait system, and the accuracy of the label and the enterprise portrait system is improved.
Further, the specific method for obtaining the enterprise data sets D2 and D3 in the step (1) is as follows:
(1.1) defining Text as a single multi-source information set to be cleaned, defining id, content1, content2 and content3 as enterprise serial number, enterprise name, enterprise introduction and enterprise operation range respectively, and satisfying the relation
Text={id,content1,content2,content3};
(1.2) defining Text1 as an information set to be cleaned in a single enterprise operation range, defining id, content3 and label as an enterprise serial number, an enterprise operation range and an enterprise label respectively, and satisfying the relationship Text1 ═ id, content3 and label };
(1.3) define D as the first layer tag deepening to-be-cleaned data set, D1 as the next layer tag deepening to-be-cleaned data set, D ═ Text11,Text12,…,Text1a,…,Text1len(D)},Text1aFor the a-th enterprise tag data to be cleaned in D, D1 ═ Text1,Text2,…,Texta1,…,Textlen(D1)},Texta1For the a enterprise multi-source data to be cleaned in D1, wherein len (D) is the number of texts in D, and the variable a belongs to [1, len (D)]Len (D1) is the number of texts in D1, and the variable a1 e [1, len (D1)];
(1.4) after the text in the data set D is deduplicated and deduplicated, the cleaned first-layer enterprise data set D2 is obtained1,T2,…,Tb,…,Tlen(D2)},TbFor the b-th business label data to be processed in D2, where len (D2) is the number of texts in D2, and the variable b is the [1, len (D2)];
(1.5) after the text in the data set D1 is deduplicated and nulled, the next enterprise data set D3 ═ T is obtained1,T2,…,Tb1,…,Tlen(D3)},Tb1For the b-th enterprise multi-source data to be processed in D3, wherein len (D3) is the number of texts in D3, and variable b1 is E [1, len (D3)]。
Further, the specific method of screening out the label data set which cannot completely summarize the characteristics of the enterprise in the step (2) and defining the label data set as D4 and counting out all label sets as the basis for the deepening is as follows:
(2.1) screening the D2 data set to screen out a data set which can not completely summarize enterprise characteristics but can be deepened by other labels, such as wholesale industry, retail industry and the like, and defining D4 ═ { T ═ T1,T2,…,Tc,…,Tlen(D4)D5 ═ T1,T2,…,Td,…,Tlen(D5)The rest data sets are represented, the number of label categories of D4 is n, and list4 represents a label set of D4;
(2.2) counting the D5 data set, wherein all labels are counted to be used as a deepening basis, m is the number of categories of the D5 data set, and list5 is a label set of D5;
(2.3) using the label set of list5 as a label of which the label classification is deepened;
(2.4) using the first layer data set D5 as a training set, and carrying out classification and deepening on the D4 data set according to a list5 label set.
Further, the specific method for performing classification deepening on the first-layer labels by using the softmax layer in the step (3) is as follows:
(3.1) building a Bert model, and performing model training by using a D5 training set;
(3.2) processing the data set D4 to obtain text content T to be processedcFixed to a uniform length Lmax
(3.3) defining a cycle variable i, and assigning an initial value of i to be 1;
(3.4) jumping to step (3.5) if i is less than or equal to len (D4), otherwise, jumping to step (3.9);
(3.5) definition of len (T)i) Is the length of the ith text message in the text, if len (T)i)+2≤LmaxThen jump to next step after complementing 0, otherwise intercept text front LmaxSkipping to the next step for each unit;
(3.6)i=i+1;
(3.7) feeding each text into a Token entries layer in the BERT model, wherein the output result is represented as V1, and simultaneously extracting text information and Position information from the Segment entries layer and the Position entries layer, and the output result is represented as V2 and V3;
(3.8) adding the three different outputs V1, V2 and V3 to obtain a result denoted V, using vector V as input to the BERT model, and obtaining a word vector sequence s in the last layer of neuronsi={V(W1),V(W2),…,V(Wf),…,V(WLmax) }; wherein V (W)f) Is the f-th vector representation of the combined text information;
(3.9) end loop, output word vector sequence S ═ S1,s2,s3,…,sf,…,slen(D3)};
(3.10) Carrying out document classification prediction on the vector sequence by using a softmax function to obtain a classification probability prediction vector P ═ { P }1,p2,…,pg,…,phIn which p isgRepresenting the probability of the g class of the text, and h is the total number of the classes;
and (3.11) searching the maximum value in the vector P, and outputting a result corresponding to the maximum value, namely a label classification deepening result y.
Further, a specific method for using the processed word as a next-layer deep label in the step (4) is as follows:
(4.1) post-wash dataset D3 ═ T in step (1.5)1,T2,…,Tb1,…,Tlen(D3)And T ═ id, content1, content2, content3, where id, content1, content2, content3 are enterprise serial number, enterprise name, enterprise profile, and enterprise business scope, respectively;
(4.2) define D6 as the dataset to be integrated, len (D6) as the number of texts to be integrated in D6, D6 ═ T1,T2,…,Ta,…,Tlen(D6)};
(4.3) integrating the enterprise name, the enterprise introduction and the management scope information, wherein the integrated enterprise text is content4, and satisfies the conditions that T1 is { id, content4}, and D7 is { T1 }1,T12,…,T1a,…,T1len(D7)T1 is a single integrated text, D7 is an integrated enterprise dataset;
(4.4) carrying out statistics on words influencing the extraction result, and establishing a stop word dictionary;
(4.5) establishing an enterprise dictionary by collecting professional vocabularies of the enterprise field;
(4.6) performing keyword extraction on all nouns in the D7 enterprise integrated data set by using TextRank to obtain an extraction result K1 set;
(4.7) performing keyword extraction on all nouns in the D7 enterprise integrated data set by using TF-IDF to obtain an extraction result K2 set;
(4.8) finally, performing keyword extraction on all nouns in the D7 enterprise integration data set by using an LDA topic model to obtain an extraction result K3 set;
(4.9) pairsAnd sorting and combining the extracted K1, K2 and K3 keyword sets to obtain a keyword set K, wherein K is { W ═1,W2,…,Wi,…,Wlen(D7)},WiFor a single enterprise keyword set, i<len(D7);
(4.10) extracting the keywords WiAs a further extended depth label;
and (4.11) counting the obtained labels, and marking all the labels for the enterprises according to the hierarchical relationship.
Further, based on the tag deepening method in the step (5), the method is applied to an enterprise representation system, and the specific method for improving the accuracy of the tag and the enterprise representation system comprises the following steps:
(5.1) the enterprise portrait system comprises a preprocessing module, a tag classification deepening module, a keyword extraction deepening module, a tag integration module and a portrait display module;
(5.2) inputting a text of the enterprise to be deepened, and preprocessing the text by a preprocessing module to remove noise;
(5.3) transmitting the preprocessed enterprise text into a label classification deepening module to perform classification deepening of labels;
(5.4) integrating the enterprise name, the enterprise introduction and the operation range information, and further enriching the label content in the keyword extraction and depth extension module;
(5.5) integrating all the extended labels in a label integration module, and marking all the labels for the enterprise;
and (5.6) generating enterprise image information, and displaying the label information through the image display module.
By adopting the technical scheme, the invention has the following beneficial effects:
based on the existing enterprise text label data set, the invention utilizes Bert and keyword extraction to carry out label layering and deepening modeling, and the specific description is as follows: according to the method, a Bert model is utilized to carry out first-layer classification deepening on the data set in the enterprise operation range, then the integrated data set is further extracted and deepened by combining various extraction algorithms, and finally, through label integration, labels can accurately depict enterprise characteristics, meanwhile, the label modeling speed is optimized, the working time of practitioners is shortened, and the operation efficiency of an enterprise portrait system is improved.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a flow diagram of data cleansing in an exemplary embodiment;
FIG. 3 is a flow chart of statistical screening data in an embodiment;
FIG. 4 is a flowchart illustrating classification and depth enhancement of a Bert model in an exemplary embodiment;
FIG. 5 is a flowchart illustrating keyword extraction deepening in an exemplary embodiment;
FIG. 6 is a flow diagram illustrating an exemplary implementation of an enterprise representation system.
Detailed Description
The present invention is further illustrated by the following specific examples in conjunction with the national standards of engineering, it being understood that these examples are intended only to illustrate the invention and not to limit the scope of the invention, which is defined in the claims appended hereto, as modifications of various equivalent forms by those skilled in the art upon reading the present invention.
As shown in fig. 1-6, the enterprise sketch method based on label hierarchical deepening modeling according to the present invention includes the following steps:
step 1: the method comprises the following steps of carrying out duplicate removal and null removal on an enterprise tag data set D and enterprise multi-source data D1, and cleaning to obtain enterprise data sets D2 and D3, wherein the specific method comprises the following steps:
step 1.1: defining Text as a single multi-source information set to be cleaned, defining id, content1, content2 and content3 as an enterprise serial number, an enterprise name, an enterprise introduction and an enterprise operation range respectively, and satisfying the relationship of { id, content1, content2 and content3 };
step 1.2: defining Text1 as an information set to be cleaned in a single enterprise operation range, defining id, content3 and label as an enterprise serial number, an enterprise operation range and an enterprise label respectively, and satisfying the relationship Text1 as { id, content3 and label };
step 1.3: defining D as the first layer of tag extension to be cleaned data set, D1 as the next layer of tag extensionDeep-to-be-cleaned dataset, D ═ Text11,Text12,…,Text1a,…,Text1len(D)},Text1aFor the a-th enterprise tag data to be cleaned in D, D1 ═ Text1,Text2,…,Texta1,…,Textlen(D1)},Texta1For the a enterprise multi-source data to be cleaned in D1, wherein len (D) is the number of texts in D, and the variable a belongs to [1, len (D)]Len (D1) is the number of texts in D1, and the variable a1 e [1, len (D1)];
Step 1.4: and after the duplication and null removing operation is carried out on the text in the data set D, obtaining a cleaned first-layer enterprise data set D2 ═ T1,T2,…,Tb,…,Tlen(D2)},TbFor the b-th business label data to be processed in D2, where len (D2) is the number of texts in D2, and the variable b is the [1, len (D2)];
Step 1.5: and after the text in the data set D1 is subjected to duplicate removal and null removal, the next-layer enterprise data set D3 is obtained1,T2,…,Tb1,…,Tlen(D3)},Tb1For the b-th enterprise multi-source data to be processed in D3, wherein len (D3) is the number of texts in D3, and variable b1 is E [1, len (D3)]。
Step 2: counting and screening a data set D2, screening a tag data set which can not completely summarize the characteristics of the enterprise, defining the tag data set as D4, and counting all tag sets as a deepening basis, wherein the specific method comprises the following steps:
step 2.1: screening the D2 data set to screen out the data set which can not completely summarize the characteristics of enterprises such as wholesale industry, retail industry and the like but can be deepened by other labels, and defining D4 ═ T1,T2,…,Tc,…,Tlen(D4)D5 ═ T1,T2,…,Td,…,Tlen(D5)The rest data sets are represented, the number of label categories of D4 is n, and list4 represents a label set of D4;
step 2.2: counting the D5 data set, and counting all tags as a deepening basis, wherein m is the category number of the D5 data set, and list5 is the tag set of D5;
step 2.3: taking the label set of list5 as label of label classification deepening;
step 2.4: the first layer data set D5 is used as a training set, and the D4 data set is classified and deepened according to a list5 label set.
And step 3: constructing a Bert model, taking a data set D4 as an input of the model, and after semantic learning, performing classification and deepening of a first layer of labels by using a softmax layer, wherein the concrete method comprises the following steps:
step 3.1: establishing a Bert model, and performing model training by using a D5 training set;
step 3.2: processing the data set D4 to obtain text content T to be processedcFixed to a uniform length Lmax
Step 3.3: defining a cycle variable i, and assigning an initial value of i as 1;
step 3.4: if i is less than or equal to len (D4), skipping to step 3.5, otherwise skipping to step 3.9;
step 3.5: definition len (T)i) Is the length of the ith text message in the text, if len (T)i)+2≤LmaxThen jump to next step after complementing 0, otherwise intercept text front LmaxSkipping to the next step for each unit;
step 3.6: i is i + 1;
step 3.7: sending each text into a Token columns layer in a BERT model, wherein the output result is represented as V1, extracting text information and Position information from a Segment columns layer and a Position columns layer, and the output result is represented as V2 and V3;
step 3.8: adding three different outputs V1, V2 and V3 to obtain a result which is expressed as V, taking the vector V as the input of the BERT model, and obtaining a word vector sequence s in the neuron of the last layeri={V(W1),V(W2),…,V(Wf),…,V(WLmax) }; wherein V (W)f) Is the f-th vector representation of the combined text information;
step 3.9: ending the loop, and outputting the word vector sequence S ═ S1,s2,s3,…,sf,…,slen(D3)};
Step 3.10: performing vector sequence by using softmax functionClassifying and predicting the documents to obtain a classification probability prediction vector P ═ { P ═ P1,p2,…,pg,…,phIn which p isgRepresenting the probability of the g class of the text, and h is the total number of the classes;
step 3.11: and searching the maximum value in the vector P, and outputting a result corresponding to the maximum value, namely a label classification deepening result y.
And 4, step 4: integrating the enterprise name, enterprise introduction and operation range information in a D3 data set, respectively extracting keywords by using TextRank, TF-IDF and LDA topic models, processing the extracted keywords, and taking the processed words as the next layer of deep tags, wherein the specific method comprises the following steps:
step 4.1: the post-wash dataset D3 ═ T in step 1.51,T2,…,Tb1,…,Tlen(D3)And T ═ id, content1, content2, content3, where id, content1, content2, content3 are enterprise serial number, enterprise name, enterprise profile, and enterprise business scope, respectively;
step 4.2: define D6 as the data set to be integrated, len (D6) as the number of texts to be integrated in D6, D6 ═ T1,T2,…,Ta,…,Tlen(D6)};
Step 4.3: integrating the enterprise name, enterprise introduction and management range information, wherein the integrated enterprise text is content4, and satisfies T1 ═ id, content4, and D7 ═ T11,T12,…,T1a,…,T1len(D7)T1 is a single integrated text, D7 is an integrated enterprise dataset;
step 4.4: counting words influencing the extraction result, and establishing a stop word dictionary;
step 4.5: establishing an enterprise dictionary by collecting professional vocabularies of the enterprise field;
step 4.6: performing keyword extraction on all nouns in the D7 enterprise integration data set by using TextRank to obtain an extraction result K1 set;
step 4.7: then, performing keyword extraction on all nouns in the D7 enterprise integration data set by using TF-IDF to obtain an extraction result K2 set;
step 4.8: finally, performing keyword extraction on all nouns in the D7 enterprise integration data set by using an LDA topic model to obtain an extraction result K3 set;
step 4.9: sorting and merging the extracted K1, K2 and K3 keyword sets to obtain a keyword set K, wherein K is { W ═1,W2,…,Wi,…,Wlen(D7)},WiFor a single enterprise keyword set, i<len(D7);
Step 4.10: the extracted key words WiAs a further extended depth label;
step 4.11: and counting the obtained tags, and marking all tags for the enterprise according to the hierarchical relationship.
And 5: based on the label deepening method, the method is applied to an enterprise portrait system, the accuracy of the label and the enterprise portrait system is improved, and the specific method comprises the following steps:
step 5.1: the enterprise portrait system comprises a preprocessing module, a label classification and deepening module, a keyword extraction and deepening module, a label integration module and a portrait display module;
step 5.2: inputting a text of an enterprise to be deepened, and preprocessing the text by a preprocessing module to remove noise;
step 5.3: transmitting the preprocessed enterprise text into a tag classification deepening module to perform tag classification deepening;
step 5.4: integrating the enterprise name, enterprise introduction and operation range information, and further enriching the label content in the keyword extraction and deepening module;
step 5.5: integrating all the extended labels in a label integration module, and marking all the labels for enterprises;
step 5.6: and generating enterprise portrait information, and displaying the label information through the portrait display module.
Table 1 description of variables
Figure BDA0002787662650000081
Figure BDA0002787662650000091
Figure BDA0002787662650000101
The above description is only an example of the present invention and is not intended to limit the present invention. All equivalents which come within the spirit of the invention are therefore intended to be embraced therein. Details not described herein are well within the skill of those in the art.

Claims (6)

1. An enterprise portrait method based on label layering and deepening modeling is characterized by comprising the following specific steps:
(1) removing the weight and the empty of the enterprise label data set D and the enterprise multi-source data set D1, and cleaning to obtain enterprise data sets D2 and D3;
(2) counting and screening a data set D2, screening a tag data set which cannot completely summarize the characteristics of the enterprise, defining the tag data set as D4, and counting all tag sets as a deepening basis;
(3) constructing a Bert model, taking a data set D4 as an input of the model, and after semantic learning, performing classification and deepening of a first layer of labels by using a softmax layer;
(4) integrating the enterprise name, enterprise introduction and operation range information in the D3 data set, respectively extracting keywords by using TextRank, TF-IDF and LDA topic models, then processing the extracted keywords, and taking the processed words as next-layer deep tags;
(5) based on the label deepening method, the method is applied to an enterprise portrait system, and the accuracy of the label and the enterprise portrait system is improved.
2. The method for enterprise representation based on tag hierarchy deepening modeling according to claim 1, wherein the specific method for obtaining the enterprise data sets D2 and D3 in the step (1) is as follows:
(1.1) defining Text as a single multi-source information set to be cleaned, defining id, content1, content2 and content3 as enterprise serial number, enterprise name, enterprise introduction and enterprise operation range respectively, and satisfying the relation
Text={id,content1,content2,content3};
(1.2) defining Text1 as an information set to be cleaned in a single enterprise operation range, defining id, content3 and label as an enterprise serial number, an enterprise operation range and an enterprise label respectively, and satisfying the relationship Text1 ═ id, content3 and label };
(1.3) define D as the first layer tag deepening to-be-cleaned data set, D1 as the next layer tag deepening to-be-cleaned data set, D ═ Text11,Text12,…,Text1a,…,Text1len(D)},Text1aFor the a-th enterprise tag data to be cleaned in D, D1 ═ Text1,Text2,…,Texta1,…,Textlen(D1)},Texta1For the a enterprise multi-source data to be cleaned in D1, wherein len (D) is the number of texts in D, and the variable a belongs to [1, len (D)]Len (D1) is the number of texts in D1, and the variable a1 e [1, len (D1)];
(1.4) after the text in the data set D is deduplicated and deduplicated, the cleaned first-layer enterprise data set D2 is obtained1,T2,…,Tb,…,Tlen(D2)},TbFor the b-th business label data to be processed in D2, where len (D2) is the number of texts in D2, and the variable b is the [1, len (D2)];
(1.5) after the text in the data set D1 is deduplicated and nulled, the next enterprise data set D3 ═ T is obtained1,T2,…,Tb1,…,Tlen(D3)},Tb1For the b-th enterprise multi-source data to be processed in D3, wherein len (D3) is the number of texts in D3, and variable b1 is E [1, len (D3)]。
3. The method for representing an enterprise image based on tag hierarchical deepening modeling according to claim 1, wherein the step (2) is to screen out a tag data set which cannot completely summarize characteristics of the enterprise, and define it as D4, and count all tag data sets as a deepening basis by the specific method of:
(2.1) screening the D2 data set to screen out a data set which can not completely summarize enterprise characteristics but can be deepened by other labels, such as wholesale industry, retail industry and the like, and defining D4 ═ { T ═ T1,T2,…,Tc,…,Tlen(D4)D5 ═ T1,T2,…,Td,…,Tlen(D5)The rest data sets are represented, the number of label categories of D4 is n, and list4 represents a label set of D4;
(2.2) counting the D5 data set, wherein all labels are counted to be used as a deepening basis, m is the number of categories of the D5 data set, and list5 is a label set of D5;
(2.3) using the label set of list5 as a label of which the label classification is deepened;
(2.4) using the first layer data set D5 as a training set, and carrying out classification and deepening on the D4 data set according to a list5 label set.
4. The enterprise sketch method based on label hierarchical deepening modeling according to claim 1, wherein the specific method for performing classification deepening of the first layer of labels in the step (3) by using a softmax layer is as follows:
(3.1) building a Bert model, and performing model training by using a D5 training set;
(3.2) processing the data set D4 to obtain text content T to be processedcFixed to a uniform length Lmax
(3.3) defining a cycle variable i, and assigning an initial value of i to be 1;
(3.4) jumping to step (3.5) if i is less than or equal to len (D4), otherwise, jumping to step (3.9);
(3.5) definition of len (T)i) Is the length of the ith text message in the text, if len (T)i)+2≤LmaxThen jump to next step after complementing 0, otherwise intercept text front LmaxSkipping to the next step for each unit;
(3.6)i=i+1;
(3.7) feeding each text into a Token entries layer in the BERT model, wherein the output result is represented as V1, and simultaneously extracting text information and Position information from the Segment entries layer and the Position entries layer, and the output result is represented as V2 and V3;
(3.8) adding the three different outputs V1, V2 and V3 to obtain a result denoted V, using vector V as input to the BERT model, and obtaining a word vector sequence s in the last layer of neuronsi={V(W1),V(W2),…,V(Wf),…,V(WLmax) }; wherein V (W)f) Is the f-th vector representation of the combined text information;
(3.9) end loop, output word vector sequence S ═ S1,s2,s3,…,sf,…,slen(D3)};
(3.10) carrying out document classification prediction on the vector sequence by using a softmax function to obtain a classification probability prediction vector P ═ { P }1,p2,…,pg,…,phIn which p isgRepresenting the probability of the g class of the text, and h is the total number of the classes;
and (3.11) searching the maximum value in the vector P, and outputting a result corresponding to the maximum value, namely a label classification deepening result y.
5. The enterprise portrait method based on tag hierarchical deepening modeling according to claim 1, wherein the specific method of using the processed words as the deepening tags of the next layer in the step (4) is as follows:
(4.1) post-wash dataset D3 ═ T in step (1.5)1,T2,…,Tb1,…,Tlen(D3)And T ═ id, content1, content2, content3, where id, content1, content2, content3 are enterprise serial number, enterprise name, enterprise profile, and enterprise business scope, respectively;
(4.2) define D6 as the dataset to be integrated, len (D6) as the number of texts to be integrated in D6, D6 ═ T1,T2,…,Ta,…,Tlen(D6)};
(4.3) integrating the enterprise name, the enterprise introduction and the management scope information, wherein the integrated enterprise text is content4, and satisfies the conditions that T1 is { id, content4}, and D7 is { T1 }1,T12,…,T1a,…,T1len(D7)T1 is a single integrated text, D7 is an integrated enterprise dataset;
(4.4) carrying out statistics on words influencing the extraction result, and establishing a stop word dictionary;
(4.5) establishing an enterprise dictionary by collecting professional vocabularies of the enterprise field;
(4.6) performing keyword extraction on all nouns in the D7 enterprise integrated data set by using TextRank to obtain an extraction result K1 set;
(4.7) performing keyword extraction on all nouns in the D7 enterprise integrated data set by using TF-IDF to obtain an extraction result K2 set;
(4.8) finally, performing keyword extraction on all nouns in the D7 enterprise integration data set by using an LDA topic model to obtain an extraction result K3 set;
(4.9) sorting and merging the extracted K1, K2 and K3 keyword sets to obtain a keyword set K, wherein K is { W ═ W1,W2,…,Wi,…,Wlen(D7)},WiFor a single enterprise keyword set, i<len(D7);
(4.10) extracting the keywords WiAs a further extended depth label;
and (4.11) counting the obtained labels, and marking all the labels for the enterprises according to the hierarchical relationship.
6. The method for enterprise representation based on tag layering deepening modeling according to claim 1, wherein the method applied to the enterprise representation system in step (5) based on the tag deepening method is specifically configured to improve the accuracy of the tag and the enterprise representation system:
(5.1) the enterprise portrait system comprises a preprocessing module, a tag classification deepening module, a keyword extraction deepening module, a tag integration module and a portrait display module;
(5.2) inputting a text of the enterprise to be deepened, and preprocessing the text by a preprocessing module to remove noise;
(5.3) transmitting the preprocessed enterprise text into a label classification deepening module to perform classification deepening of labels;
(5.4) integrating the enterprise name, the enterprise introduction and the operation range information, and further enriching the label content in the keyword extraction and depth extension module;
(5.5) integrating all the extended labels in a label integration module, and marking all the labels for the enterprise;
and (5.6) generating enterprise image information, and displaying the label information through the image display module.
CN202011303829.0A 2020-11-19 2020-11-19 Enterprise portrait method based on label layering and deepening modeling Active CN112580332B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011303829.0A CN112580332B (en) 2020-11-19 2020-11-19 Enterprise portrait method based on label layering and deepening modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011303829.0A CN112580332B (en) 2020-11-19 2020-11-19 Enterprise portrait method based on label layering and deepening modeling

Publications (2)

Publication Number Publication Date
CN112580332A true CN112580332A (en) 2021-03-30
CN112580332B CN112580332B (en) 2022-07-12

Family

ID=75122937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011303829.0A Active CN112580332B (en) 2020-11-19 2020-11-19 Enterprise portrait method based on label layering and deepening modeling

Country Status (1)

Country Link
CN (1) CN112580332B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836903A (en) * 2021-08-17 2021-12-24 淮阴工学院 Method and device for extracting enterprise portrait label based on situation embedding and knowledge distillation
CN114218380A (en) * 2021-12-03 2022-03-22 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN114398485A (en) * 2021-12-29 2022-04-26 淮阴工学院 Expert portrait construction method and device based on multi-view fusion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512245A (en) * 2015-11-30 2016-04-20 青岛智能产业技术研究院 Enterprise figure building method based on regression model
CN110135901A (en) * 2019-05-10 2019-08-16 重庆天蓬网络有限公司 A kind of enterprise customer draws a portrait construction method, system, medium and electronic equipment
CN110705855A (en) * 2019-09-23 2020-01-17 清华苏州环境创新研究院 Enterprise environment portrait evaluation method and system
CN111191001A (en) * 2019-12-23 2020-05-22 浙江大胜达包装股份有限公司 Enterprise multi-element label identification method for paper package and related industries thereof
CN111950932A (en) * 2020-08-26 2020-11-17 北京信息科技大学 Multi-source information fusion-based comprehensive quality portrait method for small and medium-sized micro enterprises

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512245A (en) * 2015-11-30 2016-04-20 青岛智能产业技术研究院 Enterprise figure building method based on regression model
CN110135901A (en) * 2019-05-10 2019-08-16 重庆天蓬网络有限公司 A kind of enterprise customer draws a portrait construction method, system, medium and electronic equipment
CN110705855A (en) * 2019-09-23 2020-01-17 清华苏州环境创新研究院 Enterprise environment portrait evaluation method and system
CN111191001A (en) * 2019-12-23 2020-05-22 浙江大胜达包装股份有限公司 Enterprise multi-element label identification method for paper package and related industries thereof
CN111950932A (en) * 2020-08-26 2020-11-17 北京信息科技大学 Multi-source information fusion-based comprehensive quality portrait method for small and medium-sized micro enterprises

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LACASSE, P.M ET.AL: "A hierarchical, fuzzy inference approach to data filtration and feature prioritization in the connected manufacturing enterprise", 《JOURNAL OF BIG DATA》 *
丁行硕 等: "基于标签分层延深建模的企业画像构建方法", 《计算机应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836903A (en) * 2021-08-17 2021-12-24 淮阴工学院 Method and device for extracting enterprise portrait label based on situation embedding and knowledge distillation
CN113836903B (en) * 2021-08-17 2023-07-18 淮阴工学院 Enterprise portrait tag extraction method and device based on situation embedding and knowledge distillation
CN114218380A (en) * 2021-12-03 2022-03-22 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN114398485A (en) * 2021-12-29 2022-04-26 淮阴工学院 Expert portrait construction method and device based on multi-view fusion

Also Published As

Publication number Publication date
CN112580332B (en) 2022-07-12

Similar Documents

Publication Publication Date Title
Hidayat et al. Sentiment analysis of twitter data related to Rinca Island development using Doc2Vec and SVM and logistic regression as classifier
CN112580332B (en) Enterprise portrait method based on label layering and deepening modeling
CN112434151A (en) Patent recommendation method and device, computer equipment and storage medium
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
Rashid et al. Feature level opinion mining of educational student feedback data using sequential pattern mining and association rule mining
CN111353050A (en) Word stock construction method and tool in vertical field of telecommunication customer service
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN112836509A (en) Expert system knowledge base construction method and system
TWI828928B (en) Highly scalable, multi-label text classification methods and devices
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN116468460A (en) Consumer finance customer image recognition system and method based on artificial intelligence
CN111754208A (en) Automatic screening method for recruitment resumes
CN114942974A (en) E-commerce platform commodity user evaluation emotional tendency classification method
CN114239828A (en) Supply chain affair map construction method based on causal relationship
Swami et al. Resume classifier and summarizer
CN107908749A (en) A kind of personage&#39;s searching system and method based on search engine
Schmitt et al. Outlier detection on semantic space for sentiment analysis with convolutional neural networks
CN114817533A (en) Bullet screen emotion analysis method based on time characteristics
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN111667306A (en) Customized production-oriented customer demand identification method, system and terminal
Yang et al. Graph convolutional networks with dependency parser towards multiview representation learning for sentiment analysis
Verma et al. Predicting Sentiment from Movie Reviews Using Machine Learning Approach
Yadao et al. A semantically enhanced deep neural network framework for reputation system in web mining for Covid-19 Twitter dataset
Cook Learning context-aware representations of subtrees

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230512

Address after: Room 801, 85 Kefeng Road, Huangpu District, Guangzhou City, Guangdong Province

Patentee after: Yami Technology (Guangzhou) Co.,Ltd.

Address before: 223005 Jiangsu Huaian economic and Technological Development Zone, 1 East Road.

Patentee before: HUAIYIN INSTITUTE OF TECHNOLOGY