CN113947079A - Method and device for generating enterprise industry label - Google Patents
Method and device for generating enterprise industry label Download PDFInfo
- Publication number
- CN113947079A CN113947079A CN202111266471.3A CN202111266471A CN113947079A CN 113947079 A CN113947079 A CN 113947079A CN 202111266471 A CN202111266471 A CN 202111266471A CN 113947079 A CN113947079 A CN 113947079A
- Authority
- CN
- China
- Prior art keywords
- target
- industry
- enterprise
- probability
- group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000012545 processing Methods 0.000 claims abstract description 46
- 238000013145 classification model Methods 0.000 claims abstract description 18
- 230000008569 process Effects 0.000 claims description 14
- 238000012163 sequencing technique Methods 0.000 claims description 11
- 238000011144 upstream manufacturing Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 description 7
- 230000009471 action Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 230000004931 aggregating effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Data Mining & Analysis (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Accounting & Taxation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Molecular Biology (AREA)
- Entrepreneurship & Innovation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Game Theory and Decision Science (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a method and a device for generating enterprise industry labels, wherein the method comprises the following steps: acquiring the name and the operation range information of a target enterprise; extracting target characteristic words from the name of the target enterprise; determining industries corresponding to the target characteristic words as industry labels of the target enterprises based on the corresponding relation between each characteristic word and the industries; if the operation range information of the target enterprise is obtained, processing the operation range information by utilizing a grouping model in a pre-trained industry classification model to obtain the probability that the target enterprise belongs to each group; selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the models corresponding to the target groups respectively to obtain the primary selection industry to which the target enterprise belongs; selecting at least one primary industry from the primary industries to which the target enterprise belongs, and determining the primary industry as an industry label of the target enterprise; and summarizing and outputting the determined industry labels of all the target enterprises.
Description
Technical Field
The application relates to the technical field of industry division, in particular to a method and a device for generating enterprise industry labels.
Background
Nowadays, in a demanding industry, each enterprise is generally required to be labeled with a related industry so as to associate the enterprise, or perform business transaction with the enterprise based on an industry label, and the like.
The existing industry label of the enterprise is usually generated by a worker based on the related information of the enterprise by means of personal experience or previous labels to generate records, determine the industry of the enterprise to be determined currently, and input the corresponding industry label.
However, since an enterprise does not necessarily belong to an industry, and individual industries are cold first, the accuracy of manual labeling is relatively low, and the efficiency is relatively slow.
Disclosure of Invention
Based on the defects of the prior art, the application provides a method and a device for generating an enterprise industry label, so as to solve the problems of relatively low accuracy and efficiency of the existing method for generating the industry label.
In order to achieve the above object, the present application provides the following technical solutions:
the first aspect of the present application provides a method for generating an enterprise industry tag, including:
acquiring the name and the operation range information of a target enterprise;
if the name of the target enterprise is obtained, extracting a target feature word from the name of the target enterprise;
determining industries corresponding to the target feature words as industry labels of the target enterprises based on the predetermined corresponding relation between each feature word and the industries; the industry corresponding to one feature word is the industry corresponding to the maximum value in the probability that the feature word belongs to each industry; the probability that the characteristic words belong to each industry is obtained by calculating the word frequency of the characteristic words in each industry, wherein the word frequency is obtained by statistics;
if the operation range information of the target enterprise is obtained, processing the operation range information by utilizing a grouping model in a pre-trained industry classification model to obtain the probability that the target enterprise belongs to each group;
selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the models corresponding to the target groups respectively to obtain the primary selection industry to which the target enterprise belongs; the models corresponding to the small groups form a second-layer network of the industry classification model;
selecting at least one primary industry from the primary industry to which the target enterprise belongs, and determining the primary industry as an industry label of the target enterprise;
and summarizing and outputting the determined industry labels of all the target enterprises.
Optionally, in the foregoing method, the selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the models corresponding to the target groups, respectively, to obtain the primary industry to which the target enterprise belongs includes:
if the target enterprise belongs to a small-scale enterprise and the target characteristic word is not extracted from the name of the target enterprise, taking the group with the maximum probability to which the target enterprise belongs as a target group;
processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries;
and determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Optionally, in the foregoing method, the selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the models corresponding to the target groups, respectively, to obtain the primary industry to which the target enterprise belongs includes:
if the target enterprise does not belong to the small-scale enterprise and the target feature words are extracted from the name of the target enterprise, judging whether the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is smaller than a first threshold value; wherein the second highest probability is a probability value that is only less than the highest probability;
if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to the groups is smaller than a first threshold value, determining the group corresponding to the maximum probability and the group corresponding to the second maximum probability as a target group;
respectively processing the operation range information through a model corresponding to each target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries;
determining an industry corresponding to the maximum value in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries as a primary industry to which the target enterprise belongs;
if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is judged to be not smaller than a first threshold value, determining the group corresponding to the maximum probability as a target group;
processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries;
and determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Optionally, in the foregoing method, the selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the models corresponding to the target groups, respectively, to obtain the primary industry to which the target enterprise belongs includes:
if the target enterprise does not belong to a small-scale enterprise and the target feature word is not extracted from the name of the target enterprise, judging whether the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value;
if the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value, determining the subgroup corresponding to the maximum probability and the subgroup corresponding to the second maximum probability as a target subgroup;
respectively processing the operation range information through a model corresponding to each target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries;
and determining the first two corresponding industries in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries according to the descending order as the primary industry to which the target enterprise belongs.
Optionally, in the above method, the selecting at least one primary industry from the primary industries to which the target enterprise belongs, and determining the selected at least one primary industry as the industry label of the target enterprise includes:
if the primary industry to which each target enterprise belongs is obtained by the models corresponding to different target groups, determining the primary industry to which each target enterprise belongs as the industry label of the target enterprise;
if any two primary industry to which the target enterprise belongs are obtained from the models corresponding to the same target group, sequencing the primary industry according to the descending order of the probability that the target enterprise belongs to the primary industry;
determining the primary selection industries ranked at the first place and all the primary selection industries meeting preset conditions as the industry labels of the target enterprises; the preset conditions are that the probability that the target enterprise belongs to any two adjacent primary industry is smaller than a third threshold value in the current primary industry and all primary industries sequenced before the current primary industry.
Optionally, in the above method, the method further includes:
acquiring a text corpus;
extracting a plurality of candidate words from the text corpus;
calculating the corresponding score of each candidate word;
sequencing all the candidate words according to the sequence of grades from large to small to obtain a sequencing result;
taking the candidate words ranked at the top N in the ranking result as target candidate words;
for each target candidate word, if the target candidate word does not exist in the word list, determining the target candidate word as a new word; and the determined keywords of each industry are stored in the word list.
Optionally, in the above method, the method further includes:
aiming at the determined industry label of each target enterprise, taking the industry corresponding to the industry label as a target industry;
determining representative enterprises in the target industry;
acquiring comparison information of representative enterprises in the target industry and comparison information of the target enterprises; the comparison information is the operation range information of an upstream transaction opponent enterprise of the enterprise;
if the semantic similarity between the comparison information of the representative enterprise under the target industry and the comparison information of the target enterprise is smaller than the preset similarity, calculating the semantic similarity between the comparison information of the target enterprise and the comparison information of the representative enterprises under other industries;
and replacing the target industry with the industry corresponding to the maximum value in the semantic similarity between the comparison information of the representative enterprises in other industries and the comparison information of the target enterprise, and returning and executing the determination of the representative enterprises in the target industry aiming at the replaced target industry until the semantic similarity between the comparison information of the representative enterprises in the target industry and the comparison information of the target enterprise is not less than the preset similarity.
The second aspect of the present application provides an apparatus for generating an enterprise industry tag, including:
the first acquisition unit is used for acquiring the name and the operation range information of a target enterprise;
the first extraction unit is used for extracting a target feature word from the name of the target enterprise if the name of the target enterprise is obtained;
the matching unit is used for determining industries corresponding to the target characteristic words as the industry labels of the target enterprises based on the predetermined corresponding relation between each characteristic word and the industries; the industry corresponding to one feature word is the industry corresponding to the maximum value in the probability that the feature word belongs to each industry; the probability that the characteristic words belong to each industry is obtained by calculating the word frequency of the characteristic words in each industry, wherein the word frequency is obtained by statistics;
the grouping model is used for processing the operation range information by utilizing a grouping model in a pre-trained industry classification model if the operation range information of the target enterprise is obtained, so as to obtain the probability that the target enterprise belongs to each group;
the primary selection unit is used for selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the models corresponding to the target groups respectively to obtain the primary selection industry to which the target enterprise belongs; the models corresponding to the small groups form a second-layer network of the industry classification model;
the screening unit is used for selecting at least one primary industry from the primary industry to which the target enterprise belongs and determining the primary industry as the industry label of the target enterprise;
and the summarizing unit is used for summarizing and outputting the determined industry labels of all the target enterprises.
Optionally, in the above apparatus, when the target enterprise belongs to a small-scale enterprise and no target feature word is extracted from the name of the target enterprise, the primary election unit is configured to, based on the probability that the target enterprise belongs to each group, select at least one target group, and process the business area information through models corresponding to the target groups, to obtain a primary election industry to which the target enterprise belongs:
taking the group with the maximum probability to which the target enterprise belongs as a target group;
processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries;
and determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Optionally, in the above apparatus, when the target enterprise does not belong to a small-scale enterprise and a target feature word is extracted from a name of the target enterprise, the primary selection unit performs the selection of at least one target group based on the probability that the target enterprise belongs to each group, and processes the business scope information through models corresponding to the target groups, respectively, to obtain a primary selection industry to which the target enterprise belongs, and is configured to:
judging whether the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is smaller than a first threshold value or not; wherein the second highest probability is a probability value that is only less than the highest probability;
if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to the groups is smaller than a first threshold value, determining the group corresponding to the maximum probability and the group corresponding to the second maximum probability as a target group;
respectively processing the operation range information through a model corresponding to each target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries;
determining an industry corresponding to the maximum value in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries as a primary industry to which the target enterprise belongs;
if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is judged to be not smaller than a first threshold value, determining the group corresponding to the maximum probability as a target group;
processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries;
and determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Optionally, in the above apparatus, when the target enterprise does not belong to a small-scale enterprise and no target feature word is extracted from the name of the target enterprise, the primary selection unit performs the selection of at least one target group based on the probability that the target enterprise belongs to each group, and processes the business scope information through models corresponding to the target groups, respectively, to obtain a primary selection industry to which the target enterprise belongs, and is configured to:
judging whether the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value;
if the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value, determining the subgroup corresponding to the maximum probability and the subgroup corresponding to the second maximum probability as a target subgroup;
respectively processing the operation range information through a model corresponding to each target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries;
and determining the first two corresponding industries in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries according to the descending order as the primary industry to which the target enterprise belongs.
Optionally, in the above apparatus, the screening unit includes:
the first determining unit is used for determining the primary industry to which each target enterprise belongs as the industry label of the target enterprise if the primary industry to which each target enterprise belongs is obtained by different models corresponding to the target groups;
the first sequencing unit is used for sequencing the primary selection industries according to the descending order of the probability that any two target enterprises belong to the primary selection industries if the primary selection industries to which any two target enterprises belong are obtained from the same model corresponding to the target group;
the second determining unit is used for determining the primary selection industries which are ranked at the first place and all the primary selection industries which meet preset conditions as the industry labels of the target enterprises; the preset conditions are that the probability that the target enterprise belongs to any two adjacent primary industry is smaller than a third threshold value in the current primary industry and all primary industries sequenced before the current primary industry.
Optionally, in the above apparatus, further comprising:
the second acquiring unit is used for acquiring text corpora;
the second extraction unit is used for extracting a plurality of candidate words from the text corpus;
the scoring unit is used for calculating a score corresponding to each candidate word;
the second sorting unit is used for sorting the candidate words according to the sequence of scores from large to small to obtain a sorting result;
the selecting unit is used for taking the candidate words ranked at the top N in the ranking result as target candidate words;
a new word determining unit, configured to determine, for each target candidate word, if the target candidate word does not exist in a word list, the target candidate word as a new word; and the determined keywords of each industry are stored in the word list.
Optionally, in the above apparatus, further comprising:
the target industry determining unit is used for regarding the determined industry label of each target enterprise, and taking the industry corresponding to the industry label as a target industry;
the representative determining unit is used for determining representative enterprises in the target industry;
the third acquisition unit is used for acquiring the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise; the comparison information is the operation range information of an upstream transaction opponent enterprise of the enterprise;
the calculating unit is used for calculating the semantic similarity between the comparison information of the target enterprise and the comparison information of the representative enterprises in other industries if the semantic similarity between the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise is smaller than the preset similarity;
and the replacing unit is used for replacing the target industry with the industry corresponding to the maximum value in the semantic similarity between the comparison information of the representative enterprise in other industries and the comparison information of the target enterprise, and returning the representative determining unit to the target industry after replacement to determine the representative enterprise in the target industry until the semantic similarity between the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise is not less than the preset similarity.
According to the method for generating the enterprise industry label, the enterprise industry label is generated through the enterprise name and the operation range information, and the name and the operation range information of the target enterprise are specifically obtained. When the name of the target enterprise is obtained, extracting the target feature words from the name of the target enterprise, and determining industries corresponding to the target feature words as industry labels of the target enterprise based on the corresponding relation between each predetermined feature word and the industries. The industry corresponding to one feature word is the industry corresponding to the maximum value of the probability that the feature word belongs to each industry. And calculating the probability that the feature words belong to each industry by using the word frequency of the feature words in each industry, which is obtained by statistics. And processing the operation range information by utilizing a grouping model in a pre-trained industry classification model after the operation range information of the target enterprise is obtained to obtain the probability that the target enterprise belongs to each group. And then selecting at least one target group based on the probability that the target enterprise belongs to each group, processing the operation range information through the models corresponding to the target groups respectively to obtain the primary selection industries to which the target enterprise belongs, and then selecting at least one primary selection industry from the primary selection industries to which the target enterprise belongs to determine the primary selection industry as the industry label of the target enterprise. Finally, the determined industry labels of all target enterprises are collected and output, so that the industry labels of the enterprises are automatically generated based on the enterprise names and the operation range information, the accuracy and the comprehensiveness of the industry labels are effectively guaranteed, and the efficiency of labeling the industry labels is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for generating an enterprise industry tag according to an embodiment of the present disclosure;
FIG. 2 is a block diagram of a packet model according to an embodiment of the present disclosure;
fig. 3 is a flowchart of a method for acquiring a primary industry to which a target enterprise belongs according to an embodiment of the present application;
fig. 4 is a flowchart of another method for acquiring a primary industry to which a target enterprise belongs according to an embodiment of the present application;
fig. 5 is a flowchart of another method for acquiring a primary industry to which a target enterprise belongs according to an embodiment of the present application;
FIG. 6 is a flowchart of a method for optimizing an industry label according to an embodiment of the present disclosure;
FIG. 7 is a flowchart of a method for discovering new business keywords according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an apparatus for generating an enterprise industry label according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In this application, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiment of the application provides a method for generating an enterprise industry label, as shown in fig. 1, the method includes:
s101, acquiring the name and the operation range information of the target enterprise.
The experience scope information may specifically include an enterprise profile, experience scope description in industry and commerce, project description, bid winning description, and the like.
It should be noted that, each industry to which the enterprise belongs may be determined more comprehensively through the experience range information of the enterprise, but since the experience range information of the enterprise may not be necessarily obtained, the name of the enterprise is usually available, and the name of the enterprise may reflect the industry to which the enterprise belongs, the name of the target enterprise and the experience range information are obtained at the same time in the embodiment of the present application.
And if the two kinds of information of the target enterprise can be acquired simultaneously, determining the industry label of the target enterprise based on the two kinds of information respectively. That is, when the name of the target business is acquired in step S101, step S102 is executed, and when it is determined in step S104 that the business segment information of the target business is acquired, step S105 and the subsequent steps are executed.
And S102, extracting the target characteristic words from the names of the target enterprises.
It should be noted that, in the embodiments of the present application, the feature words refer to industry logo feature words. The target feature words refer to feature words which are contained in the name of the target enterprise and have determined correspondence with the industry. Optionally, the correspondence between each feature word and the industry may be stored by a feature word table. The target feature words may refer to words contained in the name of the target business and stored in the feature word table.
Optionally, the name of the target enterprise may be segmented, and the target feature words may be extracted by comparing each segmented word with the feature words in the feature word vocabulary. Optionally, since the feature words are usually the last three words of the name of the business, only the last three words in the name of the target business may be compared to determine the target feature word in the name of the target business.
It should be noted that, after the target feature word is extracted in step S102, step S103 is performed. Alternatively, if the target feature word may not be extracted, step S104 is directly performed.
S103, determining industries corresponding to the target characteristic words as industry labels of the target enterprises based on the predetermined corresponding relation between each characteristic word and the industry.
The industry corresponding to one feature word is the industry corresponding to the maximum value of the probability that the feature word belongs to each industry. And calculating the probability that the feature word belongs to each industry by using the word frequency of the feature word in each industry, which is obtained by statistics.
In the embodiment of the application, the pre-word segmentation tool performs word segmentation on the enterprise name set of the known industry in the database, and divides each enterprise name into a plurality of words. Then, the word frequency of the characteristic words in the corresponding industries is counted. Specifically, the characteristic words of most enterprises are generally located in the positions of the last three words of the enterprise name. All name records in the collection are scanned until the complete partial record is scanned. For each enterprise name, if a certain word is one of the last three words of the enterprise, the word frequency of the word appearing in the corresponding industry of the enterprise is increased by one. Then, according to the counted word frequency of each characteristic word in each industry, the total word frequency of the characteristic word in all industries is calculated, finally, the probability of the word in each industry is calculated, and the corresponding relation between the industry corresponding to the maximum probability and the word is selected. Optionally, the corresponding relations between all the characteristic words and the industry can be manually screened, and useless words are filtered out to form an industry characteristic word list.
Therefore, after the target characteristic words are extracted, the industries corresponding to the target characteristic words can be determined as the industry labels of the target enterprises.
And S104, judging whether the operation range information of the target enterprise is acquired.
It should be noted that step S104 is not necessarily performed after step S103, the execution sequence in the embodiment of the present application is only one optional manner, and step S104 only needs to be performed after step S101.
If it is determined that the operation range information of the target enterprise is acquired, step S105 is performed, that is, when the operation range information of the target enterprise is acquired, step S105 is performed.
And S105, processing the operation range information by using a grouping model in the pre-trained industry classification model to obtain the probability that the target enterprise belongs to each group.
In the embodiment of the present application, the industry classification model used includes two layers. The first layer is a grouping model and is mainly used for grouping the enterprises, namely classifying the enterprises, the second layer comprises a plurality of models corresponding to groups, and the model corresponding to each group is specifically used for determining the industry to which the enterprises belong based on the experience range information of the enterprises.
It should be noted that the industry classification model adopts a two-layer network, which mainly has two reasons: firstly, the industries are of too many types, and about 1281 industries exist, and if all the industries are put together for classification, the phenomenon that the categories of enterprises with similar industries are crossed is easily caused. Secondly, in the training data set, the samples are distributed unevenly, the training samples in some industries are few, and if the training samples are put together to train the model, the training samples are easily covered by other types of samples.
Wherein all industries are mainly grouped in the grouping model, for example, the grouping model can be divided into 40 groups. The grouping of industries can be according to the industry chain, the quantity and the scale of the samples, and the like. The grouping is usually based on the size of the number of samples, i.e. the groups with larger number of samples are placed in the larger groups and the groups with smaller number of samples are placed in the smaller groups. Due to the different sample numbers of different industry classes in each group, different network models can be adopted by the model corresponding to each group, such as fastText, TextCNN, TextRNN, RCNN, hierarchical attention network, Seq2Seq model with attention mechanism (Seq2Seq attn), Transformer model, dynamic memory network (dynamic memory), entity network (EntityNet), and traditional machine learning algorithm.
Alternatively, after comparing the performances of the plurality of models, in the embodiment of the present application, the TextCNN model is preferentially selected as the grouping model. So in particular, the framework of the grouping model can be as shown in fig. 2, and referring to fig. 2, when grouping a sentence, the grouping model first performs a convolution operation on the input matrix. Since the text data is used, when the filtering process is performed, the local correlation between the characters is extracted by moving down instead of sliding in the horizontal direction. After the convolution characteristic vectors are obtained, performing maximum pooling operation on each vector, splicing each pooled characteristic value to finally obtain characteristic representation of the sentence, and then providing the sentence characteristic vectors for a grouping model to group to obtain the probability of each group.
Alternatively, for the models corresponding to the respective groups, the classification performance of the different models in the groups can be determined according to the sample size in the groups. For example, if the sample size of the industry is large enough, the method of TextCNN can be used to classify the industry, and if the sample size is particularly small, other methods can be used, such as bayesian network, SVM, GBDT, etc.
And S106, selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the models corresponding to the target groups respectively to obtain the primary selection industry to which the target enterprise belongs.
And the models corresponding to the small groups form a second-layer network of the industry classification model.
Specifically, the target groups may be selected in an order from the highest probability to the lowest probability based on the probability that the target enterprise belongs to each group. Specifically, one or more of the selected enterprises can be selected, so that the primary industry to which the target enterprise belongs can be output through different models.
It should be noted that the primary industry to which the target enterprise of the model output corresponding to the target group belongs is not the finally determined industry that needs to mark the target enterprise, and only the target enterprise of the model output corresponding to the target group belongs to the industry that has a higher probability and meets the requirement, so that the step S107 needs to be further executed to screen out the final industry.
It should be further noted that the model corresponding to each target group obtains at least one primary industry to which the target enterprise belongs. The model corresponding to the target group obtains the probability that the target enterprise belongs to a plurality of industries, and then one or more industries are selected from the industries according to the sequence from high probability to low probability to serve as the primary industry to which the target enterprise output by the model belongs, so that the number of the primary industries of the target enterprise output by the model can be dynamically changed according to the requirements.
Optionally, in another embodiment of the present application, before performing step S106, further determining whether the target enterprise belongs to a small-scale enterprise, and determining whether the target feature word is extracted from the name of the target enterprise.
In this embodiment of the application, if it is determined that the target enterprise belongs to a small-scale enterprise and the target feature word is not extracted from the name of the target enterprise, a specific implementation manner of step S106, as shown in fig. 3, includes:
and S301, taking the group with the highest probability to which the target enterprise belongs as a target group.
It should be noted that, in the embodiment of the present application, for a small-scale enterprise, that is, for a small-scale enterprise, the operation range of the small-scale enterprise is relatively small, and usually only a single industry is involved, and the name of the small-scale enterprise can accurately reflect the industry to which the small-scale enterprise belongs, so if the target enterprise is the small-scale enterprise and the target feature word is extracted from the name of the target enterprise, only the target feature word is required to obtain the industry label of the target enterprise, and the industry label of the target enterprise may not be obtained based on the operation range information, that is, step S105 and subsequent steps are not performed.
If the target characteristic word is not extracted from the name of the target enterprise, the enterprise label of the target enterprise needs to be acquired from the operation range information. Since the small-scale enterprises have fewer design industries, in the embodiment of the application, only the group with the highest probability to which the target enterprise belongs is selected as the target group.
And S302, processing the operation range information through the model corresponding to the target group to obtain the probability that the target enterprise belongs to multiple industries.
And S303, determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Since there are few small-scale enterprise design industries, in the embodiment of the present application, only one industry is selected as the primary industry of the target enterprise.
If it is determined that the target enterprise does not belong to a small-scale enterprise, that is, the target enterprise is a medium-large scale enterprise, and the target feature word is extracted from the name of the target enterprise, a specific implementation manner of the corresponding step S106, as shown in fig. 4, includes:
s401, whether the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is smaller than a first threshold value or not is judged.
Wherein the second highest probability is a probability value that is only less than the highest probability. That is, the second maximum probability refers to the probability that is ranked second, ordered from the highest probability to the lowest probability.
Alternatively, the first threshold may be obtained by a maximum entropy method.
It should be noted that, if it is determined that the difference between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to the respective groups is smaller than the first threshold, it is determined that the probabilities that the target enterprise belongs to the groups ranked at the first two digits are close, so step S402 is executed at this time. If the difference between the maximum probability and the second maximum probability in the probabilities of the target enterprise belonging to the groups is not smaller than the first threshold, it indicates that the probabilities of the target enterprise belonging to the groups ranked at the first two have a larger difference, and an industry label of the target enterprise can be obtained according to the name of the target enterprise, so step S405 is executed at this time.
S402, determining the group corresponding to the maximum probability and the group corresponding to the second maximum probability as a target group.
And S403, respectively processing the operation range information through the model corresponding to the target group aiming at each target group to obtain the probability that the target enterprise corresponding to the target group belongs to multiple industries.
S404, determining the industry corresponding to the maximum value in the probability that the target enterprise corresponding to each target group belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Since two target groups are selected, after step S404, the primary industry to which the two target enterprises belong is obtained accordingly.
S405, determining the group corresponding to the maximum probability as a target group.
And S406, processing the operation range information through the model corresponding to the target group to obtain the probability that the target enterprise belongs to multiple industries.
S407, determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
If the target enterprise does not belong to the small-scale enterprise and the target feature word is not extracted from the name of the target enterprise, the corresponding specific implementation manner of step S106, as shown in fig. 5, includes:
s501, judging whether the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value.
It should be noted that, since the target feature word cannot be obtained, and the industry label of the target enterprise is determined according to the target feature word, multiple industry labels need to be obtained from the model corresponding to the group, and therefore multiple groups need to be selected as the target group, and the second threshold is usually greater than the first threshold.
Alternatively, the first threshold may be obtained by a maximum entropy method.
If the difference between the maximum probability and the second maximum probability is smaller than the second threshold, step S502 is executed. If the difference between the maximum probability and the second maximum probability is judged to be not smaller than the second threshold, the group corresponding to the maximum probability can be determined as the target group.
And S502, determining the group corresponding to the maximum probability and the group corresponding to the second maximum probability as a target group.
S503, respectively aiming at each target group, processing the operation range information through the model corresponding to the target group to obtain the probability that the target enterprise corresponding to the target group belongs to multiple industries.
S504, determining the first two corresponding industries in the probability that the target enterprise corresponding to each target group belongs to the plurality of industries according to the descending order as the primary industry to which the target enterprise belongs.
The models corresponding to the two target groups obtain the probability that the target enterprises belong to a plurality of industries, and the maximum two bits are taken out from the results obtained by each model respectively to serve as the primary selection industries of the target enterprises, namely the primary selection industries to which the four target enterprises belong are obtained in total.
S107, selecting at least one primary industry from the primary industries to which the target enterprise belongs, and determining the primary industry as the industry label of the target enterprise.
It should be noted that, when only one primary industry to which the target enterprise belongs exists, the primary industry is directly determined as the industry label of the primary industry. If a plurality of the primary industry selections of the target enterprise exist, all or part of the primary industry selections can be selected as the industry labels of the target enterprise. Specifically, the primary selection industry with a higher probability is preferentially selected based on the probability selection of the target enterprise belonging to each primary selection industry.
Optionally, in another embodiment of the present application, a specific implementation manner of step S107 includes:
and if the primary industry to which each target enterprise belongs is obtained by the models corresponding to different target groups, determining the primary industry to which each target enterprise belongs as the industry label of the target enterprise.
That is, for the scenario corresponding to the embodiment shown in fig. 3 and the scenario obtained corresponding to the embodiment shown in fig. 4, all the initially selected industries to which the deficient target enterprise belongs are determined as the industry tags of the target enterprise.
If the primary selection industries to which any two target enterprises belong are obtained from the models corresponding to the same target group, sequencing the primary selection industries according to the sequence of the probability that the target enterprises belong to the primary selection industries from large to small, and determining the primary selection industry sequenced at the first place and each primary selection industry meeting preset conditions as the industry label of the target enterprise.
The preset condition is that the probability that the target enterprise belongs to any two adjacent primary selection industries is smaller than a third threshold value in the current primary selection industry and all primary selection industries which are sequenced before the current primary selection industry.
Specifically, at least one industry label is needed, so the primary industry with the highest probability is selected first, the primary industry ranked first is determined as the industry label of the target enterprise, then each subsequent primary industry is sequentially targeted according to the ranking order, if the difference value of the probabilities corresponding to the current primary industry and the previous primary industry is small, the current primary industry can be selected, and the current primary industry is selected as the industry label of the target enterprise. If the difference value of the corresponding probabilities of the current primary selection industry and the previous primary selection industry is large, the difference values of the probabilities of the current primary selection industry, the subsequent primary selection industry and the previous primary selection industry are too large, and therefore the selection of the primary selection industry is stopped.
And S108, summarizing and outputting the determined industry labels of all the target enterprises.
Specifically, the industry labels of the target enterprises determined in steps S103 and S107 are summarized, and the same industry label is output to establish an association relationship between the target enterprise and the output industry label, thereby marking the target enterprise.
Optionally, in another embodiment of the present application, after the step S108 is executed, the determined industry label of the target enterprise may be further optimized. As shown in fig. 6, the method for optimizing an industry tag according to the embodiment of the present application includes:
s601, regarding the determined industry label of each target enterprise, taking the industry corresponding to the industry label as the target industry.
And S602, determining representative enterprises in the target industry.
Optionally, from the training data set, for each business under the target industry, the number of upstream transaction partner businesses of the business is calculated, and the top 25% of the businesses of the upstream transaction partner businesses are taken as representative businesses of the industry.
And S603, acquiring comparison information of the representative enterprise in the target industry and comparison information of the target enterprise.
Wherein, the comparison information is the operation range information of the upstream transaction opponent enterprise of the enterprise.
Optionally, for each representative enterprise in the target industry, calculating the proportion of the transaction amount and the proportion of the transaction times of the enterprise and each upstream transaction counter-party enterprise, and acquiring the operation range information of the upstream transaction counter-party enterprise of which the proportions are greater than a preset value as the comparison information of the representative enterprise in the target industry. And aiming at the target enterprise, the comparison information of the target enterprise is determined in the same way.
S604, judging that the semantic similarity between the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise is smaller than the preset similarity.
If the semantic similarity between the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise is smaller than the preset similarity, it indicates that the labeled target industry is not accurate, so step S605 is executed at this time. If the semantic similarity between the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise is not less than the preset similarity, step S607 is executed.
S605, calculating semantic similarity between the comparison information of the target enterprise and the comparison information of the representative enterprises in other industries.
The method for acquiring the comparison information of the representative enterprise in each industry is the same as the method for acquiring the comparison information of the representative enterprise in the target industry.
And S606, replacing the target industry with the industry corresponding to the maximum value in the semantic similarity between the comparison information of the representative enterprise and the comparison information of the target enterprise in other industries.
That is, in other industries in which the target industry is removed, the industry with the largest semantic similarity between the comparison information and the comparison information of the target enterprise replaces the current target industry, and the industry is called as a new target industry corresponding to the target enterprise.
It should be noted that, after the target industry is replaced and a new target industry is obtained, step S602 needs to be executed for the replaced target industry, and it is determined whether further optimization is needed.
And S607, finishing the optimization of the target industry.
Optionally, in another embodiment of the present application, a method for discovering a new industry keyword is further included, as shown in fig. 7, including:
and S701, acquiring text corpora.
The text corpus may be information about the operation range of a plurality of companies.
S702, extracting a plurality of candidate words from the text corpus.
Specifically, the text may be segmented into a plurality of sentences according to punctuation marks, and then 2-tuple, 3-tuple, … -m-tuple of each sentence may be extracted as a candidate word.
And S703, calculating the corresponding score of each candidate word.
Alternatively, the sum of the degrees of freedom and the degree of aggregation of the candidate words may be employed as the score corresponding to the selected word.
S704, sorting the candidate words according to the order of scores from large to small to obtain a sorting result.
S705, taking the candidate words ranked at the top N in the ranking result as target candidate words.
And S706, aiming at each target candidate word, if the target candidate word does not exist in the word list, determining the target candidate word as a new word.
The word list stores the determined keywords of each industry, that is, the word list may be the industry feature word list in step S103.
Optionally, after new words are obtained, texts similar to the new words can be selected by a semantic four-degree-of-xi-an method to serve as training sample corpora of a new industry of the new industry, the corpora are added to a training sample data set, and an industry classification model is retrained.
The embodiment of the application provides a method for generating an enterprise industry label, which generates the enterprise industry label through an enterprise name and operation range information, and specifically obtains the name and the operation range information of a target enterprise. When the name of the target enterprise is obtained, extracting the target feature words from the name of the target enterprise, and determining industries corresponding to the target feature words as industry labels of the target enterprise based on the corresponding relation between each predetermined feature word and the industries. The industry corresponding to one feature word is the industry corresponding to the maximum value of the probability that the feature word belongs to each industry. And calculating the probability that the feature words belong to each industry by using the word frequency of the feature words in each industry, which is obtained by statistics. And processing the operation range information by utilizing a grouping model in a pre-trained industry classification model after the operation range information of the target enterprise is obtained to obtain the probability that the target enterprise belongs to each group. And then selecting at least one target group based on the probability that the target enterprise belongs to each group, processing the operation range information through the models corresponding to the target groups respectively to obtain the primary selection industries to which the target enterprise belongs, and then selecting at least one primary selection industry from the primary selection industries to which the target enterprise belongs to determine the primary selection industry as the industry label of the target enterprise. Finally, the determined industry labels of all target enterprises are collected and output, so that the industry labels of the enterprises are automatically generated based on the enterprise names and the operation range information, the accuracy and the comprehensiveness of the industry labels are effectively guaranteed, and the efficiency of labeling the industry labels is improved.
Another embodiment of the present application provides an apparatus for generating an enterprise industry tag, as shown in fig. 8, including:
the first obtaining unit 801 is configured to obtain a name and business scope information of a target enterprise.
The first extracting unit 802 is configured to, if the name of the target enterprise is obtained, extract the target feature word from the name of the target enterprise.
The matching unit 803 is configured to determine, based on the predetermined correspondence between each feature word and an industry, an industry corresponding to the target feature word as an industry tag of the target enterprise.
The industry corresponding to one feature word is the industry corresponding to the maximum value of the probability that the feature word belongs to each industry. And calculating the probability that the feature words belong to each industry by using the word frequency of the feature words in each industry, which is obtained by statistics.
And the grouping model 804 is used for processing the operation range information by utilizing a grouping model in a pre-trained industry classification model if the operation range information of the target enterprise is obtained, so as to obtain the probability that the target enterprise belongs to each group.
And the primary selection unit 805 is configured to select at least one target group based on the probability that the target enterprise belongs to each group, and process the operation range information through the models corresponding to the target groups, so as to obtain a primary selection industry to which the target enterprise belongs. And the models corresponding to the small groups form a second-layer network of the industry classification model.
The screening unit 806 is configured to select at least one primary industry from the primary industries to which the target enterprise belongs, and determine the selected primary industry as the industry tag of the target enterprise.
And an aggregating unit 807 for aggregating and outputting the determined industry labels of all the target enterprises.
Optionally, in the apparatus for generating an enterprise industry tag provided in another embodiment, the target enterprise belongs to a small-scale enterprise, and the target feature word is not extracted from the name of the target enterprise, the primary selection unit is configured to select at least one target group based on the probability that the target enterprise belongs to each group, and process the operation range information through the models corresponding to the target groups, so that when the primary selection industry to which the target enterprise belongs is obtained, the primary selection apparatus is configured to:
and taking the group with the highest probability to which the target enterprise belongs as the target group.
And processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries.
And determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Optionally, in the apparatus for generating an enterprise industry tag according to another embodiment, when the target enterprise does not belong to a small-scale enterprise and the target feature word is extracted from the name of the target enterprise, the primary selection unit performs selection of at least one target group based on probabilities that the target enterprise belongs to each group, and processes the business scope information through the models corresponding to the target groups, so as to obtain a primary selection industry to which the target enterprise belongs, and is configured to:
and judging whether the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is smaller than a first threshold value.
Wherein the second highest probability is a probability value that is only less than the highest probability.
And if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to the groups is smaller than the first threshold value, determining the group corresponding to the maximum probability and the group corresponding to the second maximum probability as the target group.
And respectively processing the operation range information through the model corresponding to the target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries.
And determining the industry corresponding to the maximum value in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries as the primary industry to which the target enterprise belongs.
And if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is judged to be not smaller than the first threshold value, determining the group corresponding to the maximum probability as the target group.
And processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries.
And determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Optionally, in the apparatus for generating an enterprise industry tag according to another embodiment, when the target enterprise does not belong to a small-scale enterprise and the target feature word is not extracted from the name of the target enterprise, the primary selection unit performs selection of at least one target group based on probabilities that the target enterprise belongs to each group, and processes the business scope information through the models corresponding to the target groups, so that when the primary selection industry to which the target enterprise belongs is obtained, the primary selection unit is configured to:
and judging whether the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value.
And if the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value, determining the group corresponding to the maximum probability and the group corresponding to the second maximum probability as the target group.
And respectively processing the operation range information through the model corresponding to the target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries.
And determining the first two corresponding industries in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries according to the descending order as the primary industry to which the target enterprise belongs.
Optionally, in an apparatus for generating an enterprise industry label provided in another embodiment, the filtering unit includes:
and the first determining unit is used for determining the primary industry to which each target enterprise belongs as the industry label of the target enterprise if the primary industry to which each target enterprise belongs is obtained by the models corresponding to different target groups.
And the first sequencing unit is used for sequencing the primary selection industries according to the descending order of the probability that the target enterprises belong to the primary selection industries if the primary selection industries to which any two target enterprises belong are obtained from the models corresponding to the same target group.
And the second determining unit is used for determining the primary selection industries which are ranked at the first place and all the primary selection industries which meet the preset conditions as the industry labels of the target enterprises. The preset condition is that the probability that the target enterprise belongs to any two adjacent primary selection industries is smaller than a third threshold value in the current primary selection industry and all primary selection industries which are sequenced before the current primary selection industry.
Optionally, in an apparatus for generating an enterprise industry tag provided in another embodiment, the apparatus further includes:
and the second acquisition unit is used for acquiring the text corpora.
And the second extraction unit is used for extracting a plurality of candidate words from the text corpus.
And the scoring unit is used for calculating the corresponding score of each candidate word.
And the second sorting unit is used for sorting the candidate words according to the sequence of scores from large to small to obtain a sorting result.
And the selecting unit is used for taking the candidate words ranked at the top N in the ranking result as target candidate words.
And the new word determining unit is used for determining the target candidate word as the new word if the target candidate word does not exist in the word list for each target candidate word. The word list stores the determined keywords of each industry.
Optionally, in an apparatus for generating an enterprise industry tag provided in another embodiment, the apparatus further includes:
and the target industry determining unit is used for taking the industry corresponding to the industry label as the target industry aiming at the determined industry label of each target enterprise.
And the representative determining unit is used for determining representative enterprises in the target industry.
And the third acquisition unit is used for acquiring the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise. Wherein, the comparison information is the operation range information of the upstream transaction opponent enterprise of the enterprise.
And the calculating unit is used for calculating the semantic similarity between the comparison information of the target enterprise and the comparison information of the representative enterprises in other industries if the semantic similarity between the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise is smaller than the preset similarity.
And the replacing unit is used for replacing the target industry with the industry corresponding to the maximum value in the semantic similarity between the comparison information of the representative enterprise in other industries and the comparison information of the target enterprise, and returning the representative determining unit to determine the representative enterprise in the target industry according to the replaced target industry until the semantic similarity between the comparison information of the representative enterprise in the target industry and the comparison information of the target enterprise is not less than the preset similarity.
It should be noted that, for the specific working processes of each unit provided in the foregoing embodiments of the present application, corresponding steps in the foregoing method embodiments may be referred to accordingly, and are not described herein again.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for generating an enterprise industry label is characterized by comprising the following steps:
acquiring the name and the operation range information of a target enterprise;
if the name of the target enterprise is obtained, extracting a target feature word from the name of the target enterprise;
determining industries corresponding to the target feature words as industry labels of the target enterprises based on the predetermined corresponding relation between each feature word and the industries; the industry corresponding to one feature word is the industry corresponding to the maximum value in the probability that the feature word belongs to each industry; the probability that the characteristic words belong to each industry is obtained by calculating the word frequency of the characteristic words in each industry, wherein the word frequency is obtained by statistics;
if the operation range information of the target enterprise is obtained, processing the operation range information by utilizing a grouping model in a pre-trained industry classification model to obtain the probability that the target enterprise belongs to each group;
selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the model corresponding to each target group respectively to obtain the primary industry to which the target enterprise belongs; the models corresponding to the small groups form a second-layer network of the industry classification model;
selecting at least one primary industry from the primary industry to which the target enterprise belongs, and determining the primary industry as an industry label of the target enterprise;
and summarizing and outputting the determined industry labels of all the target enterprises.
2. The method according to claim 1, wherein the selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through a model corresponding to each target group to obtain a primary industry to which the target enterprise belongs comprises:
if the target enterprise belongs to a small-scale enterprise and the target characteristic word is not extracted from the name of the target enterprise, taking the group with the maximum probability to which the target enterprise belongs as a target group;
processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries;
and determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
3. The method according to claim 2, wherein the selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through a model corresponding to each target group to obtain the primary industry to which the target enterprise belongs comprises:
if the target enterprise does not belong to the small-scale enterprise and the target feature words are extracted from the name of the target enterprise, judging whether the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is smaller than a first threshold value; wherein the second highest probability is a probability value that is only less than the highest probability;
if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to the groups is smaller than a first threshold value, determining the group corresponding to the maximum probability and the group corresponding to the second maximum probability as a target group;
respectively processing the operation range information through a model corresponding to each target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries;
determining an industry corresponding to the maximum value in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries as a primary industry to which the target enterprise belongs;
if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is judged to be not smaller than a first threshold value, determining the group corresponding to the maximum probability as a target group;
processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries;
and determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
4. The method according to claim 3, wherein the selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through a model corresponding to each target group to obtain the primary industry to which the target enterprise belongs comprises:
if the target enterprise does not belong to a small-scale enterprise and the target feature word is not extracted from the name of the target enterprise, judging whether the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value;
if the difference value between the maximum probability and the second maximum probability is smaller than a second threshold value, determining the subgroup corresponding to the maximum probability and the subgroup corresponding to the second maximum probability as a target subgroup;
respectively processing the operation range information through a model corresponding to each target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries;
and determining the first two corresponding industries in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries according to the descending order as the primary industry to which the target enterprise belongs.
5. The method according to any one of claims 1 to 4, wherein the selecting at least one primary industry from the primary industries to which the target enterprise belongs and determining the industry label of the target enterprise comprises:
if the primary industry to which each target enterprise belongs is obtained by the models corresponding to different target groups, determining the primary industry to which each target enterprise belongs as the industry label of the target enterprise;
if any two primary industry to which the target enterprise belongs are obtained from the models corresponding to the same target group, sequencing the primary industry according to the descending order of the probability that the target enterprise belongs to the primary industry;
determining the primary selection industries ranked at the first place and all the primary selection industries meeting preset conditions as the industry labels of the target enterprises; the preset conditions are that the probability that the target enterprise belongs to any two adjacent primary industry is smaller than a third threshold value in the current primary industry and all primary industries sequenced before the current primary industry.
6. The method of claim 1, further comprising:
acquiring a text corpus;
extracting a plurality of candidate words from the text corpus;
calculating the corresponding score of each candidate word;
sequencing all the candidate words according to the sequence of grades from large to small to obtain a sequencing result;
taking the candidate words ranked at the top N in the ranking result as target candidate words;
for each target candidate word, if the target candidate word does not exist in the word list, determining the target candidate word as a new word; and the determined keywords of each industry are stored in the word list.
7. The method of claim 1, further comprising:
aiming at the determined industry label of each target enterprise, taking the industry corresponding to the industry label as a target industry;
determining representative enterprises in the target industry;
acquiring comparison information of representative enterprises in the target industry and comparison information of the target enterprises; the comparison information is the operation range information of an upstream transaction opponent enterprise of the enterprise;
if the semantic similarity between the comparison information of the representative enterprise under the target industry and the comparison information of the target enterprise is smaller than the preset similarity, calculating the semantic similarity between the comparison information of the target enterprise and the comparison information of the representative enterprises under other industries;
and replacing the target industry with the industry corresponding to the maximum value in the semantic similarity between the comparison information of the representative enterprises in other industries and the comparison information of the target enterprise, and returning and executing the determination of the representative enterprises in the target industry aiming at the replaced target industry until the semantic similarity between the comparison information of the representative enterprises in the target industry and the comparison information of the target enterprise is not less than the preset similarity.
8. An apparatus for generating an enterprise industry label, comprising:
the first acquisition unit is used for acquiring the name and the operation range information of a target enterprise;
the first extraction unit is used for extracting a target feature word from the name of the target enterprise if the name of the target enterprise is obtained;
the matching unit is used for determining industries corresponding to the target characteristic words as the industry labels of the target enterprises based on the predetermined corresponding relation between each characteristic word and the industries; the industry corresponding to one feature word is the industry corresponding to the maximum value in the probability that the feature word belongs to each industry; the probability that the characteristic words belong to each industry is obtained by calculating the word frequency of the characteristic words in each industry, wherein the word frequency is obtained by statistics;
the grouping model is used for processing the operation range information by utilizing a grouping model in a pre-trained industry classification model if the operation range information of the target enterprise is obtained, so as to obtain the probability that the target enterprise belongs to each group;
the primary selection unit is used for selecting at least one target group based on the probability that the target enterprise belongs to each group, and processing the operation range information through the model corresponding to each target group to obtain the primary selection industry to which the target enterprise belongs; the models corresponding to the small groups form a second-layer network of the industry classification model;
the screening unit is used for selecting at least one primary industry from the primary industry to which the target enterprise belongs and determining the primary industry as the industry label of the target enterprise;
and the summarizing unit is used for summarizing and outputting the determined industry labels of all the target enterprises.
9. The apparatus according to claim 8, wherein when the target enterprise belongs to a small-scale enterprise and a target feature word is not extracted from a name of the target enterprise, the preliminary election unit performs the selection of at least one target group based on the probability that the target enterprise belongs to each group, and processes the operation range information through a model corresponding to each target group, respectively, to obtain a preliminary election industry to which the target enterprise belongs, and is configured to:
taking the group with the maximum probability to which the target enterprise belongs as a target group;
processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries;
and determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
10. The apparatus according to claim 9, wherein when the target enterprise does not belong to a small-scale enterprise and a target feature word is extracted from a name of the target enterprise, the primary election unit performs the selection of at least one target group based on the probability that the target enterprise belongs to each group, and processes the business scope information through models corresponding to the target groups, respectively, to obtain a primary election industry to which the target enterprise belongs, and is configured to:
judging whether the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is smaller than a first threshold value or not; wherein the second highest probability is a probability value that is only less than the highest probability;
if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to the groups is smaller than a first threshold value, determining the group corresponding to the maximum probability and the group corresponding to the second maximum probability as a target group;
respectively processing the operation range information through a model corresponding to each target group aiming at each target group to obtain the probability that the target enterprises corresponding to the target groups belong to a plurality of industries;
determining an industry corresponding to the maximum value in the probability that the target enterprise corresponding to each target group belongs to a plurality of industries as a primary industry to which the target enterprise belongs;
if the difference value between the maximum probability and the second maximum probability in the probabilities that the target enterprise belongs to each group is judged to be not smaller than a first threshold value, determining the group corresponding to the maximum probability as a target group;
processing the operation range information through a model corresponding to the target group to obtain the probability that the target enterprise belongs to a plurality of industries;
and determining the industry corresponding to the maximum value in the probability that the target enterprise belongs to the multiple industries as the primary industry to which the target enterprise belongs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111266471.3A CN113947079A (en) | 2021-10-28 | 2021-10-28 | Method and device for generating enterprise industry label |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111266471.3A CN113947079A (en) | 2021-10-28 | 2021-10-28 | Method and device for generating enterprise industry label |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113947079A true CN113947079A (en) | 2022-01-18 |
Family
ID=79337167
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111266471.3A Pending CN113947079A (en) | 2021-10-28 | 2021-10-28 | Method and device for generating enterprise industry label |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113947079A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115358309A (en) * | 2022-08-15 | 2022-11-18 | 江苏苏宁银行股份有限公司 | Industry code selection method based on Bayesian classification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783818A (en) * | 2019-01-17 | 2019-05-21 | 上海三零卫士信息安全有限公司 | A kind of enterprises ' industry multi-tag classification method |
CN112163153A (en) * | 2020-09-30 | 2021-01-01 | 深圳前海微众银行股份有限公司 | Industry label determination method, device, equipment and storage medium |
CN112487794A (en) * | 2019-08-21 | 2021-03-12 | 顺丰科技有限公司 | Industry classification method and device, terminal equipment and storage medium |
-
2021
- 2021-10-28 CN CN202111266471.3A patent/CN113947079A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109783818A (en) * | 2019-01-17 | 2019-05-21 | 上海三零卫士信息安全有限公司 | A kind of enterprises ' industry multi-tag classification method |
CN112487794A (en) * | 2019-08-21 | 2021-03-12 | 顺丰科技有限公司 | Industry classification method and device, terminal equipment and storage medium |
CN112163153A (en) * | 2020-09-30 | 2021-01-01 | 深圳前海微众银行股份有限公司 | Industry label determination method, device, equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115358309A (en) * | 2022-08-15 | 2022-11-18 | 江苏苏宁银行股份有限公司 | Industry code selection method based on Bayesian classification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2287750B1 (en) | Methods and apparatus to classify text communications | |
CN112800113B (en) | Bidding auditing method and system based on data mining analysis technology | |
CN107423279B (en) | Information extraction and analysis method for financial credit short message | |
KR101276602B1 (en) | System and method for searching and matching data having ideogrammatic content | |
KR101312770B1 (en) | Information classification paradigm | |
CN103699523B (en) | Product classification method and apparatus | |
US8015198B2 (en) | Method for automatically indexing documents | |
CN109902090B (en) | Method and device for acquiring field name | |
CN111506727B (en) | Text content category acquisition method, apparatus, computer device and storage medium | |
WO2001093102A1 (en) | Method and apparatus for making predictions about entities represented in documents | |
CN110910175B (en) | Image generation method for travel ticket product | |
CN113420145B (en) | Semi-supervised learning-based bid-bidding text classification method and system | |
CN112541077A (en) | Processing method and system for power grid user service evaluation | |
CN112395881B (en) | Material label construction method and device, readable storage medium and electronic equipment | |
CN111930933A (en) | Detection case processing method and device based on artificial intelligence | |
CN112528031A (en) | Work order intelligent distribution method and system | |
CN112667806B (en) | Text classification screening method using LDA | |
CN117454220A (en) | Data hierarchical classification method, device, equipment and storage medium | |
CN201654779U (en) | Scientific document automatic classification system | |
CN113947079A (en) | Method and device for generating enterprise industry label | |
CN110688572A (en) | Method for identifying search intention in cold starting state | |
CN107480126B (en) | Intelligent identification method for engineering material category | |
CN114511027A (en) | Method for extracting English remote data through big data network | |
CN112784040A (en) | Vertical industry text classification method based on corpus | |
CN113139106B (en) | Event auditing method and device for security check |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |