WO2022068297A1 - 行业标签的确定方法、装置、设备及存储介质 - Google Patents

行业标签的确定方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022068297A1
WO2022068297A1 PCT/CN2021/103262 CN2021103262W WO2022068297A1 WO 2022068297 A1 WO2022068297 A1 WO 2022068297A1 CN 2021103262 W CN2021103262 W CN 2021103262W WO 2022068297 A1 WO2022068297 A1 WO 2022068297A1
Authority
WO
WIPO (PCT)
Prior art keywords
category
business
sub
enterprise
target
Prior art date
Application number
PCT/CN2021/103262
Other languages
English (en)
French (fr)
Inventor
唐圳
刘博�
郑文琛
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2022068297A1 publication Critical patent/WO2022068297A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services

Definitions

  • the present disclosure relates to the technical field of character recognition, and in particular, to a method, device, device and storage medium for determining an industry label.
  • the main purpose of the present disclosure is to provide a method, device, equipment and storage medium for determining an industry label. For an enterprise with an unclear industry label, it can automatically match a clear industry label according to its business content, and the label determination method has high accuracy. It is more in line with the business situation of the enterprise and provides a good foundation for the subsequent determination of the enterprise portrait.
  • an embodiment of the present disclosure provides a method for determining an industry label, and the method for determining an industry label includes:
  • the type of the industry label of the target enterprise is an unknown label type
  • the sub-category business content of the sub-category is generated, wherein the target category is the category or category to which the industry label of the target enterprise belongs, and the industry of the sub-category belongs to
  • the type of the label is a known label type; according to the business scope of the target enterprise and each of the sub-category business contents, it is determined that the sub-category business content that matches the business scope of the target enterprise is the matching business content of the target enterprise ; Determine the industry label of the sub-category corresponding to the matching business content as the industry label of the target enterprise.
  • the sub-category business content of the sub-category is generated according to the business scope of each enterprise, including:
  • the method further includes:
  • a word segmentation process is performed on the business scope of the target enterprise to obtain word segmentation of each target business scope of the business scope of the target enterprise.
  • each of the sub-category business contents determine the sub-category business contents that match the business scope of the target enterprise as the matching business contents of the target enterprise, including:
  • For each sub-category of business content calculate the matching degree between the sub-category of business content and the business scope of the target company according to the word segmentation of each of the business scope of the enterprise and each of the target business scope of the sub-category of business content ; Determine the sub-category business content with the highest matching degree as the matching business content of the target enterprise.
  • each sub-category of business content calculate the sub-category of business content and the target enterprise's business operations according to the sub-category of the business scope of the enterprise and each of the target business scope of the word segmentation.
  • the matching degree of the range including:
  • the total business content of the target category is determined; for each sub-category of business content, based on the word frequency-inverse document frequency technology, the total business content is taken as a document set, Calculate the first score of each enterprise business scope word segmentation of the sub-category business content in the total business content; for each sub-category business content, according to each of the target business scope word segmentation and each sub-category business content
  • the first score of the business scope segmentation of the enterprise determines the matching degree between the business content of the sub-category and the business scope of the target enterprise.
  • each of the target business scope word segmentation and the first score of each enterprise business scope word segmentation of the sub-category business content determine the matching degree of the sub-category business content and the business scope of the target enterprise.
  • the first score of the current enterprise business scope word segmentation is determined as the target business scope word segmentation. According to the target scores of each target business scope word segmentation, the matching degree of the sub-category business content and the business scope of the target enterprise is determined.
  • determine the matching degree of the sub-category business content and the business scope of the target enterprise include:
  • each of the target business scope segmentation and the first score of each enterprise business scope segmentation of the sub-category business content determine the second sub-category of the sub-category business scope of the target company's business scope corresponding to the sub-category. score; according to the second score and the vector distance, determine the degree of matching between the sub-category business content of the sub-category and the business scope of the target enterprise.
  • the calculating the word vector of each business scope word segmentation of the enterprise includes:
  • the word vector of the word segmentation of each business scope of the enterprise is calculated.
  • the calculating the word vector of each target business scope word segmentation of the target enterprise includes:
  • word vectors for word segmentation of each target business scope of the target enterprise are calculated.
  • an embodiment of the present disclosure further provides a device for determining an industry label, including:
  • a data acquisition module used for acquiring the business scope of the target enterprise of the existing users, wherein the type of the industry label of the target enterprise is an unknown label type;
  • the sub-category business content determination module is used for each sub-category under the target category to obtain the business scope of each enterprise of the existing users corresponding to the sub-category, and generate the sub-category according to the business scope of each enterprise
  • the business content of the sub-category wherein the target category is the category or category to which the industry label of the target enterprise belongs, and the type of the industry label of the sub-category is a known label type;
  • a content matching module configured to determine, according to the business scope of the target enterprise and each of the sub-category business contents, the sub-category business contents that match the business scope of the target enterprise as the matching business contents of the target enterprise;
  • the industry label determination module is configured to determine the industry label of the sub-category corresponding to the matching business content as the industry label of the target enterprise.
  • an embodiment of the present disclosure further provides a device for determining an industry label, where the device for determining an industry label includes: a memory, a processor, and an industry label stored on the memory and running on the processor
  • a determination program of the industry label when the industry label determination program is executed by the processor, implements the steps of the industry label determination method provided by any embodiment corresponding to the first aspect of the present disclosure.
  • an embodiment of the present disclosure further provides a computer-readable storage medium, where a program for determining an industry label is stored on the computer-readable storage medium, and when the program for determining an industry label is executed by a processor, the implementation of the present disclosure is The steps of the method for determining an industry label provided by any embodiment corresponding to the first aspect.
  • the method, device, device, and storage medium for determining an industry label provided by the embodiments of the present disclosure are aimed at a target enterprise whose industry label of existing users is not clear, through the business scope of the target enterprise and the category or category corresponding to the target enterprise.
  • the label type is the business content of each sub-category of the known label type, wherein the business content of the sub-category is determined by the business scope of each enterprise corresponding to the sub-category, and the sub-category business content that matches the business scope of the target enterprise is determined,
  • the sub-category of the sub-category business content is determined as the industry label of the target company, which realizes automatic matching of clear industry labels for companies with unclear industry labels, and the label matching accuracy is high, so as to determine the enterprise portrait of the company. It provides a good foundation for enterprises to provide high-quality services in line with their business conditions and improves user experience.
  • FIG. 1 is an application scenario diagram of a method for determining an industry label provided by an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for determining an industry label provided by an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a method for determining an industry label provided by another embodiment of the present disclosure.
  • FIG. 4 is a flowchart of step S306 in the embodiment shown in FIG. 3 of the present disclosure.
  • FIG. 5 is a flowchart of a method for determining an industry label provided by another embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of an apparatus for determining an industry label provided by an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of a device for determining an industry label according to an embodiment of the present disclosure.
  • FIG. 1 is an application scenario diagram of the method for determining an industry label provided by an embodiment of the present disclosure.
  • a corresponding industry label can be determined for each enterprise according to the industry classification of the national economy. The order from the smallest to the category is the category, the major category, the middle category and the minor category.
  • the service enterprise 110 needs to determine the enterprise portrait of the target enterprise 120 according to the industry label 121 of the service target enterprise 120, so as to provide the target enterprise 120 with high-quality services according to the enterprise portrait.
  • the industry label 121 of the target company 120 is an industry label of an unknown label type in the sub-category
  • the sub-category code is 5199 for other unlisted wholesale industries
  • the granularity of the corporate portrait of the target company 120 will be unclear.
  • the enterprise portrait cannot correctly describe the needs of the target enterprise 120, and thus cannot provide the target enterprise 120 with a service strategy that meets its needs.
  • the embodiments of the present disclosure provide a method for automatically determining a clear industry label for an enterprise with an unclear industry label.
  • the main idea of the method for determining the industry label is as follows: The business scope of the enterprise, and the business scope of each enterprise corresponding to each clearly defined sub-category with the same broad category or category as the enterprise, determine the business scope of the sub-category that matches the business scope of the target enterprise, and assign the sub-category to the business scope of the target enterprise.
  • the industry label of the target company is determined as the industry label of the target company, so as to match a suitable and clear industry label for the target company, so that based on the clear industry label, a clear enterprise portrait of the target company can be generated, and based on the enterprise portrait correct, Appropriately describe the needs of the target company, so as to provide quality services.
  • FIG. 2 is a flowchart of a method for determining an industry label provided by an embodiment of the present disclosure. As shown in FIG. 2 , the method for determining an industry label includes the following steps:
  • Step S201 acquiring the business scope of the target enterprise of the existing users.
  • the type of the industry label of the target enterprise is an unknown label type.
  • the industry label usually refers to the category name of the subclass in the "National Economic Industry Classification".
  • the industry label of an unknown label type indicates that the category name of the corresponding subclass contains other industry labels with unclear category description, such as " "Other agriculture”, “other animal husbandry”, “other unspecified wholesale business”, “other unspecified manufacturing industry”, etc.
  • Existing users refer to users who use the provided services, usually referring to existing customers.
  • the business scope is the data used to describe the business scope of the enterprise, which can be described by keywords or sentences.
  • the business scope of the target enterprise may be: the business scope is wholesale and retail of steel and clothing.
  • the number of target enterprises may be one or more.
  • the format of the business scope is not uniform.
  • the business scope of the target enterprise is converted into a business scope in a preset format.
  • the business scope of the target company C1 is "the company's main business: wholesale and retail of various stationery, jewelry, beverages and tobacco", it is converted to a preset format, and the converted target company C1's business
  • the scope is "the business scope is: wholesale and retail of all kinds of stationery, jewelry, beverages and tobacco”.
  • Step S202 for each subcategory under the target category, obtain the business scope of each enterprise of the existing users corresponding to the subcategory, and generate subcategory business content of the subcategory according to the business scope of each enterprise.
  • the target category is the category or category to which the industry label of the target enterprise belongs
  • the type of the industry label of the subcategory is a known label type.
  • the known label type is opposite to the above-mentioned location label type, indicating that the industry label of the enterprise is clear or explicit, and it can be an industry label that does not contain the above-mentioned "unlisted” keywords, such as "fruit, vegetable wholesale (5123)", “Apparel wholesale (5132)” and other industry labels.
  • the category or category described in the industry label of the target enterprise is obtained, and the business scope of each enterprise corresponding to each sub-category under the category or category of the existing users is acquired, that is, the corresponding sub-categories under the target category are obtained.
  • the business scope of each enterprise and then integrate the business scope of each enterprise to obtain the sub-category business content of the sub-category.
  • the content in parentheses in the business scope can be removed, and the business scope of the enterprise whose business scope is an abnormal value can be removed, for example, the value of the business scope is empty.
  • keywords of the business scope of each enterprise may be extracted, and then the sub-category business content of the sub-category is composed of the keywords of each enterprise.
  • the category to which the target enterprise belongs is "wholesale industry", and its category code is 51, and there are enterprise customers in two sub-categories of known label types of existing users under the wholesale industry, which are building materials wholesale ( The subclass code is 5165) and the wholesale of textiles, knitwear and raw materials (the subclass code is 5131), among which, the subclasses that belong to the wholesale of building materials are enterprises C2 and C3, which belong to the subclass of wholesale of textiles, knitwear and raw materials.
  • the enterprises are C4, C5 and C6, and then integrate the business scope of enterprises C2 and C3 to obtain the business content of the building materials wholesale category, and integrate the business scope of enterprises C4, C5 and C6, and obtain the textile, knitwear and raw material wholesale categories. business content.
  • Step S203 according to the business scope of the target enterprise and each of the sub-category business contents, determine the sub-category business contents matching the business scope of the target enterprise as the matching business contents of the target enterprise.
  • each keyword of the business scope of the target enterprise can be matched with each keyword of the business content of the sub-category, and then the matching degree of the target enterprise corresponding to the sub-category can be obtained, and then the sub-category corresponding to the sub-category with the highest matching degree can be matched.
  • the similar business content is determined as the matching business content of the target enterprise.
  • a weight value can be set for each keyword of the sub-category business content in advance, and then when the keyword of the target company's business scope is consistent with or matched with the keyword of the sub-category business content, the weight value of the matched keyword is obtained. , the weights of each matching keyword are superimposed to obtain the matching degree corresponding to the sub-category.
  • the weight value of the keyword of the sub-category business content may be determined based on the frequency of occurrence of the keyword.
  • the keywords and weights of the sub-category business content of the sub-category are "wholesale 0.1, retail 0.1, steel 0.4 and lumber 0.4"
  • the keywords of the target enterprise's business content are "wholesale, steel and clothing”
  • the matching degree of the target enterprise corresponding to this subclass is 0.5.
  • Step S204 Determine the industry label of the sub-category corresponding to the matching business content as the industry label of the target enterprise.
  • a clear industry label of the sub-category corresponding to the matching business content is obtained, and the industry label is determined.
  • the industry label for the target company realizes the automatic setting of a clear industry label for the target company.
  • the business scope of the target enterprise and the label type under the category or category corresponding to the target enterprise are the known label types of each sub-category.
  • Business content where the business content of a sub-category is determined by the business scope of each enterprise corresponding to the sub-category, determine the sub-category business content that matches the business scope of the target enterprise, and determine the sub-category of the sub-category business content as the target.
  • the industry label of the enterprise realizes the automatic matching of clear industry labels for companies with unclear industry labels, and the label matching accuracy is high, which provides a good basis for determining the enterprise portrait of the enterprise, and is convenient for enterprises to provide enterprises with business conditions. high-quality services and improve user experience.
  • FIG. 3 is a flowchart of a method for determining an industry label provided by another embodiment of the present disclosure. This embodiment is based on the embodiment shown in FIG. 2 , and further refines steps S202 and S203 . Then, the step of performing word segmentation processing on the business scope of the target enterprise is added. As shown in FIG. 3 , the method for determining an industry label provided by this embodiment includes the following steps:
  • Step S301 acquiring the business scope of the target enterprise of the existing users.
  • the type of the industry label of the target enterprise is an unknown label type.
  • Step S302 Perform word segmentation processing on the business scope of the target enterprise to obtain word segmentation for each target business scope of the business scope of the target enterprise.
  • word segmentation processing refers to the process of recombining consecutive sentences into word sequences according to certain specifications.
  • the business scope of the enterprises involved in this disclosure may be described in Chinese or in English.
  • the word segmentation algorithm may be a word segmentation algorithm based on string matching, a word segmentation algorithm based on a Hidden Markov Model (HMM), a word segmentation algorithm based on a conditional random field, or other word segmentation algorithms.
  • HMM Hidden Markov Model
  • word segmentation processing can also be performed on the business scope of the target enterprise and the business scope of each subsequent enterprise in the sub-category.
  • the business scope of the target company is "The company's business scope is: wholesale and retail of grains and oils, food, beverages and tobacco products"
  • the target business scope participles obtained after word segmentation processing are: grain and oil, beverages, tobacco products, wholesale and retail.
  • Step S303 for each sub-category under the target category, obtain the business scope of each enterprise of the existing users corresponding to the sub-category.
  • Step S304 for each enterprise of each sub-category, perform word segmentation processing on the business scope of the enterprise, so as to obtain word segmentation of each enterprise business scope of the enterprise.
  • word segmentation is performed on its business scope.
  • the specific word segmentation processing algorithm is similar to that in step S302, which is not repeated here, so as to obtain the enterprise business scope word segmentation of each enterprise in each subcategory.
  • Step S305 for each sub-category, perform de-duplication processing and stop-word removal processing on the word segmentation of the business scope of each enterprise of the sub-category, so as to obtain the sub-category business content of the sub-category.
  • a stop word set may be predetermined, and the stop word set is composed of each stop word. Further, based on the set of stop words, the operation of removing stop words may be performed on the word segmentation of the business scope of each enterprise of the sub-category. Furthermore, the sub-category business content is composed of the business scope word segmentation of each enterprise after deduplication and removal of stop words.
  • Step S306 for each sub-category of business content, calculate the sub-category of business content and the business scope of the target company according to each of the enterprise business scope word segmentation and each of the target business scope word segmentation of the sub-category business content. match.
  • the weight value of the enterprise business scope word segmentation can be determined according to the frequency of the enterprise business scope word segmentation in the sub-category business content, and then when the target business scope word segmentation matches the enterprise business scope word segmentation, the enterprise business scope word segmentation The weight value of the word segmentation is determined as the segmentation score of the target business scope.
  • FIG. 4 is a flowchart of step S306 in the embodiment shown in FIG. 3 of the present disclosure. As shown in FIG. 4 , step S306 includes the following steps:
  • Step S3061 Determine the total business content of the target category according to the word segmentation of the business scope of the enterprise in each sub-category of business content.
  • the target category is the category or category to which the industry label of the target enterprise belongs.
  • Step S3062 for each sub-category of business content, based on the word frequency-inverse document frequency technology, with the total business content as a document set, calculate the sub-category business content of each enterprise business scope word segmentation in the total business content. first score.
  • the term frequency-inverse document frequency (IT-DTF, Term Frequency-Inverse Document Frequency) technology is a technology used to evaluate the importance of a word corresponding to a document in a document set or corpus.
  • the weight of a word is determined mainly according to the frequency of word occurrence.
  • the first score is the IT-DTF value of each enterprise's business scope word segmentation in the aggregate business content of the document.
  • TF Term Frequency
  • Tf term represents the word frequency of the given word term
  • T term represents the given word
  • NT represents the total number of words in the given document or given article.
  • IDF Inverse Document Frequency
  • Idf term represents the inverse document frequency for a given term
  • D term represents the number of documents containing a given term
  • N D represents the total number of documents in the corpus.
  • its TF-IDF value can be obtained by multiplying its word frequency and inverse document frequency, namely the above-mentioned first score.
  • Step S3063 for each sub-category business content, determine the sub-category business content and the target according to each of the target business scope segmentation and the first score of each enterprise business scope segmentation of the sub-category business content. The matching degree of the business scope of the enterprise.
  • the matching degree of the sub-category business content and the target enterprise's sub-category business scope can be obtained.
  • the sub-category business contents include Word1, Word2, Word3, and Word4, the corresponding first scores are 0.48, 0.24, 0.01, and 0.05, respectively, and the target business scope segmented words include Word2 and Word3, then it is determined that the sub-category business contents Word2 and Word3 are matched words, and their first scores are added to obtain the matching degree, that is, the matching degree is 0.24+0.01, which is 0.25.
  • each of the target business scope word segmentation and the first score of each enterprise business scope word segmentation of the sub-category business content determine the matching degree of the sub-category business content and the business scope of the target enterprise.
  • the first score of the current enterprise business scope word segmentation is determined as the target business scope word segmentation. According to the target scores of each target business scope word segmentation, the matching degree of the sub-category business content and the business scope of the target enterprise is determined.
  • the current enterprise business scope participle is any enterprise business scope participle in the sub-category business content.
  • the matching of the target business scope participle with the current enterprise business scope participle may mean that the two are the same or similar.
  • the sum of the target scores of each target business scope word segmentation can be calculated, so as to obtain the matching degree between the business content of the sub-category and the business scope of the target enterprise.
  • the number of enterprises corresponding to the sub-category can also be obtained, the sub-category weight value of each sub-category can be determined according to the number of enterprises, and then the sub-category weight value and the target score of each target business scope word segmentation can be determined.
  • the degree of matching between the business content of the small category and the business scope of the target enterprise can be determined.
  • the sub-category weight value is determined by the ratio of the number of enterprises corresponding to the sub-category to the total number of enterprises corresponding to the target category.
  • the existing users include two subcategories with clear industry labels under one category of manufacturing, namely candy, chocolate manufacturing and dairy product manufacturing, while confectionery and chocolate manufacturing is a small category.
  • the sub-category corresponds to 7 enterprises
  • the dairy product manufacturing sub-category corresponds to 3 enterprises, so the sub-category weight value of the candy and chocolate manufacturing sub-category is determined to be 0.3, and the sub-category weight value of the dairy product manufacturing sub-category is 0.7.
  • Step S307 determining the sub-category business content with the highest matching degree as the matching business content of the target enterprise.
  • Step S308 Determine the industry label of the sub-category corresponding to the matching business content as the industry label of the target enterprise.
  • each enterprise is processed by word segmentation; for each sub-category, the sub-category business content is integrated by de-duplicating and removing stop words from the word segmentation of each enterprise in the sub-category; based on TF-IDF Technology, take the total business content of the category or major category as the document set, calculate the first score of the word segmentation of each sub-category; through the word segmentation matching and the first score, determine the matching degree of the target enterprise and each sub-category, so as to obtain the target enterprise and the target enterprise.
  • TF-IDF Technology take the total business content of the category or major category as the document set, calculate the first score of the word segmentation of each sub-category; through the word segmentation matching and the first score, determine the matching degree of the target enterprise and each sub-category, so as to obtain the target enterprise and the target enterprise.
  • the sub-category business content with the highest matching degree of business scope and then determine the industry label of this sub-category as the industry label of the target enterprise, which realizes the automatic matching of clear industry labels for enterprises with unclear industry labels, and the label matching accuracy is high. , so as to provide a good foundation for determining the enterprise portrait of the enterprise, which is convenient to provide the enterprise with high-quality services in line with the business situation of the enterprise, and improve the user experience.
  • Fig. 5 is a flowchart of a method for determining an industry label provided by another embodiment of the present disclosure. This embodiment is based on the embodiment shown in Fig. 3, and is added after step S303. As shown in Fig. 5, this embodiment The method for determining the industry label provided by the example includes the following steps:
  • Step S501 acquiring the business scope of the target enterprise of the existing users.
  • the type of the industry label of the target enterprise is an unknown label type.
  • the category described in the industry label of the target enterprise is category F
  • F 1 is a small industry with unclear industry labels
  • this small industry F 1 corresponds to m 1 target enterprises
  • Step S502 performing word segmentation processing on the business scope of the target enterprise to obtain word segmentation for each target business scope of the business scope of the target enterprise.
  • Step S503 for each sub-category under the target category, obtain the business scope of each enterprise of the existing users corresponding to the sub-category.
  • the target category is the above-mentioned major category or category F
  • the business scope of , m i represents the number of enterprises in the ith sub-category or sub-category industry.
  • Step S504 for each enterprise, perform word segmentation processing on the business scope of the enterprise to obtain word segmentation of the business scope of the enterprise.
  • Step S505 for each sub-category, perform deduplication processing and stop-word removal processing on the word segmentation of the business scope of each enterprise of the sub-category, so as to obtain the sub-category business content of the sub-category.
  • Step S506 Determine the total business content of the target category according to the word segmentation of the business scope of the enterprise in each sub-category of business content.
  • Step S507 for each sub-category of business content, based on the word frequency-inverse document frequency technology, with the total business content as a document set, calculate the sub-category business content of each enterprise business scope word segmentation in the total business content. first score.
  • the TF-IDF score of each enterprise's business scope word in the business content category that is, the first score above.
  • Step S508 for each enterprise of each sub-category, calculate the word vector of each enterprise business scope word segmentation of the enterprise, and determine the enterprise business scope sentence vector of the enterprise according to the word vector of each enterprise business scope word segmentation.
  • the word vector of the word segmentation of each business scope of the enterprise is calculated, and then the sentence vector of the enterprise business scope of the enterprise is obtained.
  • the calculating the word vector of each business scope word segmentation of the enterprise includes:
  • the word vector of the word segmentation of each business scope of the enterprise is calculated.
  • the text vectorization (Word to Vector, word2vec) model is a tool for dialectical words into numerical vectors.
  • the preset Chinese word vector dictionary is a word vector dictionary trained based on a large number of Chinese word corpora.
  • Step S509 for each sub-category, determine the business scope center vector of the sub-category according to the enterprise business-scope sentence vectors of each enterprise in the sub-category.
  • the vector summation of the enterprise business scope sentence vectors of each enterprise in the sub-category can be performed to obtain the business scope center vector of the sub-category.
  • Step S510 Calculate the word vector of each target business scope word segmentation of the target enterprise, and determine the target business scope sentence vector of the target enterprise according to the word vector of each target business scope word segmentation.
  • the calculating the word vector of each target business scope word segmentation of the target enterprise including:
  • word vectors for word segmentation of each target business scope of the target enterprise are calculated.
  • step S508 the specific method of calculating the word vector of each target business scope word segmentation of the target enterprise and the target business scope sentence vector is the same as the method of calculating the word vector and the enterprise business scope sentence vector in step S508, only the objects are composed of subclasses. The business is replaced with the target business.
  • Step S511 Calculate the vector distance between the target business scope sentence vector and the business scope center vector of each of the sub-categories.
  • the vector distance is the Euclidean distance of two vectors, that is, the Euclidean distance between the target business scope sentence vector and the business scope center vector of the subclass.
  • Step S512 Determine the business scope of the target enterprise corresponding to the sub-category business content according to each of the target business scope segmentation and the first score of each enterprise business scope segmentation of the sub-category business content. the second score.
  • the second score corresponding to the sub-category is the sum of the first scores of the business-scope segmentations of the sub-categories that match the target business-scope segmentation of the target enterprise.
  • Step S513 Determine, according to the second score and the vector distance, the degree of matching between the sub-category business content of the sub-category and the business scope of the target enterprise.
  • is the weight coefficient, and the value of ⁇ is negative.
  • Step S514 determining the sub-category business content with the highest matching degree as the matching business content of the target enterprise.
  • Step S515 Determine the industry label of the sub-category corresponding to the matching business content as the industry label of the target enterprise.
  • the matching degree of the sub-category business content of the sub-category and the business scope of the target enterprise is determined through multiple dimensions.
  • the TF-IDF technology is used to calculate the two
  • the matching degree of the words of the person, and the overall matching degree is calculated from the perspective of the whole, that is, the sentence vector through the text vectorization model, and the combination of the two is used to comprehensively determine the relationship between the business content of the sub-category and the business scope of the target enterprise.
  • Matching degree improves the accuracy of matching degree calculation; the industry label of the sub-industry with the highest matching degree is determined as the industry label of the target enterprise, which realizes the automatic matching of clear industry labels for enterprises with unclear industry labels, and the labels match
  • the accuracy is high, which provides a good basis for determining the enterprise portrait of the enterprise, which is convenient for providing high-quality services in line with the enterprise's business conditions and improving user experience.
  • FIG. 6 is a schematic structural diagram of a device for determining an industry label provided by an embodiment of the present disclosure.
  • the device for determining an industry label includes: a data acquisition module 610 , a sub-category business content determination module 620 , a content matching module 630 and Industry label determination module 640 .
  • the data acquisition module 610 is used to acquire the business scope of the target company of the existing users, wherein the type of the industry label of the target company is an unknown label type;
  • the sub-category business content determination module 620 is used for For each sub-category, obtain the business scope of each enterprise of the existing users corresponding to the sub-category, and generate the sub-category business content of the sub-category according to the business scope of each enterprise, wherein the target category is all The category or category to which the industry label of the target enterprise belongs, and the type of the industry label of the sub-category is a known label type;
  • the content matching module 630 is used for operating according to the business scope of the target enterprise and each of the sub-categories.
  • the industry label determination module 640 is used to determine the industry label of the sub-category corresponding to the matching business content, Industry label for the target company.
  • the sub-category business content determination module 620 includes:
  • the business scope obtaining unit is used to obtain, for each sub-category under the target category, the business scope of each enterprise of the existing users corresponding to the sub-category;
  • the first word segmentation processing unit is configured to, for each enterprise, The business scope of the described enterprise is subjected to word segmentation processing, so as to obtain the word segmentation of each business scope of the enterprise;
  • the sub-category business content determination unit is used for each sub-category to perform word segmentation on the business scope of each enterprise of the sub-category. De-duplication processing and stop-word removal processing are performed to obtain the sub-category business content of the sub-category.
  • the device for determining the industry label further includes:
  • the second word segmentation processing unit is configured to perform word segmentation processing on the business scope of the target enterprise, so as to obtain word segmentations for each target business scope of the business scope of the target enterprise.
  • the content matching module 630 includes:
  • the matching degree calculation unit is configured to, for each sub-category of business content, calculate the sub-category business content and the target according to each of the enterprise business scope word segmentation and each of the target business scope word segmentation of the sub-category business content
  • the matching degree of the business scope of the enterprise; the matching business content determining unit is used to determine the sub-category business content with the highest matching degree as the matching business content of the target enterprise.
  • the matching degree calculation unit includes:
  • the total business content determination subunit is used to determine the total business content of the target category according to the word segmentation of the business scope of each sub-category of business content;
  • the first score calculation subunit is used for each sub-category of business content, Based on the word frequency-inverse document frequency technology, taking the total business content as the document set, calculate the first score of the word segmentation of each enterprise business scope of the sub-category business content in the total business content;
  • the matching degree calculation subunit using For each sub-category of business content, according to each of the target business scope segmentation and the first score of each enterprise business scope segmentation of the sub-category of business content, determine the sub-category of business content and the target enterprise. The matching degree of the business scope.
  • the matching degree calculation subunit is specifically used for:
  • the first score of the current enterprise business scope word segmentation is determined as the target business scope word segmentation. According to the target scores of each target business scope word segmentation, the matching degree of the sub-category business content and the business scope of the target enterprise is determined.
  • the device for determining the industry label further includes:
  • the enterprise business scope sentence vector determination module is used to calculate the word vector of the word segmentation of each enterprise business scope of the enterprise for each enterprise of each subcategory, and determine the word vector of the enterprise business scope word segmentation according to the word vector of each enterprise business scope.
  • the business scope sentence vector of the enterprise; the business scope center vector determination module is used for each subclass to determine the business scope center vector of the subclass according to the enterprise business scope sentence vector of each enterprise in the subclass; the target business scope
  • the sentence vector determination module is used to calculate the word vector of each target business scope word segmentation of the target enterprise, and determine the target business scope sentence vector of the target enterprise according to the word vector of each target business scope word segmentation;
  • the vector distance calculation module It is used to calculate the vector distance between the target business scope sentence vector and the business scope center vector of each of the subclasses.
  • the matching degree calculation subunit is specifically used for:
  • each of the target business scope segmentation and the first score of each enterprise business scope segmentation of the sub-category business content determine the second sub-category of the sub-category business scope of the target company's business scope corresponding to the sub-category. score; according to the second score and the vector distance, determine the degree of matching between the sub-category business content of the sub-category and the business scope of the target enterprise.
  • the calculating the word vector of each business scope word segmentation of the enterprise includes:
  • the calculating the word vector of the word segmentation of each target business scope of the target enterprise includes: based on The text vectorization model and the preset Chinese word vector dictionary are used to calculate the word vectors of each target business scope of the target enterprise.
  • the apparatus for determining an industry label provided by the embodiment of the present disclosure can execute the method for determining an industry label provided by any embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method.
  • FIG. 7 is a schematic structural diagram of a device for determining an industry label provided by an embodiment of the present disclosure.
  • the device for determining an industry label includes: a memory 710 , a processor 720 and a computer program.
  • the computer program is stored in the memory 710 and configured to be executed by the processor 720 to implement the method for determining an industry label provided by any of the embodiments corresponding to FIGS. 2-5 of the present disclosure.
  • the memory 710 and the processor 720 are connected through a bus 730 .
  • An embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the industry label provided by any of the embodiments corresponding to FIG. 2 to FIG. 5 of the present disclosure. Determine the method.
  • the computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be in electrical, mechanical or other forms.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present disclosure may be integrated in one processing unit, or each module may exist physically alone, or two or more modules may be integrated in one unit.
  • the units formed by the above modules can be implemented in the form of hardware, or can be implemented in the form of hardware plus software functional units.
  • the above-mentioned integrated modules implemented in the form of software functional modules can be stored in a computer-readable storage medium.
  • the above-mentioned software function modules are stored in a storage medium, and include several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (English: processor) to execute the various embodiments of the present disclosure. part of the method.
  • processor may be a central processing unit (Central Processing Unit, referred to as CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, referred to as DSP), application specific integrated circuit (Application Specific Integrated Circuit, Referred to as ASIC) and so on.
  • CPU Central Processing Unit
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the invention can be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
  • the memory may include high-speed RAM memory, and may also include non-volatile storage NVM, such as at least one magnetic disk memory, and may also be a U disk, a removable hard disk, a read-only memory, a magnetic disk or an optical disk, and the like.
  • NVM non-volatile storage
  • the bus can be an Industry Standard Architecture (ISA for short) bus, a Peripheral Component (PCI for short) bus, or an Extended Industry Standard Architecture (EISA for short) bus, or the like.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the buses in the drawings of the present disclosure are not limited to only one bus or one type of bus.
  • the above-mentioned storage medium may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Except programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable except programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • a storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.
  • An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and the storage medium may be located in Application Specific Integrated Circuits (ASIC for short).
  • ASIC Application Specific Integrated Circuits
  • the processor and the storage medium may also exist in the electronic device or the host device as discrete components.
  • the terms "comprising”, “comprising” or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a" does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本公开公开了一种行业标签的确定方法、装置、设备及存储介质,该方法包括:获取存量用户的目标企业的经营范围,其中,目标企业的行业标签的类型为未知标签类型;针对目标类别下的每个小类,获取小类对应的存量用户的各个企业的经营范围,根据各个企业的经营范围生成小类的小类经营内容,目标类别为目标企业的行业标签所属的门类或大类,小类的行业标签的类型为已知标签类型;根据目标企业的经营范围和各个小类经营内容,确定与目标企业的经营范围匹配的小类经营内容为目标企业的匹配经营内容;将匹配经营内容对应的小类的行业标签,确定为目标企业的行业标签,实现了根据经营范围自动为企业确定明确的行业标签,且标签确定方法准确度高。

Description

行业标签的确定方法、装置、设备及存储介质
本公开要求于2020年09月30日提交中国专利局、申请号为202011060599.X、申请名称为“行业标签的确定方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及文字识别技术领域,尤其涉及一种行业标签的确定方法、装置、设备及存储介质。
背景技术
随着企业的综合性发展,一个企业跨多个行业的现象越来越多,越来越多的企业的行业分类标签为行业不明确的标签,即行业标签的类型为为止标签类型,如“其他未列明批发业(5199)”、“其他农业(0190)”等,该类标签无法清楚描述企业的经营内容。
当企业的行业标签为上述未知标签类型时,会导致无法精准地确定该企业的企业画像,从而无法为其提供优质的服务。
发明内容
本公开的主要目的在于提供一种行业标签的确定方法、装置、设备及存储介质,针对行业标签不明确的企业,根据其经营内容自动为其匹配明确的行业标签,标签确定方法准确度高,更贴合企业的经营情况,为后续的确定企业画像提供良好的基础。
为实现上述目的,第一方面,本公开实施例提供一种行业标签的确定方法,该行业标签的确定方法包括:
获取存量用户的目标企业的经营范围,其中,所述目标企业的行业标签的类型为未知标签类型;针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围,根据各个企业的所述经营范围生成所述小类的小类经营内容,其中,所述目标类别为所述目标企业的行业标签所属的门类或大类,所述小类的行业标签的类型为已知标签类型;根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内 容;将所述匹配经营内容对应的小类的行业标签,确定为所述目标企业的行业标签。
可选地,根据各个企业的所述经营范围生成所述小类的小类经营内容,包括:
针对每个小类的每个企业,对所述企业的经营范围进行分词处理,以得到所述企业的各个企业经营范围分词;针对每个小类,对所述小类的各个企业的企业经营范围分词进行去重处理和去除停用词处理,以得到所述小类的小类经营内容。
可选地,在获取存量用户的目标企业的经营范围之后,还包括:
对所述目标企业的经营范围进行分词处理,以得到所述目标企业的经营范围的各个目标经营范围分词。
相应的,根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内容,包括:
针对每个小类经营内容,根据所述小类经营内容的各个所述企业经营范围分词以及各个所述目标经营范围分词,计算所述小类经营内容与所述目标企业的经营范围的匹配度;将匹配度最高的小类经营内容确定为所述目标企业的匹配经营内容。
可选地,针对每个小类经营内容,根据所述小类经营内容的各个所述企业经营范围分词以及各个所述目标经营范围分词,计算所述小类经营内容与所述目标企业的经营范围的匹配度,包括:
根据各个小类经营内容的所述企业经营范围分词,确定所述目标类别的总经营内容;针对每个小类经营内容,基于词频-逆文档频率技术,以所述总经营内容为文档集,计算所述小类经营内容的各个企业经营范围分词在所述总经营内容中的第一分数;针对每个小类经营内容,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度。
可选地,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度,包括:
针对每个所述目标经营范围分词,当所述目标经营范围分词与所述小类经营内容的当前企业经营范围分词匹配时,将所述当前企业经营范围分词的所述第一分数确定为所述目标经营范围分词的目标分数;根据各个所述目标经营范围分词的所述目标分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度。
可选地,还包括:
针对每个小类的每个企业,计算所述企业的各个企业经营范围分词的词向量,并根据各个企业经营范围分词的词向量,确定所述企业的企业经营范围句向量;针对每个小类,根据所述小类的各个企业的企业经营范围句向量,确定所述小类的经营范围中心向量;计算所述目标企业的各个目标经营范围分词的词向量,并根据各目标经营范围分词的词向量,确定所述目标企业的目标经营范围句向量;计算所述目标经营范围句向量与各个所述小类的经营范围中心向量的向量距离。
相应的,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度,包括:
根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类的小类经营内容对应的所述目标企业的经营范围的第二分数;根据所述第二分数和所述向量距离确定所述小类的所述小类经营内容与所述目标企业的经营范围的匹配度。
可选地,所述计算所述企业的各个企业经营范围分词的词向量,包括:
基于文本向量化模型以及预设中文词向量词典,计算所述企业的各个企业经营范围分词的词向量。
相应的,所述计算所述目标企业的各个目标经营范围分词的词向量,包括:
基于文本向量化模型以及预设中文词向量词典,计算所述目标企业的各个目标经营范围分词的词向量。
第二方面,本公开实施例还提供一种行业标签的确定装置,包括:
数据获取模块,用于获取存量用户的目标企业的经营范围,其中,所述目标企业的行业标签的类型为未知标签类型;
小类经营内容确定模块,用于针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围,根据各个企业的所述经营范围生成所述小类的小类经营内容,其中,所述目标类别为所述目标企业的行业标签所属的门类或大类,所述小类的行业标签的类型为已知标签类型;
内容匹配模块,用于根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内容;
行业标签确定模块,用于将所述匹配经营内容对应的小类的行业标签,确定为所述目标企业的行业标签。
第三方面,本公开实施例还提供一种行业标签的确定设备,所述行业标签的确定设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的行业标签的确定程序,所述行业标签的确定程序被所述处理器执行时实现如本公开第一方面对应的任意实施例提供的行业标签的确定方法的步骤。
第四方面,本公开实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有行业标签的确定程序,所述行业标签的确定程序被处理器执行时实现如本公开第一方面对应的任意实施例提供的行业标签的确定方法的步骤。
本公开实施例提供的行业标签的确定方法、装置、设备及存储介质,针对存量用户的行业标签不明确的目标企业,通过该目标企业的经营范围,以及该目标企业对应的门类或大类下的标签类型为已知标签类型的各个小类的经营内容,其中,小类的经营内容由该小类对应的各个企业的经营范围确定,确定与目标企业的经营范围匹配的小类经营内容,将该小类经营内容的小类确定为该目标企业的行业标签,实现了自动为行业标签不明确的企业匹配明确的行业标签,且标签匹配准确度高,从而为确定该企业的企业画像提供了良好的基础,便于为企业提供符合企业经营情况的优质服务,提高用户体验。
附图说明
图1是本公开实施例提供的行业标签的确定方法的一种应用场景图;
图2是本公开实施例提供的行业标签的确定方法的流程图;
图3是本公开另一个实施例提供的行业标签的确定方法的流程图;
图4是本公开图3所示实施例中步骤S306的流程图;
图5是本公开另一个实施例提供的行业标签的确定方法的流程图;
图6是本公开实施例提供的行业标签的确定装置的结构示意图;
图7为本公开一个实施例提供的行业标签的确定设备的结构示意图。
本公开目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本 公开的范围完整的传达给本领域的技术人员。
下面对本公开实施例的应用场景进行解释:
图1是本公开实施例提供的行业标签的确定方法的一种应用场景图,如图1所示,根据国民经济行业分类可以为各个企业确定相应的行业标签,其中,行业标签的代码从大到小依次为门类、大类、中类和小类。服务企业110需要根据服务的目标企业120的行业标签121,确定目标企业120的企业画像,从而根据企业画像为目标企业120提供优质的服务。
当目标企业120的行业标签121为小类中的未知标签类型的行业标签时,如其他未列明批发业,其小类代码为5199,则将导致目标企业120的企业画像的粒度不清晰,从而使得企业画像无法正确描述目标企业120的需求,从而无法为目标企业120提供符合其需求的服务策略。
为了提高行业标签不明确企业的企业画像的清晰度,本公开实施例提供了一种为业标签不明确企业,自动确定明确的行业标签的方法,该行业标签的确定方法的主要构思为:根据企业的经营范围,以及与该企业具有相同大类或者门类的各个类型明确的小类对应的各个企业的经营范围,确定与目标企业的经营范围匹配的小类的经营范围,并将该小类的行业标签确定为目标企业的行业标签,从而为该目标企业匹配合适的明确的行业标签,进而使得基于该明确的行业标签,可以生成目标企业的清晰的企业画像,以及基于该企业画像正确、贴切地描述该目标企业的需求,从而为其提供优质的服务。
图2是本公开实施例提供的行业标签的确定方法的流程图,如图2所示,该行业标签的确定方法包括以下步骤:
步骤S201,获取存量用户的目标企业的经营范围。
其中,所述目标企业的行业标签的类型为未知标签类型。行业标签通常指的是《国民经济行业分类》中的小类的类别名称,未知标签类型的行业标签表示其对应的小类的类别名称中包含其他的、类别表述不明确的行业标签,如“其他农业”、“其他畜牧业”、“其他未列明批发业”、“其他未列明制造业”等。存量用户指的是采用所提供的业务的用户,通常指的是已有的客户。经营范围是用于描述企业的业务经营范围的数据,可以采用关键词或者语句进行描述。
示例性的,以行业标签为其他未列明批发业为例,目标企业的经营范围可以是:经营范围为批发和零售钢材和衣服。
具体的,目标企业的数量可以是一个也可以是多个。
进一步地,在获取所述目标企业的经营范围之后,还包括:
清除所述目标企业的行业标签;将所述经营范围转换为预设格式的经营范围;对所述预设格式的经营范围进行分词处理,以得到所述目标企业的经营范围对应的各个目标经营范围分词。
具体的,为了为该行业标签不明确的目标企业重新配置行业标签,需要将其现有的行业标签清洗掉。
具体的,由于目标企业的经营范围通常由人为输入或填写,从而导致经营范围的格式不统一,为了便于数据处理,将目标企业的经营范围转换为预设格式的经营范围。
示例性的,假设目标企业C1的经营范围为“本公司主营:各类文具用品、首饰、饮料和烟草的批发和零售”,对其进行预设格式转换,转换后的目标企业C1的经营范围为“经营范围是:批发和零售各类文具用品、首饰、饮料和烟草”。
步骤S202,针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围,根据各个企业的所述经营范围生成所述小类的小类经营内容。
其中,所述目标类别为所述目标企业的行业标签所属的门类或大类,所述小类的行业标签的类型为已知标签类型。已知标签类型与上述位置标签类型相反,表示企业的行业标签为清楚的或明确的,可以是不包含上述“未列明”关键词的行业标签,如“果品、蔬菜批发(5123)”、“服装批发(5132)”等行业标签。
具体的,获取目标企业的行业标签所述的门类或大类,并获取存量用户在该门类或者大类下的各个小类对应的各个企业的经营范围,即获取目标类别下的各个小类对应的各个企业的经营范围,进而整合各个企业的经营范围,得到该小类的小类经营内容。
进一步地,针对每个企业的经营范围,可以去除该经营范围中括号中的内容,以及去除经营范围为异常值的企业的经营范围,如经营范围的值为空。
进一步地,在得到该小类的各个企业的经营范围之后,可以提取各个企业的经营范围的关键词,进而由各个企业的关键词组成该小类的小类经营内容。
示例性的,假设目标企业所属的大类为“批发业”,其大类代码为51,而存量用户在批发业下的2个已知标签类型的小类存在企业客户,分别为建材批发(小类代码为5165)和纺织品、针织品及原料批发(小类代码为5131),其中,属于建材批发这一小类的为企业C2和C3,属于纺织品、针织品及原料批发这一小类的企业为C4、C5 和C6,进而整合企业C2和C3的经营范围,得到建材批发小类的经营内容,以及整合企业C4、C5和C6的经营范围,得到纺织品、针织品及原料批发小类的经营内容。
步骤S203,根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内容。
具体的,可以将目标企业的经营范围的各个关键词与小类经营内容的各个关键词进行匹配,进而得到该小类对应的目标企业的匹配度,进而将匹配度最高的小类对应的小类经营内容确定为目标企业的匹配经营内容。
进一步地,可以预先为小类经营内容的各个关键词设置权重值,进而当目标企业的经营范围的关键词与小类经营内容的关键词一致或匹配时,获取该匹配的关键词的权重值,将各个匹配的关键词权重值相叠加,变得到该小类对应的匹配度。
具体的,小类经营内容的关键词的权重值,可以基于该关键词出现的频率确定。
示例性,假设小类的小类经营内容的关键词以及权重值为“批发0.1、零售0.1、钢材0.4和木材0.4”,而目标企业的经营内容的关键词为“批发、钢材和衣服”,则该小类对应的目标企业的匹配度为0.5。
步骤S204,将所述匹配经营内容对应的小类的行业标签,确定为所述目标企业的行业标签。
具体的,当从各个小类的小类经营内容中确定与目标企业的经营范围最匹配的匹配经营内容之后,获取该匹配经营内容对应的小类的明确的行业标签,并将该行业标签确定为目标企业的行业标签,实现了自动为目标企业设置明确的行业标签。
在本实施例中,针对存量用户的行业标签不明确的目标企业,通过该目标企业的经营范围,以及该目标企业对应的门类或大类下的标签类型为已知标签类型的各个小类的经营内容,其中,小类的经营内容由该小类对应的各个企业的经营范围确定,确定与目标企业的经营范围匹配的小类经营内容,将该小类经营内容的小类确定为该目标企业的行业标签,实现了自动为行业标签不明确的企业匹配明确的行业标签,且标签匹配准确度高,从而为确定该企业的企业画像提供了良好的基础,便于为企业提供符合企业经营情况的优质服务,提高用户体验。
图3是本公开另一个实施例提供的行业标签的确定方法的流程图,本实施例是在图2所示实施例的基础上,对步骤S202和步骤S203的进一步细化,以及在步骤S201之后增加对目标企业的经营范围进行分词处理的步骤,如图3所示,本实施例提供的行业标签的确定方法包括以下步骤:
步骤S301,获取存量用户的目标企业的经营范围。
其中,所述目标企业的行业标签的类型为未知标签类型。
步骤S302,对所述目标企业的经营范围进行分词处理,以得到所述目标企业的经营范围的各个目标经营范围分词。
具体的,分词处理指的是将连续的语句,按照一定的规范重新组合成词序列的过程。本公开涉及的企业的经营范围可以是采用中文描述,也可以是采用英文描述。分词处理的算法可以是基于字符串匹配的分词算法、基于隐马尔可夫模型(Hidden Markov Model,HMM)的分词算法、基于条件随机场的分词算法或者其他分词算法。
进一步地,还可以基于Python的中文分词组件jieba分词(结巴分词)对目标企业的经营范围以及后续的小类的各个企业的经营范围进行分词处理。
示例性的,假设目标企业的经营范围为“本公司的经营内容为:粮油、食品、饮料以及烟草制品的批发和零售”,则首先去除经营范围中冒号之前的内容,以及去除停用词“以及”、“的”、“和”,以及去除经营范围中的标点符号,再经过分词处理所得到的目标经营范围分词为:粮油、饮料、烟草制品、批发和零售。
步骤S303,针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围。
步骤S304,针对每个小类的每个企业,对所述企业的经营范围进行分词处理,以得到所述企业的各个企业经营范围分词。
具体的,针对行业标签明确的各个企业,对其经营范围进行分词处理,具体分词处理算法与步骤S302中相似,在此不再赘述,从而得到各个小类的各个企业的企业经营范围分词。
步骤S305,针对每个小类,对所述小类的各个企业的企业经营范围分词进行去重处理和去除停用词处理,以得到所述小类的小类经营内容。
具体的,可以预先确定停用词集,该停用词集由各个停用词组成。进而可以基于该停用词集对小类的各个企业的企业经营范围分词进行去除停用词操作。进而小类经营内容便是由去重和去除停用词处理后的各个企业的企业经营范围分词组成的。
步骤S306,针对每个小类经营内容,根据所述小类经营内容的各个所述企业经营范围分词以及各个所述目标经营范围分词,计算所述小类经营内容与所述目标企业的经营范围的匹配度。
具体的,可以根据企业经营范围分词在所述小类经营内容中出现的频次,确定企 业经营范围分词的权重值,进而当目标经营范围分词与该企业经营范围分词匹配时,将该企业经营范围分词的权重值确定为目标经营范围分词的分词分数,将目标企业的各个分词分数相叠加,便可以得到该小类经营内容对应的目标企业的经营范围的匹配度。
可选地,图4是本公开图3所示实施例中步骤S306的流程图,如图4所示,步骤S306包括以下步骤:
步骤S3061,根据各个小类经营内容的所述企业经营范围分词,确定所述目标类别的总经营内容。
具体的,整合各个小类经营内容的各个企业的各个企业经营范围分词,便可以得到目标类别对应的总经营内容。其中,目标类别为目标企业的行业标签所属的门类或大类。
步骤S3062,针对每个小类经营内容,基于词频-逆文档频率技术,以所述总经营内容为文档集,计算所述小类经营内容的各个企业经营范围分词在所述总经营内容中的第一分数。
其中,词频-逆文档频率(IT-DTF,Term Frequency-Inverse Document Frequency)技术,是一种用来评估一个词对应一个文档集或者语料库中的某个文档的重要程度的技术。主要是根据词出现的频次,确定词的权重。第一分数即为各个企业经营范围分词在文档集总经营内容中的IT-DTF值。
具体的,词频(Term Frequency,TF)指的是一个给定词出现的次数,其表达式为:
Figure PCTCN2021103262-appb-000001
其中,Tf term表示给定词term的词频;T term表示给定词
Figure PCTCN2021103262-appb-000002
在给定文档或给定文章中出现的次数,N T表示给定文档或给定文章的总词数。
具体的,逆文档频率(Inverse Document Frequency,IDF)是用来描述给定词的普遍重要性的参数,其与词的常见程度成反比,其表达式为:
Figure PCTCN2021103262-appb-000003
其中,Idf term表示给定词term的逆文档频率;D term表示包含给定词term的文档的数量;N D表示语料库中文档的总数。
进而,针对每个给定词,将其词频和逆文档频率相乘便可以得到其TF-IDF值,即 上述第一分数。
具体的,以总经营内容为文档集,基于词频-逆文档频率技术,计算小类经营内容中的各个企业经营范围分词的词频和逆文档频率,进而便可以得到各个企业经营范围分词的TF-IDF值,即第一分数。
步骤S3063,针对每个小类经营内容,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度。
具体的,将小类经营内容的与目标经营范围分词匹配的各个企业经营范围分词的第一分数进行叠加,便可以得到该小类经营内容与目标企业的小类经营范围的匹配度。
示例性的,假设小类经营内容包括Word1、Word2、Word3和Word4,相应的第一分数分别为0.48、0.24、0.01和0.05,而目标经营范围分词包括Word2和Word3,则确定小类经营内容中的Word2和Word3为匹配的词,将其第一分数相加,便可以得到匹配度,即匹配度为0.24+0.01,即为0.25。
可选地,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度,包括:
针对每个所述目标经营范围分词,当所述目标经营范围分词与所述小类经营内容的当前企业经营范围分词匹配时,将所述当前企业经营范围分词的所述第一分数确定为所述目标经营范围分词的目标分数;根据各个所述目标经营范围分词的所述目标分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度。
其中,当前企业经营范围分词为小类经营内容中的任意一个企业经营范围分词。
具体的,目标经营范围分词与当前企业经营范围分词匹配可以指的是两者相同或相似。
具体的,可以计算各个目标经营范围分词的目标分数的和,从而得到该小类经营内容与目标企业的经营范围的匹配度。
进一步地,针对每个小类,还可以获取该小类对应的企业数量,根据企业数量确定各个小类的小类权重值,进而根据小类权重值以及各个目标经营范围分词的目标分数,确定小类经营内容与目标企业的经营范围的匹配度。
具体的,小类权重值由该小类对应的企业数量与目标类别对应的企业总数量的比值决定。通过设置小类权重值,避免由于不同的小类的企业数量不同,而造成对匹配 度计算的影响。
示例性的,假设目标企业所述的门类为制造业,存量用户在制造业之一门类下包括2个行业标签明确的小类,分别为糖果、巧克力制造和乳制品制造,而糖果巧克力制造小类对应7个企业,乳制品制造小类对应3个企业,则确定糖果、巧克力制造小类的小类权重值为0.3,而乳制品制造小类的小类权重值为0.7。
步骤S307,将匹配度最高的小类经营内容确定为所述目标企业的匹配经营内容。
步骤S308,将所述匹配经营内容对应的小类的行业标签,确定为所述目标企业的行业标签。
在本实施例中,针对存量用户的行业标签不明确的目标企业,获取该目标企业的经营范围,以及与该目标企业属于同一门类或大类的存量用户的行业标签明确的各个小类对应的各个企业的经营范围,对各个经营范围进行分词处理;针对每个小类,通过对小类的各个企业的分词进行去重和去除停用词处理,整合出小类经营内容;基于TF-IDF技术,以门类或大类的总经营内容为文档集,计算各个小类的分词的第一分数;通过分词匹配以及第一分数,确定目标企业与各个小类的匹配度,从而得到与目标企业的经营范围匹配度最高的小类经营内容,进而将该小类的行业标签确定为目标企业的行业标签,实现了自动为行业标签不明确的企业匹配明确的行业标签,且标签匹配准确度高,从而为确定该企业的企业画像提供了良好的基础,便于为企业提供符合企业经营情况的优质服务,提高用户体验。
图5是本公开另一个实施例提供的行业标签的确定方法的流程图,本实施例是在图3所示实施例的基础上,在步骤S303之后增加了,如图5所示,本实施例提供的行业标签的确定方法包括以下步骤:
步骤S501,获取存量用户的目标企业的经营范围。
其中,所述目标企业的行业标签的类型为未知标签类型。
具体的,设目标企业的行业标签所述的大类为门类为F,存量用户在该大类或门类F下具有n个小类行业,分别为F i(i=1,2,3,…,n),假设F 1为行业标签不明确的小类行业,该小类行业F 1对应m 1个目标企业
Figure PCTCN2021103262-appb-000004
则需要获取存量用户的各个目标企业
Figure PCTCN2021103262-appb-000005
的经营范围。
步骤S502,对所述目标企业的经营范围进行分词处理,以得到所述目标企业的经营范围的各个目标经营范围分词。
步骤S503,针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围。
具体的,目标类别即上述大类或门类F,获取各个小类行业F i(i=2,3,…,n)的各个企业的经营范围,即获取各个企业
Figure PCTCN2021103262-appb-000006
的经营范围,m i表示第i个小类或小类行业的企业数量。
步骤S504,针对每个企业,对所述企业的经营范围进行分词处理,以得到所述企业的各个企业经营范围分词。
步骤S505,针对每个小类,对所述小类的各个企业的企业经营范围分词进行去重处理和去除停用词处理,以得到所述小类的小类经营内容。
具体的,对各个企业
Figure PCTCN2021103262-appb-000007
的经营范围进行分词、去重和去除停用词处理,然后,以小类行业为组进行整合,以得到小类的小类经营内容E i(i=2,3,…,n)。
步骤S506,根据各个小类经营内容的所述企业经营范围分词,确定所述目标类别的总经营内容。
步骤S507,针对每个小类经营内容,基于词频-逆文档频率技术,以所述总经营内容为文档集,计算所述小类经营内容的各个企业经营范围分词在所述总经营内容中的第一分数。
具体的,以总经营内容E为“文档集”,以各个小类行业的小类经营内容E i(i=2,3,…,n)为“文章”,基于TF-IDF技术,计算小类经营内容中的每一个企业经营范围分词的TF-IDF分数,即上述第一分数。
步骤S508,针对每个小类的每个企业,计算所述企业的各个企业经营范围分词的词向量,并根据各个企业经营范围分词的词向量,确定所述企业的企业经营范围句向量。
具体的,针对每个小类的每个企业
Figure PCTCN2021103262-appb-000008
基于预设词向量算法,计算企业的各个企业经营范围分词的词向量,进而得到该企业的企业经营范围句向量。
可选地,所述计算所述企业的各个企业经营范围分词的词向量,包括:
基于文本向量化模型以及预设中文词向量词典,计算所述企业的各个企业经营范围分词的词向量。
其中,文本向量化(Word to Vector,word2vec)模型是一种将词辩证为数值向量的工具。预设中文词向量词典是基于大量的中文词的语料库训练的词向量词典。
步骤S509,针对每个小类,根据所述小类的各个企业的企业经营范围句向量,确定所述小类的经营范围中心向量。
具体的,可以将小类的各个企业的企业经营范围句向量进行向量求和,从而得到 该小类的经营范围中心向量。
步骤S510,计算所述目标企业的各个目标经营范围分词的词向量,并根据各目标经营范围分词的词向量,确定所述目标企业的目标经营范围句向量。
可选地,所述计算所述目标企业的各个目标经营范围分词的词向量,包括:
基于文本向量化模型以及预设中文词向量词典,计算所述目标企业的各个目标经营范围分词的词向量。
需要说明的是,计算目标企业的各个目标经营范围分词的词向量以及目标经营范围句向量的具体方式与步骤S508中的计算词向量和企业经营范围句向量的方式相同,仅对象由小类的企业替换为目标企业。
步骤S511,计算所述目标经营范围句向量与各个所述小类的经营范围中心向量的向量距离。
具体的,向量距离为两个向量的欧式距离,即目标经营范围句向量与小类的经营范围中心向量的欧式距离。
步骤S512,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类的小类经营内容对应的所述目标企业的经营范围的第二分数。
具体的,针对每个小类,该小类对应的第二分数为与目标企业的目标经营范围分词匹配的该小类的各个企业经营范围分词的第一分数的和。
步骤S513,根据所述第二分数和所述向量距离确定所述小类的所述小类经营内容与所述目标企业的经营范围的匹配度。
具体的,小类F i(i=2,3,…,n)的第二分数为S i(i=2,3,…,n),向量距离为D i(i=2,3,…,n),则小类F i对应的匹配度P i的表达式为:
P i=S i+λD i
其中,λ为权重系数,λ的取值为负数。
步骤S514,将匹配度最高的小类经营内容确定为所述目标企业的匹配经营内容。
步骤S515,将所述匹配经营内容对应的小类的行业标签,确定为所述目标企业的行业标签。
在本实施例中,针对行业标签不明确的目标企业,通过多个维度确定小类的小类经营内容与目标企业的经营范围的匹配度,具体为通过TF-IDF技术从分词的角度计算两者的词的匹配度,以及通过文本向量化模型从整体的角度,即句向量的角度,计算整体的匹配度,通过两者结合综合确定小类的小类经营内容与目标企业的经营范围的 匹配度,提高了匹配度计算的准确度;将匹配度最高的小类行业的行业标签确定为目标企业的行业标签,实现了自动为行业标签不明确的企业匹配明确的行业标签,且标签匹配准确度高,从而为确定该企业的企业画像提供了良好的基础,便于为企业提供符合企业经营情况的优质服务,提高用户体验。
图6是本公开实施例提供的行业标签的确定装置的结构示意图,如图6所示,该行业标签的确定装置包括:数据获取模块610、小类经营内容确定模块620、内容匹配模块630和行业标签确定模块640。
其中,数据获取模块610,用于获取存量用户的目标企业的经营范围,其中,所述目标企业的行业标签的类型为未知标签类型;小类经营内容确定模块620,用于针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围,根据各个企业的所述经营范围生成所述小类的小类经营内容,其中,所述目标类别为所述目标企业的行业标签所属的门类或大类,所述小类的行业标签的类型为已知标签类型;内容匹配模块630,用于根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内容;行业标签确定模块640,用于将所述匹配经营内容对应的小类的行业标签,确定为所述目标企业的行业标签。
可选地,小类经营内容确定模块620,包括:
经营范围获取单元,用于针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围;第一分词处理单元,用于针对每个企业,对所述企业的经营范围进行分词处理,以得到所述企业的各个企业经营范围分词;小类经营内容确定单元,用于针对每个小类,对所述小类的各个企业的企业经营范围分词进行去重处理和去除停用词处理,以得到所述小类的小类经营内容。
可选地,该行业标签的确定装置,还包括:
第二分词处理单元,用于对所述目标企业的经营范围进行分词处理,以得到所述目标企业的经营范围的各个目标经营范围分词。
相应的,内容匹配模块630,包括:
匹配度计算单元,用于针对每个小类经营内容,根据所述小类经营内容的各个所述企业经营范围分词以及各个所述目标经营范围分词,计算所述小类经营内容与所述目标企业的经营范围的匹配度;匹配经营内容确定单元,用于将匹配度最高的小类经营内容确定为所述目标企业的匹配经营内容。
可选地,所述匹配度计算单元,包括:
总经营内容确定子单元,用于根据各个小类经营内容的所述企业经营范围分词,确定所述目标类别的总经营内容;第一分数计算子单元,用于针对每个小类经营内容,基于词频-逆文档频率技术,以所述总经营内容为文档集,计算所述小类经营内容的各个企业经营范围分词在所述总经营内容中的第一分数;匹配度计算子单元,用于针对每个小类经营内容,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度。
可选地,所述匹配度计算子单元,具体用于:
针对每个所述目标经营范围分词,当所述目标经营范围分词与所述小类经营内容的当前企业经营范围分词匹配时,将所述当前企业经营范围分词的所述第一分数确定为所述目标经营范围分词的目标分数;根据各个所述目标经营范围分词的所述目标分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度。
可选地,该行业标签的确定装置,还包括:
企业经营范围句向量确定模块,用于针对每个小类的每个企业,计算所述企业的各个企业经营范围分词的词向量,并根据各个企业经营范围分词的词向量,确定所述企业的企业经营范围句向量;经营范围中心向量确定模块,用于针对每个小类,根据所述小类的各个企业的企业经营范围句向量,确定所述小类的经营范围中心向量;目标经营范围句向量确定模块,用于计算所述目标企业的各个目标经营范围分词的词向量,并根据各目标经营范围分词的词向量,确定所述目标企业的目标经营范围句向量;向量距离计算模块,用于计算所述目标经营范围句向量与各个所述小类的经营范围中心向量的向量距离。
相应的,所述匹配度计算子单元,具体用于:
根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类的小类经营内容对应的所述目标企业的经营范围的第二分数;根据所述第二分数和所述向量距离确定所述小类的所述小类经营内容与所述目标企业的经营范围的匹配度。
可选地,所述计算所述企业的各个企业经营范围分词的词向量,包括:
基于文本向量化模型以及预设中文词向量词典,计算所述企业的各个企业经营范围分词的词向量;相应的,所述计算所述目标企业的各个目标经营范围分词的词向量, 包括:基于文本向量化模型以及预设中文词向量词典,计算所述目标企业的各个目标经营范围分词的词向量。
本公开实施例所提供的行业标签的确定装置可执行本公开任意实施例所提供的行业标签的确定方法,具备执行方法相应的功能模块和有益效果。
图7为本公开一个实施例提供的行业标签的确定设备的结构示意图,如图7所示,该行业标签的确定设备包括:存储器710,处理器720以及计算机程序。
其中,计算机程序存储在存储器710中,并被配置为由处理器720执行以实现本公开图2-图5所对应的实施例中任意实施例提供的行业标签的确定方法。
其中,存储器710和处理器720通过总线730连接。
相关说明可以对应参见图2-图5的步骤所对应的相关描述和效果进行理解,此处不做过多赘述。
本公开一个实施例提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行以实现本公开图2-图5所对应的实施例中任意实施例提供的行业标签的确定方法。
其中,计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
在本公开所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个单元中。上述模块成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
上述以软件功能模块的形式实现的集成的模块,可以存储在一个计算机可读取存 储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(英文:processor)执行本公开各个实施例所述方法的部分步骤。
应理解,上述处理器可以是中央处理单元(Central Processing Unit,简称CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合发明所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
存储器可能包含高速RAM存储器,也可能还包括非易失性存储NVM,例如至少一个磁盘存储器,还可以为U盘、移动硬盘、只读存储器、磁盘或光盘等。
总线可以是工业标准体系结构(Industry Standard Architecture,简称ISA)总线、外部设备互连(Peripheral Component,简称PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,简称EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,本公开附图中的总线并不限定仅有一根总线或一种类型的总线。
上述存储介质可以是由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。存储介质可以是通用或专用计算机能够存取的任何可用介质。
一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于专用集成电路(Application Specific Integrated Circuits,简称ASIC)中。当然,处理器和存储介质也可以作为分立组件存在于电子设备或主控设备中。需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本公开各个实施例所述的方法。
以上仅为本公开的优选实施例,并非因此限制本公开的专利范围,凡是利用本公开说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本公开的专利保护范围内。

Claims (20)

  1. 一种行业标签的确定方法,其特征在于,包括:
    获取存量用户的目标企业的经营范围,其中,所述目标企业的行业标签的类型为未知标签类型;
    针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围,根据各个企业的所述经营范围生成所述小类的小类经营内容,其中,所述目标类别为所述目标企业的行业标签所属的门类或大类,所述小类的行业标签的类型为已知标签类型;
    根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内容;
    将所述匹配经营内容对应的小类的行业标签,确定为所述目标企业的行业标签。
  2. 根据权利要求1所述的方法,其特征在于,根据各个企业的所述经营范围生成所述小类的小类经营内容,包括:
    提取各个企业的经营范围的关键词;
    由各个企业的关键词组成所述小类的小类经营内容。
  3. 根据权利要求2所述的方法,其特征在于,根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内容,包括:
    针对每个小类,将所述目标企业的经营范围的各个关键与所述小类的小类经营内容的各个关键词进行匹配,得到所述小类与所述目标企业的匹配度;
    将匹配度最高的小类对应的小类内容,确定为所述目标企业的匹配经营内容。
  4. 根据权利要求3所述的方法,其特征在于,将所述目标企业的经营范围的各个关键与所述小类的小类经营内容的各个关键词进行匹配,得到所述小类与所述目标企业的匹配度,包括:
    为所述小类的小类经营内容的各个关键词设置权重值;
    当所述目标企业的经营范围的关键词与所述小类经营内容的关键词一致时,获取匹配的所述关键词的权重值;
    将各个匹配的所述关键词的权重值叠加,得到所述小类的小类经营内容与所述目标企业的经营范围的匹配度。
  5. 根据权利要求4所述的方法,其特征在于,为所述小类的小类经营内容的各 个关键词设置权重值,包括:
    根据关键词出现的频率,确定所述小类的小类经营内容的各个关键词设置权重值。
  6. 根据权利要求1所述的方法,其特征在于,根据各个企业的所述经营范围生成所述小类的小类经营内容,包括:
    针对每个小类的每个企业,对所述企业的经营范围进行分词处理,以得到所述企业的各个企业经营范围分词;
    针对每个小类,对所述小类的各个企业的企业经营范围分词进行去重处理和去除停用词处理,以得到所述小类的小类经营内容。
  7. 根据权利要求6所述的方法,其特征在于,对所述企业的经营范围进行分词处理,包括:
    基于Python的中分分词组件jieba分词,对所述企业的经营范围进行分词处理。
  8. 根据权利要求6或7所述的方法,其特征在于,在获取存量用户的目标企业的经营范围之后,还包括:
    对所述目标企业的经营范围进行分词处理,以得到所述目标企业的经营范围的各个目标经营范围分词;
    相应的,根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内容,包括:
    针对每个小类经营内容,根据所述小类经营内容的各个所述企业经营范围分词以及各个所述目标经营范围分词,计算所述小类经营内容与所述目标企业的经营范围的匹配度;
    将匹配度最高的小类经营内容确定为所述目标企业的匹配经营内容。
  9. 根据权利要求8所述的方法,其特征在于,针对每个小类经营内容,根据所述小类经营内容的各个所述企业经营范围分词以及各个所述目标经营范围分词,计算所述小类经营内容与所述目标企业的经营范围的匹配度,包括:
    根据各个小类经营内容的所述企业经营范围分词,确定所述目标类别的总经营内容;
    针对每个小类经营内容,基于词频-逆文档频率技术,以所述总经营内容为文档集,计算所述小类经营内容的各个企业经营范围分词在所述总经营内容中的第一分数;
    针对每个小类经营内容,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的 经营范围的匹配度。
  10. 根据权利要求9所述的方法,其特征在于,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度,包括:
    针对每个所述目标经营范围分词,当所述目标经营范围分词与所述小类经营内容的当前企业经营范围分词匹配时,将所述当前企业经营范围分词的所述第一分数确定为所述目标经营范围分词的目标分数;
    根据各个所述目标经营范围分词的所述目标分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度。
  11. 根据权利要求10所述的方法,其特征在于,根据各个所述目标经营范围分词的所述目标分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度,包括:
    获取所述小类对应的企业数量;
    根据所述企业数量,确定所述小类的小类权重值;
    根据所述小类权重值以及各个所述目标经营范围分词的所述目标分数,确定所述小类经营内容与所述目标企业的经营范围的匹配度。
  12. 根据权利要求11所述的方法,其特征在于,根据所述企业数量,确定所述小类的小类权重值,包括:
    根据所述小类的企业数量与所述目标类别的企业总数量的比值,确定所述小类的小类权重值。
  13. 根据权利要求9所述的方法,其特征在于,还包括:
    针对每个小类的每个企业,计算所述企业的各个企业经营范围分词的词向量,并根据各个企业经营范围分词的词向量,确定所述企业的企业经营范围句向量;
    针对每个小类,根据所述小类的各个企业的企业经营范围句向量,确定所述小类的经营范围中心向量;
    计算所述目标企业的各个目标经营范围分词的词向量,并根据各目标经营范围分词的词向量,确定所述目标企业的目标经营范围句向量;
    计算所述目标经营范围句向量与各个所述小类的经营范围中心向量的向量距离;
    相应的,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类经营内容与所述目标企业的经营范围的匹配 度,包括:
    根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类的小类经营内容对应的所述目标企业的经营范围的第二分数;
    根据所述第二分数和所述向量距离,确定所述小类的所述小类经营内容与所述目标企业的经营范围的匹配度。
  14. 根据权利要求13所述的方法,其特征在于,所述计算所述企业的各个企业经营范围分词的词向量,包括:
    基于文本向量化模型以及预设中文词向量词典,计算所述企业的各个企业经营范围分词的词向量;
    相应的,所述计算所述目标企业的各个目标经营范围分词的词向量,包括:
    基于文本向量化模型以及预设中文词向量词典,计算所述目标企业的各个目标经营范围分词的词向量。
  15. 根据权利要求13或14所述的方法,其特征在于,根据各个所述目标经营范围分词以及所述小类经营内容的各个企业经营范围分词的所述第一分数,确定所述小类的小类经营内容对应的所述目标企业的经营范围的第二分数,包括:
    确定与所述目标经营范围分词匹配的所述小类经营内容的各个企业经营范围分词的第一分数之和,为所述小类的小类经营内容对应的所述目标企业的经营范围的第二分数。
  16. 根据权利要求13至15任一项所述的方法,其特征在于,根据所述第二分数和所述向量距离,确定所述小类的所述小类经营内容与所述目标企业的经营范围的匹配度,包括:
    根据所述第二分数、所述向量距离以及下述表达式,确定所述小类的所述小类经营内容与所述目标企业的经营范围的匹配度:
    P i=S i+λD i
    其中,P i为小类F i(i=2,3,…,n)对应的匹配度;S i为小类F i对应的第二分数;D i为小类F i对应的向量距离;λ为权重系数,且λ的取值为负数。
  17. 一种行业标签的确定装置,其特征在于,包括:
    数据获取模块,用于获取存量用户的目标企业的经营范围,其中,所述目标企业的行业标签的类型为未知标签类型;
    小类经营内容确定模块,用于针对目标类别下的每个小类,获取所述小类对应的 所述存量用户的各个企业的经营范围,根据各个企业的所述经营范围生成所述小类的小类经营内容,其中,所述目标类别为所述目标企业的行业标签所属的门类或大类,所述小类的行业标签的类型为已知标签类型;
    内容匹配模块,用于根据所述目标企业的经营范围和各个所述小类经营内容,确定与所述目标企业的经营范围匹配的小类经营内容为所述目标企业的匹配经营内容;
    行业标签确定模块,用于将所述匹配经营内容对应的小类的行业标签,确定为所述目标企业的行业标签。
  18. 根据权利要求17所述的装置,其特征在于,所述小类经营内容确定模块,包括:
    经营范围获取单元,用于针对目标类别下的每个小类,获取所述小类对应的所述存量用户的各个企业的经营范围;第一分词处理单元,用于针对每个企业,对所述企业的经营范围进行分词处理,以得到所述企业的各个企业经营范围分词;小类经营内容确定单元,用于针对每个小类,对所述小类的各个企业的企业经营范围分词进行去重处理和去除停用词处理,以得到所述小类的小类经营内容。
  19. 一种行业标签的确定设备,其特征在于,所述行业标签的确定设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的行业标签的确定程序,所述行业标签的确定程序被所述处理器执行时实现如权利要求1至16中任一项所述的行业标签的确定方法的步骤。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有行业标签的确定程序,所述行业标签的确定程序被处理器执行时实现如权利要求1至16中任一项所述的行业标签的确定方法的步骤。
PCT/CN2021/103262 2020-09-30 2021-06-29 行业标签的确定方法、装置、设备及存储介质 WO2022068297A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011060599.X 2020-09-30
CN202011060599.XA CN112163153B (zh) 2020-09-30 2020-09-30 行业标签的确定方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022068297A1 true WO2022068297A1 (zh) 2022-04-07

Family

ID=73860835

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/103262 WO2022068297A1 (zh) 2020-09-30 2021-06-29 行业标签的确定方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112163153B (zh)
WO (1) WO2022068297A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018258A (zh) * 2022-05-11 2022-09-06 中国城市规划设计研究院深圳分院 一种目标地区企业类型及产业链空间识别方法
CN115905506A (zh) * 2023-02-21 2023-04-04 江西省科技事务中心 基础理论文件推送方法、系统、计算机及可读存储介质
CN116361726A (zh) * 2023-04-03 2023-06-30 全拓科技(杭州)股份有限公司 一种基于多维大数据分析的数据处理方法
CN116579786A (zh) * 2023-05-06 2023-08-11 全拓科技(杭州)股份有限公司 一种应用于大数据分析的数据清洗方法与系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163153B (zh) * 2020-09-30 2024-05-03 深圳前海微众银行股份有限公司 行业标签的确定方法、装置、设备及存储介质
CN113591979A (zh) * 2021-07-30 2021-11-02 深圳前海微众银行股份有限公司 行业类目识别方法、设备、介质及计算机程序产品
CN113869639B (zh) * 2021-08-26 2023-11-07 中国环境科学研究院 长江流域企业筛选方法、装置、电子设备及存储介质
CN113869640A (zh) * 2021-08-26 2021-12-31 中国环境科学研究院 企业筛选方法、装置、电子设备及存储介质
CN117971421A (zh) * 2024-01-02 2024-05-03 国网河北省电力有限公司信息通信分公司 通感算存一体的系统的任务分配方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130268526A1 (en) * 2012-04-06 2013-10-10 Mark E. Johns Discovery engine
CN110188357A (zh) * 2019-05-31 2019-08-30 阿里巴巴集团控股有限公司 对象的行业识别方法及装置
CN110990529A (zh) * 2019-11-28 2020-04-10 爱信诺征信有限公司 企业的行业明细划分方法及系统
CN111027318A (zh) * 2019-10-12 2020-04-17 中国平安财产保险股份有限公司 基于大数据的行业分类方法、装置、设备及存储介质
CN111538837A (zh) * 2020-04-27 2020-08-14 北京同邦卓益科技有限公司 用于分析企业经营范围信息的方法和装置
CN112163153A (zh) * 2020-09-30 2021-01-01 深圳前海微众银行股份有限公司 行业标签的确定方法、装置、设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808641A (zh) * 2016-02-24 2016-07-27 百度在线网络技术(北京)有限公司 线下资源的挖掘方法和装置
US11093557B2 (en) * 2016-08-29 2021-08-17 Zoominfo Apollo Llc Keyword and business tag extraction
CN107169523B (zh) * 2017-05-27 2020-07-21 鹏元征信有限公司 自动确定机构的所属行业类别的方法、存储设备及终端
CN108171276B (zh) * 2018-01-17 2019-07-23 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
KR102041242B1 (ko) * 2018-03-29 2019-11-07 (주)다음소프트 오토인코더를 이용한 산업분류 시스템 및 방법
CN110020427B (zh) * 2019-01-30 2023-10-17 创新先进技术有限公司 策略确定方法和装置
CN110781955A (zh) * 2019-10-24 2020-02-11 中国银联股份有限公司 无标签对象的分类和检测套码的方法、装置及计算机可读存储介质
CN111597304B (zh) * 2020-05-15 2023-04-07 上海财经大学 一种中文企业名实体精准识别二次匹配方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130268526A1 (en) * 2012-04-06 2013-10-10 Mark E. Johns Discovery engine
CN110188357A (zh) * 2019-05-31 2019-08-30 阿里巴巴集团控股有限公司 对象的行业识别方法及装置
CN111027318A (zh) * 2019-10-12 2020-04-17 中国平安财产保险股份有限公司 基于大数据的行业分类方法、装置、设备及存储介质
CN110990529A (zh) * 2019-11-28 2020-04-10 爱信诺征信有限公司 企业的行业明细划分方法及系统
CN111538837A (zh) * 2020-04-27 2020-08-14 北京同邦卓益科技有限公司 用于分析企业经营范围信息的方法和装置
CN112163153A (zh) * 2020-09-30 2021-01-01 深圳前海微众银行股份有限公司 行业标签的确定方法、装置、设备及存储介质

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018258A (zh) * 2022-05-11 2022-09-06 中国城市规划设计研究院深圳分院 一种目标地区企业类型及产业链空间识别方法
CN115018258B (zh) * 2022-05-11 2023-08-18 中国城市规划设计研究院深圳分院 一种目标地区企业类型及产业链空间识别方法
CN115905506A (zh) * 2023-02-21 2023-04-04 江西省科技事务中心 基础理论文件推送方法、系统、计算机及可读存储介质
CN116361726A (zh) * 2023-04-03 2023-06-30 全拓科技(杭州)股份有限公司 一种基于多维大数据分析的数据处理方法
CN116361726B (zh) * 2023-04-03 2024-03-29 全拓科技(杭州)股份有限公司 一种基于多维大数据分析的数据处理方法
CN116579786A (zh) * 2023-05-06 2023-08-11 全拓科技(杭州)股份有限公司 一种应用于大数据分析的数据清洗方法与系统
CN116579786B (zh) * 2023-05-06 2023-11-14 全拓科技(杭州)股份有限公司 一种应用于大数据分析的数据清洗方法与系统

Also Published As

Publication number Publication date
CN112163153A (zh) 2021-01-01
CN112163153B (zh) 2024-05-03

Similar Documents

Publication Publication Date Title
WO2022068297A1 (zh) 行业标签的确定方法、装置、设备及存储介质
JP5916947B2 (ja) オンライン商品検索方法およびシステム
US9934293B2 (en) Generating search results
JP3855551B2 (ja) 検索方法及び検索システム
US7908279B1 (en) Filtering invalid tokens from a document using high IDF token filtering
CN108363694B (zh) 关键词提取方法及装置
CN110188357B (zh) 对象的行业识别方法及装置
CN111209372B (zh) 一种关键词的确定方法、装置、电子设备和存储介质
US20160140634A1 (en) System, method and non-transitory computer readable medium for e-commerce reputation analysis
CN111767713A (zh) 关键词的提取方法、装置、电子设备及存储介质
WO2016040772A1 (en) Method and apparatus of matching an object to be displayed
CN115905489B (zh) 一种提供招投标信息搜索服务的方法
CN107832444A (zh) 基于搜索日志的事件发现方法及装置
CN105740232A (zh) 一种自动提取反馈热点的方法和装置
CN111522938B (zh) 一种人才业绩文档的筛选方法、装置和设备
CN105653553B (zh) 词权重生成方法和装置
CN113191145B (zh) 关键词的处理方法、装置、电子设备和介质
JP2015203961A (ja) 文書抽出システム
CN113821727A (zh) 物品推荐方法、计算机设备及计算机可读存储介质
CN111625619B (zh) 查询省略方法、装置、计算机可读介质及电子设备
CN112182448A (zh) 页面信息处理方法、装置及设备
JP6247413B1 (ja) 取引明細の自動仕訳装置、自動仕訳方法および自動仕訳用プログラム
CN109284384A (zh) 文本分析方法、装置、电子设备及可读存储介质
CN110837843B (zh) 信息分类方法、装置、计算机设备及存储介质
CN111191049B (zh) 信息推送方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21873959

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 070723)

122 Ep: pct application non-entry in european phase

Ref document number: 21873959

Country of ref document: EP

Kind code of ref document: A1