CN115391519A - NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium - Google Patents

NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium Download PDF

Info

Publication number
CN115391519A
CN115391519A CN202210859622.4A CN202210859622A CN115391519A CN 115391519 A CN115391519 A CN 115391519A CN 202210859622 A CN202210859622 A CN 202210859622A CN 115391519 A CN115391519 A CN 115391519A
Authority
CN
China
Prior art keywords
model
enterprise
data
training
data source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210859622.4A
Other languages
Chinese (zh)
Inventor
张巍元
陈作星
孙宇
姜艳萍
吕海玉
葛振兴
王艳彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin Province Jilin Xiangyun Information Technology Co ltd
Original Assignee
Jilin Province Jilin Xiangyun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin Province Jilin Xiangyun Information Technology Co ltd filed Critical Jilin Province Jilin Xiangyun Information Technology Co ltd
Priority to CN202210859622.4A priority Critical patent/CN115391519A/en
Publication of CN115391519A publication Critical patent/CN115391519A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

An enterprise automatic labeling model generation method, system, equipment and storage medium based on NLP technology belongs to the technical field of artificial intelligence and solves the problems that an existing labeling mode depends on manual work, efficiency is low, accuracy is low, labor cost is high, and the proportion of subjective factors of experts is too high. The method comprises the following steps: s1, capturing Internet enterprise information to form a basic data source; s2, correspondingly processing the basic data source, and extracting enterprise key information from the processed basic data source by using an NLP (non-line segment) technology; s3, combining the original label data of the enterprise, and performing model training on the key information and the label data of the enterprise; s4, combining a model training result, adjusting model parameters and changing input data, and performing multiple iterations on the model to generate a training model; and S5, supplementing the model rule by combining the actual situation to generate an automatic labeling model.

Description

NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an enterprise automatic labeling model generation method, system, equipment and storage medium based on NLP technology.
Background
At present, classification and labeling of enterprises generally depend on a traditional manual selection mode, and labeling is carried out by using experience of business experts. The method has the defects of low efficiency, high labor cost, high proportion of subjective factors of experts and the like. With the development of the times, more and more enterprises can generate the situation of a plurality of labels, and the situation of omission, misjudgment and the like easily occurs in a manual selection mode. Moreover, more and more enterprise data are required to be labeled at present, which causes great difficulty to the traditional manual labeling mode.
In summary, the existing labeling method has the following defects: because of depending on the manual work, the efficiency is low, the accuracy is low, the labor cost is high, and the ratio of subjective factors of experts is too high.
Disclosure of Invention
The invention solves the problems of low efficiency, low accuracy, high labor cost and overhigh subjective factor ratio of experts in the existing labeling mode because of depending on manpower.
The invention relates to an enterprise automatic labeling model generation method based on NLP technology, which comprises the following steps:
s1, capturing Internet enterprise information to form a basic data source;
s2, correspondingly processing the basic data source, and extracting enterprise key information from the processed basic data source by using an NLP (non line segment) technology;
s3, combining the original label data of the enterprise, and performing model training on the key information of the enterprise and the label data;
s4, combining a model training result, adjusting model parameters and changing input data, and performing multiple iterations on the model to generate a training model;
and S5, supplementing the model rule by combining the actual situation to generate an automatic labeling model.
Further, in an embodiment of the present invention, in the step S1, the manner of capturing the internet enterprise information includes web crawler collection and historical enterprise tag library data.
Further, in an embodiment of the present invention, in the step S2, the performing corresponding processing on the basic data source includes the following steps:
step S201, data in a basic data source is cleaned, and interference items in the data are removed;
step S202, performing word segmentation on the cleaned data in the basic data source;
and step S203, managing and supplementing the professional vocabulary and the disabled vocabulary according to the word segmentation result of the step S202.
Further, in an embodiment of the present invention, in the step S2, the weight of the part of the professional vocabulary of the enterprise key information extracted from the processed basic data source by using the NLP technology is adjusted.
Further, in an embodiment of the present invention, in the step S3, the XGBOOST algorithm is used for the model training.
Further, in an embodiment of the present invention, in the step S3, the model training of the enterprise tag data includes the following steps:
step S301, taking the enterprise tag data as a result set, and extracting vectorization data of the enterprise tag data by using an NLP technology;
and step S302, cutting a training set, a verification set and a cross-verification set by combining the result set, and then training a model.
The invention relates to an enterprise automatic labeling model generation system based on NLP technology, which comprises the following modules:
the capturing module is used for capturing internet enterprise information to form a basic data source;
the processing module is used for correspondingly processing the basic data source and extracting enterprise key information from the processed basic data source by utilizing an NLP (non line segment) technology;
the model module is used for performing model training on the enterprise key information and the label data in combination with the original label data of the enterprise;
the iteration module is used for adjusting model parameters and changing input data by combining a model training result, and performing multiple iterations on the model to generate a training model;
and the generating module is used for supplementing the model rule by combining the actual situation and generating the automatic labeling model.
The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for finishing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the above methods when executing the program stored in the memory.
A computer-readable storage medium according to the present invention, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any of the above-mentioned methods.
The invention solves the problems of low efficiency, low accuracy, high labor cost and high proportion of subjective factors of experts in the existing labeling mode because of depending on manpower. The method has the following specific beneficial effects:
1. according to the enterprise automatic labeling model generation method based on the NLP technology, firstly, an enterprise basic information database is formed by capturing enterprise basic information, key data are extracted through data cleaning and iterative word segmentation, and a professional vocabulary weighting mode is introduced before Chinese text vectorization, so that the data model calculation is more accurate. Meanwhile, a model calculation method with the best effect is adopted, a data model is repeatedly trained in an iterative mode, and finally a business rule model is added, so that automatic labeling service of an enterprise which meets business requirements more accurately is provided, and the problems that an existing labeling mode depends on manual work, the efficiency is low, the accuracy is low, the labor cost is high, and the ratio of subjective factors of experts is too high are effectively solved.
2. According to the enterprise automatic labeling model generation method based on the NLP technology, data cleaning is carried out on enterprise basic information data, interference items in the data are removed, partial data fields which are not suitable for participating in the model are deleted, and accuracy of the data is improved.
3. According to the enterprise automatic labeling model generation method based on the NLP technology, the generated training model is combined with business data and expert suggestions to establish a rule model and supplement the training model, so that the result output by using the model can meet business related requirements.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of an enterprise automatic labeling model generation method based on NLP technology according to an embodiment.
FIG. 2 is a diagram of a basic data block, in accordance with an embodiment.
Fig. 3 is a flowchart of enterprise basic information data processing according to an embodiment.
Detailed Description
Various embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings. The embodiments described by referring to the drawings are exemplary and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The method for generating the enterprise automatic labeling model based on the NLP technology comprises the following steps:
s1, capturing Internet enterprise information to form a basic data source;
s2, correspondingly processing the basic data source, and extracting enterprise key information from the processed basic data source by using an NLP (non-line segment) technology;
s3, combining the original label data of the enterprise, and performing model training on the key information and the label data of the enterprise;
s4, combining a model training result, adjusting model parameters and changing input data, and performing multiple iterations on the model to generate a training model;
and S5, supplementing the model rule by combining the actual situation to generate an automatic labeling model.
In this embodiment, in step S1, the manner of capturing the internet enterprise information includes web crawler collection and historical enterprise tag library data.
In this embodiment, in the step S2, the corresponding processing performed by the basic data source includes the following steps:
step S201, data in a basic data source is cleaned, and interference items in the data are removed;
step S202, performing word segmentation on the cleaned data in the basic data source;
and step S203, managing and supplementing the professional vocabulary and the disabled vocabulary according to the word segmentation result of the step S202.
In this embodiment, in the step S2, the weight of the part of the professional vocabulary of the enterprise key information extracted from the processed basic data source by the NLP technology is adjusted.
In this embodiment, in step S3, the XGBOOST algorithm is used for the model training.
In this embodiment, the step S3 of performing model training on the enterprise tag data includes the following steps:
step S301, taking the enterprise tag data as a result set, and extracting vectorization data of the enterprise tag data by using an NLP technology;
and step S302, cutting a training set, a verification set and a cross-verification set by combining the result set, and then training a model.
The embodiment of the system for generating the enterprise automatic labeling model based on the NLP technology comprises the following modules:
the capturing module is used for capturing internet enterprise information to form a basic data source;
the processing module is used for correspondingly processing the basic data source and extracting enterprise key information from the processed basic data source by using an NLP (non line segment) technology;
the model module is used for performing model training on the enterprise key information and the label data in combination with the original label data of the enterprise;
the iteration module is used for adjusting model parameters and changing input data by combining a model training result, and performing multiple iterations on the model to generate a training model;
and the generation module is used for supplementing the model rule by combining the actual situation and generating the automatic labeling model.
The electronic device according to this embodiment includes a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface are configured to complete communication between the processor and the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of the above embodiments when executing the program stored in the memory.
A computer-readable storage medium according to this embodiment, in which a computer program is stored, which, when being executed by a processor, implements the method steps of any of the above embodiments.
The embodiment is based on the method for generating the enterprise automatic labeling model based on the NLP technology, which can be better understood by combining with fig. 1, and provides an actual embodiment:
step S1: establishing a basic data source: capturing internet enterprise information to form a basic data source;
step S2: extracting key information: extracting key information of an enterprise by using an NLP technology;
and step S3: primary model training: performing model training by combining the label data;
and step S4: an iterative model: iteration is carried out on the model by combining the model parameters and the data condition;
step S5: the supplementary model rules are: supplementing a model rule by combining with a service expert suggestion;
step S6: and generating a final automatic labeling model.
The basic data is mainly divided into two parts, namely the enterprise basic information data collected by the web crawler and the historical enterprise tag library data. As shown in fig. 2, then performing word segmentation, key information extraction and vectorization on the basic information of the related enterprise through the NLP technology; training relevant key information and label data into a model by combining original relevant enterprise label data of a company;
as shown in FIG. 3, firstly, data cleaning is needed to remove interference items in the data, and part of data fields which are not suitable for participating in the model are deleted, so that the accuracy of the data is improved; then, performing word segmentation, wherein the part is an iterative process and needs to perform management and supplement of professional vocabularies and disabled vocabularies according to word segmentation results; then extracting key information of each industry through NLP technology; and then, properly adjusting the weight of part of professional vocabularies to make the data more suitable for model calculation, and then carrying out Chinese text vectorization through a related algorithm.
The enterprise information labeling is essentially a multi-classification task, so the XGB OST algorithm is adopted for model training. Taking the enterprise label data as a result set, utilizing vectorization data extracted by an NLP module, cutting a training set, a verification set and a cross-validation set by combining the result set, and then carrying out model training; and (4) carrying out model iteration by properly adjusting parameters and changing input data according to a model training result to generate a training model.
And establishing a rule model by combining the service data and expert suggestions, and performing a supplementary training model to ensure that the result output by using the model meets the relevant requirements of the service. And finally, providing model service, wherein the input is the basic information of the enterprise, and the output is the enterprise label, thereby completing the automatic labeling of the enterprise.
In summary, the invention first captures the basic information of the enterprise to form a basic information database of the enterprise; through a data cleaning and iterative word segmentation mode, the NLP technology is used for extracting key data, and a mode of professional vocabulary weighting is introduced before Chinese text vectorization, so that the data model calculation is more accurate. Meanwhile, a model calculation method with the best effect is adopted, and a data model is repeatedly trained in an iterative mode; and finally, a business rule model is added, and the automatic labeling service of the enterprise which meets the business requirements more accurately is provided.
The method, the system, the equipment and the storage medium for generating the enterprise automatic labeling model based on the NLP technology are introduced in detail, a specific example is applied in the method to explain the principle and the implementation mode of the invention, and the explanation of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (9)

1. An enterprise automatic labeling model generation method based on NLP technology is characterized by comprising the following steps:
s1, capturing Internet enterprise information to form a basic data source;
s2, correspondingly processing the basic data source, and extracting enterprise key information from the processed basic data source by using an NLP (non-line segment) technology;
s3, combining the original label data of the enterprise, and performing model training on the key information of the enterprise and the label data;
s4, combining a model training result, adjusting model parameters and changing input data, and performing multiple iterations on the model to generate a training model;
and S5, supplementing the model rule by combining the actual situation to generate an automatic labeling model.
2. The method for generating an enterprise automatic tagging model based on the NLP technology according to claim 1, wherein in the step S1, the manner of capturing internet enterprise information includes web crawler collection and historical enterprise tag library data.
3. The method for generating an enterprise automatic labeling model based on NLP technology according to claim 1, wherein in the step S2, the basic data source is processed correspondingly, which includes the following steps:
step S201, data in a basic data source is cleaned, and interference items in the data are removed;
step S202, performing word segmentation on the cleaned data in the basic data source;
and step S203, managing and supplementing the professional vocabulary and the disabled vocabulary according to the word segmentation result of the step S202.
4. The method for generating an enterprise automatic labeling model based on the NLP technology as claimed in claim 1, wherein in step S2, the NLP technology is used to extract part of professional vocabularies of enterprise key information from the processed basic data source for weight adjustment.
5. The method for generating an enterprise automatic tagging model based on the NLP technology as claimed in claim 1, wherein in step S3, the XGBOOST algorithm is used for the model training.
6. The method according to claim 3, wherein in the step S3, the model training of the enterprise tag data includes the following steps:
step S301, taking the enterprise tag data as a result set, and extracting vectorization data of the enterprise tag data by using an NLP technology;
and step S302, cutting a training set, a verification set and a cross-verification set by combining the result set, and then training a model.
7. An enterprise automatic labeling model generation system based on NLP technology is characterized by comprising the following modules:
the capturing module is used for capturing internet enterprise information to form a basic data source;
the processing module is used for correspondingly processing the basic data source and extracting enterprise key information from the processed basic data source by utilizing an NLP (non line segment) technology;
the model module is used for performing model training on the enterprise key information and the label data in combination with the original label data of the enterprise;
the iteration module is used for adjusting model parameters and changing input data by combining a model training result, and performing multiple iterations on the model to generate a training model;
and the generating module is used for supplementing the model rule by combining the actual situation and generating the automatic labeling model.
8. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 6 when executing a program stored in a memory.
9. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.
CN202210859622.4A 2022-07-21 2022-07-21 NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium Pending CN115391519A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210859622.4A CN115391519A (en) 2022-07-21 2022-07-21 NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210859622.4A CN115391519A (en) 2022-07-21 2022-07-21 NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115391519A true CN115391519A (en) 2022-11-25

Family

ID=84116815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210859622.4A Pending CN115391519A (en) 2022-07-21 2022-07-21 NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115391519A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599965A (en) * 2022-12-13 2023-01-13 山东中慧强企信息科技有限公司(Cn) Data economic informatization management system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783818A (en) * 2019-01-17 2019-05-21 上海三零卫士信息安全有限公司 A kind of enterprises ' industry multi-tag classification method
CN112287075A (en) * 2020-12-25 2021-01-29 北京智源人工智能研究院 Method and device for automatically acquiring enterprise multi-level classification training data
CN112632980A (en) * 2020-12-30 2021-04-09 广州友圈科技有限公司 Enterprise classification method and system based on big data deep learning and electronic equipment
CN113312476A (en) * 2021-02-03 2021-08-27 珠海卓邦科技有限公司 Automatic text labeling method and device and terminal
CN114491209A (en) * 2022-01-24 2022-05-13 南京中新赛克科技有限责任公司 Method and system for mining enterprise business label based on internet information capture

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783818A (en) * 2019-01-17 2019-05-21 上海三零卫士信息安全有限公司 A kind of enterprises ' industry multi-tag classification method
CN112287075A (en) * 2020-12-25 2021-01-29 北京智源人工智能研究院 Method and device for automatically acquiring enterprise multi-level classification training data
CN112632980A (en) * 2020-12-30 2021-04-09 广州友圈科技有限公司 Enterprise classification method and system based on big data deep learning and electronic equipment
CN113312476A (en) * 2021-02-03 2021-08-27 珠海卓邦科技有限公司 Automatic text labeling method and device and terminal
CN114491209A (en) * 2022-01-24 2022-05-13 南京中新赛克科技有限责任公司 Method and system for mining enterprise business label based on internet information capture

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599965A (en) * 2022-12-13 2023-01-13 山东中慧强企信息科技有限公司(Cn) Data economic informatization management system
CN115599965B (en) * 2022-12-13 2023-08-11 山东中慧强企信息科技有限公司 Data economy informatization management system

Similar Documents

Publication Publication Date Title
CN110377759B (en) Method and device for constructing event relation graph
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN110569359B (en) Training and application method and device of recognition model, computing equipment and storage medium
CN116629275B (en) Intelligent decision support system and method based on big data
CN111914555B (en) Automatic relation extraction system based on Transformer structure
CN111309910A (en) Text information mining method and device
CN110751234B (en) OCR (optical character recognition) error correction method, device and equipment
CN106528616A (en) Language error correcting method and system for use in human-computer interaction process
CN114239574A (en) Miner violation knowledge extraction method based on entity and relationship joint learning
CN110705272A (en) Named entity identification method for automobile engine fault diagnosis
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN114265937A (en) Intelligent classification analysis method and system of scientific and technological information, storage medium and server
CN115391519A (en) NLP technology-based enterprise automatic labeling model generation method, system, equipment and storage medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN114357195A (en) Knowledge graph-based question-answer pair generation method, device, equipment and medium
CN112749556B (en) Multi-language model training method and device, storage medium and electronic equipment
CN112836013A (en) Data labeling method and device, readable storage medium and electronic equipment
CN110472231B (en) Method and device for identifying legal document case
CN110362828B (en) Network information risk identification method and system
CN116739408A (en) Power grid dispatching safety monitoring method and system based on data tag and electronic equipment
CN107657060B (en) Feature optimization method based on semi-structured text classification
CN117786427B (en) Vehicle type main data matching method and system
CN111402012B (en) E-commerce defective product identification method based on transfer learning
CN114036946B (en) Text feature extraction and auxiliary retrieval system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination