WO2021128721A1 - Procédé et dispositif de classification de texte - Google Patents

Procédé et dispositif de classification de texte Download PDF

Info

Publication number
WO2021128721A1
WO2021128721A1 PCT/CN2020/092099 CN2020092099W WO2021128721A1 WO 2021128721 A1 WO2021128721 A1 WO 2021128721A1 CN 2020092099 W CN2020092099 W CN 2020092099W WO 2021128721 A1 WO2021128721 A1 WO 2021128721A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
model
training
corpus
classified
Prior art date
Application number
PCT/CN2020/092099
Other languages
English (en)
Chinese (zh)
Inventor
张禄
及洪泉
姚晓明
胡彩娥
丁屹峰
王培祎
马龙飞
陆斯悦
王健
徐蕙
Original Assignee
国网北京市电力公司
国家电网有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国网北京市电力公司, 国家电网有限公司 filed Critical 国网北京市电力公司
Publication of WO2021128721A1 publication Critical patent/WO2021128721A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to the field of text classification, and in particular, to a text classification processing method and device.
  • the 95598 customer service system As an important part of the ubiquitous power Internet of Things application, has registered massive amounts of customer information. Currently, it mainly relies on manual statistical work order analysis, resulting in related problems such as insufficient efficiency. Due to the large amount of customer demand data in 95598, manual classification efficiency is low, and accurate and efficient classification cannot be achieved.
  • the embodiments of the present invention provide a text classification processing method and device to at least solve the technical problem of manually classifying text in the prior art.
  • a text classification processing method including: obtaining the text to be classified; inputting the text to be classified into a model, wherein the model uses training data through machine learning Obtained by training; the output obtained from the model is used as the category corresponding to the text to be classified; and the text to be classified and its corresponding category are saved.
  • the method before acquiring the text to be classified, the method further includes: using multiple sets of training data to train through machine learning to obtain the model.
  • training to obtain the model through machine learning includes: using a first corpus to perform pre-training to obtain a first model; using a second corpus to perform iterative training on the first model to obtain the model, wherein The second corpus includes multiple sets of data, and each set of data includes text and the category corresponding to the text.
  • using the first corpus to perform pre-training to obtain the first model includes: using the first corpus to perform training to obtain the first model through BERT, wherein each of the corpus in the corpus is masked during the training. Part of the content of a piece of corpus. The training is used to predict the concealed content.
  • the text includes a work order text
  • the category includes: a type of the work order, wherein the type includes at least one category.
  • a text classification processing device including: an acquisition module for acquiring text to be classified; an input module for inputting the text to be classified into a model, Wherein, the model is obtained through machine learning training using training data; an output module is used to use the output obtained from the model as the category corresponding to the text to be classified; and a storage module is used to save the The text to be classified and its corresponding category.
  • a training module configured to use multiple sets of training data to train through machine learning to obtain the model.
  • the training module includes: a first training unit for pre-training using a first corpus to obtain the first model; a second training unit for iterating on the first model using a second corpus
  • the model is obtained by training, wherein the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  • the first training unit is configured to: use the first corpus to train through BERT to obtain the first model, wherein part of the content of each corpus in the corpus is masked during the training, so The training is used to predict what is covered.
  • the text includes a work order text
  • the category includes: a type of the work order, wherein the type includes at least one category.
  • a storage medium including a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute any one of the above The text classification processing method.
  • a processor configured to run a program, wherein the text classification processing method described in any one of the above is executed when the program is running.
  • the text to be classified is obtained; the text to be classified is input into a model, where the model is obtained through machine learning training using training data; and the text is obtained from the model
  • the output of is used as the category corresponding to the text to be classified; the way to save the text to be classified and its corresponding category, the model obtained through machine learning training recognizes the category corresponding to the text to be classified, and saves it to achieve
  • the purpose of categorizing quickly and accurately is to achieve the technical effect of improving the efficiency of text categorization, thereby solving the technical problem of manually categorizing text in the prior art.
  • Fig. 1 is a flowchart of a text classification processing method according to an embodiment of the present invention
  • Fig. 2 is a flowchart of training of a classification model according to an optional embodiment of the present invention
  • Fig. 3 is a schematic diagram of a text classification processing device according to an embodiment of the present invention.
  • an embodiment of a text classification processing method is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, Although a logical sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than here.
  • Fig. 1 is a flowchart of a text classification processing method according to an embodiment of the present invention. As shown in Fig. 1, the method includes the following steps:
  • Step S102 Obtain the text to be classified
  • the above-mentioned text to be classified includes but is not limited to a work order.
  • the text to be classified can be obtained in a variety of ways, for example, using crawling software, manual entry, and so on.
  • multiple methods are used to obtain the text to be classified, which can expand the source of the text to be classified to be suitable for a variety of application scenarios.
  • Step S104 input the text to be classified into the model, where the model is obtained through machine learning training using training data;
  • the text to be classified can be processed through the model.
  • the model is a work order classification model. It should be noted that the above model is obtained through machine learning training using training data, and can realize automatic text classification.
  • Step S106 Use the output obtained from the model as the category corresponding to the text to be classified;
  • the input text to be classified can be output corresponding to its corresponding category.
  • This model can effectively improve the classification accuracy and improve the efficiency of text classification.
  • Step S108 Save the text to be classified and its corresponding category.
  • the text to be classified and its corresponding category can be saved in a predetermined format, where the predetermined format includes text attributes and category attributes, and the text to be classified can be saved in the location of the text attribute. , Save the category corresponding to the text to be classified in the location of the category attribute. It should be noted that in the specific implementation process, it is not limited to the above methods.
  • the model obtained by machine learning training can be used to identify the category corresponding to the text to be classified, and save it, so as to achieve the purpose of fast and accurate classification, thereby achieving the technical effect of improving the efficiency of text classification and solving the problem.
  • the method before obtaining the text to be classified, the method further includes: using multiple sets of training data to train through machine learning to obtain a model.
  • the above use of multiple sets of training data means using a large amount of training data. Therefore, the model obtained through machine learning training based on a large amount of training data has better recognition or prediction effect, which greatly improves the classification accuracy and accuracy. .
  • the attention mechanism in Transformer can be used to replace the original RNN, and when the RNN is being trained, the calculation of the current step depends on the implicit state of the previous step, that is It is said that this is a sequence process, and each calculation must wait for the previous calculation to be completed before it can be unfolded.
  • the Transformer does not use RNN, and all calculations can be performed in parallel, thereby increasing the speed of training.
  • the data of the first frame must be passed to the tenth frame through the second, three, four, five...9 frames in turn, and then the calculation of the two is generated.
  • the data of the first frame may have been biased, so the speed and accuracy of this interaction are not guaranteed.
  • Transformer due to the existence of selfattention, there is a direct link between any two frames. The interaction, thus establishing a direct dependence, no matter how far the distance between the two, this can improve the accuracy of training.
  • training to obtain a model through machine learning includes: using a first corpus to perform pre-training to obtain a first model; using a second corpus to perform iterative training on the first model to obtain a model, wherein the second corpus includes multiple groups Data, each group of data includes text and the category corresponding to the text.
  • the first model can be pre-trained and iteratively trained through the first corpus and the second corpus to obtain the final model.
  • Both the first corpus and the second corpus include multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  • using the first corpus to perform pre-training to obtain the first model includes: using the first corpus to train through BERT to obtain the first model, where part of the content of each corpus in the corpus is masked during training, and training is used What is hidden in the forecast.
  • the above-mentioned BERT includes a Transformer encoder, in which, when used to predict the masked content, all the tags corresponding to the masked words are masked. At the same time, under the condition that the overall masking rate remains unchanged, the first model can independently predict the tag of each masked word.
  • the text includes the work order text
  • the category includes: the type of the work order, where the type includes at least one category.
  • the aforementioned single text may include, but is not limited to, 95598 work orders, where the types of work orders can be divided according to application requirements, for example, different work order types can be divided according to distance, entry time, and work order level.
  • Figure 2 is a flowchart of the training of the classification model according to an optional embodiment of the present invention.
  • the customer service accesses, the customer service manually enters the content of the work order into two categories: category and text. Part, after the corresponding cleaning and proofreading work is done on the category and the text, the text content enters the already trained classification model. Then the prediction data of the classification model is compared with the manually entered categories, and the evaluation index of the current model is obtained to evaluate the performance of the current model.
  • the current model performance is used to determine whether it is necessary to use the new comparison results and text content to continue to tune and update the model. This can ensure the real-time effect of the model, avoid uncertain model deviations, and provide the model with the possibility of continuous use and optimization.
  • automatic text-based classification functions can be provided for 95598 work orders; real-time monitoring and display of model performance are provided to facilitate model maintenance; models have the ability to continuously update and optimize, Able to continuously optimize in the actual business process; have a certain adaptability to the trend changes of text work orders; the way the model is used in the actual business process.
  • FIG. 3 is a schematic diagram of the text classification processing apparatus according to the embodiment of the present invention, such as As shown in FIG. 3, the text classification processing device includes: an acquisition module 302, an input module 304, an output module 306, and a storage module 308. The text classification processing device will be described in detail below.
  • the obtaining module 302 is used to obtain the text to be classified
  • the input module 304 connected to the above-mentioned acquisition module 302, is used to input the text to be classified into the model, where the model is obtained through machine learning training using training data;
  • the saving module 308 is connected to the aforementioned output module 306, and is used to save the text to be classified and its corresponding category.
  • the above device can recognize the category corresponding to the text to be classified through the model obtained by the machine learning training, and save it, so as to achieve the purpose of fast and accurate classification, thereby achieving the technical effect of improving the efficiency of text classification, thereby solving the prior art Rely on manual methods to classify text technical issues.
  • the above-mentioned acquisition module 302, input module 304, output module 306, and saving module 308 correspond to steps S102 to S108 in Embodiment 1.
  • the above modules and the corresponding steps implement the same examples and application scenarios. However, it is not limited to the content disclosed in Example 1 above. It should be noted that, as a part of the device, the above-mentioned modules can be executed in a computer system such as a set of computer-executable instructions.
  • a training module configured to use multiple sets of training data to train through machine learning to obtain a model.
  • the above use of multiple sets of training data means using a large amount of training data. Therefore, the model obtained through machine learning training based on a large amount of training data has better recognition or prediction effect, which greatly improves the classification accuracy and accuracy. .
  • the attention mechanism in Transformer can be used to replace the original RNN, and when the RNN is being trained, the calculation of the current step depends on the implicit state of the previous step, that is It is said that this is a sequence process, and each calculation must wait for the previous calculation to be completed before it can be expanded.
  • the Transformer does not use RNN, and all calculations can be performed in parallel, thereby increasing the speed of training.
  • the data of the first frame must be passed to the tenth frame through the second, three, four, five...9 frames in turn, and then the calculation of the two is generated.
  • the data of the first frame may have been biased, so the speed and accuracy of this interaction are not guaranteed.
  • Transformer due to the existence of selfattention, there is a direct link between any two frames. The interaction, thus establishing a direct dependence, no matter how far the distance between the two, this can improve the accuracy of training.
  • the training module includes: a first training unit for pre-training using the first corpus to obtain the first model; a second training unit for iterative training of the first model using the second corpus to obtain the model, Among them, the second corpus includes multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  • the first model can be pre-trained and iteratively trained through the first corpus and the second corpus to obtain the final model.
  • Both the first corpus and the second corpus include multiple sets of data, and each set of data includes a text and a category corresponding to the text.
  • the first training unit is used to: use the first corpus to train through BERT to obtain the first model, where part of the content of each corpus in the corpus is masked in the training, and the training is used to predict the masked content.
  • the above-mentioned BERT includes a Transformer encoder, in which, when used to predict the masked content, all the tags corresponding to the masked words are masked. At the same time, under the condition that the overall masking rate remains unchanged, the first model can independently predict the tag of each masked word.
  • the text includes the work order text
  • the category includes: the type of the work order, where the type includes at least one category.
  • the aforementioned single text may include, but is not limited to, 95598 work orders, where the types of work orders can be divided according to application requirements, for example, different work order types can be divided according to distance, entry time, and work order level.
  • a storage medium includes a stored program, wherein the device where the storage medium is located is controlled to execute any one of the above-mentioned text classification processing methods when the program is running.
  • a processor which is configured to run a program, where any one of the text classification processing methods described above is executed when the program is running.
  • the disclosed technical content can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units may be a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • a computer device which can be a personal computer, a server, or a network device, etc.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Procédé et dispositif de classification de texte. Le procédé comprend les étapes consistant à acquérir un texte à classer (S102) ; à entrer le texte dans un modèle, le modèle étant obtenu par apprentissage machine et étant entraîné à l'aide de données d'apprentissage (S104) ; à utiliser la sortie acquise à partir du modèle en tant que catégorie correspondant au texte (S106) ; et à stocker le texte et la catégorie correspondant à celui-ci (S108). Le problème technique selon lequel une classification de texte dans l'état de la technique repose sur un mode manuel est résolu.
PCT/CN2020/092099 2019-12-25 2020-05-25 Procédé et dispositif de classification de texte WO2021128721A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911360673.7 2019-12-25
CN201911360673.7A CN111209394A (zh) 2019-12-25 2019-12-25 文本分类处理方法和装置

Publications (1)

Publication Number Publication Date
WO2021128721A1 true WO2021128721A1 (fr) 2021-07-01

Family

ID=70786462

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092099 WO2021128721A1 (fr) 2019-12-25 2020-05-25 Procédé et dispositif de classification de texte

Country Status (2)

Country Link
CN (1) CN111209394A (fr)
WO (1) WO2021128721A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111861201A (zh) * 2020-07-17 2020-10-30 南京汇宁桀信息科技有限公司 一种基于大数据分类算法的政务智能派单的方法
CN112949674A (zh) * 2020-08-22 2021-06-11 上海昌投网络科技有限公司 一种多模型融合的语料生成方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109213860A (zh) * 2018-07-26 2019-01-15 中国科学院自动化研究所 融合用户信息的文本情感分类方法及装置
CN109670167A (zh) * 2018-10-24 2019-04-23 国网浙江省电力有限公司 一种基于Word2Vec的电力客服工单情感量化分析方法
CN109710825A (zh) * 2018-11-02 2019-05-03 成都三零凯天通信实业有限公司 一种基于机器学习的网页有害信息识别方法
US10354203B1 (en) * 2018-01-31 2019-07-16 Sentio Software, Llc Systems and methods for continuous active machine learning with document review quality monitoring
CN110489521A (zh) * 2019-07-15 2019-11-22 北京三快在线科技有限公司 文本类别检测方法、装置、电子设备和计算机可读介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032644A (zh) * 2019-04-03 2019-07-19 人立方智能科技有限公司 语言模型预训练方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10354203B1 (en) * 2018-01-31 2019-07-16 Sentio Software, Llc Systems and methods for continuous active machine learning with document review quality monitoring
CN109213860A (zh) * 2018-07-26 2019-01-15 中国科学院自动化研究所 融合用户信息的文本情感分类方法及装置
CN109670167A (zh) * 2018-10-24 2019-04-23 国网浙江省电力有限公司 一种基于Word2Vec的电力客服工单情感量化分析方法
CN109710825A (zh) * 2018-11-02 2019-05-03 成都三零凯天通信实业有限公司 一种基于机器学习的网页有害信息识别方法
CN110489521A (zh) * 2019-07-15 2019-11-22 北京三快在线科技有限公司 文本类别检测方法、装置、电子设备和计算机可读介质

Also Published As

Publication number Publication date
CN111209394A (zh) 2020-05-29

Similar Documents

Publication Publication Date Title
CN109635117B (zh) 一种基于知识图谱识别用户意图方法及装置
CN110516067B (zh) 基于话题检测的舆情监控方法、系统及存储介质
WO2020125445A1 (fr) Procédé d'entraînement de modèle de classification, procédé de classification, dispositif et support
Xie et al. Detecting duplicate bug reports with convolutional neural networks
WO2021051517A1 (fr) Procédé de récupération d'informations basé sur un réseau neuronal convolutif, et dispositif associé
CN112070138B (zh) 多标签混合分类模型的构建方法、新闻分类方法及系统
US11741094B2 (en) Method and system for identifying core product terms
KR20200127020A (ko) 의미 텍스트 데이터를 태그와 매칭시키는 방법, 장치 및 명령을 저장하는 컴퓨터 판독 가능한 기억 매체
US20220277005A1 (en) Semantic parsing of natural language query
CN107205016A (zh) 物联网设备的检索方法
CN108108426A (zh) 自然语言提问的理解方法、装置及电子设备
CN110866799A (zh) 使用人工智能监视在线零售平台的系统和方法
WO2023065642A1 (fr) Procédé d'examen minutieux de corpus, procédé d'optimisation de modèle de reconnaissance d'intention, dispositif et support de stockage
WO2021128721A1 (fr) Procédé et dispositif de classification de texte
CN112966089A (zh) 基于知识库的问题处理方法、装置、设备、介质和产品
US20220100967A1 (en) Lifecycle management for customized natural language processing
KR20210063882A (ko) 효율적 문서 분류 처리를 지원하는 지식 그래프 기반 마케팅 정보 분석 서비스 제공 방법 및 그 장치
CN107480270A (zh) 一种基于用户反馈数据流的实时个性化推荐方法及系统
KR20210063878A (ko) 지식 그래프 기반 마케팅 정보 분석 챗봇 서비스 제공 방법 및 그 장치
CN113553431A (zh) 用户标签提取方法、装置、设备及介质
WO2023093116A1 (fr) Procédé et appareil pour déterminer un noeud de chaîne industrielle d'une entreprise, et terminal et support de stockage
Lo et al. An emperical study on application of big data analytics to automate service desk business process
CN116090450A (zh) 一种文本处理方法及计算设备
CN115438658A (zh) 一种实体识别方法、识别模型的训练方法和相关装置
US11295091B2 (en) Systems and method for intent messaging

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20905055

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20905055

Country of ref document: EP

Kind code of ref document: A1