WO2021051560A1 - Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium - Google Patents

Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium Download PDF

Info

Publication number
WO2021051560A1
WO2021051560A1 PCT/CN2019/117647 CN2019117647W WO2021051560A1 WO 2021051560 A1 WO2021051560 A1 WO 2021051560A1 CN 2019117647 W CN2019117647 W CN 2019117647W WO 2021051560 A1 WO2021051560 A1 WO 2021051560A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
training
text
vector
label
Prior art date
Application number
PCT/CN2019/117647
Other languages
French (fr)
Chinese (zh)
Inventor
郑立颖
徐亮
阮晓雯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051560A1 publication Critical patent/WO2021051560A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the technical field of artificial intelligence, and disclosed are a text classification method and apparatus. The method comprises: by means of segmenting words of a text to be classified, obtaining a segmented word set corresponding to the text; vectorizing the segmented word set according to a preset word vector dictionary, and obtaining a word vector set corresponding to the text, the word vector dictionary being integrated with a fast text vector and a word embedded vector corresponding to the segmented words; by means of a preset tag prediction model, predicting a category tag for the word vector set corresponding to the text, the tag predication model being obtained by training according to both a training set and a test set, and the test set being used for correcting error data in the training set; and acquiring a prediction result outputted by the tag prediction model, wherein the prediction result corresponds to the text category that corresponds to the text. The present application is capable of greatly improving the accuracy of text classification.

Description

文本分类方法和装置、电子设备、计算机非易失性可读存储介质Text classification method and device, electronic equipment, computer non-volatile readable storage medium 技术领域Technical field
本申请要求2019年9月17日递交、申请名称为“文本分类方法和装置、电子设备、计算机可读存储介质”的中国专利申请201910877110.9的优先权,在此通过引用将其全部内容合并于此。This application claims the priority of the Chinese patent application 201910877110.9 filed on September 17, 2019 with the application titled "Text Classification Method and Device, Electronic Equipment, Computer Readable Storage Medium", and the entire contents of which are incorporated herein by reference. .
本申请涉及人工智能技术领域,尤其涉及一种文本分类方法及装置、电子设备、计算机非易失性可读存储介质。This application relates to the field of artificial intelligence technology, in particular to a text classification method and device, electronic equipment, and computer non-volatile readable storage media.
背景技术Background technique
随着网络技术的快速发展,对于电子文本信息进行有效组织和管理,并且快速且全面地从中获取相关信息的要求越来越高。文本分类作为信息处理的重要研究方向,是解决文本信息发现的常用方法。With the rapid development of network technology, the requirements for effective organization and management of electronic text information and obtaining relevant information quickly and comprehensively are getting higher and higher. As an important research direction of information processing, text classification is a common method to solve text information discovery.
发明人意识到,文本分类是将自然语句按照一定的分类体系或标准进行自动分类且标记相应类别的技术,文本分类的处理大致分为文本预处理、文本特征提取和分类模型构建等阶段。由于文本分类的处理过程繁杂,容易因为一些常见错误导致无法对自然语句进行准确分类。The inventor realizes that text classification is a technology that automatically classifies natural sentences according to a certain classification system or standard and marks corresponding categories. The processing of text classification is roughly divided into the stages of text preprocessing, text feature extraction, and classification model construction. Due to the complicated process of text classification, it is easy to be unable to accurately classify natural sentences due to some common errors.
技术问题technical problem
因此,如何提高文本分类的准确度,是相关领域的技术人员需要不断研究的技术问题。Therefore, how to improve the accuracy of text classification is a technical problem that technicians in related fields need to study continuously.
技术解决方案Technical solutions
为了解决上述技术问题,本申请提供了一种文本分类方法及装置、电子设备、计算机非易失性可读存储介质。In order to solve the above technical problems, the present application provides a text classification method and device, electronic equipment, and computer non-volatile readable storage media.
其中,本申请所采用的技术方案为:Among them, the technical solution adopted in this application is:
一方面,一种文本分类方法,包括:通过对待分类文本进行分词处理,获得所述待分类文本对应的分词集合;根据预设的词向量词典对所述分词集合进行向量化处理,获得所述待分类文本对应的词向量集合,所述词向量词典融合有分词所对应的快速文本向量和词嵌入向量;通过预设的标签预测模型对所述待分类文本对应的词向量集合进行类别标签预测,所述标签预测模型是共同根据所述训练集和测试集进行训练得到的,所述测试集用于修正所述训练集中的错误数据;获取所述标签预测模型所输出的预测结果,所述预测结果对应于所述待分类文本所对应的文本类别。In one aspect, a text classification method includes: obtaining a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified; performing vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain the The word vector set corresponding to the text to be classified, and the word vector dictionary combines the fast text vector corresponding to the word segmentation and the word embedding vector; the category label prediction is performed on the word vector set corresponding to the text to be classified through a preset label prediction model , The label prediction model is obtained by jointly training according to the training set and the test set, the test set is used to correct the erroneous data in the training set; the prediction result output by the label prediction model is obtained, and the The prediction result corresponds to the text category corresponding to the text to be classified.
另一方面,一种文本分类装置,包括:分词处理器,配置为通过对待分类文本进行分词处理,获得所述待分类文本对应的分词集合;向量化处理器,配置为根据预设的词向量词典对所述分词集合进行向量化处理,获得所述待分类文本对应的词向量集合,所述词向量词典融合有分词所对应的快速文本向量和词嵌入向量;标签预测器,配置为通过预设的标签预测模型对所述待分类文本对应的词向量集合进行类别标签预测,所述标签预测模型是共同根据所述训练集和测试集进行训练得到的,所述测试集配置为修正所述训练集中的错误数据;类别获取器,配置为获取所述标签预测模型所输出的预测结果,所述预测结果对应于所述待分类文本所对应的文本类别。On the other hand, a text classification device includes: a word segmentation processor configured to obtain a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified; the vectorization processor is configured to perform word segmentation according to a preset word vector The dictionary performs vectorization processing on the word segmentation set to obtain the word vector set corresponding to the text to be classified. The word vector dictionary fuses the fast text vector and word embedding vector corresponding to the word segmentation; the label predictor is configured to pass pre- It is assumed that the label prediction model performs category label prediction on the word vector set corresponding to the text to be classified, the label prediction model is jointly trained based on the training set and the test set, and the test set is configured to modify the Error data in the training set; category obtainer configured to obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
另一方面,一种电子设备,包括处理器及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如上所述的文本分类方法。In another aspect, an electronic device includes a processor and a memory, and computer-readable instructions are stored on the memory, and the computer-readable instructions implement the text classification method as described above when executed by the processor.
另一方面,一种计算机非易失性可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的文本分类方法。On the other hand, a computer non-volatile readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the text classification method as described above is realized.
有益效果Beneficial effect
本申请实施例提供的技术方案可以包括以下有益效果:The technical solutions provided by the embodiments of the present application may include the following beneficial effects:
在上述技术方案中,对待分类文本进行分词处理获得分词集合后,先根据词向量词典对分词集合进行向量化处理得到待分类文本对应词向量集合,然后通过标签预测模型对词向量集合进行类别标签预测,由于词向量词典中融合有分词对应的快速文本向量和词嵌入向量,能够对待分类文本中的未登录词和错别字具有容错性,使得待分类文本进行分词向量化的过程更加准确,此外,由于标签预测模型是共同根据训练集和测试集进行训练得到的,相比传统标签预测模型仅根据训练集进行训练得到,本申请在标签预测模型的训练中能够根据测试集对训练集中的错误数据自动修正,从而优化所训练标签预测模型的准确度。因此,基于更加准确的分词向量和标签预测模型,能够极大程度地提高文本分类的准确度。In the above technical solution, after performing word segmentation processing on the text to be classified to obtain the word segmentation set, first perform vectorization processing on the word segmentation set according to the word vector dictionary to obtain the word vector set corresponding to the text to be classified, and then classify the word vector set through the label prediction model It is predicted that because the word vector dictionary is fused with the fast text vector and word embedding vector corresponding to the word segmentation, it can be fault-tolerant for unregistered words and typos in the classified text, making the process of word segmentation vectorization of the text to be classified more accurate. In addition, Since the label prediction model is jointly trained based on the training set and the test set, compared with the traditional label prediction model that is only trained based on the training set, the training of the label prediction model in this application can compare the error data in the training set based on the test set. Automatic correction to optimize the accuracy of the trained label prediction model. Therefore, based on a more accurate word segmentation vector and label prediction model, the accuracy of text classification can be greatly improved.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。The drawings here are incorporated into the specification and constitute a part of the specification, show embodiments that conform to the application, and are used together with the specification to explain the principle of the application.
图1是根据一示例性实施例示出的本申请所涉及实施环境的示意图;Fig. 1 is a schematic diagram showing an implementation environment involved in this application according to an exemplary embodiment;
图2是根据一示例性实施例示出的一种服务器的硬件框图;Fig. 2 is a hardware block diagram showing a server according to an exemplary embodiment;
图3是根据一示例性实施例示出的一种文本分类方法的流程图;Fig. 3 is a flowchart showing a text classification method according to an exemplary embodiment;
图4是根据另一示例性实施例示出的一种文本分类方法的流程图;Fig. 4 is a flowchart showing a method for text classification according to another exemplary embodiment;
图5是根据另一示例性实施例示出的一种文本分类方法的流程图;Fig. 5 is a flowchart showing a text classification method according to another exemplary embodiment;
图6是图5所示步骤550在一个实施例的流程图;FIG. 6 is a flowchart of step 550 shown in FIG. 5 in an embodiment;
图7是根据一示例性实施例所示出的一种文本分类装置的框图。Fig. 7 is a block diagram of a text classification device according to an exemplary embodiment.
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述,这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。Through the above drawings, the specific embodiments of the present application have been shown, and there will be more detailed descriptions in the following. These drawings and text descriptions are not intended to limit the scope of the concept of the present application in any way, but by referring to specific embodiments. The concept of this application is explained to those skilled in the art.
本发明的实施方式Embodiments of the present invention
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Here, an exemplary embodiment will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present application. On the contrary, they are merely examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.
图1是根据一示例性实施例示出的一种本申请所涉及实施环境的示意图。如图1所示,该实施环境包括文本获取客户端100和文本处理服务端200。Fig. 1 is a schematic diagram showing an implementation environment involved in this application according to an exemplary embodiment. As shown in FIG. 1, the implementation environment includes a text acquisition client 100 and a text processing server 200.
其中,文本获取客户端100与文本服务端200之间预先建立有线或者无线网络连接,以实现文本获取客户端100与文本服务端200之间的交互。Wherein, a wired or wireless network connection is established in advance between the text obtaining client 100 and the text server 200 to realize the interaction between the text obtaining client 100 and the text server 200.
文本获取客户端100用于获取文本信息,并将获取的文本信息传输至文本服务端200进行相应处理。例如,在智能面试的应用场景中,文本获取客户端100为智能面试终端,不仅用于向面试者展示面试题目,同时还获取面试者输入的文本信息,并且在面试者输入为语音时,通过对输入语音进行智能识别,以将输入语音转换为输入文本。The text obtaining client 100 is used for obtaining text information, and transmitting the obtained text information to the text server 200 for corresponding processing. For example, in the application scenario of a smart interview, the text acquisition client 100 is a smart interview terminal, which is not only used to display the interview questions to the interviewer, but also to obtain the text information input by the interviewer, and when the interviewer's input is voice, pass Intelligently recognize the input voice to convert the input voice into input text.
示例性的,文本获取客户端100可以是智能手机、平板电脑、笔记本电脑、计算机等电子设备,其数量不作限制(图1仅示出2个)。Exemplarily, the text acquisition client 100 may be an electronic device such as a smart phone, a tablet computer, a notebook computer, a computer, and the like, and the number thereof is not limited (only two are shown in FIG. 1).
文本服务端200用于对文本获取客户端100所传输的文本信息进行相应处理,以实现文本获取客户端100所对应的功能。例如,在上述智能面试场景中,文本服务端200用于根据文本获取客户端100传输的文本信息,对面试者的面试表现进行评分,实现面试成绩的智能评估。The text server 200 is configured to perform corresponding processing on the text information transmitted by the text obtaining client 100 to implement the functions corresponding to the text obtaining client 100. For example, in the above-mentioned smart interview scenario, the text server 200 is used to obtain the text information transmitted by the client 100 according to the text, score the interview performance of the interviewer, and realize the intelligent evaluation of the interview result.
文本服务端200在进行文本信息处理时,不可避免地需要对所接收的文本信息进行分类处理,由此,在本实施环境中,由文本服务端200执行待分类文本的分类处理。When the text server 200 performs text information processing, it is inevitably required to classify the received text information. Therefore, in the present implementation environment, the text server 200 executes the classification processing of the text to be classified.
示例性的,文本服务端200可以是一台服务器,也可以是由若干服务器构成的服务器集群,本处不进行限制。Exemplarily, the text server 200 may be a server or a server cluster composed of several servers, which is not limited here.
图2是根据一示例性实施例所示出的一种服务器的框图。该服务器可以被具体实现为图1所示实施环境中的文本服务端200。Fig. 2 is a block diagram of a server according to an exemplary embodiment. The server can be specifically implemented as a text server 200 in the implementation environment shown in FIG. 1.
需要说明的是,该服务器只是一个适配于本申请的示例,不能认为是提供了对本申请的使用范围的任何限制。该服务器也不能解释为需要依赖于或者必须具有图2中示出的示例性的服务器中的一个或者多个组件。It should be noted that the server is only an example adapted to this application, and cannot be considered as providing any restriction on the scope of use of this application. The server also cannot be interpreted as needing to rely on or have one or more components in the exemplary server shown in FIG. 2.
该服务器的硬件结构可因配置或者性能的不同而产生较大的差异,如图7所示,服务器包括:电源210、接口230、至少一存储器250、以及至少一中央处理器(CPU ,Central Processing Units)270。The hardware structure of the server may vary greatly due to differences in configuration or performance. As shown in FIG. 7, the server includes: a power supply 210, an interface 230, at least one memory 250, and at least one central processing unit (CPU, Central Processing Units) 270.
其中,电源210用于为服务器上的各硬件设备提供工作电压。接口230包括至少一有线或无线网络接口231、至少一串并转换接口233、至少一输入输出接口235以及至少一USB接口237等,用于与外部设备通信。存储器250作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源包括操作系统251、应用程序253或者数据255等,存储方式可以是短暂存储或者永久存储。Wherein, the power supply 210 is used to provide working voltage for each hardware device on the server. The interface 230 includes at least one wired or wireless network interface 231, at least one serial-to-parallel conversion interface 233, at least one input/output interface 235, at least one USB interface 237, etc., for communicating with external devices. As a carrier for resource storage, the memory 250 can be a read-only memory, a random access memory, a magnetic disk or an optical disc, etc. The resources stored on it include the operating system 251, application programs 253 or data 255, etc. The storage method can be short-term storage or permanent storage. .
其中,操作系统251用于管理与控制服务器上的各硬件设备以及应用程序253,以实现中央处理器270对海量数据255的计算与处理,其可以是Windows ServerTM、Mac OS XTM、UnixTM、LinuxTM等。应用程序253是基于操作系统251之上完成至少一项特定工作的计算机程序,其可以包括至少一模块(图2中未示出),每个模块都可以分别包含有对服务器的一系列计算机可读指令。数据255可以是存储于磁盘中的接口元数据等。Among them, the operating system 251 is used to manage and control various hardware devices and application programs 253 on the server to realize the calculation and processing of the massive data 255 by the central processing unit 270, which can be Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, etc. The application program 253 is a computer program that completes at least one specific task based on the operating system 251. It may include at least one module (not shown in FIG. 2), and each module may include a series of computer programs for the server. Read instructions. The data 255 may be interface metadata stored in a disk or the like.
中央处理器270可以包括一个或多个以上的处理器,并设置为通过总线与存储器250通信,用于运算与处理存储器250中的海量数据255。The central processing unit 270 may include one or more processors, and is configured to communicate with the memory 250 via a bus for computing and processing the massive data 255 in the memory 250.
如上面所详细描述的,适用本申请的服务器将通过中央处理器270读取存储器250中存储的一系列计算机可读指令的形式来完成以下实施例所述的文本分类方法。As described in detail above, a server applicable to the present application will read a series of computer-readable instructions stored in the memory 250 through the central processing unit 270 to complete the text classification method described in the following embodiments.
此外,通过硬件电路或者硬件电路结合软件指令也能同样实现本申请,因此,实现本申请并不限于任何特定硬件电路、软件以及两者的组合。In addition, this application can also be implemented by hardware circuits or hardware circuits in combination with software instructions. Therefore, implementation of this application is not limited to any specific hardware circuits, software, and combinations of the two.
图3是根据一例性实施例示出的一种文本方法的流程图,该方法适用于图1所示实施环境中的文本服务端200,以实现对输入文本的分类处理。如图3所示,该文本分类方法至少包括以下步骤:Fig. 3 is a flowchart showing a text method according to an exemplary embodiment. The method is applicable to the text server 200 in the implementation environment shown in Fig. 1 to realize the classification processing of the input text. As shown in Figure 3, the text classification method includes at least the following steps:
步骤310,通过对待分类文本进行分词处理,获得待分类文本对应的分词集合。Step 310: Obtain a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified.
如前所述,文本分类是对待分类文本按照一定分类体系进行自动分类标记的过程,整个文本分类过程均由计算机设备自动执行。在对待分类文本的自动分类执行中,计算机设备无法处理一些常见错误,例如,待分类文本中存在未登录词或者错别字,导致计算机设备无法准确获知待分类文本的含义,从而导致计算机设备对待分类文本的分类准确度不高。As mentioned earlier, text classification is a process of automatically classifying and marking text to be classified according to a certain classification system, and the entire text classification process is automatically executed by computer equipment. In the execution of automatic classification of the text to be classified, the computer equipment cannot handle some common errors, for example, there are unregistered words or typos in the text to be classified, which causes the computer equipment to be unable to accurately understand the meaning of the text to be classified, thus causing the computer equipment to treat the text to be classified The classification accuracy of is not high.
为了解决该问题,本实施例提供了一种文本分类方法,能够对待分类文本中存在的未登录词和错别字具有很高的容错性,从而提升对待分类文本执行文本分类的准确性。In order to solve this problem, this embodiment provides a text classification method, which can have high fault tolerance for unregistered words and typos in the classified text, thereby improving the accuracy of text classification for the text to be classified.
应当理解,未登录词是指待分类文本中,无法在训练好的词向量词典中直接找到的词。例如“知识库”是在计算机技术的不断发展中形成的新词,在普通的词向量词典中不能直接找到。It should be understood that unregistered words refer to words that cannot be directly found in the trained word vector dictionary in the text to be classified. For example, "knowledge base" is a new word formed in the continuous development of computer technology, which cannot be found directly in ordinary word vector dictionaries.
对待分类文本进行分词处理,是由中文分词算法实现的,以将待分类文本划分为若干分词,从而得到待分类文本所对应的分词集合。The word segmentation processing of the text to be classified is implemented by a Chinese word segmentation algorithm to divide the text to be classified into a number of word segments, so as to obtain the word segmentation set corresponding to the text to be classified.
示例性的,中文分词算法可以选用基于词表的分词算法,例如正向最大匹配算法(FMM)、逆向最大匹配算法(BMM)或者双向最大匹配算法(BM),或者选用基于统计模型的分词算法,例如基于N-gram语言模型的分词算法,还可以选用基于序列标注的分词算法,例如基于隐马尔可夫模型(HMM)、条件随机场(CRF)、深度学习的端到端的分词算法,本处并不对该中文分词算法的具体类型进行限定。Exemplarily, the Chinese word segmentation algorithm can choose a word segmentation algorithm based on the vocabulary, such as forward maximum matching algorithm (FMM), reverse maximum matching algorithm (BMM), or two-way maximum matching algorithm (BM), or select a word segmentation algorithm based on a statistical model For example, the word segmentation algorithm based on the N-gram language model can also use the word segmentation algorithm based on sequence labeling, such as the end-to-end word segmentation algorithm based on hidden Markov model (HMM), conditional random field (CRF), and deep learning. The Office does not limit the specific types of Chinese word segmentation algorithms.
需要说明的是,通过对待分类文本进行分词处理,并不能消除待分类文本中含有未登录词和错别字,因此在待分类文本自身含有未登录词或者错别字的情况下,待分类文本所对应的分词集合中也应当含有未登录词或者错别字。It should be noted that the word segmentation processing of the text to be classified cannot eliminate the unregistered words and typos in the text to be classified. Therefore, when the text to be classified contains unregistered words or typos, the word segmentation corresponding to the text to be classified The set should also contain unregistered words or typos.
步骤330,根据预设的词向量词典对分词集合进行向量化处理,获得待分类文本对应的词向量集合,该词向量词典中融合有分词对应的快速文本向量和词嵌入向量。Step 330: Perform vectorization processing on the word segmentation set according to the preset word vector dictionary to obtain the word vector set corresponding to the text to be classified. The word vector dictionary is fused with the fast text vector and the word embedding vector corresponding to the word segmentation.
其中,本实施例中所采用的词向量词典是预先经由特殊训练得到的,使得根据该词向量词典对待分类文本所对应分词集合进行向量化处理时,能够对分词集合中的未登录词和错别字具备容错性。Among them, the word vector dictionary used in this embodiment is obtained through special training in advance, so that when vectorizing the word set corresponding to the text to be classified according to the word vector dictionary, the unregistered words and typos in the word set can be processed. It is fault-tolerant.
根据词向量词典对分词集合进行向量化是指,对分词集合中的每一分词都从词向量词典中查询该分词对应的词向量,由查询得到的词向量形成待分类文本对应的词向量集合。Vectorizing the word segmentation set according to the word vector dictionary means that each word in the word segmentation set is queried from the word vector dictionary for the word vector corresponding to the word segmentation, and the word vector obtained from the query forms the word vector set corresponding to the text to be classified .
词向量词典所融合的快速文本向量是指,通过快速文本模型(即FastText模型)的连续跳跃元语法模式(即skip-gram模式)对分词进行向量化训练得到的向量。在本实施例中,需要将连续跳跃元语法模式下的子字长度参数(即subword)设为1-2,使得快速文本模型在执行分词的向量化时,将分词拆分为1个字或者2个字进行词向量训练。The fast text vector fused by the word vector dictionary refers to the vector obtained by vectorizing the word segmentation through the continuous skip metagram mode (ie skip-gram mode) of the fast text model (ie, FastText model). In this embodiment, the subword length parameter (ie, subword) in the continuous skip meta-grammar mode needs to be set to 1-2, so that when the fast text model performs vectorization of word segmentation, the word segmentation is split into 1 word or 2 characters for word vector training.
对于未登陆词来说,在通过快速文本模型进行词向量训练中,由于是将未登录词拆分为1-2个字进行词向量训练的,通过拼接所拆分字对应向量,即可准确得到未登录词对应的词向量。例如在对“知识库”进行词向量训练时,将拆解为“知识”和“库”进行相应训练,拼接对二者训练所得词向量即可准确得到“知识库”所对应词向量。因此在训练得到的词向量词典中,能够准确查到未登录词对应的词向量,从而体现了对未登录词的容错性。For unregistered words, in the word vector training through the fast text model, since the unregistered words are split into 1-2 words for word vector training, the corresponding vectors of the split words can be accurately spliced. Get the word vector corresponding to the unregistered word. For example, when performing word vector training on the "knowledge base", disassemble it into "knowledge" and "library" for corresponding training, and concatenate the word vectors obtained by training the two to accurately obtain the word vector corresponding to the "knowledge base". Therefore, in the word vector dictionary obtained by training, the word vector corresponding to the unregistered word can be found accurately, which reflects the fault tolerance for unregistered words.
对于错别字来说,由于分词经由拆解后,得到的子字中会存在重复的情况,对于正确子字以及错误子字(即错别字)会赋予类似的向量表达,因此在训练得到的词向量词典中,能够对错别字起到修正作用。相应的,词嵌入向量是通过词嵌入模型(即word2vec模型)对分词进行向量化训练得到的向量。For typos, after the word segmentation is disassembled, there will be repetitions in the sub-words obtained, and similar vector expressions will be given to the correct sub-words and the wrong sub-words (ie typos), so the word vector dictionary obtained is trained , Can play a corrective role in correcting typos. Correspondingly, the word embedding vector is a vector obtained by vectorizing the word segmentation training through the word embedding model (ie, the word2vec model).
由于词嵌入模型所对应的网络结构中含有隐藏层,对于所在文本结构复杂的分词来说,在执行向量化训练时需要充分考虑分词之间的词序信息才能够得到准确的词向量,因此,采用词嵌入模型能够准确得到一些复杂句子中分词对应的词向量。Since the network structure corresponding to the word embedding model contains a hidden layer, for word segmentation with complex text structure, it is necessary to fully consider the word order information between the word segments when performing vectorization training to obtain an accurate word vector. Therefore, use The word embedding model can accurately obtain the word vector corresponding to the word segmentation in some complex sentences.
因此,本实施例采用快速文本模型和词嵌入模型来训练得到词向量词典对待分类文本对应的分词集合进行向量化,充分保证了所获得待分类文本所对应词向量集合的准确性。Therefore, in this embodiment, the fast text model and the word embedding model are used to train the word vector dictionary to vectorize the word vector set corresponding to the text to be classified, which fully guarantees the accuracy of the word vector set corresponding to the text to be classified.
步骤350,通过预设的标签预测模型对待分类文本对应的词向量集合进行类别标签预测,该标签预测模型是共同根据训练集和测试集进行训练得到的。Step 350: Perform category label prediction on the word vector set corresponding to the text to be classified using a preset label prediction model. The label prediction model is obtained by jointly training based on the training set and the test set.
其中,对待分类文本对应的词向量集合进行类别标签预测的标签预测模型也是通过特殊训练方式所得到的,使得该预测模型能够所输入待分类文本对应的词向量集合准确进行标签预测。Among them, the label prediction model that performs category label prediction on the word vector set corresponding to the text to be classified is also obtained through a special training method, so that the prediction model can accurately perform label prediction on the word vector set corresponding to the input text to be classified.
在普通的标签预测模型训练中,训练集是含有大量训练样本的数据集合,这些训练样本用于进行标签模型的训练,以得到符合条件的标签预测模型。而测试集是含有大量测试样本的数据集合,这些测试样本用于对训练好的标签预测模型进行测试,并不参与模型训练的过程。In ordinary label prediction model training, the training set is a data set containing a large number of training samples. These training samples are used to train the label model to obtain a qualified label prediction model. The test set is a data set containing a large number of test samples. These test samples are used to test the trained label prediction model and do not participate in the process of model training.
而在本实施例,训练集和测试集都共同用于进行标签预测模型的训练,具体的,在标签预测模型的训练中,由于训练集中的错误数据会影响所训练标签预测模型的准确度,因此在标签预测模型的训练中,通过测试集对训练集中的错误数据进行自动修正,再将修正得到的训练集用于执行标签预测模型的训练,由此极大地优化了标签预测模型的训练过程,以此训练得到更加准确的标签预测模型。示例性,训练集中所存在的错误数据包括训练标本所标注的类别标签错误。In this embodiment, both the training set and the test set are used to train the label prediction model. Specifically, in the training of the label prediction model, because the wrong data in the training set will affect the accuracy of the trained label prediction model, Therefore, in the training of the label prediction model, the wrong data in the training set is automatically corrected through the test set, and then the corrected training set is used to perform the training of the label prediction model, thereby greatly optimizing the training process of the label prediction model , In order to train to obtain a more accurate label prediction model. Exemplarily, the error data in the training set includes the category label error of the training specimen.
需要说明的是,在本实施例中不对标签预测模型的具体类型进行限制,在进行标签预测模型的训练中,可以根据具体应用场景适应性选择初始的标签预测模型。示例性的,在待训练数据的数据量低于设定阈值时,可选用传统机器学习模型作为初始的标签预测模型进行训练,例如SVM(Support Vector Machine,支持向量机)模型;如果待训练数据的数据量超过设定阈值,则可以选用深度学习模型作为待进行训练的初始标签预测模型,例如CNN(Convolutional neural network,卷积神经网络)模型或者LSTM(Long Short-Term Memory,长短期记忆网络)模型。It should be noted that the specific type of the label prediction model is not limited in this embodiment. In the training of the label prediction model, the initial label prediction model can be adaptively selected according to specific application scenarios. Exemplarily, when the amount of data to be trained is lower than the set threshold, a traditional machine learning model can be selected as the initial label prediction model for training, such as SVM (Support Vector Machine) model; if the data to be trained If the amount of data exceeds the set threshold, a deep learning model can be selected as the initial label prediction model to be trained, such as CNN (Convolutional neural network, convolutional neural network) model or LSTM (Long Short-Term Memory, long and short-term memory network) model.
步骤370,获取标签预测模型所输出的预测结果,该预测结果对应于待分类文本所对应的文本类别。Step 370: Obtain a prediction result output by the label prediction model, where the prediction result corresponds to the text category corresponding to the text to be classified.
其中,标签预测模型所输出的预测结果包括待分类文本可能对应的若干文本类别,以及每种文本类别所对应的概率值,该概率值用于表示待分类对应该种文本类别的可能性。Among them, the prediction result output by the label prediction model includes several text categories that the text to be classified may correspond to, and the probability value corresponding to each text category, and the probability value is used to indicate the possibility of the text to be classified corresponding to the text category.
由此,通过本实施例所提供的方法,能够充分应对待分类文本中含有未登录词以及错误字的问题,以及训练集中存在错误数据导致所训练标签预测模型不准确的问题,因此能够对待分类文本对应的文本类别进行准确预测。Therefore, the method provided by this embodiment can adequately deal with the problem of unregistered words and wrong words in the classified text, and the problem of incorrect data in the training set causing the inaccuracy of the trained label prediction model. Therefore, it can deal with the problem of classification. The text category corresponding to the text is accurately predicted.
图4是根据另一示例性实施例所示出的一种文本分类方法的流程图。如图4所示,在步骤310之前,该文本分类方法还包括以下步骤:Fig. 4 is a flowchart of a text classification method according to another exemplary embodiment. As shown in FIG. 4, before step 310, the text classification method further includes the following steps:
步骤410,获取待进行词向量训练的语料分词词库。Step 410: Obtain the word segmentation lexicon of the corpus to be trained on word vectors.
其中,语料分词词库中是预先准确好的大量分词集合,通过对预料分词词库所包含的每一分词进行词向量训练,获得分词对应的词向量,由这些分词以及分词对应的词向量形成词向量词典。Among them, the corpus word segmentation database contains a large number of accurate pre-accurate word segmentation sets. Through word vector training for each word segment contained in the expected word segmentation thesaurus, the word vector corresponding to the word segmentation is obtained, which is formed by these word segments and the word vector corresponding to the word segmentation. Word vector dictionary.
需要说明的是,针对不同的应用场景,获取预料分词词库的来源对应不同。示例性的,在前述智能面试的应用场景中,预料分词词库可以是对互联网上的一些面试攻略、面试题进行分词处理所得到的,还可以是对面试业务方所直接提供的语料数据进行分词处理所得。It should be noted that, for different application scenarios, the sources for obtaining the expected word segmentation lexicon correspond to different correspondences. Exemplarily, in the aforementioned smart interview application scenario, the expected word segmentation database can be obtained by word segmentation processing on some interview strategies and interview questions on the Internet, or it can be performed on the corpus data directly provided by the interview business party. Word segmentation processing income.
步骤430,对预料分词词库中的各分词,分别通过快速文本模型的连续跳跃元语法模式和词嵌入模型进行词向量训练,获得分词对应的快速文本向量和词嵌入向量。Step 430: For each word segmentation in the expected word segmentation thesaurus, word vector training is performed through the continuous skip metagram mode of the fast text model and the word embedding model to obtain the fast text vector and the word embedding vector corresponding to the word segmentation.
如前所述,在通过快速文本模型的连续跳跃元语法模式对预料分词词库中的各分词进行词向量训练时,需要将连续跳跃元语法模式下的子字长度参数(即subword)由默认值3-6修改为1-2,使得通过本实施例所训练得到的词向量词典能够对待分类文本中的未登录词以及错别字具备容错性。As mentioned earlier, when performing word vector training for each word segmentation in the expected word segmentation lexicon through the continuous skip metagram mode of the fast text model, the subword length parameter (namely subword) in the continuous skip metagram mode needs to be changed from the default The value of 3-6 is modified to 1-2, so that the word vector dictionary trained by this embodiment can treat unregistered words and typos in the classified text with fault tolerance.
需要说明的是,对语料分词词库中的分词,如果按照所设置子字长度参数1-2进行词向量训练得到多个词向量,则按照分词被拆解为子字的顺序,将各子字对应的词向量进行拼接即可得到分词对应的词向量。It should be noted that, for the word segmentation in the corpus, if the word vector training is performed according to the set sub-word length parameter 1-2 to obtain multiple word vectors, the sub-words will be divided into sub-words in the order in which the word segmentation is split into sub-words. The word vector corresponding to the word is spliced to obtain the word vector corresponding to the word segmentation.
而通过词嵌入模型对预料分词词库中的各分词进行词向量训练,则能够考虑分词之间的词序信息,得到准确的词向量。By using the word embedding model to train each word segment in the expected word segmentation lexicon, the word order information between the word segments can be considered to obtain an accurate word vector.
也即是说,按照本实施例所提供方法对语料分词词库中的每一分词进行词向量训练,都能够得到对应的一个快速文本向量和一个词嵌入向量。That is to say, by performing word vector training on each word segmentation in the corpus word segmentation lexicon according to the method provided in this embodiment, a corresponding fast text vector and a word embedding vector can be obtained.
步骤450,通过计算分词所对应快速文本向量和词嵌入向量的平均向量,获取该平均向量为分词对应的向量表达。Step 450: By calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation, the average vector is obtained as a vector expression corresponding to the word segmentation.
其中,为了使得词向量词典中每一分词所对应的词向量能够对相应分词进行准确表达,有必要将该词向量融合通过步骤430所得到的快速文本向量和词嵌入向量。Among them, in order to enable the word vector corresponding to each word segmentation in the word vector dictionary to accurately express the corresponding word segmentation, it is necessary to fuse the word vector with the fast text vector and word embedding vector obtained through step 430.
在本实施例中,将快速文本向量和词嵌入向量融合至分词所对应词向量中是指,通过将分词所对应快速文本向量和词嵌入向量相加,然后对相加所得向量和进行平均值计算即可,所得计算结果则为分词对应的向量表达,该向量表达为词向量词典中与分词对应的词向量。In this embodiment, fusing the fast text vector and the word embedding vector into the word vector corresponding to the word segmentation refers to adding the fast text vector corresponding to the word segmentation and the word embedding vector, and then averaging the sum of the resulting vectors The calculation is enough, and the result of the calculation is the vector expression corresponding to the word segmentation, and the vector expression is the word vector corresponding to the word segmentation in the word vector dictionary.
步骤470,获取语料分词词库中每一分词所对应向量表达形成词向量词典。Step 470: Obtain a vector expression corresponding to each word segment in the corpus word segmentation dictionary to form a word vector dictionary.
通过步骤430和步骤450所描述过程,能够获得语料分词词库中每一分词所对应向量表达,因此由语料分词词库中每一分词以及每一分词对应的向量表达形成词向量词典。Through the process described in step 430 and step 450, the vector expression corresponding to each participle in the corpus word segmentation dictionary can be obtained. Therefore, each participle in the corpus word segmentation dictionary and the vector expression corresponding to each participle form a word vector dictionary.
如前所述,在对待分类文本所对应分词集合进行向量化处理时,根据本实施例所训练得到的词向量词典,能够准确查询到分词集合中各分词对应的词向量,准确获得待分类文本所对应的词向量集合。As mentioned earlier, when performing vectorization processing on the word segmentation set corresponding to the text to be classified, the word vector dictionary trained according to this embodiment can accurately query the word vector corresponding to each word in the word segmentation set, and accurately obtain the text to be classified The corresponding word vector collection.
图5是根据另一示例性实施例所示出的一种文本分类方法的流程图。如图5所示,在步骤310之前,该文本分类方法还包括以下步骤:Fig. 5 is a flowchart of a text classification method according to another exemplary embodiment. As shown in FIG. 5, before step 310, the text classification method further includes the following steps:
步骤510,按照设定比例,将待进行标签预测模型训练的标注语料划分为训练集和测试集,该标注语料含有标注的类别标签。Step 510: According to a set ratio, the annotated corpus to be trained for the label prediction model is divided into a training set and a test set, and the annotated corpus contains the annotated category labels.
其中,标注预料是指标注有类别标签的文本集合,也将此标注有类别标签的文本称为一个样本。Among them, the annotation expectation is a collection of texts marked with category labels for indicators, and the text marked with category labels is also called a sample.
标注预料还对应于步骤410中获取的语料分词词库,示例性的,在步骤410所描述的应用场景中,标注预料不仅包括互联网上的一些面试攻略和面试题,还包括面试业务方所直接提供的语料数据,通过对标注语料进行分词处理,即可得到相应的语料分词词库。The labeling expectation also corresponds to the corpus of word segmentation obtained in step 410. Illustratively, in the application scenario described in step 410, the labeling expectation includes not only some interview strategies and interview questions on the Internet, but also direct interviews by the business party. The provided corpus data, through the word segmentation processing of the labeled corpus, can obtain the corresponding corpus word segmentation thesaurus.
将标注语料划分为训练集和测试集的比例是预先设定的,例如,所划分为训练集和测试集的比例可以是7:3,本处并不进行对该比例值进行限定。但需要说明的是,在一般情况下,训练集所占比重应当大于测试集所占比重,数据量较大的训练集更有助于得到准确的标签预测模型。The ratio of dividing annotated corpus into training set and test set is preset. For example, the ratio of dividing into training set and test set can be 7:3, and the ratio value is not limited here. However, it should be noted that in general, the proportion of the training set should be greater than the proportion of the test set, and a training set with a larger amount of data is more helpful to obtain an accurate label prediction model.
步骤530,根据训练集对待训练的标签预测模型进行初始训练。Step 530: Perform initial training on the label prediction model to be trained based on the training set.
如前所述,在不同的应用场景中,可以具体选择进行初始训练的标签预测模型。例如,在训练集的数据量低于设定阈值时,可选用SVM模型进行初始训练;如果训练集的数据量超过设定阈值,则可以选用CNN模型或者LSTM模型进行初始训练。As mentioned earlier, in different application scenarios, the label prediction model for initial training can be specifically selected. For example, when the amount of data in the training set is lower than the set threshold, the SVM model can be used for initial training; if the amount of data in the training set exceeds the set threshold, the CNN model or the LSTM model can be used for initial training.
需要说明的是,根据训练集对待训练的标签预测模型进行初始训练,其目的在于获取的一个初始的标签预测模型,但由于训练集中训练样本所标注的类别标签可能存在错误,经由该训练集进行初始训练得到的标签预测模型所进行的类别标签预测会存在预测偏差。It should be noted that the initial training of the label prediction model to be trained based on the training set is to obtain an initial label prediction model, but because the category labels labeled by the training samples in the training set may have errors, the training set is used to perform initial training. The category label prediction performed by the label prediction model obtained by the initial training may have a prediction bias.
因此,有必要对训练集中标注错误的类别标签进行自动修正,再根据修正得到的训练集迭代进行标签预测模型的训练,以此训练得到准确度较高的标签预测模型。Therefore, it is necessary to automatically correct the incorrectly labeled category labels in the training set, and then iteratively train the label prediction model according to the corrected training set, so as to train to obtain a label prediction model with higher accuracy.
步骤550,对初始训练所得标签预测模型,分别通过训练集和测试集进行组合训练,根据标签预测模型所输出预测结果对训练集中标注错误的类别标签进行修正。Step 550: Perform combined training on the label prediction model obtained in the initial training through the training set and the test set, and correct the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model.
其中,通过初始训练得到初始的标签预测模型后,通过该初始的标签预测模型训练集和测试集进行组合训练。需要理解的是,该组合训练的过程是指,将训练集和测试集依次输入初始的标签预测模型中,得到标签预测模型分别对训练集中各训练样本进行标签预测所输出预测结果,以及对测试集中各测试样本进行标签预测所输出预测结果。Among them, after the initial label prediction model is obtained through initial training, combined training is performed through the initial label prediction model training set and test set. What needs to be understood is that the combined training process refers to inputting the training set and the test set into the initial label prediction model in turn, and the label prediction model is used to perform label prediction on each training sample in the training set. Collect each test sample to perform label prediction and output the prediction result.
由于训练集和测试集是从标注预料中划分得到的,每一训练样本和测试样本均事先标注有样本对应的类别标签,根据标签预测模型所输出预测结果与样本事先标注的类别标签进行对比,能够分别获得标签预测模型分别对训练集和测试集进行标签预测的准确率。Since the training set and the test set are divided from the labeling expectations, each training sample and test sample are pre-labeled with the corresponding class label of the sample, and the prediction result output by the label prediction model is compared with the pre-labeled class label of the sample. The accuracy of label prediction for the training set and the test set can be obtained separately by the label prediction model.
应当理解,训练集对应的准确率是指,标签预测模型所输出预测结果与事先标注类别标签相同的训练样本数占训练样本总数的比例。测试集对应的准确率同理,本处不作赘述。It should be understood that the accuracy rate corresponding to the training set refers to the ratio of the number of training samples whose prediction results output by the label prediction model are the same as the pre-labeled category labels to the total number of training samples. The accuracy rate corresponding to the test set is the same, so I won't repeat it here.
根据训练集和测试集分别对应的准确率,能够获知初始训练所得标签预测模型的预测效果。示例性,如果训练集对应的准确率高于90%,测试集对应的准确率高于85%则说明初始训练得到的标签预测模型预测效果较好,否则表示当前标签预测模型无法达到较好的预测效果。According to the respective accuracy rates of the training set and the test set, the prediction effect of the label prediction model obtained from the initial training can be obtained. For example, if the accuracy rate corresponding to the training set is higher than 90%, and the accuracy rate corresponding to the test set is higher than 85%, it means that the label prediction model obtained by the initial training has a better prediction effect, otherwise it means that the current label prediction model cannot achieve better results Forecast effect.
如前所述,导致初始训练所得标签预测模型效果不佳的原因可能是训练集中存在训练样本事先标注的类别标签错误,因此需要对训练集中标注错误的类别标签进行修正,以获取正确的训练集。As mentioned above, the reason for the poor performance of the label prediction model obtained in the initial training may be that there are errors in the pre-labeled category labels of the training samples in the training set. Therefore, it is necessary to correct the incorrectly labeled category labels in the training set to obtain the correct training set. .
步骤570,根据修正得到的类别标签对训练集更新,且通过测试集和更新所得到的训练集迭代执行标签预测模型的训练过程,直至标签预测模型收敛。Step 570: Update the training set according to the corrected category labels, and iteratively execute the training process of the label prediction model through the test set and the training set obtained by the update, until the label prediction model converges.
其中,通过测试集和更新所得到的训练集迭代执行标签预测模型的训练过程是指,在得到更新的训练集之后,重复执行步骤530和步骤550所描述内容,也即,先根据更新得到的训练集对初始训练得到的标签预测模型再次训练,然后根据测试集和更新得到的训练集对训练所得标签预测模型进行组合训练,并判断当前标签预测模型的预测效果,如果效果不佳,继续执行训练集中错误类别标签的修正以及标签预测模型的再次训练,直至标签预测模型收敛。应当理解,标签预测模型收敛是指,标签预测模型所进行的类别预测中能够达到设定的预测精度。Among them, the iterative execution of the training process of the label prediction model through the training set obtained by the test set and the update means that after the updated training set is obtained, the contents described in step 530 and step 550 are repeatedly executed, that is, first based on the updated training set. The training set retrains the label prediction model obtained from the initial training, and then performs combined training on the training label prediction model according to the test set and the updated training set, and judges the prediction effect of the current label prediction model. If the effect is not good, continue to execute Correction of the wrong category labels in the training set and retraining of the label prediction model until the label prediction model converges. It should be understood that the label prediction model convergence means that the set prediction accuracy can be achieved in the category prediction performed by the label prediction model.
由此,根据本实施例所提供的方法,能够训练得到预测精度较高的标签预测模型,在实际的应用场景中,标签预测模型对待分类文本所对应词向量集合进行预测,能够获得准确的预测结果。Therefore, according to the method provided in this embodiment, a label prediction model with higher prediction accuracy can be trained. In actual application scenarios, the label prediction model predicts the set of word vectors corresponding to the text to be classified, and can obtain accurate predictions. result.
图6是图5所示步骤550在一示例性实施例中的流程图。如图5所示,根据标签预测模型所输出预测结果对训练集中标注错误的类别标签进行修正的过程具体包括以下步骤:FIG. 6 is a flowchart of step 550 shown in FIG. 5 in an exemplary embodiment. As shown in Figure 5, the process of correcting the incorrectly labeled category labels in the training set according to the prediction results output by the label prediction model specifically includes the following steps:
步骤551,根据标签预测模型所输出结果,分别计算标签预测模型为训练集和测试集进行标签预测的准确率。Step 551: According to the output result of the label prediction model, respectively calculate the accuracy of label prediction for the training set and the test set by the label prediction model.
如前所述,标签预测模型为训练集进行标签预测的准确率是指,标签预测模型所输出预测结果与事先标注的类别标签相同的训练样本数占训练样本总数的比例。由此,通过获取标签预测模型所输出预测结果与事先标注的类别标签相同的训练样本数,然后计算该训练样本数与训练样本集合所包含训练样本总数的比例,即可获得相应的准确率。As mentioned above, the accuracy of label prediction by the label prediction model for the training set refers to the ratio of the number of training samples whose output prediction result of the label prediction model is the same as the pre-labeled category label to the total number of training samples. Thus, by obtaining the number of training samples whose prediction results output by the label prediction model are the same as the pre-labeled category labels, and then calculating the ratio of the number of training samples to the total number of training samples contained in the training sample set, the corresponding accuracy rate can be obtained.
标签预测模型为测试集进行标签预测的准确率同理,本处不作赘述。The label prediction model is the same for the accuracy of label prediction for the test set, and will not be repeated here.
步骤553,在训练集和测试集对应的准确率均低于设定的准确率阈值时,筛选训练集中预测结果与所标注类别标签不一致的训练样本集合。In step 553, when the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold, select the training sample set in which the prediction result in the training set is inconsistent with the labeled category label.
其中,为训练集和测试集所对应准确率设定的准确率阈值可以相同,也可以不相同。一般来说,由于当前标签预测模型是经由训练集进行初始训练得到的,标签预测模型为训练集预测得到准确率更高,因此对应的准确率阈值也应更大。Among them, the accuracy thresholds set for the accuracy rates corresponding to the training set and the test set may be the same or different. Generally speaking, since the current label prediction model is obtained through initial training through the training set, the label prediction model has a higher accuracy rate for the training set prediction, so the corresponding accuracy threshold should also be larger.
所设定的准确率阈值可以结合标注有类别标签的样本进行确定。示例性的,针对当前标签预测模型为训练集输出的预测结果,通过汇总所有预测正确的类别标签对应的概率值(该概率值是标签预测模型所直接输出的),获得概率值集合,并对概率值集合进行统计分析。在一个实施例中,对概率值集合进行统计分析的过程为,找出概率值集合中的50%分位值对应的概率值,将此概率值获取为准确率阈值。The set accuracy threshold can be determined in combination with samples marked with category labels. Exemplarily, for the prediction result output by the current label prediction model for the training set, by summarizing the probability values corresponding to all correctly predicted category labels (the probability value is directly output by the label prediction model), the probability value set is obtained, and the Probability value collection for statistical analysis. In an embodiment, the process of performing statistical analysis on the probability value set is to find the probability value corresponding to the 50% quantile value in the probability value set, and obtain this probability value as the accuracy threshold.
步骤555,通过计算训练样本集合中预测结果正确且类别标签标注错误的概率,获得训练样本集合对应的预测概率值。Step 555: Obtain the predicted probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled.
其中,训练样本集合对应的预测概率值,表示对应训练样本可能发生类别标签标注错误的概率,当预测概率值高于设定的概率阈值时,表示训练样本发生类别标签标注错误的概率很大,跳转执行步骤557。而当预测概率值低于设定的概率阈值时,表示训练样本发生类别标签标注错误的概率较小,跳转执行步骤559。Among them, the predicted probability value corresponding to the training sample set indicates the probability that the corresponding training sample may have a wrong labeling of the category label. When the predicted probability value is higher than the set probability threshold, it means that the probability of the training sample having a wrong labeling of the category label is high. Jump to step 557. When the predicted probability value is lower than the set probability threshold, it indicates that the probability of the training sample being incorrectly labeled with the category label is small, and step 559 is skipped to.
步骤557,将训练样本集合中训练样本的类别标签修正为与标签预测模型输出的预测结果相对应。Step 557: Correct the category label of the training sample in the training sample set to correspond to the prediction result output by the label prediction model.
步骤559,获取人工输入的类别标签对训练样本集合中训练样本的类别标签进行修正。Step 559: Obtain manually input category labels to correct the category labels of training samples in the training sample set.
其中,当训练样本发生类别标签标注错误的概率较小时,需要结合人工经验判断训练样本集合中训练样本的类别标签是否正确,并对类别标签标注错误的训练样本进行修正。通过获取人工输入的正确的类别标签,将该正确的类别标签对训练样本标注错误的类别标签进行更换,即可实现对训练样本集合中训练样本的类别标签额的修正。Among them, when the probability that the training sample is incorrectly labeled with the category label is small, it is necessary to determine whether the category label of the training sample in the training sample set is correct in combination with manual experience, and correct the training sample with the wrong category label. By obtaining the correct category label manually input, and replacing the correct category label with the wrong category label of the training sample, the correction of the category label amount of the training sample in the training sample set can be realized.
通过本实施例所提供方法,实现了对训练样本中标注错误的类别标签的自动修正,由此获得准确的标签预测模型。Through the method provided in this embodiment, the automatic correction of the incorrectly labeled category labels in the training samples is realized, thereby obtaining an accurate label prediction model.
图7是根据一示例性实施例示出的一种文本分类装置的框图。如图7所示,该装置包括分词处理器610、向量化处理器630、标签预测器650和类别获取器670。Fig. 7 is a block diagram showing a text classification device according to an exemplary embodiment. As shown in FIG. 7, the device includes a word segmentation processor 610, a vectorization processor 630, a label predictor 650, and a category obtainer 670.
分词处理器610配置为通过对待分类文本进行分词处理,获得待分类文本对应的分词集合。向量化处理器630配置为根据预设的词向量词典对分词集合进行向量化处理,获得待分类文本对应的词向量集合,该词向量词典融合有分词所对应的快速文本向量和词嵌入向量。标签预测器650配置为通过预设的标签预测模型对待分类文本对应的词向量集合进行类别标签预测,该标签预测模型是共同根据训练集和测试集进行训练得到的,该测试集配置为修正所述训练集中的错误数据。类别获取器670配置为获取标签预测模型所输出的预测结果,该预测结果对应于待分类文本所对应的文本类别。The word segmentation processor 610 is configured to obtain a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified. The vectorization processor 630 is configured to perform vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified. The word vector dictionary integrates the fast text vector and the word embedding vector corresponding to the word segmentation. The label predictor 650 is configured to perform category label prediction on the set of word vectors corresponding to the text to be classified through a preset label prediction model. The label prediction model is jointly trained based on the training set and the test set, and the test set is configured as a corrector. Describe the wrong data in the training set. The category obtainer 670 is configured to obtain the prediction result output by the label prediction model, and the prediction result corresponds to the text category corresponding to the text to be classified.
在一示例性的实施例中,文本分类装置还包括语料分词词库获取器、词向量训练器、向量表达融合器和词向量词典获取器(图7中未示出)。语料分词词库获取器配置为获取待进行词向量训练的语料分词词库。词向量训练器配置为对语料分词词库中的各分词,分别通过快速文本模型的连续跳跃元语法模式和词嵌入模型进行词向量训练,获得分词对应的快速文本向量和词嵌入向量。向量表达融合器配置为通过计算分词所对应快速文本向量和词嵌入向量的平均向量,获取平均向量为分词对应的向量表达。词向量词典获取器配置为获取语料分词词库中每一分词所对应向量表达形成词向量词典。In an exemplary embodiment, the text classification device further includes a corpus word segmentation vocabulary obtainer, a word vector trainer, a vector expression fusion device, and a word vector dictionary obtainer (not shown in FIG. 7). The corpus word segmentation vocabulary acquirer is configured to obtain the corpus word segmentation vocabulary for which word vector training is to be performed. The word vector trainer is configured to perform word vector training on each word segmentation in the corpus of word segmentation of the corpus, respectively through the continuous jump metagram mode of the fast text model and the word embedding model to obtain the fast text vector and word embedding vector corresponding to the word segmentation. The vector expression fusion device is configured to obtain the average vector as the vector expression corresponding to the word segmentation by calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation. The word vector dictionary acquirer is configured to acquire the vector expression corresponding to each word segment in the corpus word segmentation dictionary to form a word vector dictionary.
在一示例性实施例中,文本分类装置还包括标注语料分配器、模型初始训练器、类别标签修正器和模型迭代训练器。标注语料分配器配置为按照设定比例,将待进行标签预测模型训练的标注语料划分为训练集和测试集,该标注语料含有标注的类别标签。模型初始训练器配置为根据所述训练集对待训练的标签预测模型进行初始训练。类别标签修正器配置为对初始训练所得标签预测模型,分别通过训练集和测试集进行组合训练,根据标签预测模型所输出预测结果对训练集中标注错误的类别标签进行修正。模型迭代训练器配置为根据修正得到的类别标签对训练集更新,且通过测试集和更新所得训练集迭代执行标签预测模型的训练过程,直至标签预测模型收敛。In an exemplary embodiment, the text classification device further includes an annotated corpus allocator, a model initial trainer, a category label corrector, and a model iterative trainer. The labeled corpus distributor is configured to divide the labeled corpus to be trained for the label prediction model into a training set and a test set according to a set ratio, and the labeled corpus contains the labeled category labels. The model initial trainer is configured to perform initial training on the label prediction model to be trained according to the training set. The category label corrector is configured to perform combined training on the label prediction model obtained from the initial training through the training set and the test set respectively, and correct the incorrectly labeled category labels in the training set according to the prediction results output by the label prediction model. The model iterative trainer is configured to update the training set according to the corrected category labels, and iteratively execute the training process of the label prediction model through the test set and the updated training set until the label prediction model converges.
需要说明的是,上述实施例所提供的装置与上述实施例所提供的方法属于同一构思,其中各个器执行操作的具体方式已经在方法实施例中进行了详细描述,此处不再赘述。It should be noted that the device provided in the foregoing embodiment and the method provided in the foregoing embodiment belong to the same concept, and the specific manners for each device to perform operations have been described in detail in the method embodiment, and will not be repeated here.
在一示例性的实施例中,本申请还提供一种电子设备,该电子设备包括:处理器;存储器,该存储器上存储有计算机可读指令,该计算机可读指令被处理器执行时,实现如前所述的文本分类方法。In an exemplary embodiment, the present application further provides an electronic device, the electronic device includes: a processor; a memory, the memory is stored with computer readable instructions, when the computer readable instructions are executed by the processor, The text classification method as described earlier.
在一示例性的实施例中,本申请还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现如前所述的文本分类方法。In an exemplary embodiment, the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the text classification method as described above is realized.
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It should be understood that the present application is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be performed without departing from its scope. The scope of the application is only limited by the appended claims.

Claims (20)

  1. 一种文本分类方法,包括:A text classification method, including:
    通过对待分类文本进行分词处理,获得所述待分类文本对应的分词集合;Obtaining a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified;
    根据预设的词向量词典对所述分词集合进行向量化处理,获得所述待分类文本对应的词向量集合,所述词向量词典融合有分词所对应的快速文本向量和词嵌入向量;Performing vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified, and the word vector dictionary fuses the fast text vector and the word embedding vector corresponding to the word segmentation;
    通过预设的标签预测模型对所述待分类文本对应的词向量集合进行类别标签预测,所述标签预测模型是共同根据所述训练集和测试集进行训练得到的,所述测试集用于修正所述训练集中的错误数据;Perform category label prediction on the word vector set corresponding to the text to be classified through a preset label prediction model, the label prediction model is jointly trained based on the training set and the test set, and the test set is used for correction Erroneous data in the training set;
    获取所述标签预测模型所输出的预测结果,所述预测结果对应于所述待分类文本所对应的文本类别。Obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
  2. 如权利要求1所述的方法,其中,在所述通过对待分类文本进行分词处理,获得所述待分类文本的分词集合之前,所述方法还包括:The method according to claim 1, wherein, before the word segmentation processing is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the method further comprises:
    获取待进行词向量训练的语料分词词库;Obtain the word-segmentation lexicon of the corpus for word vector training;
    对所述语料分词词库中的各分词,分别通过快速文本模型的连续跳跃元语法模式和词嵌入模型进行词向量训练,获得所述分词对应的快速文本向量和词嵌入向量;For each word segmentation in the corpus word segmentation thesaurus, word vector training is performed through the continuous skip metagrammatic mode of the fast text model and the word embedding model to obtain the fast text vector and the word embedding vector corresponding to the word segmentation;
    通过计算分词所对应快速文本向量和词嵌入向量的平均向量,获取所述平均向量为所述分词对应的向量表达;By calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation, obtaining the average vector as the vector expression corresponding to the word segmentation;
    获取所述语料分词词库中每一分词所对应向量表达形成所述词向量词典。The vector expression corresponding to each word segment in the corpus word segmentation dictionary is obtained to form the word vector dictionary.
  3. 如权利要求2所述的方法,其中,所述连续跳跃元语法模式下的子字长度参数用于指示将所述分词拆解为1个字或者2个字进行所述词向量训练。3. The method according to claim 2, wherein the sub-word length parameter in the continuous skip meta-grammar mode is used to indicate that the word segmentation is split into one character or two characters for the word vector training.
  4. 如权利要求1所述的方法,其中,在所述通过对待分类文本进行分词处理,获得所述待分类文本的分词集合之前,所述方法还包括:The method according to claim 1, wherein, before the word segmentation processing is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the method further comprises:
    按照设定比例,将待进行标签预测模型训练的标注语料划分为训练集和测试集,所述标注语料含有标注的类别标签;According to a set ratio, the labeled corpus to be trained for the label prediction model is divided into a training set and a test set, and the labeled corpus contains the labeled category labels;
    根据所述训练集对待训练的标签预测模型进行初始训练;Performing initial training on the label prediction model to be trained according to the training set;
    对初始训练所得标签预测模型,分别通过所述训练集和测试集进行组合训练,根据所述标签预测模型所输出预测结果对所述训练集中标注错误的类别标签进行修正;Perform combined training on the label prediction model obtained from the initial training through the training set and the test set, and correct the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model;
    根据修正得到的类别标签对所述训练集更新,且通过所述测试集和更新所得训练集迭代执行所述标签预测模型的训练过程,直至所述标签预测模型收敛。The training set is updated according to the corrected category labels, and the training process of the label prediction model is performed iteratively through the test set and the updated training set until the label prediction model converges.
  5. 如权利要求4所述的方法,其中,所述根据所述标签预测模型所输出预测结果对所述训练集中标注错误的类别标签进行修正,包括:The method of claim 4, wherein the correcting the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model comprises:
    根据所述标签预测模型所输出结果,分别计算所述标签预测模型为所述训练集和测试集进行标签预测的准确率;According to the output result of the label prediction model, respectively calculate the accuracy of label prediction by the label prediction model for the training set and the test set;
    在所述训练集和测试集对应的准确率均低于设定的准确率阈值时,筛选所述训练集中预测标签结果与所标注类别标签不一致的训练样本集合;When the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold, screening the training sample sets in which the predicted label result in the training set is inconsistent with the labeled category label;
    通过计算所述训练样本集合中预测结果正确且类别标签标注错误的概率,获得所述训练样本集合对应的预测概率值;Obtaining the prediction probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled;
    在所述预测概率值低于设定的概率阈值时,获取人工输入的类别标签对所述训练样本集合中训练样本所标注类别标签进行修正。When the predicted probability value is lower than the set probability threshold, the manually input category label is obtained to correct the category label marked by the training sample in the training sample set.
  6. 一种文本分类装置,包括:A text classification device includes:
    分词处理器,配置为通过对待分类文本进行分词处理,获得所述待分类文本对应的分词集合;The word segmentation processor is configured to obtain a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified;
    向量化处理器,配置为根据预设的词向量词典对所述分词集合进行向量化处理,获得所述待分类文本对应的词向量集合,所述词向量词典融合有分词所对应的快速文本向量和词嵌入向量;The vectorization processor is configured to perform vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified, and the word vector dictionary is fused with fast text vectors corresponding to the word segmentation And word embedding vector;
    标签预测器,配置为通过预设的标签预测模型对所述待分类文本对应的词向量集合进行类别标签预测,所述标签预测模型是共同根据所述训练集和测试集进行训练得到的,所述测试集配置为修正所述训练集中的错误数据;The label predictor is configured to perform category label prediction on the set of word vectors corresponding to the text to be classified through a preset label prediction model, the label prediction model being jointly trained based on the training set and the test set, so The test set is configured to correct incorrect data in the training set;
    类别获取器,配置为获取所述标签预测模型所输出的预测结果,所述预测结果对应于所述待分类文本所对应的文本类别。The category obtainer is configured to obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
  7. 如权利要求6所述的装置,其中,所述装置还包括:The device of claim 6, wherein the device further comprises:
    语料分词词库获取器,配置为获取待进行词向量训练的语料分词词库;The corpus word segmentation vocabulary acquirer is configured to obtain the corpus word segmentation vocabulary for word vector training;
    词向量训练器,配置为对所述语料分词词库中的各分词,分别通过快速文本模型的连续跳跃元语法模式和词嵌入模型进行词向量训练,获得所述分词对应的快速文本向量和词嵌入向量;The word vector trainer is configured to train each word in the word segmentation lexicon of the corpus through the continuous jump metagrammatic mode of the fast text model and the word embedding model to perform word vector training to obtain the fast text vector and word corresponding to the word segmentation Embedding vector
    向量表达融合器,配置为通过计算分词所对应快速文本向量和词嵌入向量的平均向量,获取所述平均向量为所述分词对应的向量表达;The vector expression fusion device is configured to obtain the average vector as the vector expression corresponding to the word segmentation by calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation;
    词向量词典获取器,配置为获取所述语料分词词库中每一分词所对应向量表达形成所述词向量词典。The word vector dictionary acquirer is configured to acquire the vector expression corresponding to each word segment in the corpus word segmentation dictionary to form the word vector dictionary.
  8. 如权利要求7所述的装置,其中,所述连续跳跃元语法模式下的子字长度参数配置为指示将所述分词拆解为1个字或者2个字进行所述词向量训练。8. The device of claim 7, wherein the sub-word length parameter in the continuous skip metagram mode is configured to indicate that the word segmentation is split into 1 character or 2 characters for the word vector training.
  9. 如权利要求6所述的装置,其中,所述装置还包括:The device of claim 6, wherein the device further comprises:
    标注语料分配器,配置为按照设定比例,将待进行标签预测模型训练的标注语料划分为训练集和测试集,所述标注语料含有标注的类别标签;An annotated corpus distributor, configured to divide an annotated corpus to be trained for a label prediction model into a training set and a test set according to a set ratio, the annotated corpus contains annotated category labels;
    模型初始训练器,配置为根据所述训练集对待训练的标签预测模型进行初始训练;The model initial trainer is configured to perform initial training on the label prediction model to be trained according to the training set;
    类别标签修正器,配置为对初始训练所得标签预测模型,分别通过所述训练集和测试集进行组合训练,根据所述标签预测模型所输出预测结果对所述训练集中标注错误的类别标签进行修正;The category label corrector is configured to perform combined training on the label prediction model obtained from the initial training through the training set and the test set respectively, and correct incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model ;
    模型迭代训练器,配置为根据修正得到的类别标签对所述训练集更新,且通过所述测试集和更新所得训练集迭代执行所述标签预测模型的训练过程,直至所述标签预测模型收敛。The model iterative trainer is configured to update the training set according to the corrected category labels, and iteratively execute the training process of the label prediction model through the test set and the updated training set until the label prediction model converges.
  10. 如权利要求9所述的装置,其中,所述类别标签修正器包括:9. The apparatus of claim 9, wherein the category label modifier comprises:
    准确率计算器,配置为根据所述标签预测模型所输出结果,分别计算所述标签预测模型为所述训练集和测试集进行标签预测的准确率;An accuracy calculator configured to calculate the accuracy of label prediction performed by the label prediction model for the training set and the test set according to the output result of the label prediction model;
    样本筛选器,配置为在所述训练集和测试集对应的准确率均低于设定的准确率阈值时,筛选所述训练集中预测标签结果与所标注类别标签不一致的训练样本集合;The sample filter is configured to filter the training sample set in which the predicted label result in the training set is inconsistent with the labeled category label when the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold;
    预测概率获取器,配置为通过计算所述训练样本集合中预测结果正确且类别标签标注错误的概率,获得所述训练样本集合对应的预测概率值;A prediction probability obtainer configured to obtain a prediction probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled;
    标签修正器,在所述预测概率值低于设定的概率阈值时,获取人工输入的类别标签对所述训练样本集合中训练样本所标注类别标签进行修正。The label modifier, when the predicted probability value is lower than the set probability threshold, obtains the manually input category label to correct the category label marked by the training sample in the training sample set.
  11. 一种电子设备,包括:An electronic device including:
    处理器;processor;
    及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,所述处理器配置为实现以下步骤:And a memory on which computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, the processor is configured to implement the following steps:
    通过对待分类文本进行分词处理,获得所述待分类文本对应的分词集合;Obtaining a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified;
    根据预设的词向量词典对所述分词集合进行向量化处理,获得所述待分类文本对应的词向量集合,所述词向量词典融合有分词所对应的快速文本向量和词嵌入向量;Performing vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified, and the word vector dictionary fuses the fast text vector and the word embedding vector corresponding to the word segmentation;
    通过预设的标签预测模型对所述待分类文本对应的词向量集合进行类别标签预测,所述标签预测模型是共同根据所述训练集和测试集进行训练得到的,所述测试集配置为修正所述训练集中的错误数据;The category label prediction is performed on the word vector set corresponding to the text to be classified through a preset label prediction model, the label prediction model is jointly trained according to the training set and the test set, and the test set is configured to modify Erroneous data in the training set;
    获取所述标签预测模型所输出的预测结果,所述预测结果对应于所述待分类文本所对应的文本类别。Obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
  12. 如权利要求11所述的电子设备,其中,在所述通过对待分类文本进行分词处理,获得所述待分类文本的分词集合之前,所述处理器配置为实现以下步骤:11. The electronic device according to claim 11, wherein, before the word segmentation process is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the processor is configured to implement the following steps:
    获取待进行词向量训练的语料分词词库;Obtain the word-segmentation lexicon of the corpus for word vector training;
    对所述语料分词词库中的各分词,分别通过快速文本模型的连续跳跃元语法模式和词嵌入模型进行词向量训练,获得所述分词对应的快速文本向量和词嵌入向量;For each word segmentation in the corpus word segmentation thesaurus, word vector training is performed through the continuous skip metagrammatic mode of the fast text model and the word embedding model to obtain the fast text vector and the word embedding vector corresponding to the word segmentation;
    通过计算分词所对应快速文本向量和词嵌入向量的平均向量,获取所述平均向量为所述分词对应的向量表达;By calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation, obtaining the average vector as the vector expression corresponding to the word segmentation;
    获取所述语料分词词库中每一分词所对应向量表达形成所述词向量词典。The vector expression corresponding to each word segment in the corpus word segmentation dictionary is obtained to form the word vector dictionary.
  13. 如权利要求12所述的电子设备,其中,所述连续跳跃元语法模式下的子字长度参数配置为指示将所述分词拆解为1个字或者2个字进行所述词向量训练。The electronic device according to claim 12, wherein the sub-word length parameter in the continuous skip meta-grammar mode is configured to indicate that the word segmentation is split into 1 character or 2 characters for the word vector training.
  14. 如权利要求11所述的电子设备,其中,在所述通过对待分类文本进行分词处理,获得所述待分类文本的分词集合之前,所述处理器配置为实现以下步骤:11. The electronic device according to claim 11, wherein, before the word segmentation process is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the processor is configured to implement the following steps:
    按照设定比例,将待进行标签预测模型训练的标注语料划分为训练集和测试集,所述标注语料含有标注的类别标签;According to a set ratio, the labeled corpus to be trained for the label prediction model is divided into a training set and a test set, and the labeled corpus contains the labeled category labels;
    根据所述训练集对待训练的标签预测模型进行初始训练;Performing initial training on the label prediction model to be trained according to the training set;
    对初始训练所得标签预测模型,分别通过所述训练集和测试集进行组合训练,根据所述标签预测模型所输出预测结果对所述训练集中标注错误的类别标签进行修正;Perform combined training on the label prediction model obtained from the initial training through the training set and the test set, and correct the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model;
    根据修正得到的类别标签对所述训练集更新,且通过所述测试集和更新所得训练集迭代执行所述标签预测模型的训练过程,直至所述标签预测模型收敛。The training set is updated according to the corrected category labels, and the training process of the label prediction model is performed iteratively through the test set and the updated training set until the label prediction model converges.
  15. 如权利要求14所述的电子设备,其中,所述根据所述标签预测模型所输出预测结果对所述训练集中标注错误的类别标签进行修正,所述处理器配置为实现以下步骤:The electronic device according to claim 14, wherein the incorrectly labeled category label in the training set is corrected according to the prediction result output by the label prediction model, and the processor is configured to implement the following steps:
    根据所述标签预测模型所输出结果,分别计算所述标签预测模型为所述训练集和测试集进行标签预测的准确率;According to the output result of the label prediction model, respectively calculate the accuracy of label prediction by the label prediction model for the training set and the test set;
    在所述训练集和测试集对应的准确率均低于设定的准确率阈值时,筛选所述训练集中预测标签结果与所标注类别标签不一致的训练样本集合;When the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold, screening the training sample sets in which the predicted label result in the training set is inconsistent with the labeled category label;
    通过计算所述训练样本集合中预测结果正确且类别标签标注错误的概率,获得所述训练样本集合对应的预测概率值;Obtaining the prediction probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled;
    在所述预测概率值低于设定的概率阈值时,获取人工输入的类别标签对所述训练样本集合中训练样本所标注类别标签进行修正。When the predicted probability value is lower than the set probability threshold, the manually input category label is obtained to correct the category label marked by the training sample in the training sample set.
  16. 一种计算机非易失性可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,所述处理器配置为实现以下步骤:A computer non-volatile readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the processor is configured to implement the following steps:
    通过对待分类文本进行分词处理,获得所述待分类文本对应的分词集合;Obtaining a word segmentation set corresponding to the text to be classified by performing word segmentation processing on the text to be classified;
    根据预设的词向量词典对所述分词集合进行向量化处理,获得所述待分类文本对应的词向量集合,所述词向量词典融合有分词所对应的快速文本向量和词嵌入向量;Performing vectorization processing on the word segmentation set according to a preset word vector dictionary to obtain a word vector set corresponding to the text to be classified, and the word vector dictionary fuses the fast text vector and the word embedding vector corresponding to the word segmentation;
    通过预设的标签预测模型对所述待分类文本对应的词向量集合进行类别标签预测,所述标签预测模型是共同根据所述训练集和测试集进行训练得到的,所述测试集配置为修正所述训练集中的错误数据;The category label prediction is performed on the word vector set corresponding to the text to be classified through a preset label prediction model, the label prediction model is jointly trained according to the training set and the test set, and the test set is configured to modify Erroneous data in the training set;
    获取所述标签预测模型所输出的预测结果,所述预测结果对应于所述待分类文本所对应的文本类别。Obtain the prediction result output by the label prediction model, the prediction result corresponding to the text category corresponding to the text to be classified.
  17. 如权利要求16所述的计算机非易失性可读存储介质,其中,在所述通过对待分类文本进行分词处理,获得所述待分类文本的分词集合之前,所述处理器配置为实现以下步骤:The computer non-volatile readable storage medium according to claim 16, wherein, before the word segmentation processing is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the processor is configured to implement the following steps :
    获取待进行词向量训练的语料分词词库;Obtain the word-segmentation lexicon of the corpus for word vector training;
    对所述语料分词词库中的各分词,分别通过快速文本模型的连续跳跃元语法模式和词嵌入模型进行词向量训练,获得所述分词对应的快速文本向量和词嵌入向量;For each word segmentation in the corpus word segmentation thesaurus, word vector training is performed through the continuous skip metagrammatic mode of the fast text model and the word embedding model to obtain the fast text vector and the word embedding vector corresponding to the word segmentation;
    通过计算分词所对应快速文本向量和词嵌入向量的平均向量,获取所述平均向量为所述分词对应的向量表达;By calculating the average vector of the fast text vector and the word embedding vector corresponding to the word segmentation, obtaining the average vector as the vector expression corresponding to the word segmentation;
    获取所述语料分词词库中每一分词所对应向量表达形成所述词向量词典。The vector expression corresponding to each word segment in the corpus word segmentation dictionary is obtained to form the word vector dictionary.
  18. 如权利要求17所述的计算机非易失性可读存储介质,其中,所述连续跳跃元语法模式下的子字长度参数配置为指示将所述分词拆解为1个字或者2个字进行所述词向量训练。The computer non-volatile readable storage medium according to claim 17, wherein the sub-word length parameter in the continuous skip metagram mode is configured to indicate that the word segmentation is split into 1 word or 2 words. The word vector training.
  19. 如权利要求16所述的计算机非易失性可读存储介质,其中,在所述通过对待分类文本进行分词处理,获得所述待分类文本的分词集合之前,所述处理器配置为实现以下步骤:The computer non-volatile readable storage medium according to claim 16, wherein, before the word segmentation processing is performed on the text to be classified to obtain the word segmentation set of the text to be classified, the processor is configured to implement the following steps :
    按照设定比例,将待进行标签预测模型训练的标注语料划分为训练集和测试集,所述标注语料含有标注的类别标签;According to a set ratio, the labeled corpus to be trained for the label prediction model is divided into a training set and a test set, and the labeled corpus contains the labeled category labels;
    根据所述训练集对待训练的标签预测模型进行初始训练;Performing initial training on the label prediction model to be trained according to the training set;
    对初始训练所得标签预测模型,分别通过所述训练集和测试集进行组合训练,根据所述标签预测模型所输出预测结果对所述训练集中标注错误的类别标签进行修正;Perform combined training on the label prediction model obtained from the initial training through the training set and the test set, and correct the incorrectly labeled category labels in the training set according to the prediction result output by the label prediction model;
    根据修正得到的类别标签对所述训练集更新,且通过所述测试集和更新所得训练集迭代执行所述标签预测模型的训练过程,直至所述标签预测模型收敛。The training set is updated according to the corrected category labels, and the training process of the label prediction model is performed iteratively through the test set and the updated training set until the label prediction model converges.
  20. 如权利要求19所述的计算机非易失性可读存储介质,其中,所述根据所述标签预测模型所输出预测结果对所述训练集中标注错误的类别标签进行修正,所述处理器配置为实现以下步骤:The computer non-volatile readable storage medium according to claim 19, wherein the incorrectly labeled category label in the training set is corrected according to the prediction result output by the label prediction model, and the processor is configured to Implement the following steps:
    根据所述标签预测模型所输出结果,分别计算所述标签预测模型为所述训练集和测试集进行标签预测的准确率;According to the output result of the label prediction model, respectively calculate the accuracy of label prediction by the label prediction model for the training set and the test set;
    在所述训练集和测试集对应的准确率均低于设定的准确率阈值时,筛选所述训练集中预测标签结果与所标注类别标签不一致的训练样本集合;When the accuracy rates corresponding to the training set and the test set are both lower than the set accuracy threshold, screening the training sample sets in which the predicted label result in the training set is inconsistent with the labeled category label;
    通过计算所述训练样本集合中预测结果正确且类别标签标注错误的概率,获得所述训练样本集合对应的预测概率值;Obtaining the prediction probability value corresponding to the training sample set by calculating the probability that the prediction result in the training sample set is correct and the category label is incorrectly labeled;
    在所述预测概率值低于设定的概率阈值时,获取人工输入的类别标签对所述训练样本集合中训练样本所标注类别标签进行修正。When the predicted probability value is lower than the set probability threshold, the manually input category label is obtained to correct the category label marked by the training sample in the training sample set.
PCT/CN2019/117647 2019-09-17 2019-11-12 Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium WO2021051560A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910877110.9A CN110717039B (en) 2019-09-17 2019-09-17 Text classification method and apparatus, electronic device, and computer-readable storage medium
CN201910877110.9 2019-09-17

Publications (1)

Publication Number Publication Date
WO2021051560A1 true WO2021051560A1 (en) 2021-03-25

Family

ID=69209890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117647 WO2021051560A1 (en) 2019-09-17 2019-11-12 Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium

Country Status (2)

Country Link
CN (1) CN110717039B (en)
WO (1) WO2021051560A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139053A (en) * 2021-04-15 2021-07-20 广东工业大学 Text classification method based on self-supervision contrast learning
CN113688244A (en) * 2021-08-31 2021-11-23 中国平安人寿保险股份有限公司 Text classification method, system, device and storage medium based on neural network
CN113704073A (en) * 2021-09-02 2021-11-26 交通运输部公路科学研究所 Method for detecting abnormal data of automobile maintenance record library
CN113821589A (en) * 2021-06-10 2021-12-21 腾讯科技(深圳)有限公司 Text label determination method and device, computer equipment and storage medium
CN113822074A (en) * 2021-06-21 2021-12-21 腾讯科技(深圳)有限公司 Content classification method and device, electronic equipment and storage medium
CN114020877A (en) * 2021-11-18 2022-02-08 中科雨辰科技有限公司 Data processing system for labeling text
CN114139531A (en) * 2021-11-30 2022-03-04 哈尔滨理工大学 Medical entity prediction method and system based on deep learning
CN114817526A (en) * 2022-02-21 2022-07-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal
CN114861650A (en) * 2022-04-13 2022-08-05 大箴(杭州)科技有限公司 Method and device for cleaning noise data, storage medium and electronic equipment
CN115495314A (en) * 2022-09-30 2022-12-20 中国电信股份有限公司 Log template identification method and device, electronic equipment and readable medium
CN116541705A (en) * 2023-05-06 2023-08-04 石家庄铁道大学 Training method of text classification model and text classification method
CN113822074B (en) * 2021-06-21 2024-05-10 腾讯科技(深圳)有限公司 Content classification method, device, electronic equipment and storage medium

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259658B (en) * 2020-02-05 2022-08-19 中国科学院计算技术研究所 General text classification method and system based on category dense vector representation
CN113111897A (en) * 2020-02-13 2021-07-13 北京明亿科技有限公司 Alarm receiving and warning condition type determining method and device based on support vector machine
CN111309912B (en) * 2020-02-24 2024-02-13 深圳市华云中盛科技股份有限公司 Text classification method, apparatus, computer device and storage medium
CN111291564B (en) * 2020-03-03 2023-10-31 腾讯科技(深圳)有限公司 Model training method, device and storage medium for word vector acquisition
CN111382271B (en) * 2020-03-09 2023-05-23 支付宝(杭州)信息技术有限公司 Training method and device of text classification model, text classification method and device
CN111444340B (en) * 2020-03-10 2023-08-11 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN111401066B (en) * 2020-03-12 2022-04-12 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
CN111460148A (en) * 2020-03-27 2020-07-28 深圳价值在线信息科技股份有限公司 Text classification method and device, terminal equipment and storage medium
CN111460101B (en) * 2020-03-30 2023-09-15 广州视源电子科技股份有限公司 Knowledge point type identification method, knowledge point type identification device and knowledge point type identification processor
CN111539209B (en) * 2020-04-15 2023-09-15 北京百度网讯科技有限公司 Method and apparatus for entity classification
CN111209377B (en) * 2020-04-23 2020-08-04 腾讯科技(深圳)有限公司 Text processing method, device, equipment and medium based on deep learning
CN111666407A (en) * 2020-04-24 2020-09-15 苏宁云计算有限公司 Text classification method and device
CN111597334A (en) * 2020-04-30 2020-08-28 陈韬文 Method, system, device and medium for classifying text of electrical drawings
CN111680803B (en) * 2020-06-02 2023-09-01 中国电力科学研究院有限公司 Operation checking work ticket generation system
CN111651601B (en) * 2020-06-02 2023-04-18 全球能源互联网研究院有限公司 Training method and classification method for fault classification model of power information system
CN111680804B (en) * 2020-06-02 2023-09-01 中国电力科学研究院有限公司 Method, equipment and computer readable medium for generating operation checking work ticket
CN112819023B (en) * 2020-06-11 2024-02-02 腾讯科技(深圳)有限公司 Sample set acquisition method, device, computer equipment and storage medium
CN111695052A (en) * 2020-06-12 2020-09-22 上海智臻智能网络科技股份有限公司 Label classification method, data processing device and readable storage medium
CN111708888B (en) * 2020-06-16 2023-10-24 腾讯科技(深圳)有限公司 Classification method, device, terminal and storage medium based on artificial intelligence
CN111813941A (en) * 2020-07-23 2020-10-23 北京来也网络科技有限公司 Text classification method, device, equipment and medium combining RPA and AI
CN112749557A (en) * 2020-08-06 2021-05-04 腾讯科技(深圳)有限公司 Text processing model construction method and text processing method
CN111930943B (en) * 2020-08-12 2022-09-02 中国科学技术大学 Method and device for detecting pivot bullet screen
CN112052356B (en) * 2020-08-14 2023-11-24 腾讯科技(深圳)有限公司 Multimedia classification method, apparatus and computer readable storage medium
CN112289398A (en) * 2020-08-17 2021-01-29 上海柯林布瑞信息技术有限公司 Pathological report analysis method and device, storage medium and terminal
CN112084334B (en) * 2020-09-04 2023-11-21 中国平安财产保险股份有限公司 Label classification method and device for corpus, computer equipment and storage medium
CN113761184A (en) * 2020-09-29 2021-12-07 北京沃东天骏信息技术有限公司 Text data classification method, equipment and storage medium
CN112307752A (en) * 2020-10-30 2021-02-02 平安科技(深圳)有限公司 Data processing method and device, electronic equipment and storage medium
CN112307209B (en) * 2020-11-05 2024-04-26 江西高创保安服务技术有限公司 Short text classification method and system based on character vector
CN112100385B (en) * 2020-11-11 2021-02-09 震坤行网络技术(南京)有限公司 Single label text classification method, computing device and computer readable storage medium
CN112434165B (en) * 2020-12-17 2023-11-07 广州视源电子科技股份有限公司 Ancient poetry classification method, device, terminal equipment and storage medium
CN112767022B (en) * 2021-01-13 2024-02-27 湖南天添汇见企业管理咨询服务有限责任公司 Mobile application function evolution trend prediction method and device and computer equipment
CN112800226A (en) * 2021-01-29 2021-05-14 上海明略人工智能(集团)有限公司 Method for obtaining text classification model, method, device and equipment for text classification
CN112801425B (en) * 2021-03-31 2021-07-02 腾讯科技(深圳)有限公司 Method and device for determining information click rate, computer equipment and storage medium
CN113807096A (en) * 2021-04-09 2021-12-17 京东科技控股股份有限公司 Text data processing method and device, computer equipment and storage medium
CN113159921A (en) * 2021-04-23 2021-07-23 上海晓途网络科技有限公司 Overdue prediction method and device, electronic equipment and storage medium
CN113011533B (en) * 2021-04-30 2023-10-24 平安科技(深圳)有限公司 Text classification method, apparatus, computer device and storage medium
CN113268979B (en) * 2021-04-30 2023-06-27 清华大学 Artificial intelligent text analysis method and related equipment based on double dictionary model
CN113297379A (en) * 2021-05-25 2021-08-24 善诊(上海)信息技术有限公司 Text data multi-label classification method and device
CN113127607A (en) * 2021-06-18 2021-07-16 贝壳找房(北京)科技有限公司 Text data labeling method and device, electronic equipment and readable storage medium
CN113434675A (en) * 2021-06-25 2021-09-24 竹间智能科技(上海)有限公司 Label correction method and system
CN113761938B (en) * 2021-09-06 2023-12-08 上海明略人工智能(集团)有限公司 Method and device for training NLP model, electronic equipment and storage medium
CN113722493B (en) * 2021-09-09 2023-10-13 北京百度网讯科技有限公司 Text classification data processing method, apparatus and storage medium
CN114254588B (en) * 2021-12-16 2023-10-13 马上消费金融股份有限公司 Data tag processing method and device
CN114661990A (en) * 2022-03-23 2022-06-24 北京百度网讯科技有限公司 Method, apparatus, device, medium and product for data prediction and model training

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357531A1 (en) * 2015-11-27 2018-12-13 Devanathan GIRIDHARI Method for Text Classification and Feature Selection Using Class Vectors and the System Thereof
US20190065589A1 (en) * 2016-03-25 2019-02-28 Quad Analytix Llc Systems and methods for multi-modal automated categorization
CN110019792A (en) * 2017-10-30 2019-07-16 阿里巴巴集团控股有限公司 File classification method and device and sorter model training method
CN110188199A (en) * 2019-05-21 2019-08-30 北京鸿联九五信息产业有限公司 A kind of file classification method for intelligent sound interaction

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092596B (en) * 2017-04-24 2020-08-04 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR
CN107943911A (en) * 2017-11-20 2018-04-20 北京大学深圳研究院 Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN109948140B (en) * 2017-12-20 2023-06-23 普天信息技术有限公司 Word vector embedding method and device
CN108334605B (en) * 2018-02-01 2020-06-16 腾讯科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN108897829B (en) * 2018-06-22 2020-08-04 广州多益网络股份有限公司 Data label correction method, device and storage medium
CN109918497A (en) * 2018-12-21 2019-06-21 厦门市美亚柏科信息股份有限公司 A kind of file classification method, device and storage medium based on improvement textCNN model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357531A1 (en) * 2015-11-27 2018-12-13 Devanathan GIRIDHARI Method for Text Classification and Feature Selection Using Class Vectors and the System Thereof
US20190065589A1 (en) * 2016-03-25 2019-02-28 Quad Analytix Llc Systems and methods for multi-modal automated categorization
CN110019792A (en) * 2017-10-30 2019-07-16 阿里巴巴集团控股有限公司 File classification method and device and sorter model training method
CN110188199A (en) * 2019-05-21 2019-08-30 北京鸿联九五信息产业有限公司 A kind of file classification method for intelligent sound interaction

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139053B (en) * 2021-04-15 2024-03-05 广东工业大学 Text classification method based on self-supervision contrast learning
CN113139053A (en) * 2021-04-15 2021-07-20 广东工业大学 Text classification method based on self-supervision contrast learning
CN113821589A (en) * 2021-06-10 2021-12-21 腾讯科技(深圳)有限公司 Text label determination method and device, computer equipment and storage medium
CN113822074A (en) * 2021-06-21 2021-12-21 腾讯科技(深圳)有限公司 Content classification method and device, electronic equipment and storage medium
CN113822074B (en) * 2021-06-21 2024-05-10 腾讯科技(深圳)有限公司 Content classification method, device, electronic equipment and storage medium
CN113688244A (en) * 2021-08-31 2021-11-23 中国平安人寿保险股份有限公司 Text classification method, system, device and storage medium based on neural network
CN113704073A (en) * 2021-09-02 2021-11-26 交通运输部公路科学研究所 Method for detecting abnormal data of automobile maintenance record library
CN114020877A (en) * 2021-11-18 2022-02-08 中科雨辰科技有限公司 Data processing system for labeling text
CN114020877B (en) * 2021-11-18 2024-05-10 中科雨辰科技有限公司 Data processing system for labeling text
CN114139531A (en) * 2021-11-30 2022-03-04 哈尔滨理工大学 Medical entity prediction method and system based on deep learning
CN114139531B (en) * 2021-11-30 2024-05-14 哈尔滨理工大学 Medical entity prediction method and system based on deep learning
CN114817526B (en) * 2022-02-21 2024-03-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal
CN114817526A (en) * 2022-02-21 2022-07-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal
CN114861650B (en) * 2022-04-13 2024-04-26 大箴(杭州)科技有限公司 Noise data cleaning method and device, storage medium and electronic equipment
CN114861650A (en) * 2022-04-13 2022-08-05 大箴(杭州)科技有限公司 Method and device for cleaning noise data, storage medium and electronic equipment
CN115495314A (en) * 2022-09-30 2022-12-20 中国电信股份有限公司 Log template identification method and device, electronic equipment and readable medium
CN116541705A (en) * 2023-05-06 2023-08-04 石家庄铁道大学 Training method of text classification model and text classification method

Also Published As

Publication number Publication date
CN110717039B (en) 2023-10-13
CN110717039A (en) 2020-01-21

Similar Documents

Publication Publication Date Title
WO2021051560A1 (en) Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium
CN111309915B (en) Method, system, device and storage medium for training natural language of joint learning
US10747962B1 (en) Artificial intelligence system using phrase tables to evaluate and improve neural network based machine translation
US9373075B2 (en) Applying a genetic algorithm to compositional semantics sentiment analysis to improve performance and accelerate domain adaptation
US20180075368A1 (en) System and Method of Advising Human Verification of Often-Confused Class Predictions
CN107273356B (en) Artificial intelligence based word segmentation method, device, server and storage medium
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
US11443209B2 (en) Method and system for unlabeled data selection using failed case analysis
CN110795938B (en) Text sequence word segmentation method, device and storage medium
US11763203B2 (en) Methods and arrangements to adjust communications
US11551437B2 (en) Collaborative information extraction
WO2020215456A1 (en) Text labeling method and device based on teacher forcing
US11615241B2 (en) Method and system for determining sentiment of natural language text content
WO2023197613A1 (en) Small sample fine-turning method and system and related apparatus
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
US11934781B2 (en) Systems and methods for controllable text summarization
US10147020B1 (en) System and method for computational disambiguation and prediction of dynamic hierarchical data structures
CN110807086A (en) Text data labeling method and device, storage medium and electronic equipment
US20220351634A1 (en) Question answering systems
CN114792089A (en) Method, apparatus and program product for managing computer system
US11126797B2 (en) Toxic vector mapping across languages
WO2021174814A1 (en) Answer verification method and apparatus for crowdsourcing task, computer device, and storage medium
CN115906854A (en) Multi-level confrontation-based cross-language named entity recognition model training method
WO2022271369A1 (en) Training of an object linking model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945513

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945513

Country of ref document: EP

Kind code of ref document: A1