CN117216272A - Training method of text classification model, text classification method, device and equipment - Google Patents

Training method of text classification model, text classification method, device and equipment Download PDF

Info

Publication number
CN117216272A
CN117216272A CN202311182896.5A CN202311182896A CN117216272A CN 117216272 A CN117216272 A CN 117216272A CN 202311182896 A CN202311182896 A CN 202311182896A CN 117216272 A CN117216272 A CN 117216272A
Authority
CN
China
Prior art keywords
text
target
word
determining
named entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311182896.5A
Other languages
Chinese (zh)
Inventor
唐海浩
石东升
姜涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202311182896.5A priority Critical patent/CN117216272A/en
Publication of CN117216272A publication Critical patent/CN117216272A/en
Pending legal-status Critical Current

Links

Abstract

The disclosure provides a training method of a text classification model, a text classification method, a device and equipment, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of natural language processing, deep learning and the like. The training method comprises the following steps: acquiring an original text, a first tag corresponding to the original text and a target naming entity, and determining a first text segment from the original text based on the target naming entity, wherein the first text segment comprises a first number of words; determining whether the first text segment has a plurality of target reference segmentations consistent with the target named entity references; in response to determining that the first text segment comprises a plurality of target reference words consistent with the target named entity references, splicing clauses where the plurality of target reference words are respectively located to obtain a first text sample; and training a text classification model based on the first text sample and the first tag.

Description

Training method of text classification model, text classification method, device and equipment
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of natural language processing, deep learning, and the like, and in particular, to a training method for a text classification model, a text classification method, a training device for a text classification model, a text classification device, an electronic device, a computer readable storage medium, and a computer program product.
Background
Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises natural language processing technology, computer vision technology, voice recognition technology, machine learning/deep learning technology, big data processing technology, knowledge graph technology and other big directions.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.
Disclosure of Invention
The present disclosure provides a training method of a text classification model, a text classification method, a training apparatus of a text classification model, a text classification apparatus, an electronic device, a computer-readable storage medium, and a computer program product.
According to an aspect of the present disclosure, there is provided a training method of a text classification model, including: acquiring an original text, a first tag corresponding to the original text and a target naming entity, and determining a first text segment from the original text based on the target naming entity, wherein the first text segment comprises a first number of words; determining whether the first text segment has a plurality of target reference segmentations consistent with the target named entity references; in response to determining that the first text segment comprises a plurality of target reference words consistent with the target named entity references, splicing clauses where the plurality of target reference words are respectively located to obtain a first text sample; and training a text classification model based on the first text sample and the first tag.
According to another aspect of the present disclosure, there is provided a text classification method including: acquiring a text to be predicted and a target named entity, and determining a second text segment from the text to be predicted based on the target named entity, wherein the second text segment comprises a first number of words; determining whether the second text segment includes a plurality of target reference terms consistent with the target named entity references; responding to the fact that the second text segment comprises a plurality of target index word fragments consistent with target named entity indexes, and splicing clauses where the target index word fragments are respectively located to obtain a first text to be classified; and obtaining a text classification result output by the text classification model based on the first text to be classified.
According to another aspect of the present disclosure, there is provided a training apparatus of a text classification model, including: a first obtaining unit configured to obtain an original text, a first tag corresponding to the original text, and a target named entity, and determine a first text segment from the original text based on the target named entity, the first text segment including a first number of words; a first determining unit configured to determine whether the first text segment includes a plurality of target-reference-word fragments consistent with the target-named-entity reference; the first splicing unit is configured to splice the clauses where the target index segmentations are respectively located in response to determining that the first text segment comprises the target index segmentations consistent with the target named entity index, so as to obtain a first text sample; and a training unit configured to train the text classification model based on the first text sample and the first label.
According to another aspect of the present disclosure, there is provided a text classification apparatus including: a second obtaining unit configured to obtain a text to be predicted and a target named entity, and determine a second text segment from the text to be predicted based on the target named entity, the second text segment including a first number of words; a second determining unit configured to determine whether the second text segment includes a plurality of target-designation-word fragments consistent with the target-designation-entity designations; the second splicing unit is configured to splice the clauses where the target index segmentations are respectively located in response to determining that the second text segment comprises the target index segmentations consistent with the target named entity index so as to obtain a first text to be classified; and a third acquisition unit configured to acquire a text classification result output by the text classification model based on the first text to be classified.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described method.
According to one or more embodiments of the present disclosure, the present disclosure determines a text segment related to a target named entity in an original text, extracts clauses including target reference words consistent with the target named entity reference in the text segment, and then concatenates the clauses to obtain a text sample for training a text classification model. By the method, the obtained text sample can be ensured to have sufficient context information related to the target named entity, irrelevant noise can be removed, and the accuracy of the trained text classification model is improved. In addition, the method does not need manual labeling, has low calculation cost, and can be used for rapidly generating a large number of samples and training a text classification model.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.
FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;
FIG. 2 illustrates a flow chart of a training method of a text classification model according to an embodiment of the disclosure;
FIG. 3 illustrates a flow diagram for determining a first text segment from original text based on a target named entity according to an embodiment of the present disclosure;
FIG. 4 illustrates a flowchart of determining whether a first text segment includes a plurality of target-reference-word fragments consistent with target-named-entity references, in accordance with an embodiment of the present disclosure;
FIG. 5 illustrates a flowchart of determining whether a first text segment has a target identical reference cluster that satisfies a preset condition according to an embodiment of the present disclosure;
FIG. 6 illustrates a flowchart of determining whether each of at least one identical reference cluster satisfies a preset condition according to an embodiment of the present disclosure;
FIG. 7 illustrates a flow chart of a training method of a text classification model according to an embodiment of the disclosure;
FIG. 8 illustrates a flow chart of a training method of a text classification model according to an embodiment of the disclosure;
FIG. 9 illustrates a flow chart of a text classification method according to an embodiment of the disclosure;
FIG. 10 illustrates a flow diagram for determining a second text segment from text to be predicted based on a target named entity according to an embodiment of the present disclosure;
FIG. 11 illustrates a flowchart of determining whether a second text segment includes multiple target-reference-word fragments consistent with target-named-entity references, in accordance with an embodiment of the present disclosure;
FIG. 12 illustrates a flowchart of determining whether a second text segment has a target identical reference cluster that satisfies a preset condition, according to an embodiment of the present disclosure;
FIG. 13 illustrates a flowchart of determining whether each of at least one identical reference cluster satisfies a preset condition according to an embodiment of the present disclosure;
FIG. 14 illustrates a flow chart of a text classification method according to an embodiment of the disclosure;
FIG. 15 illustrates a flow chart of a text classification method according to an embodiment of the disclosure;
FIG. 16 illustrates an operational flow diagram of a text classification system according to an embodiment of the disclosure;
FIG. 17 shows a block diagram of a training device of a text classification model according to an embodiment of the disclosure;
FIG. 18 shows a block diagram of a text classification device according to an embodiment of the disclosure; and
fig. 19 shows a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.
In the related art, there are two difficulties in classifying models for long texts: if the processing is performed according to the sentence granularity, the context information is lacked, and the model cost is high; if the analysis is performed according to chapter-level granularity, a large amount of irrelevant noise exists, and the manual labeling cost is high.
In order to solve the above problems, the present disclosure determines a text segment related to a target named entity in an original text, extracts clauses including target reference words consistent with the target named entity reference from the text segment, and then splices the clauses to obtain a text sample for training a text classification model. By the method, the obtained text sample can be ensured to have sufficient context information related to the target named entity, irrelevant noise can be removed, and the accuracy of the trained text classification model is improved. In addition, the method does not need manual labeling, has low calculation cost, and can be used for rapidly generating a large number of samples and training a text classification model. Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.
In embodiments of the present disclosure, server 120 may run one or more services or software applications that enable execution of the training method and/or text classification method of the text classification model of the present disclosure.
In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.
In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.
The user may use client devices 101, 102, 103, 104, 105, and/or 106 to determine text to predict. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface, e.g., may output text classification results to the user. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.
Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.
Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.
The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.
The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.
In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and/or 106.
In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.
The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.
In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.
The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.
According to one aspect of the present disclosure, a method of training a text classification model is provided. As shown in fig. 2, the training method includes: step S201, acquiring an original text, a first label corresponding to the original text and a target named entity, and determining a first text segment from the original text based on the target named entity, wherein the first text segment comprises a first number of words; step S202, determining whether a first text segment comprises a plurality of target reference word segments consistent with target named entity references; step S203, in response to determining that the first text segment comprises a plurality of target index words consistent with the target named entity index, splicing clauses where the target index words are respectively located so as to obtain a first text sample; and step S204, training a text classification model based on the first text sample and the first label.
Therefore, by the method, the obtained text sample can be ensured to have sufficient context information related to the target named entity, irrelevant noise can be removed, and the accuracy of the trained text classification model is improved. In addition, the method does not need manual labeling, has low calculation cost, and can be used for rapidly generating a large number of samples and training a text classification model. Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
According to some embodiments, a text classification model may be trained to classify the tendencies of text, particularly for specific individuals, things, events (collectively referred to as named entities in this disclosure) in the text. The technical solution of the present disclosure will be mainly described below taking text tendency classification task as an example, but it is not intended to limit the scope of the present disclosure in any way. It will be appreciated that the technical solution of the present disclosure may also be used for other text classification tasks, which are not limited herein.
In step S201, an original text, a first tag corresponding to the original text, and a target named entity are acquired, and a first text segment is determined from the original text based on the target named entity, the first text segment including a first number of words.
According to some embodiments, the original text may include news text and social media text. The original text may also be papers, books, magazines or other text, not limited herein.
According to some embodiments, the original text may be obtained by one of the following: web crawlers, databases, data interfaces. In this way, a large amount of original text can be obtained at low cost. It will be appreciated that the original text may be obtained in other ways than those described above.
In some embodiments, the text classification model is trained for trend analysis of the text, and accordingly, the first label may indicate a trend classification result for the original text, e.g., characterizing whether the article itself is positive or negative.
The first tag may be determined from the source of the original text. Often different publishers have different tendencies between them, while different content published by the same publisher has similar tendencies, so the first tag characterizing the tendencies of the original text can be determined directly based on the source of the original text. In one exemplary embodiment, a first tag from the original text of the content-active a website may be set to active and a first tag from the original text of the content-passive B website may be set to passive.
In some embodiments, the target named entity may be selected from a library of predetermined named entities, or may be determined by identifying or extracting the original text. The determination of the target naming entity may be selected or determined by a configuration center, as will be described below.
According to some embodiments, as shown in fig. 3, step S201, obtaining the original text, the first tag corresponding to the original text, and the target named entity, and determining the first text segment from the original text based on the target named entity may include: step S301, determining a text window with a first number of lengths based on the first appearance position of a target named entity in an original text; and step S302, cutting the original text based on the text window to obtain a first text segment. Thus, in the above manner, a moderately long context having close association with the target named entity can be obtained.
According to some embodiments, the text window may be centered on the location in the original text where the target named entity first appears. For example, the text window may include text content that is half the length of the first number before the location where the target named entity first appears, and text content that is half the length of the first number after the location where the target named entity first appears. In this way, it can be ensured that both the context and the context of the target named entity are fully covered.
According to some embodiments, the first number may be 1000. In one exemplary embodiment, the original text may be cut in a text window of 1000 words (e.g., about 500 words each) in front and back of the target named entity to obtain the first text segment.
It will be appreciated that the first number may be set to other values, and the first text segment associated with the target named entity may be determined in the original text in other manners, which are not limited herein.
In some embodiments, a plurality of different target named entities may be determined in the original text, and a plurality of first text segments corresponding to the plurality of target named entities respectively are determined from the original text, and further, a corresponding first text sample is determined for each first text segment, so as to obtain a plurality of training samples. It should be noted that since the training samples originate from the same original text, the training samples may have the same first label.
In step S202, it is determined whether the first text segment includes a plurality of target-reference-word fragments consistent with the target-named-entity-reference.
In some embodiments, the first text segment may include a plurality of clauses and a plurality of segmentations, which may be derived based on respective rules or by processing the first text segment with respective tools, as will be described below.
In the daily term, in the following, abbreviations or references are used instead of the words already appearing above, and this situation is referred to in linguistics as "referring to phenomenon", also referred to as "referring to". The referring phenomenon can avoid the problems of statement bloated, redundant description and the like caused by repeated occurrence of the same word; but also causes a problem of "unknown" because of such omission.
In this disclosure, reference to a word (including a target reference to a word) is a word that can be used to refer to an entity of a particular individual, thing, event, or the like. Each of the plurality of target-referring-to-words is capable of referring to a target named entity. If the first text segment includes such multiple target-pointing words, the phrases in which each target-pointing word is located may be spliced to obtain a first text sample corresponding to the target-named entity as a corresponding training sample, as will be described below.
In one exemplary embodiment, the obtained target named entity may be a player who may be three on day X and the corresponding first text snippet may be "late on day X and night," with the player being three on day X announced to be retired via personal social media, ending XX year careers. With age, the status of the grand sheet in the last racing season tournament, which is about to fill XX years, slips down. In other words, the outside world has been prepared for his retirement for some fraction of the psychology. However, when the game comes in, the fan still gives a sense of a thousand.
It can be seen that in this first text segment, the three references of the player's name "Zhang Sang", nickname "Dazhang" and "He" in the outside's retirement of him ' may each refer to the player's Zhang Sang, so it may be determined that the first text segment includes three target references words consistent with the target naming entity player's "Zhang Sang" references.
Formally, the process of partitioning different references representing the same entity into one set of equivalents (i.e., the same reference cluster) may be referred to as reference resolution. The reference resolution can effectively solve the problem of unknown references in texts, is a fundamental research in the field of natural language processing, and plays an important role in tasks such as machine reading understanding, information extraction, multi-round dialogue and the like.
According to some embodiments, as shown in fig. 4, determining whether the first text segment includes a plurality of target-reference-word fragments consistent with the target-named-entity reference in step S202 may include: step S401, determining whether the first text segment has the target identical reference cluster meeting the preset condition, wherein the multiple reference words with identical references included in the corresponding identical reference cluster indicated by the preset condition comprise the same reference word as the target named entity; and step S402, in response to determining that the first text segment has the target identical reference cluster meeting the preset condition, determining a plurality of reference words in the target identical reference cluster as a plurality of target reference words consistent with the target named entity reference.
Thus, by determining whether the first text segment has an equivalent set representing a different reference of the target named entity, including the same reference cluster of the target reference word identical to the target named entity, it can be determined whether the first text segment includes a plurality of target reference words consistent with the target named entity reference.
According to some embodiments, as shown in fig. 5, the determining whether the first text segment has the target identical reference cluster satisfying the preset condition in step S401 may include: step S501, based on the coreference resolution tool, determining whether the first text segment has at least one identical reference cluster, each identical reference cluster in the at least one identical reference cluster comprising a plurality of reference words with identical references in the first text segment; step S502, in response to determining that the first text segment has at least one identical reference cluster, determining whether each identical reference cluster in the at least one identical reference cluster meets a preset condition; and step S503, determining the same reference cluster meeting the preset condition as the target same reference cluster.
The coreference resolution tool can be used to analyze text and directly derive a plurality of reference words having identical references that make up an identical reference cluster. Thus, by using the coreference resolution tool, one or more identical reference clusters that the first text segment has (the first text segment may not have identical reference clusters) can be quickly analyzed, and whether the obtained identical reference clusters are identical to the target named entity can be determined. If the indexes are consistent, the clauses where the plurality of index segmentations (the plurality of words corresponding to each target index segmentations) indicated by the same index cluster are respectively located can be acquired for splicing processing, as will be described below.
As mentioned above, the first text segment may not have the same reference cluster, and the first text segment may be considered to not include multiple target-reference-word segments consistent with the target-named-entity reference, and the manner of processing for this will be described below.
In some embodiments, the coreference resolution tool may be capable of directly outputting a plurality of tokens (token) of the same reference cluster, so as to determine whether the same reference cluster and the target named entity refer to the same according to a relationship between the reference tokens and the target named entity in the same reference cluster. For the exemplary embodiment given above, in step S501, the coreference resolution tool may directly give the same reference cluster, i.e. a set comprising "Zhang Sanj", "Dazhang" and "He" three reference words. Since the reference words "Zhang Sanu" in the identical reference cluster are identical to the target named entity male player "Zhang Sanu", it may be determined in step S502 that the identical reference cluster satisfies the preset condition, and then it may be determined in step S503 as the target identical reference cluster.
According to some embodiments, in step S502, it may be iteratively determined whether a plurality of segments in the same reference cluster are identical to the target named entity, and if one of the segments in the same reference cluster is identical to the target named entity, the same reference cluster may be considered to be identical to the target named entity reference. As shown in fig. 6, in response to determining that the first text segment has at least one identical reference cluster, determining whether each identical reference cluster of the at least one identical reference cluster satisfies the preset condition may include: step S601, traversing a plurality of index words included in at least one identical reference cluster aiming at each identical reference cluster in the identical reference clusters, and determining whether each index word in the plurality of index words is identical with a target named entity one by one; and step S602, in response to determining that the same reference cluster comprises the same reference word as the target named entity, determining that the same reference cluster meets a preset condition.
In some embodiments, if the first text segment has at least one identical reference cluster, and the identical reference clusters are each different from the target named entity reference, the first text segment may be considered to not include multiple target reference segmentations consistent with the target named entity reference, and the manner of handling this will be described below.
It is understood that in step S401, it may also be determined in other manners whether the first text segment has the same reference cluster as the target meeting the preset condition, which is not limited herein.
In step S402, when it is determined that the first text segment has the target identical reference cluster satisfying the preset condition, a plurality of reference words in the target identical reference cluster may be determined as a plurality of target reference words consistent with the target named entity reference.
Returning to fig. 2. In step S203, in response to determining that the first text segment includes a plurality of target reference words consistent with the target named entity references, the clauses in which the plurality of target reference words are located are spliced to obtain a first text sample.
For the exemplary embodiment given above, multiple target phrases may be obtained that refer to the words "Zhang Sang", "Dazhang" and "He" each, i.e. "late X-day night," male player [ Zhang Sanj ] announces retirement via personal social media, ending XX-year career "," with age [ Dazhang ] that is about to slip down in the season-racing tournament [ Dazhang ] and "in other words, the world has had several mental readies for [ He ] retirement". Further, these clauses may be spliced to obtain a first text sample.
According to some embodiments, the plurality of words in the first text segment may each have a corresponding word index. As shown in fig. 7, the training method may further include: step S702, performing sentence segmentation on the first text segment based on punctuation marks in the first text segment to obtain a plurality of sentences and sentence indexes of each sentence in the plurality of sentences; step S703, establishing sentence mapping relations between sentence indexes of each of the plurality of clauses and word indexes of each of the plurality of words based on subordinate relations between the plurality of words and the plurality of clauses; step S704, performing word segmentation processing on the first text segment based on a word segmentation tool to obtain a plurality of word segments and word segmentation indexes of each word segment in the plurality of word segments; and step S705, establishing word segmentation mapping relations between word segmentation indexes of the plurality of words and word indexes of the plurality of words based on the subordination relations between the plurality of words and the plurality of word segmentation. It is understood that the operations of step S701, step S706 and step S708 in fig. 7 are similar to those of step S201 to step S204 in fig. 2, and are not described herein.
Therefore, through establishing a sentence mapping relation (sense map) between the indexes of the clauses and the indexes of the words and a word segmentation mapping relation (token map) between the indexes of the words and the indexes of the words, the clauses where the target index words are located can be quickly obtained after a plurality of target index words are obtained.
In some embodiments, at step S702, a clause may be performed on the first text segment with chinese punctuation grammar rules, and a sentence index may be assigned to each resulting clause. It will be appreciated that other rules or other means may be employed to phrase the first text segment.
In some embodiments, in step S703, a sentence mapping relationship between the word index of the plurality of words included in each clause and the sentence index of the clause may be established.
In some embodiments, at step S704, a word segmentation process (word segmentation preprocessing) may be performed on the first text segment using a word segmentation tool, and a word segmentation index may be assigned to each resulting word segment. It will be appreciated that other ways of word segmentation of the first text segment may be employed.
In some embodiments, in step S705, a word segmentation mapping relationship between a word index of a plurality of words included in each word segment and a word segmentation index of the word segment may be established.
In some embodiments, in step S707, in response to determining that the first text segment includes a plurality of target-reference-word fragments consistent with the target-named-entity reference, concatenating the clauses in which the plurality of target-reference-word fragments are located to obtain the first text sample may include: based on the sentence mapping relation and the word segmentation mapping relation, obtaining the clauses where the target index word segments are respectively located. Therefore, by using the sentence mapping relation and the word segmentation mapping relation, the corresponding clauses can be directly obtained according to the target index word segmentation, and the text processing efficiency is improved.
According to some embodiments, based on the sentence mapping relationship and the word segmentation mapping relationship, obtaining and splicing the phrases where the plurality of target index words are respectively located may include: for each target index word in a plurality of target index words, determining a word index of a word corresponding to the target index word based on the word index and word mapping relation of the target index word; determining a clause index of a clause where the target index word is located based on a word index and a sentence mapping relation of a word corresponding to the target index word; and acquiring the clauses where the target index words are respectively located based on the corresponding clause indexes, and splicing.
Therefore, by using the word index as an intermediate medium, the clause where each target refers to the word can be quickly obtained based on the word segmentation mapping relation and the sentence mapping relation, and the text processing efficiency is improved.
Returning to fig. 2. In step S204, a text classification model is trained based on the first text sample and the first label.
In some embodiments, the first text sample may be input into the text classification model to obtain a text classification prediction result output by the text classification model, and parameters of the text classification model are adjusted based on the text classification prediction result and the first label, so as to realize training of the text classification model. The text classification prediction result output by the text classification model can represent the tendency classification result of the original text for the target named entity. It will be appreciated that other training methods or training techniques may be used to train the text classification model in performing the training methods of the present disclosure, and are not limited in this regard.
According to some embodiments, as shown in fig. 8, the training method may further include: step S805, in response to determining that the first text segment does not have multiple target reference words consistent with the target named entity references, concatenating the clauses of the first text segment that include the target named entity to obtain a second text sample; and step S806, training a text classification model based on the second text sample and the first label. It is to be understood that the operations of step S801 to step S804 in fig. 8 are similar to the operations of step S201 to step S204 in fig. 2, respectively, and the operations of step S805 and step S806 may refer to step S203 and step S204, which are not described herein.
Thus, in the case that the first text segment does not have a plurality of target reference words consistent with the target named entity references, the context information related to the target named entity can be obtained by splicing the clauses containing the target named entity in the first text segment, and irrelevant noise can be removed.
According to another aspect of the present disclosure, a text classification method is provided. As shown in fig. 9, the method includes: step S901, acquiring a text to be predicted and a target named entity, and determining a second text segment from the text to be predicted based on the target named entity, wherein the second text segment comprises a first number of words; step S902, determining whether the second text segment comprises a plurality of target reference word segments consistent with the target named entity reference; step 903, in response to determining that the second text segment includes a plurality of target reference words consistent with the target named entity references, splicing the clauses where the plurality of target reference words are respectively located, so as to obtain a first text to be classified; and step S904, acquiring a text classification result output by the text classification model based on the first text to be classified.
It is to be understood that the operations of step S901 to step S904 in fig. 9 are similar to the operations of step S201 to step S204 in fig. 2, respectively, and are not described herein. Further, the text classification model used in step S904 may be trained using the training method of the text classification model described above.
Therefore, by the method, sufficient context information related to the target named entity can be ensured in the text of the obtained input text classification model, irrelevant noise can be removed, and the accuracy of classification results generated by the text classification model is improved.
In step S901, a text to be predicted and a target named entity are obtained, and a second text segment is determined from the text to be predicted based on the target named entity, the second text segment including a first number of words.
In some embodiments, the text to be predicted may be any text obtained by any means. In a scenario where a large number of texts need to be screened and classified, for example, a configuration center may pre-configure corresponding keywords and target named entities. Before performing the text classification method, a large number of texts may be filtered using keywords (e.g., specific event keywords), and the filtered texts may be used as texts to be predicted. Further, the text classification method described above may be performed based on the target named entity (e.g., a particular task).
According to some embodiments, as shown in fig. 10, step S901, obtaining the text to be predicted and the target named entity, and determining the second text segment from the text to be predicted based on the target named entity may include: step S1001, determining a text window with a first length based on the first appearance position of a target named entity in a text to be predicted; and step S1002, cutting the text to be predicted based on the text window to obtain a second text segment. Thus, in the above manner, a moderately long context having close association with the target named entity can be obtained.
According to some embodiments, the text window may be centered on the location where the target named entity first appears in the text to be predicted. For example, the text window may include text content that is half the length of the first number before the location where the target named entity first appears, and text content that is half the length of the first number after the location where the target named entity first appears. In this way, it can be ensured that both the context and the context of the target named entity are fully covered.
According to some embodiments, the first number may be 1000. In one exemplary embodiment, the text to be predicted may be segmented with a text window of 1000 words (e.g., about 500 words each) in front and back of the target named entity to obtain a second text segment.
In step S902, it is determined whether the second text segment includes a plurality of target-reference-word fragments consistent with the target-named-entity reference.
According to some embodiments, as shown in fig. 11, determining whether the second text segment includes a plurality of target-reference-word fragments consistent with the target-named-entity reference may include: step 1101, determining whether the second text segment has a target identical reference cluster satisfying a preset condition, where the preset condition indicates that a plurality of reference words with identical references included in the corresponding identical reference cluster include reference words identical to the target named entity; and step S1102, in response to determining that the second text segment has the target identical reference cluster meeting the preset condition, determining a plurality of reference words in the target identical reference cluster as a plurality of target reference words consistent with the target named entity reference.
Thus, by determining whether the second text segment has an equivalent set representing a different reference of the target named entity, including the same reference cluster of reference words that are the same as the target named entity, it can be determined whether the second text segment includes a plurality of target reference words that are consistent with the reference of the target named entity.
According to some embodiments, as shown in fig. 12, step S1101, determining whether the second text segment has the target identical reference cluster satisfying the preset condition may include: step S1201, based on the coreference resolution tool, determining whether the second text segment has at least one identical reference cluster, each identical reference cluster in the at least one identical reference cluster comprising a plurality of reference words in the second text segment with identical references; step S1202, in response to determining that the second text segment has at least one identical reference cluster, determining whether each identical reference cluster in the at least one identical reference cluster meets a preset condition; and step S1203, determining the same reference cluster meeting the preset condition as the target same reference cluster.
Thus, by using the coreference resolution tool, one or more identical reference clusters of the second text segment can be quickly analyzed (the second text segment may not have identical reference clusters), and whether the obtained identical reference clusters are identical to the target named entity in reference can be determined.
According to some embodiments, as shown in fig. 13, in response to determining that the second text segment has at least one identical reference cluster, determining whether each identical reference cluster in the at least one identical reference cluster satisfies the preset condition may include: step S1301, for each identical reference cluster in at least one identical reference cluster, traversing a plurality of reference words included in the identical reference cluster, and determining whether each reference word in the plurality of reference words is identical to the target named entity one by one; and step S1302, in response to determining that the same reference cluster includes one reference word identical to the target named entity, determining that the same reference cluster satisfies a preset condition.
Returning to fig. 9. In step S903, in response to determining that the second text segment includes a plurality of target reference words consistent with the target named entity references, the clauses in which the plurality of target reference words are located are spliced to obtain the first text to be classified.
According to some embodiments, the plurality of words in the second text segment each have a corresponding word index. As shown in fig. 14, the text classification method may further include: step S1402, performing sentence segmentation on the second text segment based on punctuation marks in the second text segment to obtain a plurality of sentences and sentence indexes of each of the plurality of sentences; step S1403, establishing sentence mapping relations between sentence indexes of each of a plurality of clauses and word indexes of each of the plurality of words based on subordinate relations between the plurality of words and the plurality of clauses; step S1404, performing word segmentation processing on the second text segment based on the word segmentation tool to obtain a plurality of word segments and word segmentation indexes of each word segment in the plurality of word segments; and step S1405, establishing word segmentation mapping relations between word segmentation indexes of the plurality of words and word indexes of the plurality of words based on the subordination relations between the plurality of words and the plurality of words. It is to be understood that the operations of step S1401, step S1406 to step S1408 in fig. 14 are similar to those of step S901 to step S904 in fig. 9, and are not described here.
In some embodiments, in step S1407, in response to determining that the second text segment includes a plurality of target-reference-word fragments that are consistent with the target-named-entity reference, concatenating the clauses in which the plurality of target-reference-word fragments are located to obtain the first text to be classified may include: based on the sentence mapping relation and the word segmentation mapping relation, directly acquiring the clauses where the target index word segments are respectively located.
Therefore, by using the sentence mapping relation and the word segmentation mapping relation, the corresponding clauses can be directly obtained according to the target index word segmentation, and the text processing efficiency is improved.
According to some embodiments, based on the sentence mapping relationship and the word segmentation mapping relationship, obtaining and splicing the phrases where the plurality of target index words are respectively located may include: for each target index word in a plurality of target index words, determining a word index of a word corresponding to the target index word based on the word index and word mapping relation of the target index word; determining a clause index of a clause where the target index word is located based on a word index and a sentence mapping relation of a word corresponding to the target index word; and acquiring the clauses where the target index words are respectively located based on the corresponding clause indexes, and splicing.
Therefore, by using the word index as an intermediate medium, the clause where each target refers to the word can be quickly obtained based on the word segmentation mapping relation and the sentence mapping relation, and the text processing efficiency is improved.
Returning to fig. 9. In step S904, a text classification result output by the text classification model based on the first text to be classified is acquired.
In some embodiments, the first text to be classified may be input into a text classification model to obtain a text classification result output by the text classification model. The text classification result may characterize a tendency classification result of the text to be predicted for the target named entity.
According to some embodiments, as shown in fig. 15, the text classification method may further include: step S1505, in response to determining that the second text segment does not have multiple target reference word segments consistent with the target named entity references, splicing the clauses containing the target named entity in the first text segment to obtain a second text to be classified; and step S1506, obtaining a text classification result output by the text classification model based on the first text to be classified. It is to be understood that the operations of step S1501 to step S1504 in fig. 15 are similar to the operations of step S901 to step S904 in fig. 9, respectively, and the operations of step S1505 and step S1506 may refer to step S903 and step S904, which are not described herein.
Therefore, in the case that the second text segment does not have a plurality of target reference words consistent with the target named entity references, the phrases containing the target named entity in the second text segment can be spliced to obtain the context information related to the target named entity, and irrelevant noise can be removed.
The operation of a text classification system capable of performing the training method of the text classification model of the present disclosure and the method of text classification using the trained text classification model will be described below with respect to fig. 16 by way of one exemplary embodiment.
In step S1601, it is determined whether the training phase is performed, if yes, a training text is obtained from the data source 1602 such as the web crawler, the database, the data interface, etc., and if not, a corresponding text 1603 to be predicted is obtained.
In step S1604, the tags of the training data are determined according to the tendencies classification of the data sources.
In step S1605, the text to be predicted is filtered according to the event keyword 1622 output from the configuration center 1620. If the text to be predicted does not include the corresponding event keywords, processing the text to be predicted may be skipped.
In step S1606, a text window before and after the target person in the training text or the text to be predicted may be cut according to the name keyword 1624 outputted from the configuration center 1620.
In step S1607, sentence segmentation may be performed on the text content in the text window, and a sentence mapping relationship and a word segmentation mapping relationship may be established.
In step S1608, the coreference resolution tool may process the text content within the text window to generate the same reference cluster and receive the name keyword 1624 output by the configuration center 1620.
In step S1609, it may be determined whether the text content in the text window has the same reference cluster as the name keyword 1624, if so, step S1610 is executed to splice sentences in which the plurality of divided words in the same reference cluster are located, and if not, step S1611 is executed to splice sentences in which the name keyword 1624 is located, so as to obtain a splice result.
In step S1612, the concatenation result may be input into the text classification model to obtain a result output by the text classification model.
According to another aspect of the present disclosure, a training apparatus for a text classification model is provided. As shown in fig. 17, the apparatus 1700 includes: a first obtaining unit 1710 configured to obtain an original text, a first tag corresponding to the original text, and a target named entity, and determine a first text segment from the original text based on the target named entity, the first text segment including a first number of words; a first determining unit 1720 configured to determine whether the first text segment includes a plurality of target reference segmentations consistent with the target named entity reference; a first concatenation unit 1730 configured to, in response to determining that the first text segment includes a plurality of target reference tokens consistent with the target named entity references, concatenate clauses in which the plurality of target reference tokens are each located to obtain a first text sample; and a training unit 1740 configured to train a text classification model based on the first text sample and the first tag. It is understood that the operations of the units 1710 to 1740 in the apparatus 1700 are similar to the operations of the steps S201 to S204 in fig. 2, respectively, and will not be repeated herein.
According to another aspect of the present disclosure, a text classification device is provided. As shown in fig. 18, the apparatus 1800 includes: a second obtaining unit 1810 configured to obtain the text to be predicted and the target named entity, and determine a second text segment from the text to be predicted based on the target named entity, the second text segment including the first number of words; a second determining unit 1820 configured to determine whether the second text segment includes a plurality of target-reference-word segments consistent with the target-named-entity reference; a second splicing unit 1830 configured to splice the clauses where the plurality of target reference fragments are respectively located to obtain a first text to be classified in response to determining that the second text segment includes the plurality of target reference fragments consistent with the target named entity reference; and a third acquiring unit 1840 configured to acquire a text classification result output by the text classification model based on the first text to be classified. It is understood that the operations of the units 1810 to 1840 in the apparatus 1800 are similar to the operations of the steps S901 to S904 in fig. 9, respectively, and are not described herein.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.
With reference to fig. 19, a block diagram of a structure of an electronic device 1900 that can be a server or a client of the present disclosure, which is an example of a hardware device that can be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 19, the electronic device 1900 includes a computing unit 1901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1902 or a computer program loaded from a storage unit 1908 into a Random Access Memory (RAM) 1903. In the RAM 1903, various programs and data required for operation of the electronic device 1900 may also be stored. The computing unit 1901, ROM 1902, and RAM 1903 are connected to each other via a bus 1904. An input/output (I/O) interface 1905 is also connected to bus 1904.
Various components in electronic device 1900 are connected to I/O interface 1905, including: an input unit 1906, an output unit 1907, a storage unit 1908, and a communication unit 1909. The input unit 1906 may be any type of device capable of inputting information to the electronic device 1900, the input unit 1906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 1907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 1908 may include, but is not limited to, magnetic disks, optical disks. The communication unit 1909 allows the electronic device 1900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.
The computing unit 1901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1901 performs the various methods and processes described above, such as a training method of a text classification model and/or a text classification method. For example, in some embodiments, the training method of the text classification model and/or the text classification method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1908. In some embodiments, some or all of the computer programs may be loaded and/or installed onto electronic device 1900 via ROM 1902 and/or communication unit 1909. When the computer program is loaded into RAM 1903 and executed by computing unit 1901, one or more steps of the training method of the text classification model and/or the text classification method described above may be performed. Alternatively, in other embodiments, the computing unit 1901 may be configured to perform the training method of the text classification model and/or the text classification method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims (28)

1. A method of training a text classification model, comprising:
Acquiring an original text, a first tag corresponding to the original text and a target named entity, and determining a first text segment from the original text based on the target named entity, wherein the first text segment comprises a first number of words;
determining whether the first text segment includes a plurality of target reference terms consistent with the target named entity reference;
in response to determining that the first text segment comprises a plurality of target index word fragments consistent with the target named entity index, splicing the clauses where the target index word fragments are respectively located so as to obtain a first text sample; and
the text classification model is trained based on the first text sample and the first label.
2. The method of claim 1, wherein determining whether the first text segment includes a plurality of target-designation-terms consistent with the target-named-entity designation comprises:
determining whether the first text segment has a target identical reference cluster meeting a preset condition, wherein the preset condition indicates that a plurality of reference words with identical references included in the corresponding identical reference cluster comprise reference words identical to the target named entity; and
And in response to determining that the first text segment has the target identical reference cluster meeting the preset condition, determining a plurality of reference words in the target identical reference cluster as a plurality of target reference words consistent with the target named entity reference.
3. The method of claim 2, wherein determining whether the first text segment has a target identical reference cluster that satisfies a preset condition comprises:
determining, based on a co-reference resolution tool, whether the first text segment has at least one identical reference cluster, each identical reference cluster of the at least one identical reference cluster comprising a plurality of reference words in the first text segment that are identical in reference;
in response to determining that the first text segment has at least one identical reference cluster, determining whether each identical reference cluster in the at least one identical reference cluster satisfies the preset condition; and
and determining the same reference cluster meeting the preset condition as the target same reference cluster.
4. The method of claim 3, wherein, in response to determining that the first text segment has at least one identical reference cluster, determining whether each of the at least one identical reference cluster satisfies the preset condition comprises:
Traversing a plurality of index words included in the same reference cluster aiming at each same reference cluster in the at least one same reference cluster, and determining whether each index word in the plurality of index words is the same as the target named entity one by one; and
and in response to determining that the same reference cluster comprises the same reference word as the target named entity, determining that the same reference cluster meets the preset condition.
5. The method of any of claims 1-4, further comprising:
responsive to determining that the first text segment does not have multiple target reference segmentations consistent with the target named entity references, concatenating clauses of the first text segment that include the target named entity to obtain a second text sample; and
the text classification model is trained based on the second text sample and the first label.
6. The method of any of claims 1-4, wherein the plurality of words in the first text segment each have a respective word index, the method further comprising:
performing sentence segmentation on the first text segment based on punctuation marks in the first text segment to obtain a plurality of sentences and sentence indexes of each sentence in the plurality of sentences;
Establishing sentence mapping relations between sentence indexes of each of the multiple clauses and word indexes of each of the multiple words based on the subordinate relations between the multiple words and the multiple clauses;
based on a word segmentation tool, carrying out word segmentation processing on the first text segment to obtain a plurality of word segments and word segmentation indexes of each word segment in the plurality of word segments; and
based on the subordinate relations between the plurality of words and the plurality of word fragments, establishing word fragment mapping relations between word fragment indexes of the plurality of word fragments and word fragment indexes of the plurality of word fragments,
wherein, in response to determining that the first text segment includes a plurality of target-index-word fragments consistent with the target-named-entity designations, concatenating the clauses in which the plurality of target-index-word fragments are located, obtaining a first text sample includes:
and acquiring and splicing the clauses where the target index word is respectively positioned based on the sentence mapping relation and the word segmentation mapping relation.
7. The method of claim 3, wherein obtaining and concatenating clauses in which the plurality of target-referring-to-tokens are each located based on the sentence-mapping relationship and the word-segmentation-mapping relationship comprises:
Determining, for each target reference word in the plurality of target reference words, a word index of a word corresponding to the target reference word based on a word index of the target reference word and the word mapping relationship;
determining a clause index of a clause where the target index word is located based on the word index of the word corresponding to the target index word and the sentence mapping relation; and
based on the corresponding clause index, the clauses where the target index words are respectively located are acquired and spliced.
8. The method of any of claims 1-4, wherein obtaining an original text, a first tag corresponding to the original text, and a target named entity, and determining a first text segment from the original text based on the target named entity comprises:
determining a text window with a first number of lengths based on the first appearance position of the target named entity in the original text; and
and cutting the original text based on the text window to obtain the first text segment.
9. The method of claim 8, wherein the text window is centered on a location in the original text where the target named entity first appears.
10. The method of any of claims 1-4, wherein the first number is 1000.
11. The method of any of claims 1-4, wherein the first label indicates a tendencies classification result for the original text and the first label is determined from a source of the original text.
12. The method of claim 11, wherein the original text comprises news text and social media text.
13. The method of claim 11, wherein the original text is obtained by one of: web crawlers, databases, data interfaces.
14. A text classification method, comprising:
acquiring a text to be predicted and a target named entity, and determining a second text segment from the text to be predicted based on the target named entity, wherein the second text segment comprises a first number of words;
determining whether the second text segment includes a plurality of target reference terms consistent with the target named entity reference;
in response to determining that the second text segment comprises a plurality of target index word fragments consistent with the target named entity index, splicing the clauses where the target index word fragments are respectively located so as to obtain a first text to be classified; and
And acquiring a text classification result output by the text classification model based on the first text to be classified.
15. The method of claim 14, wherein determining whether the second text segment includes a plurality of target-designation-terms consistent with the target-named-entity designation comprises:
determining whether the second text segment has a target identical reference cluster meeting a preset condition, wherein the preset condition indicates that a plurality of reference words with identical references included in the corresponding identical reference cluster comprise reference words identical to the target named entity; and
and in response to determining that the second text segment has the target identical reference cluster meeting the preset condition, determining a plurality of reference words in the target identical reference cluster as a plurality of target reference words consistent with the target named entity reference.
16. The method of claim 15, wherein determining whether the second text segment has a target identical reference cluster that satisfies a preset condition comprises:
determining, based on a co-reference resolution tool, whether the second text segment has at least one identical reference cluster, each identical reference cluster of the at least one identical reference cluster comprising a plurality of reference words in the second text segment that are identical in reference;
In response to determining that the second text segment has at least one identical reference cluster, determining whether each identical reference cluster in the at least one identical reference cluster satisfies the preset condition; and
and determining the same reference cluster meeting the preset condition as the target same reference cluster.
17. The method of claim 16, wherein in response to determining that the second text segment has at least one identical reference cluster, determining whether each of the at least one identical reference cluster satisfies the preset condition comprises:
traversing a plurality of index words included in the same reference cluster aiming at each same reference cluster in the at least one same reference cluster, and determining whether each index word in the plurality of index words is the same as the target named entity one by one; and
and in response to determining that the same reference cluster comprises one reference word identical to the target named entity, determining that the same reference cluster meets the preset condition.
18. The method of any of claims 14-17, further comprising:
in response to determining that the second text segment does not have multiple target reference word segments consistent with the target named entity references, splicing the clauses containing the target named entity in the first text segment to obtain a second text to be classified; and
And acquiring a text classification result output by the text classification model based on the first text to be classified.
19. The method of any of claims 14-17, wherein the plurality of words in the second text segment each have a respective word index, the method further comprising:
performing sentence segmentation on the second text segment based on punctuation marks in the second text segment to obtain a plurality of sentences and sentence indexes of each sentence in the plurality of sentences;
establishing sentence mapping relations between sentence indexes of each of the multiple clauses and word indexes of each of the multiple words based on the subordinate relations between the multiple words and the multiple clauses;
based on a word segmentation tool, carrying out word segmentation processing on the second text segment to obtain a plurality of word segments and word segmentation indexes of each word segment in the plurality of word segments; and
based on the subordinate relations between the plurality of words and the plurality of word fragments, establishing word fragment mapping relations between word fragment indexes of the plurality of word fragments and word fragment indexes of the plurality of word fragments,
wherein, in response to determining that the second text segment includes a plurality of target reference words consistent with the target named entity reference, concatenating the clauses in which the plurality of target reference words are located, obtaining the first text to be classified includes:
Based on the sentence mapping relation and the word segmentation mapping relation, directly acquiring the clauses where the target index word segments are respectively located and splicing the clauses.
20. The method of claim 18, wherein obtaining and concatenating clauses in which the plurality of target-referring-tokens are each located based on the sentence-mapping relationship and the word-segmentation-mapping relationship comprises:
determining, for each target reference word in the plurality of target reference words, a word index of a word corresponding to the target reference word based on a word index of the target reference word and the word mapping relationship;
determining a clause index of a clause where the target index word is located based on the word index of the word corresponding to the target index word and the sentence mapping relation; and
based on the corresponding clause index, the clauses where the target index words are respectively located are acquired and spliced.
21. The method of any of claims 14-17, wherein obtaining text to be predicted and a target named entity, and determining a second text segment from the text to be predicted based on the target named entity comprises:
determining a text window with a first number of lengths based on the first appearance position of the target named entity in the text to be predicted; and
And cutting the text to be predicted based on the text window to obtain the second text segment.
22. The method of claim 21, wherein the text window is centered on a location in the text to be predicted where the target named entity first appears.
23. The method of any of claims 14-17, wherein the first number is 1000.
24. A training device for a text classification model, comprising:
a first obtaining unit configured to obtain an original text, a first tag corresponding to the original text, and a target named entity, and determine a first text segment from the original text based on the target named entity, the first text segment including a first number of words;
a first determining unit configured to determine whether the first text segment includes a plurality of target reference words consistent with the target named entity reference;
the first splicing unit is configured to splice the clauses where the target index segmentations are respectively located in response to determining that the first text segment comprises the target index segmentations consistent with the target named entity index, so as to obtain a first text sample; and
A training unit configured to train the text classification model based on the first text sample and the first label.
25. A text classification device, comprising:
a second obtaining unit configured to obtain a text to be predicted and a target named entity, and determine a second text segment from the text to be predicted based on the target named entity, the second text segment including a first number of words;
a second determining unit configured to determine whether the second text segment includes a plurality of target reference words consistent with the target named entity reference;
the second splicing unit is configured to splice the clauses where the target index segmentations are respectively located in response to determining that the second text segment comprises the target index segmentations consistent with the target named entity index, so as to obtain a first text to be classified; and
and a third acquisition unit configured to acquire a text classification result output by the text classification model based on the first text to be classified.
26. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.
27. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-13.
28. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-13.
CN202311182896.5A 2023-09-13 2023-09-13 Training method of text classification model, text classification method, device and equipment Pending CN117216272A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311182896.5A CN117216272A (en) 2023-09-13 2023-09-13 Training method of text classification model, text classification method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311182896.5A CN117216272A (en) 2023-09-13 2023-09-13 Training method of text classification model, text classification method, device and equipment

Publications (1)

Publication Number Publication Date
CN117216272A true CN117216272A (en) 2023-12-12

Family

ID=89040143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311182896.5A Pending CN117216272A (en) 2023-09-13 2023-09-13 Training method of text classification model, text classification method, device and equipment

Country Status (1)

Country Link
CN (1) CN117216272A (en)

Similar Documents

Publication Publication Date Title
US10102191B2 (en) Propagation of changes in master content to variant content
JP2022031804A (en) Event extraction method, device, electronic apparatus and storage medium
US9818080B2 (en) Categorizing a use scenario of a product
CN112749344A (en) Information recommendation method and device, electronic equipment, storage medium and program product
CN114595686B (en) Knowledge extraction method, and training method and device of knowledge extraction model
KR20180127622A (en) Systems for data collection and analysis
US20220237376A1 (en) Method, apparatus, electronic device and storage medium for text classification
US10049108B2 (en) Identification and translation of idioms
CN114625855A (en) Method, apparatus, device and medium for generating dialogue information
US10354013B2 (en) Dynamic translation of idioms
JP2023015215A (en) Method and apparatus for extracting text information, electronic device, and storage medium
CN114611532B (en) Language model training method and device, and target translation error detection method and device
CN111538815A (en) Text query method, device, equipment and storage medium
CN116303962A (en) Dialogue generation method, training method, device and equipment for deep learning model
CN112508432A (en) Advertisement potential risk detection method and device, electronic equipment, medium and product
CN115719066A (en) Search text understanding method, device, equipment and medium based on artificial intelligence
CN114547270B (en) Text processing method, training method, device and equipment for text processing model
US11120339B2 (en) Automatic claim reliability scorer based on extraction and evidence analysis
CN115759100A (en) Data processing method, device, equipment and medium
CN117216272A (en) Training method of text classification model, text classification method, device and equipment
CN114547252A (en) Text recognition method and device, electronic equipment and medium
CN113051926A (en) Text extraction method, equipment and storage medium
US10055401B2 (en) Identification and processing of idioms in an electronic environment
CN115879468B (en) Text element extraction method, device and equipment based on natural language understanding
CN114490986B (en) Computer-implemented data mining method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination