CN117076667A - Classification label determining method and device, storage medium and electronic equipment - Google Patents

Classification label determining method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN117076667A
CN117076667A CN202310878248.7A CN202310878248A CN117076667A CN 117076667 A CN117076667 A CN 117076667A CN 202310878248 A CN202310878248 A CN 202310878248A CN 117076667 A CN117076667 A CN 117076667A
Authority
CN
China
Prior art keywords
text
data
cluster
text data
main
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310878248.7A
Other languages
Chinese (zh)
Inventor
曹靖楠
王智君
魏一雄
杨仁杰
王聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310878248.7A priority Critical patent/CN117076667A/en
Publication of CN117076667A publication Critical patent/CN117076667A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The specification discloses a method, a device, a storage medium and electronic equipment for determining classification labels. In the determining process of the classification label, the terminal device may first classify each text data in a clustering manner, determine, for each text cluster, a candidate keyword corresponding to the text cluster, determine a main keyword from the candidate keywords corresponding to the text cluster, and determine the classification label according to the main keyword of the text cluster, so as to classify the text data to be classified according to the classification label.

Description

Classification label determining method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for determining a classification label, a storage medium, and an electronic device.
Background
Nowadays, with the development of technology, the manner of investigation and collection of data is also increasing, wherein the manner of investigation and collection of data through a network is also popular due to convenience and rapidity, which greatly promotes the development of modern society and the progress of human society.
After the survey gathering of data is completed through the network, the gathered data needs to be sorted. For example, an organization may collect some questions in the society reflected by the masses through the network, the process of sorting and classifying the collected questions needs to be completed manually, which consumes a lot of manpower and material resources, and the classification label in the classification system according to which the process of sorting and classifying is performed manually is not fine enough (i.e. the collected same question data may belong to the question type corresponding to one classification label and the question type corresponding to another text label), and the classification labels in the classification system cannot completely cover all the question types, so that a new classification label needs to be determined at this time, however, the new classification label needs to be determined manually at present, which is time-consuming and labor-consuming.
Therefore, how to flexibly and efficiently determine a new classification label becomes a current urgent problem to be solved.
Disclosure of Invention
The present disclosure provides a method, an apparatus, a storage medium, and an electronic device for determining a classification label, so as to partially solve the foregoing problems in the prior art.
The technical scheme adopted in the specification is as follows:
the specification provides a method for determining a class label, comprising the following steps:
obtaining data to be classified, wherein the data to be classified comprises each text data;
for each text data, extracting the main sentence data from the text data, and determining the main sentence vector corresponding to the text data according to the main sentence data corresponding to the text data;
clustering the text data according to the corresponding subject sentence vector of each text data to obtain each text cluster;
for each text cluster, extracting each candidate keyword from the text data corresponding subject sentence contained in the text cluster, and determining the main keywords corresponding to the text cluster from each candidate keyword;
and determining a classification label according to the main keywords corresponding to each text cluster, so as to classify the text data to be classified according to the classification label.
Optionally, the method further comprises:
and determining a summary sentence corresponding to each text cluster according to the main keyword corresponding to the text cluster, wherein the summary sentence corresponding to the text cluster is used for representing the summary of the semantics of the main keyword corresponding to the text cluster.
Optionally, after acquiring the data to be classified, the method further comprises:
carrying out data updating on the text data, and re-determining the text data after the data updating as the text data in the data to be classified;
the text data is updated, which specifically comprises:
for each text data, segmenting a text corresponding to the text data into short words by a preset word segmentation tool, and determining each segmented short word as a short word corresponding to the text data;
for each text datum, deleting a target word in short words corresponding to the text datum, and redefining the short words after deleting the target word into short words corresponding to the text datum, wherein the target word at least comprises a stop word, a messy code word and a privacy word, the stop word is used for representing preset nonsensical short words for classification, the messy code word is used for representing the messy code short words, and the privacy word is used for representing the short words related to privacy information;
and updating the text data according to the short words corresponding to the text data aiming at each text data, and determining the text data after data updating.
Optionally, determining a main keyword corresponding to the text cluster from the candidate keywords, which specifically includes:
determining the probability that each candidate keyword is a main keyword according to the occurrence times of each candidate keyword in the text cluster and the preset weight of each candidate keyword;
and sequencing the candidate keywords from large to small according to the probability, and determining the candidate keywords positioned at the set ranks as main keywords corresponding to the text cluster.
Optionally, clustering each text data according to the corresponding main sentence vector of each text data to obtain each text cluster, which specifically includes:
clustering the main sentence vectors according to the main sentence vectors corresponding to each text data to obtain each main sentence vector cluster;
for each topic sentence vector cluster, determining text data corresponding to topic sentence vectors contained in the topic sentence vector cluster as a text cluster corresponding to the topic sentence vector cluster;
and determining each text cluster according to the text cluster corresponding to each vector cluster.
Optionally, for each text data, extracting the data of the main sentence from the text data specifically includes:
for each text data, inputting the text data into a preset text abstract generating model so as to extract the main sentence data from the text data through the preset text abstract generating model.
Optionally, for each text cluster, determining a summary sentence corresponding to the text cluster according to a main keyword corresponding to the text cluster specifically includes:
and inputting the main keywords corresponding to each text cluster into a preset text generation model to determine a summary sentence corresponding to the text cluster through the preset text generation model.
The specification provides a device for determining a class label, comprising:
the acquisition module is used for acquiring data to be classified, wherein the data to be classified comprises each text data;
the extraction module is used for extracting the main sentence data from the text data according to each text data, and determining the main sentence vector corresponding to the text data according to the main sentence data corresponding to the text data;
the clustering module is used for clustering the text data according to the corresponding gist sentence vector of each text data to obtain each text cluster;
the determining module is used for extracting each candidate keyword from the text data corresponding subject sentences contained in each text cluster aiming at each text cluster, and determining the main keywords corresponding to the text cluster from each candidate keyword;
and the classification module is used for determining classification labels according to the main keywords corresponding to each text cluster so as to classify the text data to be classified according to the classification labels.
The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the method of classification tag determination described above.
The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of classification tag determination described above when executing the program.
The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:
according to the method for determining the classification labels, data to be classified are obtained, the data to be classified comprise each text data, the main sentence data are extracted from the text data according to each text data, the main sentence vectors corresponding to the text data are determined according to the main sentence data corresponding to the text data, the text data are clustered according to the main sentence vectors corresponding to each text data to obtain each text cluster, each candidate keyword is extracted from the main sentence corresponding to the text data contained in the text cluster according to each text cluster, the main keywords corresponding to the text cluster are determined from each candidate keyword, and the classification labels are determined according to the main keywords corresponding to each text cluster to classify the text data to be classified according to the classification labels.
In the above method, in the determining process of the classification label, the terminal device may first classify each text data in a clustering manner, determine, for each text cluster, a candidate keyword corresponding to the text cluster, determine a main keyword from the candidate keywords corresponding to the text cluster, and determine the classification label according to the main keyword of the text cluster, so as to classify the text data to be classified according to the classification label.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:
FIG. 1 is a flow chart of a method of classification tag determination provided in the present specification;
FIG. 2 is a schematic diagram of the logic provided in this specification for sort tag determination;
FIG. 3 is a schematic diagram of a device structure for determining a class label according to the present disclosure;
fig. 4 is a schematic structural diagram of the electronic device corresponding to fig. 1 provided in the present specification.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.
Fig. 1 is a flow chart of a method for determining a classification label provided in the present specification, which includes the following steps:
s101: and obtaining data to be classified, wherein the data to be classified comprises each text data.
S102: for each text data, extracting the main sentence data from the text data, and determining the main sentence vector corresponding to the text data according to the main sentence data corresponding to the text data.
The execution subject of the method of determining a class label in the present specification may be a terminal device such as a desktop computer or a notebook computer, or may be a server, and the method of determining a class label in the embodiment of the present specification will be described below by taking the example in which the terminal device is the execution subject.
In the specific implementation of the present specification, the terminal device first obtains data to be classified, where the data to be classified may include each text data.
For example, when an organization gathers articles sent by people through a network for reflecting problems that people encounter in social life and require the organization to solve, each article collected by the organization is the text data mentioned above, and each text data forms data to be classified.
Once the data to be classified is obtained, the terminal device can update the text data, and the text data after the data update is truly the text data in the data to be classified again.
The specific process of data updating of the text data may be:
the terminal equipment firstly cuts the text corresponding to each text data into each short word through a preset word cutting tool (the preset word cutting tool can be a jieba word cutting tool), and determines each cut short word as the short word corresponding to the text data. And then, the terminal equipment carries out deleting processing on target words in the short words corresponding to the text data aiming at each text data, and redetermines the short words after the deleting processing of the target words as the short words corresponding to the text data.
The target word at least comprises a stop word, a messy code word and a privacy word, wherein the stop word is used for representing preset short words which are nonsensical to classification, the messy code word is used for representing the messy code short words, the privacy word is used for representing the short words related to privacy information, and the privacy word deletion comprises personal privacy information deletion (such as address information) and digital deletion (such as contact information).
The stop words herein may be specifically short words in the form of "questions", "etc., and the privacy words may be specifically short words in the form of words representing addresses, contact addresses etc.
And then, the terminal equipment can update the text data according to the short words corresponding to the text data aiming at each text data, and determine the text data after data update.
The terminal device may extract, for each text data, the main sentence data from the text data, and determine the main sentence vector corresponding to the text data based on the main sentence data corresponding to the text data.
The specific process of the subject sentence data may be that the terminal device inputs the text data to a preset text abstract generating model for each text data, so as to extract the subject sentence data from the text data through the preset text abstract generating model.
The specific determining process of the main Sentence vector may be that the terminal device inputs the main Sentence data into a preset main Sentence vector determining model (the preset main Sentence vector determining model may be a Sentence-BERT model) for each main Sentence data, and encodes the main Sentence data through the preset main Sentence vector determining model to determine the encoded main Sentence data as a main Sentence vector, and it should be noted that each text data corresponds to only one main Sentence vector.
The distance between the respective topic sentence vectors may be used to represent the semantic similarity between the texts corresponding to the text data corresponding to the respective topic sentence vectors, and specifically, the smaller the distance between the two topic sentence vectors, the higher the similarity between the semantics of the texts corresponding to the text data corresponding to the two topic sentence vectors, that is, the closer the semantic content expressed by the two texts. And the larger the distance between the two main sentence vectors, the lower the similarity between the semantics of the texts corresponding to the text data corresponding to the two main sentence vectors, i.e. the less close the semantic content expressed by the two texts.
S103: and clustering the text data according to the main sentence vector corresponding to each text data to obtain each text cluster.
S104: for each text cluster, extracting each candidate keyword from the text data corresponding subject sentence contained in the text cluster, and determining the main keywords corresponding to the text cluster from each candidate keyword.
Once determining the topic sentence vector corresponding to each text data, the terminal device may input the topic sentence vector corresponding to each text data to a preset k-means clustering model, so as to cluster the topic sentence vector by using the preset k-means clustering model, to obtain each topic sentence vector cluster, and then, for each topic sentence vector cluster, the terminal device may determine the text data corresponding to the topic sentence vector included in the topic sentence vector cluster as a text cluster corresponding to the topic sentence vector cluster.
Then, for each text cluster, the terminal device may extract each candidate keyword from the text data corresponding to the subject sentence contained in the text cluster, and determine the main keyword corresponding to the text cluster from each candidate keyword.
Specifically, the terminal device may segment, for each text cluster, a text data corresponding to a subject sentence included in the text cluster into each short word through the preset word segmentation tool, and determine each short word obtained by segmentation as a short word corresponding to the text cluster.
Then, the terminal device may delete, for each text cluster, the target word in the short word corresponding to the text cluster, and determine each short word after completing the deletion of the target word as each candidate keyword.
The specific process of the terminal device determining the main keywords corresponding to each text cluster from the candidate keywords of the text cluster may be: the terminal equipment inputs each candidate keyword of the text cluster into a preset document theme generation (Latent Dirichlet Allocation, LDA) model, the probability that each candidate keyword is a main keyword is determined according to the occurrence times of each candidate keyword in the text cluster and the preset weight of each candidate keyword through the preset document theme generation (Latent Dirichlet Allocation, LDA) model, then each candidate keyword is ranked from large to small according to the probability, and the candidate keywords positioned at the set rank are determined as the main keywords corresponding to the text cluster.
S105: and determining a classification label according to the main keywords corresponding to each text cluster, so as to classify the text data to be classified according to the classification label.
Once the main keywords corresponding to each text cluster are determined, the terminal device can determine a summary sentence corresponding to each text cluster according to the main keywords corresponding to the text cluster, wherein the summary sentence corresponding to the text cluster is used for representing the summary of the semantics of the main keywords corresponding to the text cluster.
Specifically, for each text cluster, the terminal device may input a main keyword corresponding to the text cluster into a preset document generation model, so as to determine a summary sentence corresponding to the text cluster through the preset document generation model.
The preset document generation model refers to a trained document generation model, and in a training process of a specific document generation model, preset main keywords of a summary sentence which is determined are input into the document generation model, and then the document generation model is trained by taking a deviation between the summary sentence generated by the document generation model and the determined summary sentence as an optimization target.
And then, the terminal equipment can determine a classification label according to the main keywords corresponding to each text cluster so as to classify the text data to be classified according to the classification label.
Fig. 2 is a schematic diagram of a sort tag determination logic provided in the present specification.
As shown in fig. 2, after the terminal device obtains each text data in the data to be classified, the terminal device may perform data update (stop word deletion, messy word deletion, and privacy word deletion included in the target word deletion in the data update, where the privacy word deletion includes personal privacy information deletion and digital deletion, which are described in detail above, and will not be repeated here), on the text data after the data update, and re-determine the text data after the data update as the text data in the data to be classified.
Then, the terminal device may extract, for each text data, the data of the subject sentence from the text data through a preset text abstract generation model.
And the terminal device can determine the corresponding main sentence vector of the text data according to the main sentence data corresponding to the text data through a preset main sentence vector determination model.
And the terminal equipment can cluster each text data according to the corresponding gist sentence vector of each text data through a k-means clustering model to obtain each text cluster.
The terminal equipment can determine candidate keywords corresponding to each text cluster through a preset word segmentation tool according to the main sentence data of the text cluster, and can determine main keywords corresponding to the text cluster from the candidate keywords corresponding to the text cluster through a preset document theme generation (Latent Dirichlet Allocation, LDA) model.
Moreover, the terminal device can also determine a summary sentence corresponding to each text cluster according to the main key words corresponding to the text cluster through a preset text generation model. The "model generation" in fig. 2 refers to determining the main sentence data, the main sentence vector, the text cluster, the main keyword, and the total sentence according to the above various models.
In the above method, in the determining process of the classification label, the terminal device may first classify each text data in a clustering manner, determine, for each text cluster, a candidate keyword corresponding to the text cluster, determine a main keyword from the candidate keywords corresponding to the text cluster, and determine the classification label according to the main keyword of the text cluster, so as to classify the text data to be classified according to the classification label.
The foregoing is a method implemented by one or more of the embodiments of the present specification, and based on the same concept, the present specification further provides a corresponding apparatus for determining a class label, as shown in fig. 3.
Fig. 3 is a schematic diagram of an apparatus for determining a classification label provided in the present specification, including:
an obtaining module 301, configured to obtain data to be classified, where the data to be classified includes each text data;
the extracting module 302 is configured to extract, for each text data, a main sentence data from the text data, and determine a main sentence vector corresponding to the text data according to the main sentence data corresponding to the text data;
the clustering module 303 is configured to cluster each text data according to the corresponding gist sentence vector of each text data, so as to obtain each text cluster;
a determining module 304, configured to extract, for each text cluster, each candidate keyword from the text data corresponding to the subject sentence included in the text cluster, and determine a main keyword corresponding to the text cluster from each candidate keyword;
a classification module 305, configured to determine a classification label according to the main keyword corresponding to each text cluster, so as to classify the text data to be classified according to the classification label
Optionally, the apparatus further comprises:
and a summarization module 306, configured to determine, for each text cluster, a summarization sentence corresponding to the text cluster according to a main keyword corresponding to the text cluster, where the summarization sentence corresponding to the text cluster is used to represent a summary of semantics of the main keyword corresponding to the text cluster.
Optionally, the apparatus further comprises:
an updating module 307, configured to update the text data after obtaining the data to be classified, and re-determine the text data after the data update as the text data in the data to be classified; the text data is updated, which specifically comprises: for each text data, segmenting a text corresponding to the text data into short words by a preset word segmentation tool, and determining each segmented short word as a short word corresponding to the text data; for each text datum, deleting a target word in short words corresponding to the text datum, and redefining the short words after deleting the target word into short words corresponding to the text datum, wherein the target word at least comprises a stop word, a messy code word and a privacy word, the stop word is used for representing preset nonsensical short words for classification, the messy code word is used for representing the messy code short words, and the privacy word is used for representing the short words related to privacy information; and updating the text data according to the short words corresponding to the text data aiming at each text data, and determining the text data after data updating.
Optionally, the classifying module 305 is specifically configured to determine, according to the number of times each candidate keyword appears in the text cluster and a preset weight of each candidate keyword, a probability that each candidate keyword is a primary keyword; and sequencing the candidate keywords from large to small according to the probability, and determining the candidate keywords positioned at the set ranks as main keywords corresponding to the text cluster.
Optionally, the clustering module 303 is specifically configured to cluster the main sentence vectors according to the main sentence vector corresponding to each text data, to obtain each main sentence vector cluster; for each topic sentence vector cluster, determining text data corresponding to topic sentence vectors contained in the topic sentence vector cluster as a text cluster corresponding to the topic sentence vector cluster; and determining each text cluster according to the text cluster corresponding to each vector cluster.
Optionally, the extracting module 302 is specifically configured to, for each text data, input the text data into a preset text summarization generating model, so as to extract the main sentence data from the text data through the preset text summarization generating model.
Optionally, the summarization module 306 is specifically configured to, for each text cluster, input a main keyword corresponding to the text cluster to a preset document generation model, so as to determine, through the preset document generation model, a summary sentence corresponding to the text cluster.
The present specification also provides a computer readable storage medium having stored thereon a computer program operable to perform a method of categorical tag determination as provided in figure 1 above.
The present specification also provides a schematic structural diagram of an electronic device corresponding to fig. 1 shown in fig. 4. At the hardware level, as shown in fig. 4, the electronic device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile storage, and may of course include hardware required by other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the method of category label determination described above with respect to fig. 1.
Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims (10)

1. A method of categorical tag determination, comprising:
obtaining data to be classified, wherein the data to be classified comprises each text data;
for each text data, extracting the main sentence data from the text data, and determining the main sentence vector corresponding to the text data according to the main sentence data corresponding to the text data;
clustering the text data according to the corresponding subject sentence vector of each text data to obtain each text cluster;
for each text cluster, extracting each candidate keyword from the text data corresponding subject sentence contained in the text cluster, and determining the main keywords corresponding to the text cluster from each candidate keyword;
and determining a classification label according to the main keywords corresponding to each text cluster, so as to classify the text data to be classified according to the classification label.
2. The method of claim 1, wherein the method further comprises:
and determining a summary sentence corresponding to each text cluster according to the main keyword corresponding to the text cluster, wherein the summary sentence corresponding to the text cluster is used for representing the summary of the semantics of the main keyword corresponding to the text cluster.
3. The method of claim 1, wherein after acquiring the data to be classified, the method further comprises:
carrying out data updating on the text data, and re-determining the text data after the data updating as the text data in the data to be classified;
the text data is updated, which specifically comprises:
for each text data, segmenting a text corresponding to the text data into short words by a preset word segmentation tool, and determining each segmented short word as a short word corresponding to the text data;
for each text datum, deleting a target word in short words corresponding to the text datum, and redefining the short words after deleting the target word into short words corresponding to the text datum, wherein the target word at least comprises a stop word, a messy code word and a privacy word, the stop word is used for representing preset nonsensical short words for classification, the messy code word is used for representing the messy code short words, and the privacy word is used for representing the short words related to privacy information;
and updating the text data according to the short words corresponding to the text data aiming at each text data, and determining the text data after data updating.
4. The method of claim 1, wherein determining the primary keyword corresponding to the text cluster from the candidate keywords comprises:
determining the probability that each candidate keyword is a main keyword according to the occurrence times of each candidate keyword in the text cluster and the preset weight of each candidate keyword;
and sequencing the candidate keywords from large to small according to the probability, and determining the candidate keywords positioned at the set ranks as main keywords corresponding to the text cluster.
5. The method of claim 1, wherein clustering each text data according to the corresponding subject sentence vector of each text data to obtain each text cluster specifically comprises:
clustering the main sentence vectors according to the main sentence vectors corresponding to each text data to obtain each main sentence vector cluster;
for each topic sentence vector cluster, determining text data corresponding to topic sentence vectors contained in the topic sentence vector cluster as a text cluster corresponding to the topic sentence vector cluster;
and determining each text cluster according to the text cluster corresponding to each vector cluster.
6. The method of claim 1, wherein extracting the sentence data from the text data for each text data, specifically comprises:
for each text data, inputting the text data into a preset text abstract generating model so as to extract the main sentence data from the text data through the preset text abstract generating model.
7. The method of claim 2, wherein for each text cluster, determining a summary sentence corresponding to the text cluster according to a primary keyword corresponding to the text cluster, specifically comprises:
and inputting the main keywords corresponding to each text cluster into a preset text generation model to determine a summary sentence corresponding to the text cluster through the preset text generation model.
8. An apparatus for classifying tag determinations, comprising:
the acquisition module is used for acquiring data to be classified, wherein the data to be classified comprises each text data;
the extraction module is used for extracting the main sentence data from the text data according to each text data, and determining the main sentence vector corresponding to the text data according to the main sentence data corresponding to the text data;
the clustering module is used for clustering the text data according to the corresponding gist sentence vector of each text data to obtain each text cluster;
the determining module is used for extracting each candidate keyword from the text data corresponding subject sentences contained in each text cluster aiming at each text cluster, and determining the main keywords corresponding to the text cluster from each candidate keyword;
and the classification module is used for determining classification labels according to the main keywords corresponding to each text cluster so as to classify the text data to be classified according to the classification labels.
9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-7 when executing the program.
CN202310878248.7A 2023-07-17 2023-07-17 Classification label determining method and device, storage medium and electronic equipment Pending CN117076667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310878248.7A CN117076667A (en) 2023-07-17 2023-07-17 Classification label determining method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310878248.7A CN117076667A (en) 2023-07-17 2023-07-17 Classification label determining method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117076667A true CN117076667A (en) 2023-11-17

Family

ID=88714291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310878248.7A Pending CN117076667A (en) 2023-07-17 2023-07-17 Classification label determining method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117076667A (en)

Similar Documents

Publication Publication Date Title
CN111488426B (en) Query intention determining method, device and processing equipment
CN117235226A (en) Question response method and device based on large language model
US20170344822A1 (en) Semantic representation of the content of an image
US20180046721A1 (en) Systems and Methods for Automatic Customization of Content Filtering
US20220342950A1 (en) System and method for searching based on text blocks and associated search operators
CN110765247A (en) Input prompting method and device for question-answering robot
CN111401062B (en) Text risk identification method, device and equipment
CN110674297B (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN117591661B (en) Question-answer data construction method and device based on large language model
CN112417093B (en) Model training method and device
CN117076650B (en) Intelligent dialogue method, device, medium and equipment based on large language model
CN111859079A (en) Information searching method and device, computer equipment and storage medium
CN116662657A (en) Model training and information recommending method, device, storage medium and equipment
CN110119442A (en) A kind of dynamic searching method, device, equipment and medium
CN113032566B (en) Public opinion clustering method, device and equipment
CN117076667A (en) Classification label determining method and device, storage medium and electronic equipment
CN115686355A (en) Partition namespace solid state disk region allocation method, device and storage medium
CN111310069A (en) Evaluation method and device for timeliness search
CN114676257A (en) Conversation theme determining method and device
CN110968691B (en) Judicial hotspot determination method and device
CN113641766A (en) Relationship identification method and device, storage medium and electronic equipment
CN117807961B (en) Training method and device of text generation model, medium and electronic equipment
CN116340469B (en) Synonym mining method and device, storage medium and electronic equipment
CN111723567B (en) Text selection data processing method, device and equipment
CN116089577A (en) Keyword labeling method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination