CN117992611A

CN117992611A - Label discovery method, label discovery device, storage medium and electronic device

Info

Publication number: CN117992611A
Application number: CN202410161771.2A
Authority: CN
Inventors: 习雨璇; 刘克松; 张磊; 马呈芳; 刘芳; 侯政旭
Original assignee: Ali Health Technology Hangzhou Co ltd
Current assignee: Ali Health Technology Hangzhou Co ltd
Priority date: 2024-02-04
Filing date: 2024-02-04
Publication date: 2024-05-07

Abstract

The embodiment of the specification provides a tag discovery method, a tag discovery device, a storage medium and an electronic device. The label discovery method comprises the following steps: text classification is carried out on texts in the text set to be screened by using a text classification model, and classification results are obtained; text screening is carried out on the text set to be screened according to the classification result, and a text candidate set is obtained, wherein the classification category of the text in the text candidate set does not belong to the existing label; performing text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1; and selecting texts to be marked from the K clusters respectively so as to be convenient for marking the texts to be marked and obtaining new labels. The label discovery method can automatically screen the text to be marked, thereby greatly reducing the workload, reducing the labor cost and improving the construction efficiency of a new label system.

Description

Label discovery method, label discovery device, storage medium and electronic device

Technical Field

Embodiments in the present specification relate to the field of text labeling, and in particular, to a tag discovery method, a tag discovery apparatus, a storage medium, and an electronic device.

Background

Currently, in the process of creating a label system for a category of commodity, labels of different categories may be inconsistent, so that a new label needs to be defined for a new category. In the new tag discovery process, a large amount of text needs to be sampled, and then the tags of the text are manually defined.

However, due to the variety of categories, simply manually sampling the text and defining the labels one by one consumes a great deal of labeling labor, and due to the fact that the text has a great deal of semantic repetition, a great deal of time is wasted due to repeated viewing during manual viewing.

Disclosure of Invention

The multiple embodiments in the specification provide a tag discovery method, a tag discovery device, a storage medium and an electronic device, which can automatically screen texts to be marked, thereby greatly reducing the workload, reducing the labor cost and improving the construction efficiency of a new tag system.

One embodiment of the present specification provides a tag discovery method including: text classification is carried out on texts in the text set to be screened by using a text classification model, and classification results are obtained; text screening is carried out on the text set to be screened according to the classification result, and a text candidate set is obtained, wherein the classification category of the text in the text candidate set does not belong to the existing label; performing text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1; and selecting texts to be marked from the K clusters respectively so as to be convenient for marking the texts to be marked and obtaining new labels.

In some embodiments, the tag discovery method further comprises: text clustering is carried out on texts in the text set to be screened by using a text clustering algorithm, and a clustering result is obtained; the text screening is performed on the text set to be screened according to the classification result to obtain a text candidate set, which comprises the following steps: and carrying out text screening on the text set to be screened according to the classification result and the clustering result to obtain a text candidate set.

In some embodiments, the clustering result includes M clusters, where M is an integer greater than 1, and text screening is performed on the text set to be screened according to the classification result and the clustering result to obtain a text candidate set, including: aiming at M clustering clusters, entropy corresponding to a classification result of each text in each clustering cluster is respectively obtained; respectively obtaining the average entropy corresponding to each cluster; and carrying out text screening on the text set to be screened according to the average entropy corresponding to each cluster, and obtaining a text candidate set.

In some embodiments, the text screening of the text set to be screened according to the average entropy corresponding to each cluster to obtain a text candidate set includes: selecting N clusters with average entropy larger than a preset threshold value from M clusters, wherein N is an integer smaller than M; and taking the texts of the N clusters as a text candidate set.

In some embodiments, the tag discovery method further comprises: training the text classification model by using the labeling text, wherein the labeling text is obtained by labeling the text to be labeled.

In some embodiments, performing text classification on the text in the text set to be screened by using the text classification model to obtain a classification result includes: acquiring text representations of each text in a text set to be screened by using a text classification model; the probabilities of the plurality of tag dimensions are derived from the text representation as a classification result.

In some embodiments, the new tag comprises a user somatosensory type tag comprising at least one of endurance, noise, measurement accuracy.

One embodiment of the present specification provides a tag discovery apparatus including: the classification module is used for classifying texts in the text set to be screened by using the text classification model to obtain classification results; the screening module is used for conducting text screening on the text set to be screened according to the classification result to obtain a text candidate set, wherein the classification category of the text in the text candidate set does not belong to the existing label; the clustering module is used for carrying out text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1; the selecting module is used for selecting texts to be marked from the K cluster clusters respectively so as to be convenient for marking the texts to be marked and obtain new labels.

An embodiment of the present specification provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; the processor is configured to execute the tag discovery method according to any one of the foregoing embodiments.

An embodiment of the present specification provides a computer-readable storage medium having stored thereon computer-executable instructions, wherein the executable instructions, when executed by a processor, implement the tag discovery method according to any of the above embodiments.

An embodiment of the present specification provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the tag discovery method of any of the embodiments described above.

According to the embodiments provided by the specification, text classification is carried out on texts in a text set to be screened by using a text classification model, and classification results are obtained; text screening is carried out on the text set to be screened according to the classification result, and a text candidate set with a text classification category not belonging to the existing label is obtained; performing text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1; the text to be marked is selected from the K clusters respectively so as to be convenient for marking the text to be marked and obtaining a new label, and the text which possibly needs to define the new label and has different semantics can be automatically screened out by the technical scheme provided by the embodiment of the specification, and the text is manually marked, so that the workload can be greatly reduced, the labor cost is reduced, and the construction efficiency of a new label system is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present description, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a tag discovery method according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a label discovery process according to an embodiment of the present disclosure.

Fig. 3 is a schematic structural diagram of a text classification model according to an embodiment of the present disclosure.

Fig. 4 is a schematic flow chart of a tag discovery method according to another embodiment of the present disclosure.

Fig. 5 is a schematic diagram of a candidate text set screening procedure according to an embodiment of the present disclosure.

Fig. 6 is a schematic flow chart of text filtering a text set to be filtered according to a classification result and a clustering result according to an embodiment of the present disclosure.

Fig. 7 is a schematic flow chart of a tag discovery apparatus according to an embodiment of the present disclosure.

Fig. 8 is a schematic flow chart of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions of the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present specification, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

SUMMARY

For a merchant to be able to better complete the purchase, an in-depth analysis of the customer's needs is required. For example, in the e-commerce field, in the process of analyzing "decision factors" affecting customer purchasing behavior, user questions about a person or customer service need to be marked (or annotated) in order to form commodity features of commodity dimensions according to the labeled labels.

For better management of goods, the goods are typically classified and a tag system is built for different categories of purposes. Since the tag systems of different types of objects may be different, during the process of creating the tag system from category to category, the text needs to be sampled and then the tags of the text are manually defined. However, due to the variety of categories, simply manually sampling the text and defining the labels one by one consumes a great deal of labeling labor, and due to the fact that the text has a great deal of semantic repetition, a great deal of time is wasted due to repeated viewing during manual viewing.

Therefore, it is necessary to provide a tag discovery method, a tag discovery apparatus, a storage medium, and an electronic device, which can automatically screen a text to be labeled, thereby greatly reducing workload, reducing labor cost, and improving construction efficiency of a new tag system.

Example method

One embodiment of the present specification provides a tag discovery method. The tag discovery method may be applied to a server. As shown in fig. 1, the tag discovery method may include the following steps.

Step S101: and carrying out text classification on the texts in the text set to be screened by using the text classification model to obtain classification results.

The text in the text set to be screened may be a customer service or a user asking a person. For example, as shown in fig. 2, the user question in the text set to be screened may be "can be used by the elderly at home? Is good to use? Thanks to the fact that the two types are different, the measurement is inaccurate, the base arc is good, etc., it should be understood that the present specification is not limited in detail.

It should be noted that, each category may share the same text classification model, and the text classification model trained by the labeling data of the existing category may be used to predict the label of the text to be screened. In some implementations, a text classification model can be utilized to obtain a text representation for each text; the probabilities of the plurality of tag dimensions are derived from the text representation as a classification result.

As shown in fig. 3, the text classification model (or classifier) may include an encoder (Encoder) and a multi-layer perceptron (Multilayer Perceptron, MLP).

In particular, the text may be encoded with an encoder to obtain a text representation, wherein the text is typically represented in the form of vectors. It should be appreciated that the encoder may be a variety of pre-trained models, such as Bert, albert models, etc., as the embodiments of the present disclosure are not limited in detail.

Then, the text representation may be mapped to an m-dimensional space (m is the number of existing labels in the existing label system) through the MLP layer, and then normalized by softmax to obtain the probability P ₁,P₂,…,P_m for each label dimension.

Step S102: and carrying out text screening on the text set to be screened according to the classification result to obtain a text candidate set, wherein the classification category of the text in the text candidate set does not belong to the existing label.

In this embodiment, a text (tag is empty) that cannot be identified by the text classification model, or a text with a large difference in text classification result may be screened out as a text candidate set. Failure of the text classification model to recognize the text or a large difference in the text classification results of the text indicates that the text is not significant under the existing label system, possibly with a new label. Thus, it can be taken as a candidate text set.

Step S103: and carrying out text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1.

Text clustering refers to clustering analysis performed on texts, similar texts are divided into the same cluster, dissimilar texts are placed in different clusters, and therefore a text set is divided. In particular, the text candidate set may be text clustered using a text clustering algorithm, such as a DBSCAN algorithm, it being understood that the type of text clustering algorithm is not particularly limited in this specification. Specifically, an encoder in the text classification model may be used to encode the text to obtain a text representation, and then a text clustering algorithm may be used to cluster the text representation.

Step S104: and selecting texts to be marked from the K clusters respectively so as to be convenient for marking the texts to be marked and obtaining new labels.

Because the labels of most texts in each cluster are consistent, a plurality of texts can be randomly selected from each cluster for marking, so that new labels are obtained and the existing marking data are expanded. In the gradual progress of category, the label system is gradually perfected and converged, the number of newly added labels is smaller and smaller, the number of data to be marked is smaller and smaller, and compared with a purely manual sampling and marking mode, a large number of marking costs can be reduced.

In some embodiments, the new tag includes a user somatosensory tag, which includes at least one of cruising, noise, and measurement accuracy, and it should be understood that the content of the new tag is not specifically limited in this specification. For example, as shown in fig. 2, the label of the text "how long this time can be used for one time" may be "cruising", "the label of the sound is big when used" may be "noise", "the label of this measurement inaccuracy" may be "measurement accuracy", etc.

According to the technical scheme provided by the embodiment of the specification, text classification is carried out on texts in a text set to be screened by using a text classification model, and a classification result is obtained; text screening is carried out on the text set to be screened according to the classification result, and a text candidate set with a text classification category not belonging to the existing label is obtained; performing text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1; the texts to be marked are selected from the K clusters to be marked, so that new labels are obtained.

In other embodiments, the tag discovery method further comprises: training the text classification model by using the labeling text, wherein the labeling text is obtained by labeling the text to be labeled.

Specifically, the new labeling text and the existing labeling text can be updated on the basis of the encoder parameters of the original text classification model (the scale of the MLP parameters is changed according to the number of labels), so that the iteration turn of the text classification model can be reduced, and the training cost can be reduced.

The text which cannot be marked by the screened model is marked, so that the iterative model is a model of active learning, and the model can learn new knowledge more quickly by the iterative mode, so that the model converges more quickly by using less marked data. In addition, the text classification model iteration process is to iterate on the basis of the original model, a model form of incremental learning is used, and the training cost of the model is reduced.

One embodiment of the present specification provides a tag discovery method. The tag discovery method may be applied to a server. As shown in fig. 4, the tag discovery method may include the following steps.

Step S401: and carrying out text classification on the texts in the text set to be screened by using the text classification model to obtain classification results.

The text in the text set to be screened may be a customer service or a user asking a person. For example, the user question in the text set to be screened may be "do it available to the elderly at home? Is good to use? Thanks to the fact that the two types are different, the measurement is inaccurate, the base arc is good, etc., it should be understood that the present specification is not limited in detail.

It should be noted that, each category may share the same text classification model, and the text classification model trained by the labeling data of the existing category may be used to predict the label of the text to be screened.

In some implementations, the text classification model (or classifier) may include an encoder (Encoder) and a multi-layer perceptron (Multilayer Perceptron, MLP).

Step S402: and carrying out text clustering on texts in the text set to be screened by using a text clustering algorithm to obtain a clustering result.

Text clustering may be performed on the text candidate set using a text clustering algorithm, such as a DBSCAN algorithm, and it should be understood that the type of text clustering algorithm is not particularly limited in this specification. Specifically, an encoder in the text classification model may be used to encode the text to obtain a text representation, and then a text clustering algorithm may be used to cluster the text representation.

Step S403: and carrying out text screening on the text set to be screened according to the classification result and the clustering result to obtain a text candidate set, wherein the classification category of the text in the text candidate set does not belong to the existing label.

In other words, as shown in fig. 5, text filtering may be performed on the text set to be filtered according to the result of the classification and the result of the clustering, so as to obtain a text candidate set for manual marking. In some embodiments, text that cannot be identified by the text classification model (the label is blank) or text with a large difference in text classification result can be screened out as a text candidate set. Specifically, the clustering result includes M clusters, M being an integer greater than 1, as shown in fig. 6, and step S403 may include steps S601 to S603.

Step S601: and aiming at M clustering clusters, respectively acquiring entropy corresponding to the classification result of each text in each clustering cluster.

Entropy is typically used to calculate the out-of-order phenomenon in a system, i.e., to calculate the degree of confusion for the system, and in machine learning, entropy is the degree of confusion that characterizes the distribution of random variables, the more chaotic the distribution, the greater the entropy. For example, the probability distribution of the classification result of a certain sample is 25%,25%,25%,25%, and the difference of the classification results of the classification model for each label dimension is the largest, and the corresponding entropy is very large. For another example, entropy is also very large for outliers that are not predicted by the text classification model. The higher the entropy value, the less significant the text is under the existing label system, and the more likely it is to be text with a new label.

The calculation formula of entropy can be as follows:

Where m is the number of existing tags in the existing tag system, and p _i represents the probability of the ith tag.

Step S602: and respectively obtaining the average entropy corresponding to each cluster.

Step S603: and carrying out text screening on the text set to be screened according to the average entropy corresponding to each cluster, and obtaining a text candidate set.

Specifically, the probability distribution of each text classification result in each of the M clusters may be solved for an entropy. And then solving the average value of each entropy to obtain the average entropy of the cluster.

Calculating the average entropy in a cluster can determine whether the predicted result in the cluster is confidence, the lower the average entropy value is, the more confidence the overall result of the cluster is, the higher the average entropy value is, the more confidence the overall result of the cluster is, the less significant the text in the cluster is under the existing label system, and the more likely the text with a new label is. Thus, in some embodiments, N clusters with an average entropy greater than a preset threshold may be selected from M clusters, where N is an integer less than M; and taking the texts of the N clusters as a text candidate set.

According to the embodiment of the specification, the entropy is used as an index for measuring the confidence of the model prediction result, the entropy is used as an index for screening samples to be marked when the text classification result and the clustering result are integrated, the mode can filter out about 80% of text with the existing labels, the recall rate is roughly estimated to be 85%, but most of text which is not recalled is long-tail text, and the required label proportion is small, so that the text is negligible.

Step S404: and carrying out semantic clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1.

Step S405: and selecting texts to be marked from the K clusters respectively so as to be convenient for marking the texts to be marked and obtaining new labels.

Because the labels of most texts in each cluster are consistent, a plurality of texts can be randomly selected from each cluster for marking, so that new labels are obtained and the existing marking data are expanded. In some embodiments, the new annotation text and the existing annotation text can be updated to the parameters of the MLP based on the encoder parameters of the original text classification model (the scale of the MLP parameters is changed according to the number of labels), so that the iteration rounds of the text classification model can be reduced, and the training cost can be reduced.

In the process of covering hundreds of categories, if each category needs to traverse all the sampled texts for marking, the sample size to be marked is huge, but according to the technical scheme provided by the embodiment of the specification, each category can share the same text classification model, when a new category is expanded, text which is different in semantics and possibly needs to be defined for a new label can be automatically screened out by carrying out text screening on a text set to be screened according to classification results and clustering results, each category only needs to mark dozens of texts, and the texts are manually marked to form a new category label system, so that the workload can be greatly reduced, the construction efficiency of the new label system and the model training efficiency are improved, and experimental results show that the clustering semantics are accurate and recall can reach 90%, so that repeated samples can be greatly reduced.

Example apparatus, electronic device, storage Medium, and software

One embodiment of the present specification also provides a tag discovery apparatus. As shown in fig. 7, the tag discovery apparatus 700 may include a classification module 701, a screening module 702, a first clustering module 703, and a selection module 704.

The classification module 701 is configured to perform text classification on the text in the text set to be screened by using the text classification model, and obtain a classification result.

And the screening module 702 is configured to screen the text set to be screened according to the classification result, and obtain a text candidate set, where the classification category of the text in the text candidate set does not belong to the existing label.

And the first clustering module 703 is configured to perform text clustering on the text candidate set to obtain K clusters, where K is an integer greater than 1.

And the selecting module 704 is used for selecting texts to be marked from the K clusters respectively so as to mark the texts to be marked conveniently and obtain new labels.

In some embodiments, the tag discovery apparatus further includes a second clustering module 705, configured to perform text clustering on the texts in the text set to be screened by using a text clustering algorithm, to obtain a clustering result; the filtering module 702 is configured to perform text filtering on the text set to be filtered according to the classification result and the clustering result, and obtain a text candidate set.

In some embodiments, the clustering result includes M clustering clusters, where M is an integer greater than 1, and the screening module 702 is configured to obtain, for the M clustering clusters, entropy corresponding to a classification result of each text in each clustering cluster; respectively obtaining the average entropy corresponding to each cluster; and carrying out text screening on the text set to be screened according to the average entropy corresponding to each cluster, and obtaining a text candidate set.

In some embodiments, the screening module 702 is configured to select N clusters with average entropy greater than a preset threshold from M clusters, where N is an integer less than M; and taking the texts of the N clusters as a text candidate set.

In some embodiments, the tag discovery apparatus further includes a training module 706 for training the text classification model with labeling text, where the labeling text is obtained by labeling text to be labeled.

In some implementations, the classification module 701 is configured to obtain a text representation of each text in the set of text to be screened using a text classification model; the probabilities of the plurality of tag dimensions are derived from the text representation as a classification result.

The implementation process of the functions and roles of each module in the apparatus shown in the embodiment of fig. 7 is specifically described in the implementation process of the corresponding steps in the method in the embodiment of fig. 1, and will not be described herein again.

Fig. 8 is a block diagram of an electronic device 800 according to an embodiment of the present disclosure.

Referring to fig. 8, an electronic device 800 includes a processing component 810 that further includes one or more processors and memory resources represented by memory 820 for storing instructions, such as applications, executable by the processing component 810. The application program stored in memory 820 may include one or more modules each corresponding to a set of instructions. Further, the processing component 810 is configured to execute instructions to perform a tag discovery method comprising: text classification is carried out on texts in the text set to be screened by using a text classification model, and classification results are obtained; text screening is carried out on the text set to be screened according to the classification result, and a text candidate set is obtained, wherein the classification category of the text in the text candidate set does not belong to the existing label; performing text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1; and selecting texts to be marked from the K clusters respectively so as to be convenient for marking the texts to be marked and obtaining new labels. .

The electronic device 800 may also include a power component configured to perform power management of the electronic device 800, a wired or wireless network interface configured to connect the electronic device 800 to a network, and an input output (I/O) interface. The electronic device 800 may operate based on an operating system stored in the memory 820, such as Windows Server ^TM,Mac OS X^TM,Unix^TM,Linux^TM,FreeBSD^TM or the like.

The present specification embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a computer, causes the computer to perform the tag discovery method of any of the above embodiments.

The present description also provides a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the tag discovery method of any of the above embodiments.

It will be appreciated that the specific examples herein are intended only to assist those skilled in the art in better understanding the embodiments of the present disclosure and are not intended to limit the scope of the present invention.

It should be understood that, in various embodiments of the present disclosure, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

It will be appreciated that the various embodiments described in this specification may be implemented either alone or in combination, and are not limited in this regard.

Unless defined otherwise, all technical and scientific terms used in the embodiments of this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this specification belongs. The terminology used in the description is for the purpose of describing particular embodiments only and is not intended to limit the scope of the description. The term "and/or" as used in this specification includes any and all combinations of one or more of the associated listed items. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will be appreciated that the processor of the embodiments of the present description may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a Digital signal processor (Digital SignalProcessor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. The methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It will be appreciated that the memory in the embodiments of this specification may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM, EPROM), an Electrically Erasable Programmable ROM (EEPROM), or a flash memory, among others. The volatile memory may be Random Access Memory (RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present specification.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and unit may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this specification, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present specification may be integrated into one processing unit, each unit may exist alone physically, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present specification may be essentially or portions contributing to the prior art or portions of the technical solutions may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present specification. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, etc.

The foregoing is merely specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope disclosed in the present disclosure, and should be covered by the scope of the present disclosure. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A tag discovery method, comprising:

text classification is carried out on texts in the text set to be screened by using a text classification model, and classification results are obtained;

Text screening is carried out on the text set to be screened according to the classification result, and a text candidate set is obtained, wherein the classification category of the text in the text candidate set does not belong to the existing label;

performing text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1;

and selecting texts to be marked from the K cluster clusters respectively so as to mark the texts to be marked conveniently and obtain new labels.

2. The method as recited in claim 1, further comprising:

text clustering is carried out on the texts in the text set to be screened by using a text clustering algorithm, and a clustering result is obtained;

The text screening is performed on the text set to be screened according to the classification result, and a text candidate set is obtained, including:

And carrying out text screening on the text set to be screened according to the classification result and the clustering result to obtain the text candidate set.

3. The method according to claim 2, wherein the clustering result includes M clusters, M is an integer greater than 1, and wherein the text filtering the text set to be filtered according to the classification result and the clustering result, to obtain the text candidate set, includes:

aiming at the M clustering clusters, entropy corresponding to a classification result of each text in each clustering cluster is respectively obtained;

respectively obtaining the average entropy corresponding to each cluster;

And carrying out text screening on the text set to be screened according to the average entropy corresponding to each cluster, and obtaining the text candidate set.

4. The method of claim 3, wherein the text filtering the text set to be filtered according to the average entropy corresponding to each cluster, to obtain the text candidate set, includes:

selecting N cluster clusters with average entropy larger than a preset threshold value from the M cluster clusters, wherein N is an integer smaller than M;

and taking the texts of the N clusters as the text candidate set.

5. The method according to any one of claims 1 to 4, further comprising:

Training the text classification model by using a labeling text, wherein the labeling text is obtained by labeling the text to be labeled.

6. The method according to any one of claims 1 to 4, wherein performing text classification on the text in the text set to be screened by using the text classification model to obtain a classification result includes:

acquiring a text representation of each text in the text set to be screened by using the text classification model;

and obtaining the probability of multiple tag dimensions according to the text representation to serve as the classification result.

7. The method of any one of claims 1 to 4, wherein the new tag comprises a user somatosensory type tag comprising at least one of endurance, noise, measurement accuracy.

8. A tag discovery apparatus, comprising:

The classification module is used for classifying texts in the text set to be screened by using the text classification model to obtain classification results;

the screening module is used for carrying out text screening on the text set to be screened according to the classification result to obtain a text candidate set, wherein the classification category of the text in the text candidate set does not belong to the existing label;

The first clustering module is used for carrying out text clustering on the text candidate set to obtain K clustering clusters, wherein K is an integer greater than 1;

And the selecting module is used for selecting texts to be marked from the K cluster clusters respectively so as to mark the texts to be marked conveniently and obtain new labels.

9. A computer readable storage medium storing a computer program for executing the method of any one of the preceding claims 1 to 7.

10. An electronic device, the electronic device comprising:

A processor;

A memory for storing the processor-executable instructions;

the processor being adapted to perform the method of any of the preceding claims 1 to 7.