CN114266255A - Corpus classification method, apparatus, device and storage medium based on clustering model - Google Patents

Corpus classification method, apparatus, device and storage medium based on clustering model Download PDF

Info

Publication number
CN114266255A
CN114266255A CN202210189341.2A CN202210189341A CN114266255A CN 114266255 A CN114266255 A CN 114266255A CN 202210189341 A CN202210189341 A CN 202210189341A CN 114266255 A CN114266255 A CN 114266255A
Authority
CN
China
Prior art keywords
corpus
information
vector
named entity
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210189341.2A
Other languages
Chinese (zh)
Other versions
CN114266255B (en
Inventor
邹倩霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen One Ledger Science And Technology Service Co ltd
Original Assignee
Shenzhen One Ledger Science And Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen One Ledger Science And Technology Service Co ltd filed Critical Shenzhen One Ledger Science And Technology Service Co ltd
Priority to CN202210189341.2A priority Critical patent/CN114266255B/en
Publication of CN114266255A publication Critical patent/CN114266255A/en
Application granted granted Critical
Publication of CN114266255B publication Critical patent/CN114266255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a corpus classification method, a corpus classification device, corpus classification equipment and a storage medium based on a clustering model, wherein the corpus classification method comprises the following steps: acquiring at least one corpus information; carrying out named entity recognition on the material information to obtain a named entity word of the corresponding named entity information; performing text vectorization processing on the corpus information to obtain a corpus vector; adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector; and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors. According to the invention, the sentence vectors of the entity vectors corresponding to the highlighted named entity words are classified, so that the technical effect of classifying the material information according to the development end target classification expectation is realized.

Description

Corpus classification method, apparatus, device and storage medium based on clustering model
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a corpus classification method, a corpus classification device, corpus classification equipment and a storage medium based on a clustering model.
Background
The clustering method is a way of unsupervised learning and is practically applied to various aspects such as customer group classification and animal and plant classification. Clustering is to classify similar data into one cluster and classify different data into different clusters according to data characteristics. Clustering algorithms are also a processing method for text classification that is often used in the field of nlp. The method can effectively classify the texts under the non-supervision condition. There is a clear recognition of the information categories expressed by the text and the central meaning of each category.
However, the inventor realizes that the current clustering algorithm performs clustering operation directly according to the corresponding vector of the corpus, and is difficult to cluster the corpus according to the classification expectation of the development end (for example, classifying the corpus according to the interest of a user in a certain product), so that the current clustering algorithm is difficult to output the classification result required by the development end.
Disclosure of Invention
The invention aims to provide a corpus classification method, a corpus classification device, a corpus classification equipment and a storage medium based on a clustering model, which are used for solving the problem that in the prior art, a corpus is difficult to cluster according to a classification expectation of a development end, so that a classification result required by the development end is difficult to obtain.
In order to achieve the above object, the present invention provides a corpus classification method based on a clustering model, which comprises:
acquiring at least one corpus information, wherein the corpus information has at least one named entity information;
carrying out named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information;
performing text vectorization processing on the corpus information to obtain a corpus vector; adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;
and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors.
In the foregoing solution, before the obtaining of the at least one corpus information, the method further includes:
receiving information to be classified, and judging the information type of the information to be classified;
if the information type is voice information, converting the information to be classified into conversion information of which the information type is text information, and storing the conversion information serving as corpus information into a preset corpus;
and if the information type is text information, storing the information to be classified as corpus information in the corpus.
In the foregoing solution, the obtaining at least one corpus information includes:
receiving user information sent by a user side; wherein, the user information refers to the identity information of the corpus information sender;
acquiring at least one corpus information corresponding to the user information from a preset corpus; the corpus is used for storing corpus information associated with user information.
In the foregoing solution, the performing named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information includes:
obtaining a corpus text corresponding to the corpus information, and performing word segmentation on the corpus text to obtain at least one corpus word;
comparing the corpus words with a preset named entity dictionary, and setting the corpus words belonging to the named entity dictionary as the named entity words; wherein the named entity dictionary has at least one named entity therein.
In the foregoing solution, the adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector includes:
performing word frequency inverse document calculation on the corpus vectors to obtain word frequency inverse document values reflecting the importance degrees of the corpus words in the corpus information, and adjusting the corpus vectors to obtain semantic vectors by taking the word frequency inverse document values as the weights of the word vectors corresponding to the corpus words;
modifying the word vector corresponding to the named entity in the semantic vector through a preset lifting coefficient to obtain an entity vector, and converting the semantic vector into the sentence vector; or
And setting the word vector corresponding to the named entity as an entity vector, and modifying other word vectors except the entity vector in the semantic vector through a preset reduction coefficient to convert the semantic vector into the sentence vector.
In the foregoing solution, when the clustering model is a K-MEANS clustering model, the clustering operation is performed on the entered sentence vectors through the clustering model to classify the corpus information corresponding to the entered sentence vectors, including:
constructing an object representing the corpus information in the clustering model according to the sentence vector, and dividing at least one group in the clustering model; wherein the group characterizes the class to which the object belongs;
operating the clustering model to perform k-means clustering operation on the objects in each group to obtain clusters and clustering centers of the clusters; wherein the cluster is a set constructed by at least one object belonging to the group;
and extracting the central corpus information of the object corresponding to the clustering center, extracting the named entity of the central corpus information, and using the named entity as the category information of the corpus information corresponding to all the objects of the cluster where the clustering center is located so as to realize the classification of the corpus information corresponding to the sentence vector.
In the foregoing solution, when the clustering model is a density clustering model, the clustering operation is performed on the entered sentence vectors through the clustering model to classify the corpus information corresponding to the entered sentence vectors, including:
constructing clustering points representing the corpus information in the clustering model according to the sentence vectors, and operating the clustering model to perform density clustering operation on the clustering points to obtain at least one cluster and a clustering center thereof; wherein the cluster is a set of at least one of the cluster points;
and extracting the central corpus information of the clustering points corresponding to the clustering centers, extracting the named entities of the central corpus information, and using the named entities as the category information of the corpus information corresponding to all the clustering points of the cluster where the clustering centers are located so as to realize the classification of the corpus information corresponding to the sentence vectors.
In order to achieve the above object, the present invention further provides a corpus classifying device based on a clustering model, including:
the information input module is used for acquiring at least one corpus information, wherein the corpus information is provided with at least one named entity information;
the entity identification module is used for carrying out named entity identification on the corpus information to obtain a named entity word corresponding to the named entity information;
the vector conversion adjusting module is used for performing text vectorization on the corpus information to obtain a corpus vector, adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;
and the corpus classification module is used for inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors.
In order to achieve the above object, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor of the computer device implements the steps of the clustering model-based corpus classification method when executing the computer program.
In order to achieve the above object, the present invention further provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program stored on the computer-readable storage medium, when being executed by a processor, implements the steps of the clustering model-based corpus classification method.
According to the corpus classification method, the apparatus, the device and the storage medium based on the clustering model, provided by the invention, the corpus information is subjected to entity recognition to obtain a mode of reflecting named entity words of the corpus information description object, so that clustering is carried out according to the object concerned by a user in the follow-up process, and the technical effect of expecting to classify the corpus based on the target classification of a development end is favorably realized; obtaining corpus vectors by performing text vectorization processing on the corpus information to obtain vector data which can be identified by a clustering model, and obtaining sentence vectors by adjusting entity vectors corresponding to the named entity words in the corpus vectors to improve the weights of the entity vectors or reduce the weights of other word vectors except the entity vectors in the corpus vectors to obtain the sentence vectors with the highlighted named entity words, so that the subsequent clustering model can accurately classify the corpus information according to the named entity words according to the sentence vectors, and the situation that other information irrelevant to classification expectation in the clustering model corpus information interferes with classification results expected by the classification is avoided; the sentence vectors of the entity vectors corresponding to the highlighted named entity words are classified through the clustering model, and the named entity words represent the target classification expectation of the development end, so that the technical effect of classifying the material information according to the target classification expectation of the development end is realized.
Drawings
FIG. 1 is a flowchart of a first embodiment of a corpus classification method based on a clustering model according to the present invention;
FIG. 2 is a schematic diagram of an environmental application of a corpus classification method based on a clustering model according to a second embodiment of the present invention;
FIG. 3 is a flowchart illustrating a concrete method of a corpus classification method based on a clustering model according to a second embodiment of the present invention;
FIG. 4 is a schematic diagram of a third embodiment of the apparatus for classifying corpus based on clustering models according to the present invention;
fig. 5 is a schematic diagram of a hardware structure of a computer device according to a fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a corpus classification method, a corpus classification device and a storage medium based on a clustering model, which are suitable for the technical field of artificial intelligence and are based on an information input module, an entity recognition module, a vector conversion adjustment module and a corpus classification module. According to the invention, the corpus information is obtained; performing entity recognition on the corpus information to obtain named entity words reflecting the corpus information description objects; performing text vectorization on the corpus information to obtain corpus vectors, and adjusting entity vectors corresponding to the named entity words in the corpus vectors to obtain sentence vectors; and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the received sentence vectors through the clustering model to classify the corpus information corresponding to the sentence vectors.
It should be noted that the named entity (named entity) is a name of a person, a name of an organization, a name of a place, and all other entities identified by names. The broader entities also include numbers, dates, currencies, addresses, and the like.
The first embodiment is as follows:
referring to fig. 1, a corpus classification method based on a clustering model according to the embodiment includes:
s102: acquiring at least one corpus information, wherein the corpus information has at least one named entity information;
s103: carrying out named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information;
s104: performing text vectorization processing on the corpus information to obtain a corpus vector; adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;
s105: and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors.
In an exemplary embodiment, a receiving and developing end sends user information, and corpus information corresponding to the user information is acquired from the corpus; and further realizing the technical effect of obtaining the corpus information in a targeted manner, so that the user portrait of the user information can be constructed subsequently according to the classification and clustering operation of the corpus information.
The way of naming entity words reflecting the corpus information description objects is obtained by carrying out entity recognition on the corpus information, so that subsequent clustering is carried out according to the objects concerned by users, and the technical effect of carrying out corpus classification based on development end target classification expectation is facilitated; the named entity words obtained by the entity recognition can be set according to the requirement of the development end.
The corpus information is subjected to text vectorization to obtain corpus vectors, vector data which can be identified by a clustering model is obtained, and a sentence vector is obtained by adjusting the entity vectors corresponding to the named entity words in the corpus vectors to improve the weight of the entity vectors or reduce the weight of other word vectors except the entity vectors in the corpus vectors, so that the sentence vector with the highlighted named entity words is obtained, the subsequent clustering model can accurately classify the corpus information according to the named entity words according to the sentence vector, and the condition that other information irrelevant to the classification expectation in the clustering model corpus information interferes with the classification result expected by the classification is avoided.
The sentence vectors of the entity vectors corresponding to the highlighted named entity words are classified through the clustering model, and the named entity words represent the target classification expectation of the development end, so that the technical effect of classifying the material information according to the target classification expectation of the development end is realized.
Example two:
the embodiment is a specific application scenario of the first embodiment, and the method provided by the present invention can be more clearly and specifically explained through the embodiment.
Next, the method provided in this embodiment is specifically described by taking an example in which text vectorization processing is performed on corpus information in a server running a corpus classification method based on a clustering model to obtain corpus vectors, entity vectors corresponding to named entity words in the corpus vectors are adjusted to obtain sentence vectors, and then the corpus information is classified according to the sentence vectors. It should be noted that the present embodiment is only exemplary, and does not limit the protection scope of the embodiments of the present invention.
Fig. 2 is a schematic diagram illustrating an environmental application of the corpus classification method based on the clustering model according to the second embodiment of the present application.
In an exemplary embodiment, the server 2 in which the corpus classification method based on the clustering model is located is respectively connected with the development end 3 and the client 4 through a network; the server 2 may provide services through one or more networks, which may include various network devices, such as routers, switches, multiplexers, hubs, modems, bridges, repeaters, firewalls, proxy devices, and/or the like. The network may include physical links, such as coaxial cable links, twisted pair cable links, fiber optic links, combinations thereof, and/or the like. The network may include wireless links, such as cellular links, satellite links, Wi-Fi links, and/or the like; the development end 3 and the client end 4 can be respectively a computer device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like.
Fig. 3 is a flowchart of a concrete method for clustering-model-based corpus classification according to an embodiment of the present invention, and the method specifically includes steps S201 to S205.
S201: receiving information to be classified, and judging the information type of the information to be classified;
if the information type is voice information, converting the information to be classified into conversion information of which the information type is text information, and storing the conversion information serving as corpus information into a preset corpus;
and if the information type is text information, the information to be classified is used as corpus information and is stored in the corpus.
Generally, when a user side sends information to be classified, the information type generated by adopting a character input mode may be information to be classified of text information, or the information type generated by adopting a voice input mode may be information to be classified of voice information, so that in order to expand the application range of the application, the application realizes classification of two linguistic data of voice information and text information by converting the information to be classified of voice information into conversion information of text information and storing the conversion information into the corpus, and directly storing the information to be classified of text information into the corpus, thereby expanding the application range of the application.
Further, user information corresponding to the information to be classified is obtained, wherein the user information refers to identity information of a sender of the information to be classified;
and inputting the user information into the corpus, and associating the user information with the corpus information corresponding to the information to be classified.
In this embodiment, the user information includes a code of a terminal used by the sender and/or a phone number of the terminal, and the registration account information, and/or ID information, and/or identification number of the sender.
And taking the user information as a label of the information to be classified, or constructing a mapping table reflecting the mapping relation between the user information and the corpus information, so that the user information and the corpus information are correlated.
Storing the user information and the corpus information in the corpus by adopting a key-value (key value pair) method, and specifically summarizing the corpus information with consistent user information as a corpus set; and constructing key value pairs in the corpus by taking the user information as a main key and the corpus as key values so as to store the user information and the corpus information thereof, so that the corpus information sent by the sender can be quickly obtained according to the sender when the corpus is classified subsequently, and the information obtaining efficiency is improved.
In fig. 3, the S201 is shown by the following labels:
s201-1: receiving information to be classified, and judging the information type of the information to be classified;
s201-2: if the information type is voice information, converting the information to be classified into conversion information of which the information type is text information, and storing the conversion information serving as corpus information into a preset corpus;
s201-3: and if the information type is text information, storing the information to be classified as corpus information in the corpus.
S202: the method comprises the steps of obtaining at least one corpus information, wherein the corpus information is provided with at least one named entity information.
In the step, a receiving development end sends user information, and corpus information corresponding to the user information is obtained from the corpus; and further realizing the technical effect of obtaining the corpus information in a targeted manner, so that the user portrait of the user information can be constructed subsequently according to the classification and clustering operation of the corpus information.
In a preferred embodiment, the obtaining the corpus information includes:
s21: receiving user information sent by a user side; wherein, the user information refers to the identity information of the corpus information sender;
in this step, the user information includes a code of a terminal used by the sender and/or a phone number of the terminal, and the registration account information, and/or ID information, and/or an identification number of the sender.
The method and the device can simultaneously determine the identity of the sender and the identity of the terminal commonly used by the sender, so that the uniqueness of user information is ensured, and the problem that the finally obtained user portrait is inaccurate due to the fact that the sender uses other terminals to send the corpus information or other people use the terminal of the sender to send the corpus information in the follow-up process is avoided.
S22: acquiring at least one corpus information corresponding to the user information from a preset corpus; the corpus is used for storing corpus information associated with user information.
In this step, the user information and the corpus information are associated with each other by using the user information as a label of the information to be classified or by constructing a mapping table reflecting a mapping relationship between the user information and the corpus information.
Storing the user information and the corpus information in the corpus by adopting a key-value (key value pair) method, and specifically summarizing the corpus information with consistent user information as a corpus set; and constructing key value pairs in the corpus by taking the user information as a main key and the corpus as key values so as to store the user information and the corpus information thereof, so that the corpus information sent by the sender can be quickly obtained according to the sender when the corpus is classified subsequently, and the information obtaining efficiency is improved.
S203: and carrying out named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information.
In order to identify the objects concerned by the users in the corpus information, the entity identification is carried out on the corpus information to obtain a mode of reflecting named entity words of the corpus information description objects, so that the subsequent clustering is carried out according to the objects concerned by the users, and the technical effect of expecting the corpus classification based on the target classification of a development end is favorably realized; the named entity words obtained by the entity recognition can be set according to the requirement of the development end.
In a preferred embodiment, performing entity recognition on corpus information to obtain named entity words reflecting semantics of the corpus information, includes:
s31: and obtaining a corpus text corresponding to the corpus information, and performing word segmentation on the corpus text to obtain at least one corpus word.
In this step, the corpus information is computer data stored in a corpus in a message or machine language/assembly language, and a corpus text corresponding to the corpus information is obtained to obtain a corpus text which can be recognized and participled by a word segmentation tool; and segmenting the corpus information by using a segmentation tool, namely jieba, and/or THULAC, and/or SnowNLP, and/or pynlpir, and/or CoreNLP, and/or pyLTP, to obtain the corpus words.
Wherein jieba is a Chinese open source word segmentation packet which is a popular Chinese open source word segmentation packet, has the characteristics of high performance, accuracy, expandability and the like, and mainly supports python at present.
THULAC (THU Lexical Analyzer for Chinese) is a set of Chinese Lexical analysis tool kit, and has the functions of Chinese word segmentation and part-of-speech tagging. The method has the characteristics of strong capability, high accuracy and high speed.
SnowNLP is a python-written class library that can be used to process chinese text content.
Pynlpir is a chinese word segmentation system package developed by the chinese academy of sciences for segmenting chinese words.
CoreNLP is a natural language processing kit. It integrates many very practical functions including word segmentation, part-of-speech tagging, syntactic analysis, and so on.
pyLTP is Python encapsulation of LTP, providing functions of word segmentation, part of speech tagging, named entity recognition, dependency syntactic analysis, and semantic role tagging.
S32: comparing the corpus words with a preset named entity dictionary, and setting the corpus words belonging to the named entity dictionary as the named entity words; wherein the named entity dictionary has at least one named entity word therein.
In the named entity word book step, the entity dictionary includes: the system comprises a general dictionary and a user-defined dictionary, wherein the entity dictionary is installed in a server running a corpus classification method based on a clustering model;
the universal dictionary is a named entity used for recording an application scenario of the server, and includes: company organization name (a certain bank, a certain insurance company), financial product name (No. 3 product issued every day of the gold), and the like.
The user-defined dictionary is used for recording simplified names and/or alias names of named entities in the general dictionary; for example: naming an entity: product No. 3 is issued every day, simplified name: jinli No. 3; naming an entity: china Bank, simplified name: middle row, etc.
In this embodiment, the named entity dictionary may be set according to a target classification expectation of the originating end; if the development end needs to classify the named entity words corresponding to the interested products of the user, an entity dictionary with the named entity words corresponding to the products is constructed; and if the development end needs to classify the named entity words according to the interesting enterprises of the user, constructing an entity dictionary with the named entity words corresponding to the enterprises.
S204: performing text vectorization processing on the corpus information to obtain a corpus vector; and adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector.
In order to realize that a clustering model can be classified according to a development end target classification expectation, and avoid the situation that other information irrelevant to the classification expectation in corpus information of the clustering model interferes with the classification result of the classification expectation, the method comprises the steps of performing text vectorization on the corpus information to obtain corpus vectors to obtain vector data capable of being identified by the clustering model, and obtaining sentence vectors by adjusting the entity vectors corresponding to named entity words in the corpus vectors to improve the weights of the entity vectors or reduce the weights of other word vectors except the entity vectors in the corpus vectors to obtain the sentence vectors with the highlighted named entity words, so that a subsequent clustering model can accurately classify the corpus information according to the named entity words according to the sentence vectors.
In this embodiment, the corpus information is subjected to text vectorization processing to obtain a corpus vector, where the corpus vector includes word vectors corresponding to each corpus word in the corpus information, and the text vectorization processing is a process of representing a text as a series of vectors capable of expressing text semantics to serve as input information of a clustering model.
Further, a vectorization tool with word2vec, and/or NNLM, and/or C & W is adopted to perform text vectorization processing on the corpus information to obtain corpus vectors.
Where word2vec is a group of correlation models used to generate word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. Word2vec is a text item quantization method based on a Bag Of words (Bag Of Word) model, Word-based processing unit.
NNLM is a Neural Network Language Model (NNLM) that differs from conventional methods in that the n-ary conditional probabilities are estimated directly through a Neural Network structure. Because the NNLM model uses low-dimensional compact word vectors to express the context, the problems of data sparseness, semantic gaps and the like caused by the bag-of-words model are solved.
C & W (context & word, context and target word) is used for completing tasks in NLP according to word vectors, such as part of speech tagging, named entity recognition, phrase recognition, semantic role tagging, and the like.
In a preferred embodiment, the performing text vectorization processing on the corpus information to obtain a corpus vector, and modifying an entity vector corresponding to the named entity word in the corpus vector to obtain a sentence vector includes:
s41: and performing word frequency inverse document calculation on the corpus vectors to obtain word frequency inverse document values reflecting the importance degrees of the corpus words in the corpus information, and adjusting the corpus vectors to obtain semantic vectors by taking the word frequency inverse document values as the weights of the word vectors corresponding to the corpus words.
In this step, the word frequency inverse document calculation is performed by using a TF-IDF algorithm, which is a statistical method for evaluating the importance degree of a word to one of a set of files or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
And adjusting the word vector corresponding to each corpus word in the corpus vector by the word frequency inverse document value to improve the vector value of the important corpus word in the corpus vector and reduce the vector value of the unimportant corpus word in the corpus vector to obtain the semantic vector capable of highlighting the semantic meaning of the corpus information.
S42: and modifying the word vector corresponding to the named entity word in the semantic vector by a preset lifting coefficient to obtain an entity vector, so that the semantic vector is converted into the sentence vector.
S43: and setting the word vector corresponding to the named entity word as an entity vector, and modifying other word vectors except the entity vector in the semantic vector through a preset reduction coefficient to convert the semantic vector into the sentence vector.
In this step, the lifting coefficient is a preset parameter for lifting the element value of the word vector, and the element value of the entity vector can be lifted by multiplying the lifting coefficient by the entity vector; the reduction coefficient is a parameter for reducing the value of an element in the word vector, and the value of the element in the different word vector may be reduced by multiplying the reduction coefficient by the different word vector.
Illustratively, the corpus of the user M includes the following corpus information: corpus information 1: the amount of money of the product A is that the corpus words obtained by word segmentation are as follows: product/amount of money; corpus information 2: product A, book keeping, the corpus words obtained by word segmentation are: product a/warranty; corpus information 3: the interest of the product A is more or less, and the corpus words obtained by word segmentation are as follows: product/interest size; corpus information 4: the amount of money of the product B is obtained by word segmentation, and the corpus words obtained by word segmentation are as follows: product/amount of money. The extracted named entity words "product a" and "product B" are extracted by the above method.
Calculating the TF value of the product A in the corpus information 1 to be TF1-1=0.5 and the IDF value of the product A in the corpus information 1 to be IDF1-1= ln (4/3) =0.29 by a TF-IDF algorithm; therefore, TF-IDF1-1=0.5 × 0.29=0.145 in corpus information 1 of the a product; and "how much money" has a TF value of TF1-2=0.5 in corpus information 1 and "how much money" has an IDF value of IDF1-2= ln (4/2) =0.693 in corpus information 1, so that the product B has a TF-IDF1-2=0.3465 in corpus information 1.
Assuming that the word vectors of the corpus information 1 have values of (0, 1) (1, 0) and the corpus vectors thereof are (0, 1,1, 0), the TF-IDF value is multiplied by the corresponding word vectors, and the word vectors are integrated to obtain the semantic vector 1 of the corpus information 1 as (0, 0.145,0.345, 0).
Suppose that the word vectors of the corpus information 2 are (0, 1) (2, 0), the word vectors of the corpus information 3 are (0, 1) (3, 0), and the word vectors of the corpus information 4 are (0, 3) (1, 0), respectively, and the semantic vectors of the integrated corpus information 2-4 are respectively as follows: the semantic vector 2 of the corpus information 2 is (0, 0.145,1.386, 0), the semantic vector 3 of the corpus information 3 is (0, 0.145,2.079, 0), and the semantic vector 4 of the corpus information 4 is (0, 2.079,0.345, 0).
Assuming that the lifting coefficient is 10, the sentence vector 1 is (0, 1.45,0.345, 0), the sentence vector 2 is (0, 1.45,1.386, 0), the sentence vector 3 is (0, 1.45,2.079, 0), and the sentence vector 4 is (0, 20.79,0.345, 0) are obtained;
assuming that the reduction coefficient is 0.1, the resulting sentence vector 1 is (0, 0.145,0.0345, 0), sentence vector 2 is (0, 0.145,0.1386, 0), sentence vector 3 is (0, 0.145,0.2079, 0), and sentence vector 4 is (0, 2.079,0.0345, 0).
S205: and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the received sentence vectors through the clustering model to classify the corpus information corresponding to the sentence vectors.
In this step, the clustering model may be a K-MEANS clustering model or a density clustering model; wherein a K-MEANS clustering algorithm (K-MEANS clustering algorithm) is operated in the K-MEANS clustering model, and is a clustering analysis algorithm for iterative solution; the Density Clustering model runs a DBSCAN (sensitivity-Based Spatial Clustering of Applications with Noise) algorithm, which is a relatively representative Density-Based Clustering algorithm.
In the step, sentence vectors of entity vectors corresponding to the highlighted named entity words are classified through the clustering model, and the named entity words represent the target classification expectation of the development end, so that the technical effect of classifying the material information according to the target classification expectation of the development end is realized.
In a preferred embodiment, when the clustering model is a K-MEANS clustering model, the performing a clustering operation on the sentence vectors received by the clustering model by using the clustering model to classify the corpus information corresponding to the sentence vectors includes:
s51: constructing an object representing the corpus information in the clustering model according to the sentence vector, and dividing at least one group in the clustering model; wherein the group characterizes the class to which the object belongs;
s52: operating the clustering model to perform k-means clustering operation on the objects in each group to obtain clusters and clustering centers of the clusters; wherein the cluster is a set constructed by at least one object belonging to the group;
s53: and extracting the central corpus information of the object corresponding to the clustering center, extracting named entity words of the central corpus information, and using the named entity words as the category information of the corpus information corresponding to all the objects of the cluster where the clustering center is located so as to realize the classification of the corpus information corresponding to the sentence vectors.
Specifically, the k-means clustering operation includes: and dividing the data into K groups in advance through the clustering model, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal.
For example, if an enterprise develops 10 products in total, 10 groups can be constructed and corpus information received by the enterprise is clustered, and since the word vectors corresponding to the sentence-wise named entity words are adjusted in the application, the enterprise can directly obtain the technical effect of which product the semantics of which corpus are directed at, thereby being beneficial to the enterprise to timely know the market feedback of each user to each product.
In a preferred embodiment, when the clustering model is a density clustering model, the clustering operation performed on the sentence vectors received by the clustering model through the clustering model to classify the corpus information corresponding to the sentence vectors includes:
s54: constructing clustering points representing the corpus information in the clustering model according to the sentence vectors, and operating the clustering model to perform density clustering operation on the clustering points to obtain at least one cluster and a clustering center thereof; wherein the cluster is a set of at least one of the cluster points;
s55: and extracting the central corpus information of the clustering points corresponding to the clustering centers, extracting named entity words of the central corpus information, and using the named entity words as category information of the corpus information corresponding to all clustering points of the cluster where the clustering centers are located so as to realize the classification of the corpus information corresponding to the sentence vectors.
Specifically, the density clustering operation is to define clusters as the maximum set of densely connected clustering points, and to divide an area having a sufficiently high density into clusters, with the aim of finding the maximum set of densely connected objects.
Exemplarily, assuming that the radius E =3, MinPts =3, there are cluster points { m, p, p1, p2, o } in the E neighborhood of the cluster point p, cluster points { m, q, p, m1, m2} in the E neighborhood of the cluster point m, cluster points { q, m } in the E neighborhood of the cluster point q, cluster points { o, p, s } in the E neighborhood of the cluster point o, cluster points { o, s, s1} in the E neighborhood of the cluster point s, then there are p, m, o, s for the core object (q is not the core object because the number of cluster points in its corresponding E neighborhood is equal to 2, less than MinPts = 3);
the density of the cluster point m is directly reachable from the cluster point p, because m is in the E neighborhood of p, and p is a core object; the density of the clustering point q is reachable from the clustering point p, because the density of the clustering point q is directly reachable from the clustering point m, and the density of the clustering point m is directly reachable from the clustering point p; cluster point q is connected to cluster point s density because cluster point q is reachable from cluster point p density and s is reachable from cluster point p density.
Note that e neighborhood: a region with a given object radius within Ε is called the Ε neighborhood of the object;
core object: if the sample clustering point number in the E neighborhood of a given object is more than or equal to MinPts, the object is called a core object;
the direct density can reach: for sample set D, if the sample cluster point q is within the e neighborhood of p, and p is the core object, then object q is directly density reachable from object p.
The density can reach: for sample set D, given a string of sample cluster points p1, p2 … pn, p = p1, q = pn, object q is reachable from object p density provided object pi is directly reachable from pi-1 density.
Density connection: there is a cluster point o in the sample set D, and if the density is reachable for object o through object p and object q, then p and q are connected in density.
Therefore, the attention degree of the user to each product can be identified according to the corpus information through the density clustering algorithm, and the product which is most concerned by the user is obtained.
Example three:
referring to fig. 4, a corpus classifying device 1 based on a clustering model according to the present embodiment includes:
the information input module 12 is configured to obtain at least one corpus information, where the corpus information has at least one named entity information;
the entity identification module 13 is configured to perform named entity identification on the corpus information to obtain a named entity word corresponding to the named entity information;
a vector conversion adjustment module 14, configured to perform text vectorization on the corpus information to obtain a corpus vector, adjust a named entity vector corresponding to the named entity word in the corpus vector, or adjust other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;
and the corpus classification module 15 is configured to record the sentence vectors into a preset clustering model, and perform clustering operation on the recorded sentence vectors through the clustering model, so as to classify corpus information corresponding to the recorded sentence vectors.
Optionally, the corpus classifying device 1 based on the clustering model further includes:
the classification processing module 11 is configured to receive information to be classified, and determine an information type of the information to be classified; if the information type is voice information, converting the information to be classified into conversion information of which the information type is text information, and storing the conversion information serving as corpus information into a preset corpus; and if the information type is text information, the information to be classified is used as corpus information and is stored in the corpus.
Optionally, the information input module 12 includes:
an information receiving unit 121, configured to receive user information sent by a user; wherein, the user information refers to the identity information of the corpus information sender;
a corpus obtaining unit 122, configured to obtain at least one corpus information corresponding to the user information from a preset corpus; the corpus is used for storing corpus information associated with user information.
Optionally, the entity identifying module 13 includes:
the corpus participle unit 131 obtains a corpus text corresponding to the corpus information, and performs participling on the corpus text to obtain at least one corpus word.
An entity recognizing unit 132, which compares the corpus words with a preset named entity dictionary, and sets the corpus words belonging to the named entity dictionary as the named entity words; wherein the named entity dictionary has at least one named entity word therein.
Optionally, the vector conversion adjusting module 14 further includes:
the weight adjusting unit 141 is configured to perform word frequency inverse document calculation on the corpus vectors to obtain word frequency inverse document values reflecting the importance degrees of the corpus words in the corpus information, and adjust the corpus vectors to obtain semantic vectors by using the word frequency inverse document values as weights of the word vectors corresponding to the corpus words;
an entity lifting unit 142, configured to modify, through a preset lifting coefficient, a word vector corresponding to the named entity word in the semantic vector to obtain an entity vector, so that the semantic vector is converted into the sentence vector;
the entity reducing unit 143 is configured to set the word vector corresponding to the named entity word as an entity vector, and modify, through a preset reduction coefficient, the word vectors except the entity vector in the semantic vector, so that the semantic vector is converted into the sentence vector.
Optionally, the corpus classifying module 15 further includes:
an object group unit 151, configured to construct an object representing the corpus information according to the sentence vector in the clustering model, and divide at least one group in the clustering model; wherein the group characterizes the class to which the object belongs;
a cluster construction unit 152, configured to run the clustering model to perform k-means clustering operation on the objects in each group, so as to obtain clusters and clustering centers of each group; wherein the cluster is a set constructed by at least one object belonging to the group;
the mean classification unit 153 is configured to extract the center corpus information of the object corresponding to the clustering center, extract the named entity words of the center corpus information, and use the named entity words as the category information of the corpus information corresponding to all the objects in the cluster where the clustering center is located, so as to implement classification of the corpus information corresponding to the sentence vectors.
Optionally, the corpus classifying module 15 further includes:
a clustering operation unit 154, configured to construct a clustering point representing the corpus information in the clustering model according to the sentence vector, and operate the clustering model to perform density clustering operation on the clustering point to obtain at least one cluster and a clustering center thereof; wherein the cluster is a set of at least one of the cluster points;
and the density classification unit 155 is configured to extract the center corpus information of the clustering point corresponding to the clustering center, extract the named entity word of the center corpus information, and use the named entity word as category information of the corpus information corresponding to all clustering points of the cluster where the clustering center is located, so as to classify the corpus information corresponding to the sentence vector.
The technical scheme is applied to the field of artificial intelligence intelligent decision making, corpus information is obtained, entity recognition is carried out on the corpus information to obtain named entity words reflecting description objects of the corpus information, text vectorization processing is carried out on the corpus information to obtain corpus vectors, the entity vectors corresponding to the named entity words in the corpus vectors are adjusted to obtain sentence vectors, the sentence vectors are input into a preset clustering model, clustering operation is carried out on the received sentence vectors through the clustering model to classify the corpus information corresponding to the sentence vectors, and therefore the technical effect of the classification model serving as the corpus information is achieved.
Example four:
in order to achieve the above object, the present invention further provides a computer device 5, in which components of the corpus classifying device based on the clustering model according to the third embodiment may be distributed in different computer devices, and the computer device 5 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster composed of multiple application servers) that executes a program, or the like. The computer device of the embodiment at least includes but is not limited to: a memory 51, a processor 52, which may be communicatively coupled to each other via a system bus, as shown in FIG. 5. It should be noted that fig. 5 only shows a computer device with components, but it should be understood that not all of the shown components are required to be implemented, and more or fewer components may be implemented instead.
In this embodiment, the memory 51 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 51 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory 51 may be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device. Of course, the memory 51 may also include both internal and external storage devices of the computer device. In this embodiment, the memory 51 is generally used to store an operating system and various types of application software installed on the computer device, for example, the program code of the corpus classifying device based on the clustering model in the third embodiment. Further, the memory 51 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 52 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 52 is typically used to control the overall operation of the computer device. In this embodiment, the processor 52 is configured to run a program code stored in the memory 51 or process data, for example, run a corpus classifying device based on a clustering model, so as to implement the corpus classifying method based on a clustering model in the first embodiment and the second embodiment.
Example five:
to achieve the above objects, the present invention also provides a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor 52, implements corresponding functions. The computer-readable storage medium of this embodiment is used for storing a computer program for implementing the clustering model-based corpus classifying method, and when being executed by the processor 52, implements the clustering model-based corpus classifying method of the first embodiment and the second embodiment.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A corpus classification method based on a clustering model is characterized by comprising the following steps:
acquiring at least one corpus information, wherein the corpus information has at least one named entity information;
carrying out named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information;
performing text vectorization processing on the corpus information to obtain a corpus vector; adjusting a named entity vector corresponding to the named entity words in the corpus vector; or
Adjusting other word and phrase vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;
and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors.
2. The corpus classifying method according to claim 1, wherein before the obtaining of the at least one corpus information, the method further comprises:
receiving information to be classified, and judging the information type of the information to be classified;
if the information type is voice information, converting the information to be classified into conversion information of which the information type is text information, and storing the conversion information serving as corpus information into a preset corpus;
and if the information type is text information, storing the information to be classified as corpus information in the corpus.
3. The corpus classifying method according to claim 1, wherein the obtaining at least one corpus information includes:
receiving user information sent by a user side, wherein the user information refers to identity information of a corpus information sender;
and acquiring at least one corpus information corresponding to the user information from a preset corpus, wherein the corpus is used for storing the corpus information related to the user information.
4. The corpus classifying method according to claim 1, wherein the named entity recognizing the corpus information to obtain a named entity word corresponding to the named entity information comprises:
obtaining a corpus text corresponding to the corpus information, and performing word segmentation on the corpus text to obtain at least one corpus word;
and comparing the corpus words with a preset named entity dictionary, and setting the corpus words belonging to the named entity dictionary as the named entity words, wherein the named entity dictionary is provided with at least one named entity.
5. The corpus classification method according to claim 1, wherein said adjusting a named entity vector corresponding to the named entity word in the corpus vector or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector comprises:
performing word frequency inverse document calculation on the corpus vectors to obtain word frequency inverse document values reflecting the importance degrees of the corpus words in the corpus information, and adjusting the corpus vectors to obtain semantic vectors by taking the word frequency inverse document values as the weights of the word vectors corresponding to the corpus words;
modifying the word vector corresponding to the named entity word in the semantic vector through a preset lifting coefficient to obtain an entity vector, and converting the semantic vector into the sentence vector; or
And setting the word vector corresponding to the named entity word as an entity vector, and modifying other word vectors except the entity vector in the semantic vector through a preset reduction coefficient to convert the semantic vector into the sentence vector.
6. The corpus classification method according to claim 1, wherein when the clustering model is a K-MEANS clustering model, the clustering operation is performed on the entered sentence vectors through the clustering model to classify the corpus information corresponding to the entered sentence vectors, including:
constructing an object representing the corpus information in the clustering model according to the sentence vector, and dividing at least one group in the clustering model; wherein the group characterizes the class to which the object belongs;
operating the clustering model to perform k-means clustering operation on the objects in each group to obtain clusters and clustering centers of the clusters; wherein the cluster is a set constructed by at least one object belonging to the group;
and extracting the central corpus information of the object corresponding to the clustering center, extracting named entity words of the central corpus information, and using the named entity words as the category information of the corpus information corresponding to all the objects of the cluster where the clustering center is located so as to realize the classification of the corpus information corresponding to the sentence vectors.
7. The corpus classifying method according to claim 1, wherein when the clustering model is a density clustering model, the clustering operation is performed on the entered sentence vectors through the clustering model to classify the corpus information corresponding to the entered sentence vectors, including:
constructing clustering points representing the corpus information in the clustering model according to the sentence vectors, and operating the clustering model to perform density clustering operation on the clustering points to obtain at least one cluster and a clustering center thereof; wherein the cluster is a set of at least one of the cluster points;
and extracting the central corpus information of the clustering points corresponding to the clustering centers, extracting named entity words of the central corpus information, and using the named entity words as category information of the corpus information corresponding to all clustering points of the cluster where the clustering centers are located so as to realize the classification of the corpus information corresponding to the sentence vectors.
8. The utility model provides a corpus sorter based on cluster model which characterized in that includes:
the information input module is used for acquiring at least one corpus information, wherein the corpus information is provided with at least one named entity information;
the entity identification module is used for carrying out named entity identification on the corpus information to obtain a named entity word corresponding to the named entity information;
the vector conversion adjusting module is used for performing text vectorization on the corpus information to obtain a corpus vector, adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;
and the corpus classification module is used for inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor of the computer device implements the steps of the clustering model based corpus classification method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program stored in the computer-readable storage medium, when being executed by a processor, implements the steps of the clustering model based corpus classification method according to any one of claims 1 to 7.
CN202210189341.2A 2022-03-01 2022-03-01 Corpus classification method, apparatus, device and storage medium based on clustering model Active CN114266255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210189341.2A CN114266255B (en) 2022-03-01 2022-03-01 Corpus classification method, apparatus, device and storage medium based on clustering model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210189341.2A CN114266255B (en) 2022-03-01 2022-03-01 Corpus classification method, apparatus, device and storage medium based on clustering model

Publications (2)

Publication Number Publication Date
CN114266255A true CN114266255A (en) 2022-04-01
CN114266255B CN114266255B (en) 2022-05-17

Family

ID=80833831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210189341.2A Active CN114266255B (en) 2022-03-01 2022-03-01 Corpus classification method, apparatus, device and storage medium based on clustering model

Country Status (1)

Country Link
CN (1) CN114266255B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292505A (en) * 2022-10-09 2022-11-04 深圳市明源云科技有限公司 Public opinion-based market analysis method, device, equipment and readable storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160117589A1 (en) * 2012-12-04 2016-04-28 Msc Intellectual Properties B.V. System and method for automatic document classification in ediscovery, compliance and legacy information clean-up
US20170052967A1 (en) * 2011-04-11 2017-02-23 Groupon, Inc. System, method, and computer program product for automated discovery, curation and editing of online local content
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence
CN110347835A (en) * 2019-07-11 2019-10-18 招商局金融科技有限公司 Text Clustering Method, electronic device and storage medium
US20190340430A1 (en) * 2012-08-16 2019-11-07 Groupon, Inc. Method, apparatus, and computer program product for classification of documents
CN110516073A (en) * 2019-08-30 2019-11-29 北京百度网讯科技有限公司 A kind of file classification method, device, equipment and medium
CN111666373A (en) * 2020-05-07 2020-09-15 华东师范大学 Chinese news classification method based on Transformer
CN112183101A (en) * 2020-10-13 2021-01-05 深圳壹账通智能科技有限公司 Text intention recognition method and device, electronic equipment and storage medium
CN112597300A (en) * 2020-12-15 2021-04-02 中国平安人寿保险股份有限公司 Text clustering method and device, terminal equipment and storage medium
US20210103634A1 (en) * 2019-10-04 2021-04-08 Omilia Natural Language Solutions Ltd. Unsupervised induction of user intents from conversational customer service corpora
CN112650852A (en) * 2021-01-06 2021-04-13 广东泰迪智能科技股份有限公司 Event merging method based on named entity and AP clustering
CN113705692A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN113761942A (en) * 2021-09-14 2021-12-07 合众新能源汽车有限公司 Semantic analysis method and device based on deep learning model and storage medium
CN113962196A (en) * 2020-12-29 2022-01-21 深圳平安智汇企业信息管理有限公司 Resume processing method and device, electronic equipment and storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170052967A1 (en) * 2011-04-11 2017-02-23 Groupon, Inc. System, method, and computer program product for automated discovery, curation and editing of online local content
US20190340430A1 (en) * 2012-08-16 2019-11-07 Groupon, Inc. Method, apparatus, and computer program product for classification of documents
US20160117589A1 (en) * 2012-12-04 2016-04-28 Msc Intellectual Properties B.V. System and method for automatic document classification in ediscovery, compliance and legacy information clean-up
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence
CN110347835A (en) * 2019-07-11 2019-10-18 招商局金融科技有限公司 Text Clustering Method, electronic device and storage medium
CN110516073A (en) * 2019-08-30 2019-11-29 北京百度网讯科技有限公司 A kind of file classification method, device, equipment and medium
US20210103634A1 (en) * 2019-10-04 2021-04-08 Omilia Natural Language Solutions Ltd. Unsupervised induction of user intents from conversational customer service corpora
CN111666373A (en) * 2020-05-07 2020-09-15 华东师范大学 Chinese news classification method based on Transformer
CN112183101A (en) * 2020-10-13 2021-01-05 深圳壹账通智能科技有限公司 Text intention recognition method and device, electronic equipment and storage medium
CN112597300A (en) * 2020-12-15 2021-04-02 中国平安人寿保险股份有限公司 Text clustering method and device, terminal equipment and storage medium
CN113962196A (en) * 2020-12-29 2022-01-21 深圳平安智汇企业信息管理有限公司 Resume processing method and device, electronic equipment and storage medium
CN112650852A (en) * 2021-01-06 2021-04-13 广东泰迪智能科技股份有限公司 Event merging method based on named entity and AP clustering
CN113705692A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN113761942A (en) * 2021-09-14 2021-12-07 合众新能源汽车有限公司 Semantic analysis method and device based on deep learning model and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
RAGHAD M.HADI 等: "Proposed Method to Enhance Text Document Clustering Using Improved Fuzzy C Mean Algorithm with Named Entity Tag", 《AI-MANSOUR JOURNAL》 *
张卫卫: "基于主题模型和句向量的文本语义挖掘研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
贺依依: "基于并行聚类方法的单位名称消歧技术研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115292505A (en) * 2022-10-09 2022-11-04 深圳市明源云科技有限公司 Public opinion-based market analysis method, device, equipment and readable storage medium

Also Published As

Publication number Publication date
CN114266255B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN110347835B (en) Text clustering method, electronic device and storage medium
CN110765265B (en) Information classification extraction method and device, computer equipment and storage medium
US20230222366A1 (en) Systems and methods for semantic analysis based on knowledge graph
WO2019153551A1 (en) Article classification method and apparatus, computer device and storage medium
CN109471944B (en) Training method and device of text classification model and readable storage medium
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
WO2020147409A1 (en) Text classification method and apparatus, computer device, and storage medium
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
CN112487149B (en) Text auditing method, model, equipment and storage medium
CN111611807A (en) Keyword extraction method and device based on neural network and electronic equipment
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN111062803A (en) Financial business query and review method and system
CN114416998A (en) Text label identification method and device, electronic equipment and storage medium
CN111680161A (en) Text processing method and device and computer readable storage medium
CN114266255B (en) Corpus classification method, apparatus, device and storage medium based on clustering model
CN112527969B (en) Incremental intention clustering method, device, equipment and storage medium
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
CN111737607B (en) Data processing method, device, electronic equipment and storage medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN115358340A (en) Credit credit collection short message distinguishing method, system, equipment and storage medium
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
CN110633468B (en) Information processing method and device for object feature extraction
CN113779248A (en) Data classification model training method, data processing method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40070348

Country of ref document: HK