CN114266255A

CN114266255A - Corpus classification method, apparatus, device and storage medium based on clustering model

Info

Publication number: CN114266255A
Application number: CN202210189341.2A
Authority: CN
Inventors: 邹倩霞
Original assignee: Shenzhen One Ledger Science And Technology Service Co ltd
Current assignee: Shenzhen One Ledger Science And Technology Service Co ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-04-01
Anticipated expiration: 2042-03-01
Also published as: CN114266255B

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a corpus classification method, a corpus classification device, corpus classification equipment and a storage medium based on a clustering model, wherein the corpus classification method comprises the following steps: acquiring at least one corpus information; carrying out named entity recognition on the material information to obtain a named entity word of the corresponding named entity information; performing text vectorization processing on the corpus information to obtain a corpus vector; adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector; and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors. According to the invention, the sentence vectors of the entity vectors corresponding to the highlighted named entity words are classified, so that the technical effect of classifying the material information according to the development end target classification expectation is realized.

Description

Corpus classification method, apparatus, device and storage medium based on clustering model

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a corpus classification method, a corpus classification device, corpus classification equipment and a storage medium based on a clustering model.

Background

The clustering method is a way of unsupervised learning and is practically applied to various aspects such as customer group classification and animal and plant classification. Clustering is to classify similar data into one cluster and classify different data into different clusters according to data characteristics. Clustering algorithms are also a processing method for text classification that is often used in the field of nlp. The method can effectively classify the texts under the non-supervision condition. There is a clear recognition of the information categories expressed by the text and the central meaning of each category.

However, the inventor realizes that the current clustering algorithm performs clustering operation directly according to the corresponding vector of the corpus, and is difficult to cluster the corpus according to the classification expectation of the development end (for example, classifying the corpus according to the interest of a user in a certain product), so that the current clustering algorithm is difficult to output the classification result required by the development end.

Disclosure of Invention

The invention aims to provide a corpus classification method, a corpus classification device, a corpus classification equipment and a storage medium based on a clustering model, which are used for solving the problem that in the prior art, a corpus is difficult to cluster according to a classification expectation of a development end, so that a classification result required by the development end is difficult to obtain.

In order to achieve the above object, the present invention provides a corpus classification method based on a clustering model, which comprises:

acquiring at least one corpus information, wherein the corpus information has at least one named entity information;

carrying out named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information;

performing text vectorization processing on the corpus information to obtain a corpus vector; adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;

and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors.

In the foregoing solution, before the obtaining of the at least one corpus information, the method further includes:

receiving information to be classified, and judging the information type of the information to be classified;

if the information type is voice information, converting the information to be classified into conversion information of which the information type is text information, and storing the conversion information serving as corpus information into a preset corpus;

and if the information type is text information, storing the information to be classified as corpus information in the corpus.

In the foregoing solution, the obtaining at least one corpus information includes:

receiving user information sent by a user side; wherein, the user information refers to the identity information of the corpus information sender;

acquiring at least one corpus information corresponding to the user information from a preset corpus; the corpus is used for storing corpus information associated with user information.

In the foregoing solution, the performing named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information includes:

obtaining a corpus text corresponding to the corpus information, and performing word segmentation on the corpus text to obtain at least one corpus word;

comparing the corpus words with a preset named entity dictionary, and setting the corpus words belonging to the named entity dictionary as the named entity words; wherein the named entity dictionary has at least one named entity therein.

In the foregoing solution, the adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector includes:

performing word frequency inverse document calculation on the corpus vectors to obtain word frequency inverse document values reflecting the importance degrees of the corpus words in the corpus information, and adjusting the corpus vectors to obtain semantic vectors by taking the word frequency inverse document values as the weights of the word vectors corresponding to the corpus words;

modifying the word vector corresponding to the named entity in the semantic vector through a preset lifting coefficient to obtain an entity vector, and converting the semantic vector into the sentence vector; or

And setting the word vector corresponding to the named entity as an entity vector, and modifying other word vectors except the entity vector in the semantic vector through a preset reduction coefficient to convert the semantic vector into the sentence vector.

In the foregoing solution, when the clustering model is a K-MEANS clustering model, the clustering operation is performed on the entered sentence vectors through the clustering model to classify the corpus information corresponding to the entered sentence vectors, including:

constructing an object representing the corpus information in the clustering model according to the sentence vector, and dividing at least one group in the clustering model; wherein the group characterizes the class to which the object belongs;

operating the clustering model to perform k-means clustering operation on the objects in each group to obtain clusters and clustering centers of the clusters; wherein the cluster is a set constructed by at least one object belonging to the group;

and extracting the central corpus information of the object corresponding to the clustering center, extracting the named entity of the central corpus information, and using the named entity as the category information of the corpus information corresponding to all the objects of the cluster where the clustering center is located so as to realize the classification of the corpus information corresponding to the sentence vector.

In the foregoing solution, when the clustering model is a density clustering model, the clustering operation is performed on the entered sentence vectors through the clustering model to classify the corpus information corresponding to the entered sentence vectors, including:

constructing clustering points representing the corpus information in the clustering model according to the sentence vectors, and operating the clustering model to perform density clustering operation on the clustering points to obtain at least one cluster and a clustering center thereof; wherein the cluster is a set of at least one of the cluster points;

and extracting the central corpus information of the clustering points corresponding to the clustering centers, extracting the named entities of the central corpus information, and using the named entities as the category information of the corpus information corresponding to all the clustering points of the cluster where the clustering centers are located so as to realize the classification of the corpus information corresponding to the sentence vectors.

In order to achieve the above object, the present invention further provides a corpus classifying device based on a clustering model, including:

the information input module is used for acquiring at least one corpus information, wherein the corpus information is provided with at least one named entity information;

the entity identification module is used for carrying out named entity identification on the corpus information to obtain a named entity word corresponding to the named entity information;

the vector conversion adjusting module is used for performing text vectorization on the corpus information to obtain a corpus vector, adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;

and the corpus classification module is used for inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors.

In order to achieve the above object, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor of the computer device implements the steps of the clustering model-based corpus classification method when executing the computer program.

In order to achieve the above object, the present invention further provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program stored on the computer-readable storage medium, when being executed by a processor, implements the steps of the clustering model-based corpus classification method.

According to the corpus classification method, the apparatus, the device and the storage medium based on the clustering model, provided by the invention, the corpus information is subjected to entity recognition to obtain a mode of reflecting named entity words of the corpus information description object, so that clustering is carried out according to the object concerned by a user in the follow-up process, and the technical effect of expecting to classify the corpus based on the target classification of a development end is favorably realized; obtaining corpus vectors by performing text vectorization processing on the corpus information to obtain vector data which can be identified by a clustering model, and obtaining sentence vectors by adjusting entity vectors corresponding to the named entity words in the corpus vectors to improve the weights of the entity vectors or reduce the weights of other word vectors except the entity vectors in the corpus vectors to obtain the sentence vectors with the highlighted named entity words, so that the subsequent clustering model can accurately classify the corpus information according to the named entity words according to the sentence vectors, and the situation that other information irrelevant to classification expectation in the clustering model corpus information interferes with classification results expected by the classification is avoided; the sentence vectors of the entity vectors corresponding to the highlighted named entity words are classified through the clustering model, and the named entity words represent the target classification expectation of the development end, so that the technical effect of classifying the material information according to the target classification expectation of the development end is realized.

Drawings

FIG. 1 is a flowchart of a first embodiment of a corpus classification method based on a clustering model according to the present invention;

FIG. 2 is a schematic diagram of an environmental application of a corpus classification method based on a clustering model according to a second embodiment of the present invention;

FIG. 3 is a flowchart illustrating a concrete method of a corpus classification method based on a clustering model according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram of a third embodiment of the apparatus for classifying corpus based on clustering models according to the present invention;

fig. 5 is a schematic diagram of a hardware structure of a computer device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a corpus classification method, a corpus classification device and a storage medium based on a clustering model, which are suitable for the technical field of artificial intelligence and are based on an information input module, an entity recognition module, a vector conversion adjustment module and a corpus classification module. According to the invention, the corpus information is obtained; performing entity recognition on the corpus information to obtain named entity words reflecting the corpus information description objects; performing text vectorization on the corpus information to obtain corpus vectors, and adjusting entity vectors corresponding to the named entity words in the corpus vectors to obtain sentence vectors; and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the received sentence vectors through the clustering model to classify the corpus information corresponding to the sentence vectors.

It should be noted that the named entity (named entity) is a name of a person, a name of an organization, a name of a place, and all other entities identified by names. The broader entities also include numbers, dates, currencies, addresses, and the like.

The first embodiment is as follows:

referring to fig. 1, a corpus classification method based on a clustering model according to the embodiment includes:

s102: acquiring at least one corpus information, wherein the corpus information has at least one named entity information;

s103: carrying out named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information;

s104: performing text vectorization processing on the corpus information to obtain a corpus vector; adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;

s105: and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the input sentence vectors through the clustering model so as to classify the corpus information corresponding to the input sentence vectors.

In an exemplary embodiment, a receiving and developing end sends user information, and corpus information corresponding to the user information is acquired from the corpus; and further realizing the technical effect of obtaining the corpus information in a targeted manner, so that the user portrait of the user information can be constructed subsequently according to the classification and clustering operation of the corpus information.

The way of naming entity words reflecting the corpus information description objects is obtained by carrying out entity recognition on the corpus information, so that subsequent clustering is carried out according to the objects concerned by users, and the technical effect of carrying out corpus classification based on development end target classification expectation is facilitated; the named entity words obtained by the entity recognition can be set according to the requirement of the development end.

The corpus information is subjected to text vectorization to obtain corpus vectors, vector data which can be identified by a clustering model is obtained, and a sentence vector is obtained by adjusting the entity vectors corresponding to the named entity words in the corpus vectors to improve the weight of the entity vectors or reduce the weight of other word vectors except the entity vectors in the corpus vectors, so that the sentence vector with the highlighted named entity words is obtained, the subsequent clustering model can accurately classify the corpus information according to the named entity words according to the sentence vector, and the condition that other information irrelevant to the classification expectation in the clustering model corpus information interferes with the classification result expected by the classification is avoided.

The sentence vectors of the entity vectors corresponding to the highlighted named entity words are classified through the clustering model, and the named entity words represent the target classification expectation of the development end, so that the technical effect of classifying the material information according to the target classification expectation of the development end is realized.

Example two:

the embodiment is a specific application scenario of the first embodiment, and the method provided by the present invention can be more clearly and specifically explained through the embodiment.

Next, the method provided in this embodiment is specifically described by taking an example in which text vectorization processing is performed on corpus information in a server running a corpus classification method based on a clustering model to obtain corpus vectors, entity vectors corresponding to named entity words in the corpus vectors are adjusted to obtain sentence vectors, and then the corpus information is classified according to the sentence vectors. It should be noted that the present embodiment is only exemplary, and does not limit the protection scope of the embodiments of the present invention.

Fig. 2 is a schematic diagram illustrating an environmental application of the corpus classification method based on the clustering model according to the second embodiment of the present application.

In an exemplary embodiment, the server 2 in which the corpus classification method based on the clustering model is located is respectively connected with the development end 3 and the client 4 through a network; the server 2 may provide services through one or more networks, which may include various network devices, such as routers, switches, multiplexers, hubs, modems, bridges, repeaters, firewalls, proxy devices, and/or the like. The network may include physical links, such as coaxial cable links, twisted pair cable links, fiber optic links, combinations thereof, and/or the like. The network may include wireless links, such as cellular links, satellite links, Wi-Fi links, and/or the like; the development end 3 and the client end 4 can be respectively a computer device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like.

Fig. 3 is a flowchart of a concrete method for clustering-model-based corpus classification according to an embodiment of the present invention, and the method specifically includes steps S201 to S205.

S201: receiving information to be classified, and judging the information type of the information to be classified;

and if the information type is text information, the information to be classified is used as corpus information and is stored in the corpus.

Generally, when a user side sends information to be classified, the information type generated by adopting a character input mode may be information to be classified of text information, or the information type generated by adopting a voice input mode may be information to be classified of voice information, so that in order to expand the application range of the application, the application realizes classification of two linguistic data of voice information and text information by converting the information to be classified of voice information into conversion information of text information and storing the conversion information into the corpus, and directly storing the information to be classified of text information into the corpus, thereby expanding the application range of the application.

Further, user information corresponding to the information to be classified is obtained, wherein the user information refers to identity information of a sender of the information to be classified;

and inputting the user information into the corpus, and associating the user information with the corpus information corresponding to the information to be classified.

In this embodiment, the user information includes a code of a terminal used by the sender and/or a phone number of the terminal, and the registration account information, and/or ID information, and/or identification number of the sender.

And taking the user information as a label of the information to be classified, or constructing a mapping table reflecting the mapping relation between the user information and the corpus information, so that the user information and the corpus information are correlated.

Storing the user information and the corpus information in the corpus by adopting a key-value (key value pair) method, and specifically summarizing the corpus information with consistent user information as a corpus set; and constructing key value pairs in the corpus by taking the user information as a main key and the corpus as key values so as to store the user information and the corpus information thereof, so that the corpus information sent by the sender can be quickly obtained according to the sender when the corpus is classified subsequently, and the information obtaining efficiency is improved.

In fig. 3, the S201 is shown by the following labels:

s201-1: receiving information to be classified, and judging the information type of the information to be classified;

s201-2: if the information type is voice information, converting the information to be classified into conversion information of which the information type is text information, and storing the conversion information serving as corpus information into a preset corpus;

s201-3: and if the information type is text information, storing the information to be classified as corpus information in the corpus.

S202: the method comprises the steps of obtaining at least one corpus information, wherein the corpus information is provided with at least one named entity information.

In the step, a receiving development end sends user information, and corpus information corresponding to the user information is obtained from the corpus; and further realizing the technical effect of obtaining the corpus information in a targeted manner, so that the user portrait of the user information can be constructed subsequently according to the classification and clustering operation of the corpus information.

In a preferred embodiment, the obtaining the corpus information includes:

s21: receiving user information sent by a user side; wherein, the user information refers to the identity information of the corpus information sender;

in this step, the user information includes a code of a terminal used by the sender and/or a phone number of the terminal, and the registration account information, and/or ID information, and/or an identification number of the sender.

The method and the device can simultaneously determine the identity of the sender and the identity of the terminal commonly used by the sender, so that the uniqueness of user information is ensured, and the problem that the finally obtained user portrait is inaccurate due to the fact that the sender uses other terminals to send the corpus information or other people use the terminal of the sender to send the corpus information in the follow-up process is avoided.

S22: acquiring at least one corpus information corresponding to the user information from a preset corpus; the corpus is used for storing corpus information associated with user information.

In this step, the user information and the corpus information are associated with each other by using the user information as a label of the information to be classified or by constructing a mapping table reflecting a mapping relationship between the user information and the corpus information.

S203: and carrying out named entity recognition on the corpus information to obtain a named entity word corresponding to the named entity information.

In order to identify the objects concerned by the users in the corpus information, the entity identification is carried out on the corpus information to obtain a mode of reflecting named entity words of the corpus information description objects, so that the subsequent clustering is carried out according to the objects concerned by the users, and the technical effect of expecting the corpus classification based on the target classification of a development end is favorably realized; the named entity words obtained by the entity recognition can be set according to the requirement of the development end.

In a preferred embodiment, performing entity recognition on corpus information to obtain named entity words reflecting semantics of the corpus information, includes:

s31: and obtaining a corpus text corresponding to the corpus information, and performing word segmentation on the corpus text to obtain at least one corpus word.

In this step, the corpus information is computer data stored in a corpus in a message or machine language/assembly language, and a corpus text corresponding to the corpus information is obtained to obtain a corpus text which can be recognized and participled by a word segmentation tool; and segmenting the corpus information by using a segmentation tool, namely jieba, and/or THULAC, and/or SnowNLP, and/or pynlpir, and/or CoreNLP, and/or pyLTP, to obtain the corpus words.

Wherein jieba is a Chinese open source word segmentation packet which is a popular Chinese open source word segmentation packet, has the characteristics of high performance, accuracy, expandability and the like, and mainly supports python at present.

THULAC (THU Lexical Analyzer for Chinese) is a set of Chinese Lexical analysis tool kit, and has the functions of Chinese word segmentation and part-of-speech tagging. The method has the characteristics of strong capability, high accuracy and high speed.

SnowNLP is a python-written class library that can be used to process chinese text content.

Pynlpir is a chinese word segmentation system package developed by the chinese academy of sciences for segmenting chinese words.

CoreNLP is a natural language processing kit. It integrates many very practical functions including word segmentation, part-of-speech tagging, syntactic analysis, and so on.

pyLTP is Python encapsulation of LTP, providing functions of word segmentation, part of speech tagging, named entity recognition, dependency syntactic analysis, and semantic role tagging.

S32: comparing the corpus words with a preset named entity dictionary, and setting the corpus words belonging to the named entity dictionary as the named entity words; wherein the named entity dictionary has at least one named entity word therein.

In the named entity word book step, the entity dictionary includes: the system comprises a general dictionary and a user-defined dictionary, wherein the entity dictionary is installed in a server running a corpus classification method based on a clustering model;

the universal dictionary is a named entity used for recording an application scenario of the server, and includes: company organization name (a certain bank, a certain insurance company), financial product name (No. 3 product issued every day of the gold), and the like.

The user-defined dictionary is used for recording simplified names and/or alias names of named entities in the general dictionary; for example: naming an entity: product No. 3 is issued every day, simplified name: jinli No. 3; naming an entity: china Bank, simplified name: middle row, etc.

In this embodiment, the named entity dictionary may be set according to a target classification expectation of the originating end; if the development end needs to classify the named entity words corresponding to the interested products of the user, an entity dictionary with the named entity words corresponding to the products is constructed; and if the development end needs to classify the named entity words according to the interesting enterprises of the user, constructing an entity dictionary with the named entity words corresponding to the enterprises.

S204: performing text vectorization processing on the corpus information to obtain a corpus vector; and adjusting a named entity vector corresponding to the named entity word in the corpus vector, or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector.

In order to realize that a clustering model can be classified according to a development end target classification expectation, and avoid the situation that other information irrelevant to the classification expectation in corpus information of the clustering model interferes with the classification result of the classification expectation, the method comprises the steps of performing text vectorization on the corpus information to obtain corpus vectors to obtain vector data capable of being identified by the clustering model, and obtaining sentence vectors by adjusting the entity vectors corresponding to named entity words in the corpus vectors to improve the weights of the entity vectors or reduce the weights of other word vectors except the entity vectors in the corpus vectors to obtain the sentence vectors with the highlighted named entity words, so that a subsequent clustering model can accurately classify the corpus information according to the named entity words according to the sentence vectors.

In this embodiment, the corpus information is subjected to text vectorization processing to obtain a corpus vector, where the corpus vector includes word vectors corresponding to each corpus word in the corpus information, and the text vectorization processing is a process of representing a text as a series of vectors capable of expressing text semantics to serve as input information of a clustering model.

Further, a vectorization tool with word2vec, and/or NNLM, and/or C & W is adopted to perform text vectorization processing on the corpus information to obtain corpus vectors.

Where word2vec is a group of correlation models used to generate word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. Word2vec is a text item quantization method based on a Bag Of words (Bag Of Word) model, Word-based processing unit.

NNLM is a Neural Network Language Model (NNLM) that differs from conventional methods in that the n-ary conditional probabilities are estimated directly through a Neural Network structure. Because the NNLM model uses low-dimensional compact word vectors to express the context, the problems of data sparseness, semantic gaps and the like caused by the bag-of-words model are solved.

C & W (context & word, context and target word) is used for completing tasks in NLP according to word vectors, such as part of speech tagging, named entity recognition, phrase recognition, semantic role tagging, and the like.

In a preferred embodiment, the performing text vectorization processing on the corpus information to obtain a corpus vector, and modifying an entity vector corresponding to the named entity word in the corpus vector to obtain a sentence vector includes:

s41: and performing word frequency inverse document calculation on the corpus vectors to obtain word frequency inverse document values reflecting the importance degrees of the corpus words in the corpus information, and adjusting the corpus vectors to obtain semantic vectors by taking the word frequency inverse document values as the weights of the word vectors corresponding to the corpus words.

In this step, the word frequency inverse document calculation is performed by using a TF-IDF algorithm, which is a statistical method for evaluating the importance degree of a word to one of a set of files or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.

And adjusting the word vector corresponding to each corpus word in the corpus vector by the word frequency inverse document value to improve the vector value of the important corpus word in the corpus vector and reduce the vector value of the unimportant corpus word in the corpus vector to obtain the semantic vector capable of highlighting the semantic meaning of the corpus information.

S42: and modifying the word vector corresponding to the named entity word in the semantic vector by a preset lifting coefficient to obtain an entity vector, so that the semantic vector is converted into the sentence vector.

S43: and setting the word vector corresponding to the named entity word as an entity vector, and modifying other word vectors except the entity vector in the semantic vector through a preset reduction coefficient to convert the semantic vector into the sentence vector.

In this step, the lifting coefficient is a preset parameter for lifting the element value of the word vector, and the element value of the entity vector can be lifted by multiplying the lifting coefficient by the entity vector; the reduction coefficient is a parameter for reducing the value of an element in the word vector, and the value of the element in the different word vector may be reduced by multiplying the reduction coefficient by the different word vector.

Illustratively, the corpus of the user M includes the following corpus information: corpus information 1: the amount of money of the product A is that the corpus words obtained by word segmentation are as follows: product/amount of money; corpus information 2: product A, book keeping, the corpus words obtained by word segmentation are: product a/warranty; corpus information 3: the interest of the product A is more or less, and the corpus words obtained by word segmentation are as follows: product/interest size; corpus information 4: the amount of money of the product B is obtained by word segmentation, and the corpus words obtained by word segmentation are as follows: product/amount of money. The extracted named entity words "product a" and "product B" are extracted by the above method.

Calculating the TF value of the product A in the corpus information 1 to be TF1-1=0.5 and the IDF value of the product A in the corpus information 1 to be IDF1-1= ln (4/3) =0.29 by a TF-IDF algorithm; therefore, TF-IDF1-1=0.5 × 0.29=0.145 in corpus information 1 of the a product; and "how much money" has a TF value of TF1-2=0.5 in corpus information 1 and "how much money" has an IDF value of IDF1-2= ln (4/2) =0.693 in corpus information 1, so that the product B has a TF-IDF1-2=0.3465 in corpus information 1.

Assuming that the word vectors of the corpus information 1 have values of (0, 1) (1, 0) and the corpus vectors thereof are (0, 1,1, 0), the TF-IDF value is multiplied by the corresponding word vectors, and the word vectors are integrated to obtain the semantic vector 1 of the corpus information 1 as (0, 0.145,0.345, 0).

Suppose that the word vectors of the corpus information 2 are (0, 1) (2, 0), the word vectors of the corpus information 3 are (0, 1) (3, 0), and the word vectors of the corpus information 4 are (0, 3) (1, 0), respectively, and the semantic vectors of the integrated corpus information 2-4 are respectively as follows: the semantic vector 2 of the corpus information 2 is (0, 0.145,1.386, 0), the semantic vector 3 of the corpus information 3 is (0, 0.145,2.079, 0), and the semantic vector 4 of the corpus information 4 is (0, 2.079,0.345, 0).

Assuming that the lifting coefficient is 10, the sentence vector 1 is (0, 1.45,0.345, 0), the sentence vector 2 is (0, 1.45,1.386, 0), the sentence vector 3 is (0, 1.45,2.079, 0), and the sentence vector 4 is (0, 20.79,0.345, 0) are obtained;

assuming that the reduction coefficient is 0.1, the resulting sentence vector 1 is (0, 0.145,0.0345, 0), sentence vector 2 is (0, 0.145,0.1386, 0), sentence vector 3 is (0, 0.145,0.2079, 0), and sentence vector 4 is (0, 2.079,0.0345, 0).

S205: and inputting the sentence vectors into a preset clustering model, and performing clustering operation on the received sentence vectors through the clustering model to classify the corpus information corresponding to the sentence vectors.

In this step, the clustering model may be a K-MEANS clustering model or a density clustering model; wherein a K-MEANS clustering algorithm (K-MEANS clustering algorithm) is operated in the K-MEANS clustering model, and is a clustering analysis algorithm for iterative solution; the Density Clustering model runs a DBSCAN (sensitivity-Based Spatial Clustering of Applications with Noise) algorithm, which is a relatively representative Density-Based Clustering algorithm.

In the step, sentence vectors of entity vectors corresponding to the highlighted named entity words are classified through the clustering model, and the named entity words represent the target classification expectation of the development end, so that the technical effect of classifying the material information according to the target classification expectation of the development end is realized.

In a preferred embodiment, when the clustering model is a K-MEANS clustering model, the performing a clustering operation on the sentence vectors received by the clustering model by using the clustering model to classify the corpus information corresponding to the sentence vectors includes:

s51: constructing an object representing the corpus information in the clustering model according to the sentence vector, and dividing at least one group in the clustering model; wherein the group characterizes the class to which the object belongs;

s52: operating the clustering model to perform k-means clustering operation on the objects in each group to obtain clusters and clustering centers of the clusters; wherein the cluster is a set constructed by at least one object belonging to the group;

s53: and extracting the central corpus information of the object corresponding to the clustering center, extracting named entity words of the central corpus information, and using the named entity words as the category information of the corpus information corresponding to all the objects of the cluster where the clustering center is located so as to realize the classification of the corpus information corresponding to the sentence vectors.

Specifically, the k-means clustering operation includes: and dividing the data into K groups in advance through the clustering model, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal.

For example, if an enterprise develops 10 products in total, 10 groups can be constructed and corpus information received by the enterprise is clustered, and since the word vectors corresponding to the sentence-wise named entity words are adjusted in the application, the enterprise can directly obtain the technical effect of which product the semantics of which corpus are directed at, thereby being beneficial to the enterprise to timely know the market feedback of each user to each product.

In a preferred embodiment, when the clustering model is a density clustering model, the clustering operation performed on the sentence vectors received by the clustering model through the clustering model to classify the corpus information corresponding to the sentence vectors includes:

s54: constructing clustering points representing the corpus information in the clustering model according to the sentence vectors, and operating the clustering model to perform density clustering operation on the clustering points to obtain at least one cluster and a clustering center thereof; wherein the cluster is a set of at least one of the cluster points;

s55: and extracting the central corpus information of the clustering points corresponding to the clustering centers, extracting named entity words of the central corpus information, and using the named entity words as category information of the corpus information corresponding to all clustering points of the cluster where the clustering centers are located so as to realize the classification of the corpus information corresponding to the sentence vectors.

Specifically, the density clustering operation is to define clusters as the maximum set of densely connected clustering points, and to divide an area having a sufficiently high density into clusters, with the aim of finding the maximum set of densely connected objects.

Exemplarily, assuming that the radius E =3, MinPts =3, there are cluster points { m, p, p1, p2, o } in the E neighborhood of the cluster point p, cluster points { m, q, p, m1, m2} in the E neighborhood of the cluster point m, cluster points { q, m } in the E neighborhood of the cluster point q, cluster points { o, p, s } in the E neighborhood of the cluster point o, cluster points { o, s, s1} in the E neighborhood of the cluster point s, then there are p, m, o, s for the core object (q is not the core object because the number of cluster points in its corresponding E neighborhood is equal to 2, less than MinPts = 3);

the density of the cluster point m is directly reachable from the cluster point p, because m is in the E neighborhood of p, and p is a core object; the density of the clustering point q is reachable from the clustering point p, because the density of the clustering point q is directly reachable from the clustering point m, and the density of the clustering point m is directly reachable from the clustering point p; cluster point q is connected to cluster point s density because cluster point q is reachable from cluster point p density and s is reachable from cluster point p density.

Note that e neighborhood: a region with a given object radius within Ε is called the Ε neighborhood of the object;

core object: if the sample clustering point number in the E neighborhood of a given object is more than or equal to MinPts, the object is called a core object;

the direct density can reach: for sample set D, if the sample cluster point q is within the e neighborhood of p, and p is the core object, then object q is directly density reachable from object p.

The density can reach: for sample set D, given a string of sample cluster points p1, p2 … pn, p = p1, q = pn, object q is reachable from object p density provided object pi is directly reachable from pi-1 density.

Density connection: there is a cluster point o in the sample set D, and if the density is reachable for object o through object p and object q, then p and q are connected in density.

Therefore, the attention degree of the user to each product can be identified according to the corpus information through the density clustering algorithm, and the product which is most concerned by the user is obtained.

Example three:

referring to fig. 4, a corpus classifying device 1 based on a clustering model according to the present embodiment includes:

the information input module 12 is configured to obtain at least one corpus information, where the corpus information has at least one named entity information;

the entity identification module 13 is configured to perform named entity identification on the corpus information to obtain a named entity word corresponding to the named entity information;

a vector conversion adjustment module 14, configured to perform text vectorization on the corpus information to obtain a corpus vector, adjust a named entity vector corresponding to the named entity word in the corpus vector, or adjust other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;

and the corpus classification module 15 is configured to record the sentence vectors into a preset clustering model, and perform clustering operation on the recorded sentence vectors through the clustering model, so as to classify corpus information corresponding to the recorded sentence vectors.

Optionally, the corpus classifying device 1 based on the clustering model further includes:

the classification processing module 11 is configured to receive information to be classified, and determine an information type of the information to be classified; if the information type is voice information, converting the information to be classified into conversion information of which the information type is text information, and storing the conversion information serving as corpus information into a preset corpus; and if the information type is text information, the information to be classified is used as corpus information and is stored in the corpus.

Optionally, the information input module 12 includes:

an information receiving unit 121, configured to receive user information sent by a user; wherein, the user information refers to the identity information of the corpus information sender;

a corpus obtaining unit 122, configured to obtain at least one corpus information corresponding to the user information from a preset corpus; the corpus is used for storing corpus information associated with user information.

Optionally, the entity identifying module 13 includes:

the corpus participle unit 131 obtains a corpus text corresponding to the corpus information, and performs participling on the corpus text to obtain at least one corpus word.

An entity recognizing unit 132, which compares the corpus words with a preset named entity dictionary, and sets the corpus words belonging to the named entity dictionary as the named entity words; wherein the named entity dictionary has at least one named entity word therein.

Optionally, the vector conversion adjusting module 14 further includes:

the weight adjusting unit 141 is configured to perform word frequency inverse document calculation on the corpus vectors to obtain word frequency inverse document values reflecting the importance degrees of the corpus words in the corpus information, and adjust the corpus vectors to obtain semantic vectors by using the word frequency inverse document values as weights of the word vectors corresponding to the corpus words;

an entity lifting unit 142, configured to modify, through a preset lifting coefficient, a word vector corresponding to the named entity word in the semantic vector to obtain an entity vector, so that the semantic vector is converted into the sentence vector;

the entity reducing unit 143 is configured to set the word vector corresponding to the named entity word as an entity vector, and modify, through a preset reduction coefficient, the word vectors except the entity vector in the semantic vector, so that the semantic vector is converted into the sentence vector.

Optionally, the corpus classifying module 15 further includes:

an object group unit 151, configured to construct an object representing the corpus information according to the sentence vector in the clustering model, and divide at least one group in the clustering model; wherein the group characterizes the class to which the object belongs;

a cluster construction unit 152, configured to run the clustering model to perform k-means clustering operation on the objects in each group, so as to obtain clusters and clustering centers of each group; wherein the cluster is a set constructed by at least one object belonging to the group;

the mean classification unit 153 is configured to extract the center corpus information of the object corresponding to the clustering center, extract the named entity words of the center corpus information, and use the named entity words as the category information of the corpus information corresponding to all the objects in the cluster where the clustering center is located, so as to implement classification of the corpus information corresponding to the sentence vectors.

Optionally, the corpus classifying module 15 further includes:

a clustering operation unit 154, configured to construct a clustering point representing the corpus information in the clustering model according to the sentence vector, and operate the clustering model to perform density clustering operation on the clustering point to obtain at least one cluster and a clustering center thereof; wherein the cluster is a set of at least one of the cluster points;

and the density classification unit 155 is configured to extract the center corpus information of the clustering point corresponding to the clustering center, extract the named entity word of the center corpus information, and use the named entity word as category information of the corpus information corresponding to all clustering points of the cluster where the clustering center is located, so as to classify the corpus information corresponding to the sentence vector.

The technical scheme is applied to the field of artificial intelligence intelligent decision making, corpus information is obtained, entity recognition is carried out on the corpus information to obtain named entity words reflecting description objects of the corpus information, text vectorization processing is carried out on the corpus information to obtain corpus vectors, the entity vectors corresponding to the named entity words in the corpus vectors are adjusted to obtain sentence vectors, the sentence vectors are input into a preset clustering model, clustering operation is carried out on the received sentence vectors through the clustering model to classify the corpus information corresponding to the sentence vectors, and therefore the technical effect of the classification model serving as the corpus information is achieved.

Example four:

in order to achieve the above object, the present invention further provides a computer device 5, in which components of the corpus classifying device based on the clustering model according to the third embodiment may be distributed in different computer devices, and the computer device 5 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster composed of multiple application servers) that executes a program, or the like. The computer device of the embodiment at least includes but is not limited to: a memory 51, a processor 52, which may be communicatively coupled to each other via a system bus, as shown in FIG. 5. It should be noted that fig. 5 only shows a computer device with components, but it should be understood that not all of the shown components are required to be implemented, and more or fewer components may be implemented instead.

In this embodiment, the memory 51 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 51 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory 51 may be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device. Of course, the memory 51 may also include both internal and external storage devices of the computer device. In this embodiment, the memory 51 is generally used to store an operating system and various types of application software installed on the computer device, for example, the program code of the corpus classifying device based on the clustering model in the third embodiment. Further, the memory 51 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 52 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 52 is typically used to control the overall operation of the computer device. In this embodiment, the processor 52 is configured to run a program code stored in the memory 51 or process data, for example, run a corpus classifying device based on a clustering model, so as to implement the corpus classifying method based on a clustering model in the first embodiment and the second embodiment.

Example five:

to achieve the above objects, the present invention also provides a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor 52, implements corresponding functions. The computer-readable storage medium of this embodiment is used for storing a computer program for implementing the clustering model-based corpus classifying method, and when being executed by the processor 52, implements the clustering model-based corpus classifying method of the first embodiment and the second embodiment.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A corpus classification method based on a clustering model is characterized by comprising the following steps:

performing text vectorization processing on the corpus information to obtain a corpus vector; adjusting a named entity vector corresponding to the named entity words in the corpus vector; or

Adjusting other word and phrase vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector;

2. The corpus classifying method according to claim 1, wherein before the obtaining of the at least one corpus information, the method further comprises:

3. The corpus classifying method according to claim 1, wherein the obtaining at least one corpus information includes:

receiving user information sent by a user side, wherein the user information refers to identity information of a corpus information sender;

and acquiring at least one corpus information corresponding to the user information from a preset corpus, wherein the corpus is used for storing the corpus information related to the user information.

4. The corpus classifying method according to claim 1, wherein the named entity recognizing the corpus information to obtain a named entity word corresponding to the named entity information comprises:

and comparing the corpus words with a preset named entity dictionary, and setting the corpus words belonging to the named entity dictionary as the named entity words, wherein the named entity dictionary is provided with at least one named entity.

5. The corpus classification method according to claim 1, wherein said adjusting a named entity vector corresponding to the named entity word in the corpus vector or adjusting other word vectors except the named entity vector in the corpus vector to obtain a sentence vector of the corpus vector comprises:

modifying the word vector corresponding to the named entity word in the semantic vector through a preset lifting coefficient to obtain an entity vector, and converting the semantic vector into the sentence vector; or

And setting the word vector corresponding to the named entity word as an entity vector, and modifying other word vectors except the entity vector in the semantic vector through a preset reduction coefficient to convert the semantic vector into the sentence vector.

6. The corpus classification method according to claim 1, wherein when the clustering model is a K-MEANS clustering model, the clustering operation is performed on the entered sentence vectors through the clustering model to classify the corpus information corresponding to the entered sentence vectors, including:

and extracting the central corpus information of the object corresponding to the clustering center, extracting named entity words of the central corpus information, and using the named entity words as the category information of the corpus information corresponding to all the objects of the cluster where the clustering center is located so as to realize the classification of the corpus information corresponding to the sentence vectors.

7. The corpus classifying method according to claim 1, wherein when the clustering model is a density clustering model, the clustering operation is performed on the entered sentence vectors through the clustering model to classify the corpus information corresponding to the entered sentence vectors, including:

and extracting the central corpus information of the clustering points corresponding to the clustering centers, extracting named entity words of the central corpus information, and using the named entity words as category information of the corpus information corresponding to all clustering points of the cluster where the clustering centers are located so as to realize the classification of the corpus information corresponding to the sentence vectors.

8. The utility model provides a corpus sorter based on cluster model which characterized in that includes:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor of the computer device implements the steps of the clustering model based corpus classification method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program stored in the computer-readable storage medium, when being executed by a processor, implements the steps of the clustering model based corpus classification method according to any one of claims 1 to 7.