CN110807097A

CN110807097A - Method and device for analyzing data

Info

Publication number: CN110807097A
Application number: CN201810876412.XA
Authority: CN
Inventors: 李超
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-08-03
Filing date: 2018-08-03
Publication date: 2020-02-18

Abstract

The invention discloses a method and a device for analyzing data, and relates to the technical field of computers. One embodiment of the method comprises: performing word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set; clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set; and analyzing the data to be analyzed based on the clustering data set and the analysis model corresponding to the analyzed data to obtain an analysis result. This embodiment can avoid carrying out a large amount of marks, has improved analysis efficiency, and has reduced the analysis cost.

Description

Method and device for analyzing data

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for analyzing data.

Background

Data analysis refers to the process of analyzing a large amount of collected data by using an appropriate statistical analysis method, extracting useful information and forming a conclusion to study and summarize the data in detail. Data analysis can help people make decisions in order to take appropriate action.

In the prior art, when data is analyzed, the data needs to be labeled first, and then the data needs to be analyzed based on some analysis models. For example, for the customer service dialogue data (i.e. the chat content between the customer service and the customer), when the customer service dialogue emotion model is used for analysis, the customer service dialogue data is manually labeled, that is, the chat sentences (characters, words or sentences) between the customer service and the customer are manually labeled with emotion labels, and then the labeled customer service dialogue data is input into the customer service dialogue emotion model for analysis.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

for data of different application scenarios or new data, in the prior art, a large number of labels need to be performed before analyzing the data, and the analysis efficiency is low and the cost is high.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for analyzing data, which can avoid performing a large number of labels, improve analysis efficiency, and reduce analysis cost.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of analyzing data.

The method for analyzing data comprises the following steps: performing word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set; the word vector set comprises a word vector to be analyzed and an analyzed word vector; clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set; and analyzing the data to be analyzed based on the clustering data set and the analysis model corresponding to the analyzed data to obtain an analysis result.

Optionally, performing word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set, where the word vector set includes: and (4) constructing a Word vector model of the analyzed data and the data to be analyzed by using a Word2vec algorithm model to obtain a Word vector set.

Optionally, clustering the word vector to be analyzed and the analyzed word vector to obtain a clustered data set includes: clustering the word vectors to be analyzed and the analyzed word vectors by using a partition clustering algorithm to obtain word vector clusters; calculating the clustering centroid of the word vector cluster to obtain the cluster category of the word vector cluster; and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

Optionally, clustering the word vector to be analyzed and the analyzed word vector to obtain a clustered data set includes: calculating the editing distance between the word vector to be analyzed and the analyzed word vector; classifying the word vector to be analyzed and the analyzed word vector by utilizing a K-nearest neighbor algorithm based on the editing distance to obtain word vector clusters; calculating the clustering centroid of the word vector cluster to obtain the cluster category of the word vector cluster; and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

Optionally, analyzing the data to be analyzed based on the clustered data set and the analysis model corresponding to the analyzed data includes: and extracting the word vector to be analyzed and the clustering category corresponding to the word vector to be analyzed from the clustering data set, and inputting the extracted word vector to be analyzed and the clustering category into an analysis model corresponding to the analyzed data so as to analyze the data to be analyzed.

Optionally, before analyzing the data to be analyzed based on the clustered data set and the analysis model corresponding to the analyzed data, the method further includes: extracting the analyzed word vector corresponding to the analyzed data from the word vector set; extracting the cluster category corresponding to the analyzed word vector from the cluster data set; and performing optimization training on the analysis model by using the analyzed word vectors and the cluster categories corresponding to the analyzed word vectors.

To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an apparatus for analyzing data.

An apparatus for analyzing data according to an embodiment of the present invention includes: the processing module is used for carrying out word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set; the word vector set comprises a word vector to be analyzed and an analyzed word vector; the clustering module is used for clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set; and the analysis module is used for analyzing the data to be analyzed based on the clustering data set and the analysis model corresponding to the analyzed data to obtain an analysis result.

Optionally, the processing module is further configured to: and (4) constructing a Word vector model of the analyzed data and the data to be analyzed by using a Word2vec algorithm model to obtain a Word vector set.

Optionally, the clustering module is further configured to: clustering the word vectors to be analyzed and the analyzed word vectors by using a partition clustering algorithm to obtain word vector clusters; calculating the clustering centroid of the word vector cluster to obtain the cluster category of the word vector cluster; and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

Optionally, the clustering module is further configured to: calculating the editing distance between the word vector to be analyzed and the analyzed word vector; classifying the word vector to be analyzed and the analyzed word vector by utilizing a K-nearest neighbor algorithm based on the editing distance to obtain word vector clusters; calculating the clustering centroid of the word vector cluster to obtain the cluster category of the word vector cluster; and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

Optionally, the analysis module is further configured to: and extracting the word vector to be analyzed and the clustering category corresponding to the word vector to be analyzed from the clustering data set, and inputting the extracted word vector to be analyzed and the clustering category into an analysis model corresponding to the analyzed data so as to analyze the data to be analyzed.

Optionally, the apparatus further comprises a training module configured to: extracting the analyzed word vector corresponding to the analyzed data from the word vector set; extracting the cluster category corresponding to the analyzed word vector from the cluster data set; and performing optimization training on the analysis model by using the analyzed word vectors and the cluster categories corresponding to the analyzed word vectors.

To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided an electronic device for analyzing data.

An electronic device for analyzing data according to an embodiment of the present invention includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method of analyzing data of an embodiment of the present invention.

To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable storage medium.

A computer-readable storage medium of an embodiment of the present invention has stored thereon a computer program that, when executed by a processor, implements a method of analyzing data of an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: because the word vector processing is carried out on the analyzed data and the data to be analyzed, a word vector set is obtained; clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set; and migrating the trained analysis model to analyze the data to be analyzed based on the similarity of the text structures of the analyzed data and the data to be analyzed. Therefore, the technical problems that a large amount of labels are needed to be carried out before data are analyzed, the analysis efficiency is low, and the cost is high are solved, and the technical effects that the large amount of labels are avoided, the analysis efficiency is improved, and the analysis cost is reduced are achieved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a method of analyzing data according to an embodiment of the invention;

FIG. 2 is a schematic diagram of an implementation framework of a method of analyzing data according to one referenceable embodiment of the invention;

FIG. 3 is a schematic diagram of the main blocks of an apparatus for analyzing data according to an embodiment of the present invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments of the present invention and the technical features of the embodiments may be combined with each other without conflict.

In the prior art, when data is analyzed, the data needs to be labeled first, and then the data needs to be analyzed based on certain models, and for similar data in different application scenarios, the data still needs to be labeled again, for example, emotion of a customer and a customer service is analyzed based on chat data (namely chat content) of the customer and the customer service, and words or sentences in the chat content need to be labeled with emotion labels such as positive, neutral or negative emotion. Because a large amount of labels are carried out on data of different application scenes or new data before analyzing the data, a large amount of computing resources or human resources need to be consumed, and the analysis efficiency is low and the cost is high.

Therefore, the embodiment of the present invention provides a method for analyzing data, which fuses analyzed data and data to be analyzed based on the text structure similarity between the analyzed data and the data to be analyzed, so as to achieve the consistency of the analyzed data and the data to be analyzed on the data structure, and further analyzes the data to be analyzed by using an analysis model used for training and analyzing the analyzed data, so as to obtain an analysis result. The trained analysis model is migrated to analyze the data to be analyzed, so that a large amount of labels can be avoided, the analysis efficiency is improved, and the analysis cost is reduced.

FIG. 1 is a schematic diagram of the main steps of a method of analyzing data according to an embodiment of the present invention.

As shown in fig. 1, the method for analyzing data according to the embodiment of the present invention mainly includes the following steps:

step S101: and performing word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set.

In order to realize the fusion of the analyzed data and the data to be analyzed, word vector processing is firstly carried out on the analyzed data and the data to be analyzed, wherein the word vector set comprises a word vector to be analyzed and an analyzed word vector. Word embedding, is a general term for a set of language modeling and feature learning techniques in Natural Language Processing (NLP), in which words or phrases from a vocabulary are mapped to vectors of real numbers. And performing word vector processing, namely mapping the analyzed data and the data to be analyzed to a vector of real numbers to obtain a word vector to be analyzed and an analyzed word vector.

The analyzed data and the data to be analyzed in the embodiment of the invention can be chat data of clients and customers, posts of forums, blog information or comment data of users, and the like, and the analyzed data and the data to be analyzed have the same or similar characteristics.

In the embodiment of the present invention, step S101 may be implemented by: and (4) constructing a Word vector model of the analyzed data and the data to be analyzed by using a Word2vec algorithm model to obtain a Word vector set.

The Word2vec algorithm model is a model for learning semantic knowledge from a large amount of text corpora in an unsupervised mode, and semantic information of words is represented in a Word vector mode through learning texts, namely, words with similar semantics are close to each other in a space through an embedded space. The core of the Word2vec algorithm model is a neural network model, which comprises two network structures of a continuous bag of words model (CBOW) and a Skip-Word model (Skip-Gram), and words can be mapped into the same coordinate system and numerical vectors are output. The CBOW is used for predicting the current word through the surrounding words, the input of the CBOW is a word vector of the surrounding words, and the output of the CBOW is a word vector of the current word; the Skip-Gram is used for predicting the surrounding words through the current words, the input of the Skip-Gram is word vectors of the current words, the word vectors of the surrounding words are output by the Skip-Gram, analyzed data and data to be analyzed can be mapped into word vectors to be analyzed and analyzed word vectors by using the Skip-Gram, and the similar word vectors to be analyzed and the analyzed word vectors are closer to each other in distance in a word vector model, so that a word vector set is obtained.

Step S102: and clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set.

In order to realize the consistency of the analyzed data and the data to be analyzed on the data structure, all word vectors to be analyzed corresponding to the analyzed data and all analyzed word vectors corresponding to the analyzed data are clustered, the word vectors to be analyzed and the analyzed word vectors of the same category are placed in the same clustering data set, so that clustering data sets to which the word vectors to be analyzed and the analyzed word vectors belong are determined, each clustering data set corresponds to a category, and a certain category can be determined to include the analyzed data and the analyzed data through clustering.

In the embodiment of the present invention, the clustering data set may be obtained by a partition clustering algorithm, that is, step S102 may be implemented by: clustering the word vectors to be analyzed and the analyzed word vectors by using a partition clustering algorithm to obtain word vector clusters; calculating the clustering centroid of the word vector cluster to obtain the clustering category of the word vector cluster; and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

Partitional clustering algorithms include K-Means algorithms (K-Means) and K-centric algorithms (K-MEDOIDS), among others. The partition clustering algorithm is as follows: dividing the word vectors to be analyzed and the analyzed word vectors into K groups (K is less than the number of the word vectors to be analyzed and the analyzed word vectors), wherein each group represents a word vector cluster, and changing the grouping of the K groups by a repeated iteration (namely repeated clustering) method until the word vector clusters are relatively stable (namely the clustering center of mass is not changed) or the iteration times reach a preset value. After word vector clustering is obtained through a partition clustering algorithm, the clustering centroid of the word vector clustering is calculated, the clustering centroid represents the clustering category, and then the mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category is established, so that a clustering data set can be obtained.

In addition, the clustering data set can also be obtained by a K-nearest neighbor algorithm, that is, step S102 can also be implemented by: calculating the editing distance between the word vector to be analyzed and the analyzed word vector; classifying the word vector to be analyzed and the analyzed word vector by using a K-nearest neighbor algorithm based on the editing distance to obtain word vector clusters; calculating the clustering centroid of the word vector cluster to obtain the clustering category of the word vector cluster; and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

The edit distance is the minimum character operand used for converting one character string into another character string, and the minimum character operand is the edit distance of the two character strings. The K-nearest neighbor algorithm is an example-based classification method, which is to find out K training samples closest to an unknown sample x, and to which class most of the K samples belong, the x is classified. After word vector clustering is obtained through editing distance and a K-nearest neighbor algorithm, the clustering centroid of the word vector clustering is calculated, the clustering centroid represents the clustering category, and then the mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category is established, so that the clustering data set can be obtained.

Step S103: and analyzing the data to be analyzed based on the clustering data set and the analysis model corresponding to the analyzed data to obtain an analysis result.

The data to be analyzed is classified according to the category of the analyzed data, and the analyzed data and the data to be analyzed have consistency on the data structure, so that the data to be analyzed can be analyzed by using the analysis model which is trained and used for analyzing the analyzed data, thereby avoiding a large amount of labels, improving the analysis efficiency and reducing the analysis cost.

In the embodiment of the present invention, when analyzing the data to be analyzed, the word vector to be analyzed corresponding to the data to be analyzed and the cluster category corresponding to the word vector to be analyzed may be extracted from the cluster data set obtained in step S102, and then the word vector to be analyzed and the cluster category corresponding to the word vector to be analyzed are input into the analysis model used for training and analyzing the analyzed data, so as to analyze the data to be analyzed. That is, step S103 can be implemented by: and extracting the word vector to be analyzed and the cluster category corresponding to the word vector to be analyzed from the cluster data set, and inputting the extracted word vector to be analyzed and the cluster category into an analysis model corresponding to the analyzed data so as to analyze the data to be analyzed.

In order to make the analysis result of the analysis model more accurate and further improve the analysis effect, the analysis model may be optimally trained by using the analyzed data (i.e., the analyzed word vectors and the cluster categories corresponding to the analyzed word vectors) processed in steps S101 and S102. In the embodiment of the present invention, before executing step S103, the following steps may also be executed: extracting analyzed word vectors corresponding to the analyzed data from the word vector set; extracting the clustering category corresponding to the analyzed word vector from the clustering data set; and performing optimization training on the analysis model by using the analyzed word vectors and the clustering categories corresponding to the analyzed word vectors.

According to the method for analyzing the data, the word vector processing is carried out on the analyzed data and the data to be analyzed to obtain a word vector set; clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set; and migrating the trained analysis model to analyze the data to be analyzed based on the similarity of the text structures of the analyzed data and the data to be analyzed. Therefore, the technical problems that a large amount of labels are needed to be carried out before data are analyzed, the analysis efficiency is low, and the cost is high are solved, and the technical effects that the large amount of labels are avoided, the analysis efficiency is improved, and the analysis cost is reduced are achieved.

Fig. 2 is a schematic diagram of an implementation framework of a method of analyzing data according to a referential embodiment of the present invention.

As shown in fig. 2, first, word vector processing is performed, and word vector processing may be performed on data to be analyzed and analyzed data, respectively, to obtain a word vector to be analyzed corresponding to the data to be analyzed, and an analyzed word vector corresponding to the analyzed data, so as to obtain a word vector set. Then, all the word vectors (i.e. all the word vectors to be analyzed and the word vectors already analyzed) in the word vector set are clustered, and the word vectors to be analyzed and the word vectors already analyzed of the same category are put into the same clustering data set. And finally, extracting the word vector to be analyzed corresponding to the data to be analyzed and the cluster category corresponding to the word vector to be analyzed from the cluster data set, and inputting the word vector to be analyzed and the corresponding cluster category into an analysis model corresponding to the analyzed data, so that the data to be analyzed is analyzed to obtain an analysis result.

It should be noted that, if the analyzed word vectors in the clustering data set and the corresponding clustering categories are used to perform optimization training on the analysis model, the analysis result of the analysis model can be more accurate, and the analysis effect is further improved.

In order to further explain the technical idea of the embodiment of the present invention, the technical solution of the present invention is now described with reference to specific application scenarios.

Suppose that the comment content of the user is data to be analyzed, for example, the comment content of the user includes "merchant a, high quality, first class"; suppose that the chat content of the customer and customer service is analyzed data, for example, the chat content includes "merchant B, good quality, very good".

The emotion of the user is now analyzed based on the comment content of the user, and an analysis model used when analyzing the chat content of the customer and the customer service is used. Specifically, firstly, word vector processing is carried out on comment content of a user and chat content of a client and customer service; then, all word vectors are clustered, and the word vectors of the same class are put into the same cluster, for example, "merchant A" and "merchant B" both belong to brands, in the same cluster, the word meanings of "good quality" and "good quality" are similar, and in the same cluster, the word meanings of "first class" and "very good stick" are similar. And finally, analyzing the emotion of the user by using an analysis model based on a word vector corresponding to the comment content of the user in the cluster, wherein the word vector in the same cluster takes the cluster centroid of the cluster as the category of the word vector, so that the categories of 'merchant A, high quality, first class' and 'merchant B, high quality and excellent' are the same, and the analysis model used for analyzing the chat content of the customer and the customer service can be directly used.

Fig. 3 is a schematic diagram of main blocks of an apparatus for analyzing data according to an embodiment of the present invention.

As shown in fig. 3, an apparatus 300 for analyzing data according to an embodiment of the present invention includes: a processing module 301, a clustering module 302, and an analysis module 303.

Wherein the content of the first and second substances,

the processing module 301 is configured to perform word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set; the word vector set comprises a word vector to be analyzed and an analyzed word vector;

a clustering module 302, configured to cluster the word vector to be analyzed and the analyzed word vector to obtain a clustering data set;

an analysis module 303, configured to analyze the data to be analyzed based on the clustered data set and an analysis model corresponding to the analyzed data, so as to obtain an analysis result.

In this embodiment of the present invention, the processing module 301 may further be configured to: and (4) constructing a Word vector model of the analyzed data and the data to be analyzed by using a Word2vec algorithm model to obtain a Word vector set.

In this embodiment of the present invention, the clustering module 302 may further be configured to: clustering the word vectors to be analyzed and the analyzed word vectors by using a partition clustering algorithm to obtain word vector clusters; calculating the clustering centroid of the word vector cluster to obtain the cluster category of the word vector cluster; and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

In this embodiment of the present invention, the clustering module 302 may further be configured to: calculating the editing distance between the word vector to be analyzed and the analyzed word vector; classifying the word vector to be analyzed and the analyzed word vector by utilizing a K-nearest neighbor algorithm based on the editing distance to obtain word vector clusters; calculating the clustering centroid of the word vector cluster to obtain the cluster category of the word vector cluster; and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

In this embodiment of the present invention, the analysis module 303 may further be configured to: and extracting the word vector to be analyzed and the clustering category corresponding to the word vector to be analyzed from the clustering data set, and inputting the extracted word vector to be analyzed and the clustering category into an analysis model corresponding to the analyzed data so as to analyze the data to be analyzed.

Furthermore, the apparatus 300 may further include a training module (not shown), which is configured to: extracting the analyzed word vector corresponding to the analyzed data from the word vector set; extracting the cluster category corresponding to the analyzed word vector from the cluster data set; and performing optimization training on the analysis model by using the analyzed word vectors and the cluster categories corresponding to the analyzed word vectors.

According to the device for analyzing data, the word vector processing is carried out on the analyzed data and the data to be analyzed to obtain a word vector set; clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set; and migrating the trained analysis model to analyze the data to be analyzed based on the similarity of the text structures of the analyzed data and the data to be analyzed. Therefore, the technical problems that a large amount of labels are needed to be carried out before data are analyzed, the analysis efficiency is low, and the cost is high are solved, and the technical effects that the large amount of labels are avoided, the analysis efficiency is improved, and the analysis cost is reduced are achieved.

Fig. 4 illustrates an exemplary system architecture 400 to which the method of analyzing data or the apparatus for analyzing data of embodiments of the present invention may be applied.

As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have various communication client applications installed thereon, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.

The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 405 may be a server that provides various services, such as a background management server that supports shopping websites browsed by users using the terminal devices 401, 402, and 403. The background management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (e.g., target push information and product information) to the terminal device.

It should be noted that the method for analyzing data provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the apparatus for analyzing data is generally disposed in the server 405.

It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a processing module, a clustering module, and an analysis module. The names of these modules do not form a limitation on the module itself under certain circumstances, for example, the clustering module may also be described as a module that clusters the word vector to be analyzed and the analyzed word vector to obtain a clustered data set.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: step S101: performing word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set; step S102: clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set; step S103: and analyzing the data to be analyzed based on the clustering data set and the analysis model corresponding to the analyzed data to obtain an analysis result.

According to the technical scheme of the embodiment of the invention, the word vector processing is carried out on the analyzed data and the data to be analyzed to obtain a word vector set; clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set; and migrating the trained analysis model to analyze the data to be analyzed based on the similarity of the text structures of the analyzed data and the data to be analyzed. Therefore, the technical problems that a large amount of labels are needed to be carried out before data are analyzed, the analysis efficiency is low, and the cost is high are solved, and the technical effects that the large amount of labels are avoided, the analysis efficiency is improved, and the analysis cost is reduced are achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of analyzing data, comprising:

performing word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set; the word vector set comprises a word vector to be analyzed and an analyzed word vector;

clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set;

and analyzing the data to be analyzed based on the clustering data set and the analysis model corresponding to the analyzed data to obtain an analysis result.

2. The method of claim 1, wherein performing word vector processing on the analyzed data and the data to be analyzed to obtain a set of word vectors comprises:

and (4) constructing a Word vector model of the analyzed data and the data to be analyzed by using a Word2vec algorithm model to obtain a Word vector set.

3. The method of claim 1, wherein clustering the word vector to be analyzed and the analyzed word vector to obtain a clustered data set comprises:

clustering the word vectors to be analyzed and the analyzed word vectors by using a partition clustering algorithm to obtain word vector clusters;

calculating the clustering centroid of the word vector cluster to obtain the cluster category of the word vector cluster;

and establishing a mapping relation between the word vector to be analyzed and the analyzed word vector and the clustering category to obtain a clustering data set.

4. The method of claim 1, wherein clustering the word vector to be analyzed and the analyzed word vector to obtain a clustered data set comprises:

calculating the editing distance between the word vector to be analyzed and the analyzed word vector;

classifying the word vector to be analyzed and the analyzed word vector by utilizing a K-nearest neighbor algorithm based on the editing distance to obtain word vector clusters;

5. The method of claim 3 or 4, wherein analyzing the data to be analyzed based on the clustered data set and an analysis model corresponding to the analyzed data comprises:

and extracting the word vector to be analyzed and the clustering category corresponding to the word vector to be analyzed from the clustering data set, and inputting the extracted word vector to be analyzed and the clustering category into an analysis model corresponding to the analyzed data so as to analyze the data to be analyzed.

6. The method of claim 1, wherein analyzing the data to be analyzed based on the analysis model corresponding to the clustered data set and the analyzed data further comprises:

extracting the analyzed word vector corresponding to the analyzed data from the word vector set;

extracting the cluster category corresponding to the analyzed word vector from the cluster data set;

and performing optimization training on the analysis model by using the analyzed word vectors and the cluster categories corresponding to the analyzed word vectors.

7. An apparatus for analyzing data, comprising:

the processing module is used for carrying out word vector processing on the analyzed data and the data to be analyzed to obtain a word vector set; the word vector set comprises a word vector to be analyzed and an analyzed word vector;

the clustering module is used for clustering the word vector to be analyzed and the analyzed word vector to obtain a clustering data set;

and the analysis module is used for analyzing the data to be analyzed based on the clustering data set and the analysis model corresponding to the analyzed data to obtain an analysis result.

8. The apparatus of claim 7, wherein the processing module is further configured to:

9. The apparatus of claim 7, wherein the clustering module is further configured to:

10. The apparatus of claim 7, wherein the clustering module is further configured to:

11. The apparatus of claim 9 or 10, wherein the analysis module is further configured to:

12. The apparatus of claim 7, further comprising a training module to:

13. An electronic device for analyzing data, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.