CN108153738A

CN108153738A - A kind of chat record analysis method and device based on hierarchical clustering

Info

Publication number: CN108153738A
Application number: CN201810137784.0A
Authority: CN
Inventors: 许振兴; 朱留锋; 荣强; 田淑宁
Original assignee: Lighthouse Financial Information Ltd
Current assignee: Lighthouse Financial Information Ltd
Priority date: 2018-02-10
Filing date: 2018-02-10
Publication date: 2018-06-12

Abstract

The present invention relates to field of computer technology, provide a kind of chat record analysis method and device based on hierarchical clustering.Method includes obtaining chat record and relevant data information, and do the chat record pretreatment before DBSCAN clustering algorithms；Using the clustering algorithm of DBSCAN, clustering processing is done to the data after pretreatment；To the result data of the clustering processing of the DBSCAN, keyword is extracted as hot word, and the number of hot word occurs in statistical data entry using TF IDF algorithms, the hot word most using occurrence number is as the label of the chat record.The present invention proposes a kind of chat record analysis method based on hierarchical clustering, combine the performance characteristics between the clustering algorithm of DBSCAN and TF IDF algorithms, existing random chat record is done with the calibration of characteristic label so that the chat record can be used further in a simplified manner by subsequent process steps.

Description

A kind of chat record analysis method and device based on hierarchical clustering

【Technical field】

The present invention relates to field of computer technology, more particularly to a kind of chat record analysis method based on hierarchical clustering And device.

【Background technology】

With the fast development of development of Mobile Internet technology, people increasingly get used to linking up and exchanging on line, this The text data (such as chat record or question and answer data) of magnanimity is created, excavation and analysis to these data tend to obtain Very abundant information.Text data digging has become one of research hotspot of message area at present, and customer service, Immense value is played in terms of corporate decision.

However, different with structural data, text data is that height is non-structured, while also has very high ambiguousness Matter, this also brings challenge to specific analysis work.

In consideration of it, the defects of overcoming present in the prior art is the art urgent problem to be solved.

【Invention content】

The technical problem to be solved by the present invention is to current text data digging have become message area research hotspot it One, and immense value is played in terms of customer service, corporate decision, however, different with structural data, text data is high It spends non-structured, while also there is very high ambiguity property, this also brings difficulty to specific analysis work.

The present invention adopts the following technical scheme that：

In a first aspect, the present invention provides a kind of chat record analysis method based on hierarchical clustering, including：

Chat record and relevant data information are obtained, and the pre- place before DBSCAN clustering algorithms is done to the chat record Reason；

Using the clustering algorithm of DBSCAN, clustering processing is done to the data after pretreatment；

To the result data of the clustering processing of the DBSCAN, keyword is extracted as hot word using TF-IDF algorithms, and There is the number of hot word in statistical data entry, and the hot word most using occurrence number is as the label of the chat record.

Preferably, the chat record includes the chat note between the customer issue extracted in system log record, client One or more in chat record between record, client and expert and the reply content published an article corresponding to client；Institute State relevant data information include the special vocabulary in financial field, Chinese stoplist, pre-training term vector data.

Preferably, the pretreatment done to the chat record before DBSCAN clustering algorithms, including：

Stock name, code in problem data is unified to be substituted with specified identifier, then text data is performed numerous One or more operation in letter conversion, capital and small letter conversion and stop words removal；

Text data is converted to and is represented by the vector that each entry is formed.

Preferably, the clustering algorithm using DBSCAN does clustering processing to the data after pretreatment, including：

Classification minimum data item number is set as：The interval of data count/a, wherein a is [100-300]；

Central point maximum distance is set as：Data average distance/b, wherein b by data average distance may be used with The mode of machine sampling is estimated to obtain, and interval is [0.1-0.3].

Preferably, it is described that keyword is extracted as hot word using TF-IDF algorithms, it specifically includes：

Pass through formulaOne by one in calculation result data entry importance；Wherein, molecule is this Occurrence number of the word in chat record, and denominator is then the sum of occurrence number of all words in chat record；

Pass through formulaWord general importance is calculated, wherein, | D | it is corpus In chat record sum；

According to formula tfidf_{I, j}=tf_{I, j}×idf_i, the synthesis importance of each word is calculated, and according to default threshold Value screens out the entry that comprehensive importance is less than the predetermined threshold value, obtains keyword as hot word.

Preferably, the method further includes：

Confirm one or more user identifier included in chat record, the next chat record will be analyzed Label is assigned to hobby/speciality information bar of the corresponding user identifier；

According to the label recorded in the hobby of the corresponding user identifier/speciality information bar, marked to the user is logged in Intelligent terminal push and the tag match information of knowledge.

Preferably, the method further includes：

Accuracy of information included in the chat sentence or entry of each user identifier is corresponded in analysis chat record, and Expert grade of the relative users mark under the label of the chat record is updated according to described information accuracy to integrate；

Expert's grade integration for when server receives the expert opinion request message that user A is sended over, Server filters out the mark for the highest chat record of similarity of asking a question with user A from each user identifier that it is managed Label, and expert's grade and the matched at least one user identifier of request of the user A；Establish at least one user identifier With the chat window of the user A.

Preferably, corresponding the method has been obtained in user A to further include：

According to the scoring of user A, the account for giving at least one user identifier is rewarded accordingly；And according in history Each scoring for puing question to user, adds a public praise grade dimension, to put question to user can for expert's grade under each user identifier With to server send problem request when, can be by setting corresponding expert's grade and/or public praise grade, to screen specified model Expert assistance in enclosing replies.

Preferably, described information includes one in stock code, stock price, stock trend, listed company's peripheral information Item is multinomial, described and special under the label of the chat record according to described information accuracy update relative users mark Family's grade integration, specifically includes：

According to the stock code, stock price and stock trend, the corresponding real stock information of time therewith is matched, If matching error is less than predetermined threshold value, increase expert grade product of the relative users mark under the label of the chat record Point, otherwise, reduce expert grade integration of the relative users mark under the label of the chat record；Wherein, described expert etc. Grade integration is corresponding with each expert's grade；

For listed company's peripheral information, then the preset verification time is given, if by counting greatly within the corresponding verification time It is matched according to getting in reality with listed company's peripheral information, then increases mark of the relative users mark in the chat record Otherwise the expert's grade integration signed, reduces expert grade integration of the relative users mark under the label of the chat record.

Second aspect, the present invention also provides a kind of chat record analytical equipment based on hierarchical clustering, including：At least one A processor；And the memory being connect at least one processor communication；Wherein, be stored with can quilt for the memory The instruction that at least one processor performs, described instruction are arranged to carry out gathering based on level described in first aspect by program The chat record analysis method of class.

The third aspect, the present invention also provides a kind of nonvolatile computer storage media, the computer storage media Computer executable instructions are stored with, which is executed by one or more processors, for completing first The chat record analysis method based on hierarchical clustering described in aspect.

The present invention proposes a kind of chat record analysis method based on hierarchical clustering, combines the clustering algorithm of DBSCAN Performance characteristics between TF-IDF algorithms have been done existing random chat record with the calibration of characteristic label, The chat record is further used in a simplified manner by subsequent process steps.

【Description of the drawings】

In order to illustrate the technical solution of the embodiments of the present invention more clearly, it will make below to required in the embodiment of the present invention Attached drawing is briefly described.It should be evident that drawings described below is only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is a kind of chat record analysis method flow diagram based on hierarchical clustering provided in an embodiment of the present invention；

Fig. 2 is the stream pre-processed in a kind of chat record analysis method based on hierarchical clustering provided in an embodiment of the present invention Journey schematic diagram；

Fig. 3 is IF-IDF algorithms in a kind of chat record analysis method based on hierarchical clustering provided in an embodiment of the present invention The flow diagram of processing；

Fig. 4 is a kind of the first application scenarios of chat record analysis method based on hierarchical clustering provided in an embodiment of the present invention Flow diagram；

Fig. 5 is a kind of the second application scenarios of chat record analysis method based on hierarchical clustering provided in an embodiment of the present invention Flow diagram；

Fig. 6 is a kind of structural representation of chat record analytical equipment based on hierarchical clustering provided in an embodiment of the present invention Figure.

【Specific embodiment】

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

In the description of the present invention, the fingers such as term " interior ", " outer ", " longitudinal direction ", " transverse direction ", " on ", " under ", " top ", " bottom " The orientation or position relationship shown be based on orientation shown in the drawings or position relationship, be for only for ease of the description present invention rather than It is required that the present invention must be with specific azimuth configuration and operation, therefore it is not construed as limitation of the present invention.

In addition, as long as technical characteristic involved in the various embodiments of the present invention described below is each other not Conflict is formed to can be combined with each other.

Embodiment 1:

The embodiment of the present invention 1 provides a kind of chat record analysis method based on hierarchical clustering, as shown in Figure 1, including：

In step 201, chat record and relevant data information are obtained, and DBSCAN clusters are done to the chat record Pretreatment before algorithm.

In embodiments of the present invention, the chat record include system log in extract customer issue record, client it Between chat record, the chat record between client and expert and one in the reply content published an article corresponding to client or Person is multinomial；The relevant data information include the special vocabulary in financial field, Chinese stoplist, pre-training term vector number According to.Wherein, the chat record can capture the acquisitions such as data and word2vec tools by the whole network.

In step 202, using the clustering algorithm of DBSCAN, clustering processing is done to the data after pretreatment.

In step 203, to the result data of the clustering processing of the DBSCAN, keyword is extracted using TF-IDF algorithms As hot word, and there is the number of hot word in statistical data entry, and the hot word most using occurrence number is as the chat record Label.

Wherein, it with reference to the embodiment of the present invention preferably after hot word is extracted, is closed hot word as the class discrimination of the category Keyword, and all chat record contents occur in the logical category and counted, the most chat record content of occurrence number is made Representative content for the category.

The embodiment of the present invention proposes a kind of chat record analysis method based on hierarchical clustering, combines the poly- of DBSCAN Performance characteristics between class algorithm and TF-IDF algorithms have been done existing random chat record with characteristic label Calibration so that the chat record can be used further in a simplified manner by subsequent process steps.

With reference to the embodiment of the present invention, the DBSCAN clusters of being done to the chat record involved in step 201 are calculated Pretreatment before method additionally provides a kind of concrete methods of realizing, as shown in Fig. 2, including：

In step 2011, stock name, the code in problem data are unified to be substituted with specified identifier, then to text Notebook data performs the one or more operation in complicated and simple conversion, capital and small letter conversion and stop words removal.

In step 2012, text data is converted to and is represented by the vector that each entry is formed.Specific practice is by text Word word vector in data represents that the vector that the term vector that then adds up obtains a data represents.

The several definition of DBSCAN algorithms employed in the embodiment of the present invention are introduced first：

Ε neighborhoods：Given object radius is known as the Ε neighborhoods of the object for the region in Ε；

Kernel object：If the sample points in given object Ε fields are more than or equal to MinPts, which is referred to as core Heart object；

Direct density is reachable：For sample set D, if sample point q, in the Ε fields of p, and p is kernel object, So object q is reachable from the direct density of object p.

Density is reachable：For sample set D, a string of sample points p1, p2 ... .pn, p=p1, q=pn are given, if object Pi is reachable from the direct density of pi-1, then object q is reachable from object p density.

Density is connected：There are the point o in sample set D, if object o to object p and object q are that density is reachable , then p and q density is connected.

It can be found that it is the reachable transitive closure of direct density, and this relationship is asymmetrical, density that density is reachable It is symmetric relation to be connected.The purpose of DBSCAN is to find density to be connected the maximum set of object.

Eg:Assuming that radius Ε=3, MinPts=3, in the E fields of point p a little { m, p, p1, p2, o }, in the E fields of point m A little { m, q, p, m1, m2 }, in the E fields of point q a little { q, m }, in the E fields of point o a little { o, p, s }, in the E fields of point s A little { o, s, s1 }

So kernel object has p, m, o, and (q is not kernel object to s, small because its corresponding E fields midpoint quantity is equal to 2 In MinPts=3)；

Point m is reachable from the direct density of point p, because m is in the E fields of p, and p is kernel object；

Point q is reachable from point p density, because point q is reachable from the direct density of point m, and point m is reachable from the direct density of point p；

Point q is connected to point s density, because point q is reachable from point p density, and s is reachable from point p density.

With reference to the embodiment of the present invention, also for the clustering algorithm using DBSCAN, to the data after pretreatment Clustering processing is done, one group is provided and parameter is effectively configured, including：

With reference to the embodiment of the present invention, the use TF-IDF algorithms extraction keyword involved in step 203 is made For hot word, as shown in figure 3, specifically including：

In step 2031, pass through formulaOne by one in calculation result data entry weight The property wanted；Wherein, molecule is occurrence number of the word in chat record, and denominator is then that all words go out in chat record The sum of occurrence number；

In step 2032, pass through formulaWord general importance is calculated, Wherein, | D | for the chat record sum in corpus：Number of files (number of files i.e.) comprising word is if the word is not In corpus, may result in denominator is zero, therefore is used under normal circumstances；

In step 2033, according to formula tfidf_{I, j}=tf_{I, j}×idf_i(3), the comprehensive weight of each word is calculated The property wanted, and entry of the comprehensive importance less than the predetermined threshold value is screened out according to predetermined threshold value, keyword is obtained as hot word.

Label based on the chat record that the embodiment of the present invention is proposed, the embodiment of the present invention additionally provide a kind of user Method, therefore, after performing step 203 in embodiment 1, as shown in figure 4, the method further includes：

In step 301, confirm one or more user identifier included in chat record, next institute will be analyzed The label for stating chat record is assigned to hobby/speciality information bar of the corresponding user identifier.

In step 302, according to the label recorded in the hobby of the corresponding user identifier/speciality information bar, to stepping on Record the intelligent terminal push of the user identifier and the tag match information.

The step 204 and step 205 that the above-mentioned combination embodiment of the present invention proposes are only with obtained by the embodiment of the present invention One of application scenarios to chat record label (are known as the first application scenarios), as shown in figure 5, for reference to the embodiment of the present invention 1 Another application scenarios (being known as the second application scenarios) obtained afterwards, also, second application scenarios and the first application scenarios Realization can also be combined, is implemented as in second application scenarios：

In step 401, it analyzes and is corresponded to included in the chat sentence or entry of each user identifier in chat record Accuracy of information, and the expert of relative users mark under the label of the chat record etc. is updated according to described information accuracy Grade integration.

Wherein, different expert's grades corresponds to corresponding integral threshold, i.e., phase can be realized more than associated quad The transition of Ying expert's grade.

In step 402, expert's grade integration in server for receiving the expert opinion that user A sends over During request message, server filters out highest described with user A similarities of asking a question from each user identifier that it is managed The label of chat record, and expert's grade and the matched at least one user identifier of request of the user A；Described in establishing at least The chat window of one user identifier and the user A.

In order to further improve the practicability of the second application scenarios, i.e. at least one of second application scenarios user identifier (being rated as expert, can possess one or more other users for solving the problems, such as that user A is proposed), needs one The above-mentioned ecosphere answered a question could be effectively maintained under kind driving force and supervision power, it is therefore preferable that being obtained in user A Complete corresponding the method further includes：

In embodiments of the present invention, described information includes stock code, stock price, stock trend, listed company periphery One or more in information, it is described and according to described information accuracy update relative users mark in the chat record Expert's grade integration under label, specifically includes：

Embodiment 2：

It is the configuration diagram of the chat record analytical equipment based on hierarchical clustering of the embodiment of the present invention such as Fig. 6.This reality The chat record analytical equipment based on hierarchical clustering for applying example includes one or more processors 21 and memory 22.Wherein, In Fig. 6 by taking a processor 21 as an example.

Processor 21 can be connected with memory 22 by bus or other modes, to be connected as by bus in Fig. 6 Example.

Memory 22 can as a kind of chat record analysis method based on hierarchical clustering and device non-volatile computer Storage medium is read, available for storage non-volatile software program, non-volatile computer executable program and module, is such as implemented The chat record analysis method based on hierarchical clustering in example 1.Processor 21 is stored in non-easy in memory 22 by operation The property lost software program, instruction and module, should so as to perform the various functions of the chat record analytical equipment based on hierarchical clustering With and data processing, that is, realize embodiment 1 the chat record analysis method based on hierarchical clustering.

Memory 22 can include high-speed random access memory, can also include nonvolatile memory, for example, at least One disk memory, flush memory device or other non-volatile solid state memory parts.In some embodiments, memory 22 It is optional including relative to the remotely located memory of processor 21, these remote memories can pass through network connection to processor 21.The example of above-mentioned network includes but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.

Described program instruction/module is stored in the memory 22, is held when by one or more of processors 21 During row, the chat record analysis method based on hierarchical clustering in above-described embodiment 1 is performed, for example, performing figure described above 1- each steps shown in fig. 5.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of embodiment is can to lead to It crosses program and is completed to instruct relevant hardware, which can be stored in a computer readable storage medium, storage medium It can include：Read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of chat record analysis method based on hierarchical clustering, which is characterized in that including：

Chat record and relevant data information are obtained, and the pretreatment before DBSCAN clustering algorithms is done to the chat record；

To the result data of the clustering processing of the DBSCAN, keyword is extracted as hot word, and count using TF-IDF algorithms There is the number of hot word in data entry, and the hot word most using occurrence number is as the label of the chat record.

2. the chat record analysis method according to claim 1 based on hierarchical clustering, which is characterized in that the chat note Record includes the chat note between the customer issue extracted in system log record, the chat record between client, client and expert One or more in record and the reply content published an article corresponding to client；The relevant data information includes finance and leads The special vocabulary in domain, Chinese stoplist, pre-training term vector data.

3. the chat record analysis method according to claim 1 based on hierarchical clustering, which is characterized in that described to described Chat record does the pretreatment before DBSCAN clustering algorithms, including：

Stock name, code in problem data is unified to be substituted with specified identifier, complicated and simple turn is then performed to text data It changes, the one or more operation in capital and small letter conversion and stop words removal；

4. the chat record analysis method according to claim 1 based on hierarchical clustering, which is characterized in that the use The clustering algorithm of DBSCAN does clustering processing to the data after pretreatment, including：

Central point maximum distance is set as：Random pumping may be used by data average distance in data average distance/b, wherein b The mode of sample is estimated to obtain, and interval is [0.1-0.3].

5. the chat record analysis method according to claim 1 based on hierarchical clustering, which is characterized in that the use TF-IDF algorithms extract keyword as hot word, specifically include：

Pass through formulaOne by one in calculation result data entry importance；Wherein, molecule is that the word exists Occurrence number in chat record, and denominator is then the sum of occurrence number of all words in chat record；

Pass through formulaWord general importance is calculated, wherein, | D | for chatting in corpus Its record sum；

According to formula tfidf_{I, j}=tf_{I, j}×idf_i, the synthesis importance of each word is calculated, and sieve according to predetermined threshold value The entry that comprehensive importance is less than the predetermined threshold value is fallen in choosing, obtains keyword as hot word.

6. the chat record analysis method according to claim 1 based on hierarchical clustering, which is characterized in that the method is also Including：

Confirm one or more user identifier included in chat record, the label of the chat record come will be analyzed It is assigned to hobby/speciality information bar of the corresponding user identifier；

According to the label recorded in the hobby of the corresponding user identifier/speciality information bar, to the login user identifier Intelligent terminal pushes and the tag match information.

7. the chat record analysis method according to claim 1 based on hierarchical clustering, which is characterized in that the method is also Including：

It analyzes and accuracy of information included in the chat sentence or entry of each user identifier is corresponded in chat record, and according to Expert grade integration of the described information accuracy update relative users mark under the label of the chat record；

Expert's grade integrates, when server receives the expert opinion request message that user A is sended over, to service Device filters out the label for the highest chat record of similarity of asking a question with user A from each user identifier that it is managed, And the matched at least one user identifier of request of expert's grade and the user A；Establish at least one user identifier and The chat window of the user A.

8. the chat record analysis method according to claim 7 based on hierarchical clustering, which is characterized in that obtained in user A Corresponding the method is taken to further include：

According to the scoring of user A, the account for giving at least one user identifier is rewarded accordingly；And according to respectively carrying in history It asks the scoring of user, a public praise grade dimension is added for expert's grade under each user identifier, to put question to user can be It, can be by setting corresponding expert's grade and/or public praise grade, to screen in specified range when sending problem request to server Expert assistance reply.

9. the chat record analysis method according to claim 7 based on hierarchical clustering, which is characterized in that described information packet Include the one or more in stock code, stock price, stock trend, listed company's peripheral information, it is described and according to described Expert grade integration of the accuracy of information update relative users mark under the label of the chat record, specifically includes：

According to the stock code, stock price and stock trend, the corresponding real stock information of time therewith is matched, if It is less than predetermined threshold value with error, then increases expert grade integration of the relative users mark under the label of the chat record, it is no Then, expert grade integration of the relative users mark under the label of the chat record is reduced；Wherein, expert's grade integration It is corresponding with each expert's grade；

For listed company's peripheral information, then the preset verification time is given, if being obtained within the corresponding verification time by big data It gets in reality and is matched with listed company's peripheral information, then increase relative users mark under the label of the chat record Expert's grade integration, otherwise, reduce relative users mark under the label of the chat record expert's grade integration.

10. a kind of chat record analytical equipment based on hierarchical clustering, which is characterized in that including：At least one processor；With And the memory being connect at least one processor communication；Wherein, be stored with can be by described at least one for the memory The instruction that processor performs, it is any described based on hierarchical clustering that described instruction by program is arranged to carry out claim 1-9 Chat record analysis method.