CN111401388B - Data mining method, device, server and readable storage medium - Google Patents

Data mining method, device, server and readable storage medium Download PDF

Info

Publication number
CN111401388B
CN111401388B CN201811526754.5A CN201811526754A CN111401388B CN 111401388 B CN111401388 B CN 111401388B CN 201811526754 A CN201811526754 A CN 201811526754A CN 111401388 B CN111401388 B CN 111401388B
Authority
CN
China
Prior art keywords
frequent
data
dialogue
sentence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811526754.5A
Other languages
Chinese (zh)
Other versions
CN111401388A (en
Inventor
吴康康
王鹏
柳俊宏
王杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN201811526754.5A priority Critical patent/CN111401388B/en
Publication of CN111401388A publication Critical patent/CN111401388A/en
Application granted granted Critical
Publication of CN111401388B publication Critical patent/CN111401388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data mining method, a device, a server and a readable storage medium, which are used for obtaining target dialogue data, extracting problem data from the target dialogue data, and segmenting the problem data to obtain a word segmentation result consisting of a plurality of word segments, so that a corresponding frequent pattern tree is constructed according to the word segmentation result, and a frequent item set is mined from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a data mining knowledge point. Therefore, knowledge points in the single-round dialogue can be accurately and comprehensively excavated, the efficiency and quality of knowledge point excavation are greatly improved, the problem of a user is effectively solved, and the satisfaction degree of the user is improved.

Description

Data mining method, device, server and readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data mining method, a data mining device, a server, and a readable storage medium.
Background
Currently, with the popularization of intelligent terminals, various Applications (APP) for providing life convenience services are also layered, and provide services (such as travel services, takeaway services, etc.) for users to eat and wear. In the process of using these services, users typically choose to consult the problems to be solved, such as multi-round conversations, single-round conversations (QA), and boring robots in customer service systems. Among them, single-round conversations play a great role in solving user problems and realizing intelligent services, and knowledge points are the most important part of single-round conversations. How to accurately and comprehensively mine knowledge points in the single-round dialogue so as to solve the problem of users more effectively and improve the satisfaction of the users is a technical problem to be solved urgently by the technicians in the field.
Disclosure of Invention
In view of the foregoing, an object of an embodiment of the present application is to provide a data mining method, apparatus, server and readable storage medium, so as to accurately and comprehensively mine knowledge points in a single-round dialogue, so as to more effectively solve the problem of a user and improve the satisfaction degree of the user.
According to one aspect of embodiments of the present application, an electronic device is provided that may include one or more storage media and one or more processors in communication with the storage media. One or more storage media store machine-readable instructions executable by a processor. When the electronic device is in operation, the processor and the storage medium communicate via a bus, and the processor executes the machine-readable instructions to perform a data mining method.
According to another aspect of the embodiments of the present application, a data mining method is provided, applied to a server, where the method may include:
acquiring target dialogue data and extracting problem data from the target dialogue data;
performing word segmentation on the problem data to obtain word segmentation results composed of a plurality of word segmentation;
and constructing a corresponding frequent pattern tree according to the word segmentation result, and mining a frequent item set from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a data-mined knowledge point.
In one possible implementation manner, the step of acquiring the target dialogue data includes:
acquiring man-machine history conversations in each conversation scene from a history database;
and acquiring the history dialogue of the dialogue requester from the man-machine history dialogue of each dialogue scene as the target dialogue data.
In one possible implementation manner, the step of extracting problem data from the target dialogue data includes:
matching each dialogue sentence with each keyword in a preset keyword list aiming at each dialogue sentence in the target dialogue data;
if the dialogue sentence is matched with any one keyword in the preset keyword list, determining the dialogue sentence as a question sentence;
and obtaining the problem data according to the determined problem statement.
In one possible implementation manner, the step of extracting problem data from the target dialogue data includes:
judging whether the sentence length of each dialogue sentence is within a preset length range or not according to each dialogue sentence in the target dialogue data;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a problem sentence;
And obtaining the problem data according to the determined problem statement.
In one possible implementation manner, the step of extracting problem data from the target dialogue data includes:
matching each dialogue sentence with each keyword in a preset keyword list aiming at each dialogue sentence in the target dialogue data;
if the dialogue sentence is matched with any one keyword in the preset keyword list, judging whether the sentence length of the dialogue sentence is within a preset length range;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a problem sentence;
and obtaining the problem data according to the determined problem statement.
In one possible implementation manner, the step of performing word segmentation on the problem data to obtain a word segmentation result composed of a plurality of word segments includes:
the method comprises the steps of performing word segmentation on problem data according to a pre-configured scene word stock table to obtain word segmentation results composed of a plurality of word segmentation, wherein the scene word stock table comprises a plurality of special scene words related to target services corresponding to the problem data; or alternatively
And segmenting the problem data according to a pre-trained scene word discovery model to obtain a segmentation result consisting of a plurality of segmentation words.
In one possible implementation, the scene word discovery model is trained by:
configuring a conditional random field algorithm CRF model;
and taking the historical dialogue data of each dialogue scene as a model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as a model output, and iteratively training the CRF model to obtain the scene word discovery model.
In one possible implementation manner, the step of constructing a corresponding frequent pattern tree according to the word segmentation result and mining a frequent item set from the constructed frequent pattern tree includes:
counting the support degree of each word in the word segmentation result, wherein the support degree represents the frequency of the word segmentation in the word segmentation result;
sequentially inserting each word into a tree taking NULL as a root node according to a descending order of support degree, and constructing a frequent pattern tree, wherein the frequent pattern tree comprises a NULL root node and a branch node, the NULL root node is an invalid value, and the branch node corresponds to one frequent item and support degree thereof;
and mining the frequent item set from the constructed frequent pattern tree.
In one possible implementation, the step of mining the frequent item set from the constructed frequent pattern tree includes:
Constructing a condition pattern base of each frequent item in the constructed frequent pattern tree, and constructing a condition frequent pattern tree of the frequent item based on the constructed condition pattern base, wherein the condition pattern base is a path set of a plurality of prefix paths taking the frequent item as a suffix item and connected with the suffix item;
updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuously executing each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, constructing a conditional pattern base of the frequent item, constructing a conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, and outputting the frequent item corresponding to the conditional frequent pattern tree until the constructed conditional frequent pattern tree is empty or contains only one path, so as to obtain a frequent item set;
and when the constructed conditional frequent pattern tree only contains one path, connecting all combined paths with the prefix path of the conditional frequent pattern tree as frequent items.
In one possible implementation, after the step of updating the frequent pattern tree based on each constructed conditional frequent pattern tree, the method may further include:
and filtering out frequent items with the support degree lower than a preset support degree in the frequent pattern tree.
In one possible implementation manner, after the steps of constructing a corresponding frequent pattern tree according to the word segmentation result and mining a frequent item set from the constructed frequent pattern tree, the method may further include:
merging the frequent items with the same meaning in the frequent item set to obtain a merged frequent item set;
the step of merging the frequent items with the same meaning in the frequent item set to obtain a merged frequent item set includes:
generating a problem set related to each frequent item;
calculating sentence vectors of problem sets related to each frequent item;
according to the calculated sentence vectors of the problem sets related to each frequent item, calculating cosine distances between the sentence vectors of the problem sets related to any two frequent items, and taking the cosine distances as the similarity between the sentence vectors of the problem sets related to the any two frequent items;
Judging whether the similarity is larger than a preset similarity, and deleting any one of the two corresponding frequent items if the similarity is not larger than the preset similarity.
In one possible implementation, the step of calculating the sentence vector of the question set related to each frequent item includes:
aiming at each frequent item related problem set, respectively performing word segmentation processing on each problem sentence in the frequent item related problem set to obtain a plurality of segmented words;
inputting each word into a pre-trained word vector model fasttext model to obtain a word vector of each word;
and obtaining sentence vectors of the problem set related to the frequent item according to the word segmentation vector of each word segmentation.
In a possible implementation manner, after the step of merging the frequent items with the same meaning in the frequent item set to obtain the merged frequent item set, the method may further include:
acquiring and storing problem solving information of each problem in the problem set of each frequent item;
when a preset problem sent by a service requester terminal is received, matching the preset problem with each problem in a problem set of each frequent item, and sending problem solving information of the problem matched with the preset problem to the service requester terminal.
According to another aspect of the embodiments of the present application, there is provided a data mining apparatus applied to a server, the apparatus including:
the first acquisition module is used for acquiring target dialogue data and extracting problem data from the target dialogue data;
the word segmentation module is used for segmenting the problem data to obtain word segmentation results composed of a plurality of word segments;
and the mining module is used for constructing a corresponding frequent pattern tree according to the word segmentation result, and mining a frequent item set from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a data-mined knowledge point.
According to another aspect of the embodiments of the present application, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, can perform the steps of the data mining method described above.
Based on any one of the above aspects, in the embodiments of the present application, by acquiring target dialogue data and extracting problem data from the target dialogue data, and performing word segmentation on the problem data, a word segmentation result composed of a plurality of word segments is obtained, so that a corresponding frequent pattern tree is constructed according to the word segmentation result, and a frequent item set is mined from the constructed frequent pattern tree, where the frequent item set includes a plurality of frequent items, and each frequent item corresponds to a knowledge point of data mining. Therefore, knowledge points in the single-round dialogue can be accurately and comprehensively excavated, the efficiency and quality of knowledge point excavation are greatly improved, the problem of a user is effectively solved, and the satisfaction degree of the user is improved.
The foregoing objects, features and advantages of embodiments of the present application will be more readily apparent from the following detailed description of the embodiments taken in conjunction with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates an interactive schematic block diagram of a data mining system provided by an embodiment of the present application;
FIG. 2 illustrates a schematic diagram of exemplary hardware and software components of an electronic device that may implement the server, service requester terminal, service provider terminal of FIG. 1, as provided by embodiments of the present application;
FIG. 3 is a schematic flow chart of a data mining method according to an embodiment of the present disclosure;
FIG. 4 is a schematic flow chart of another method for data mining according to an embodiment of the present disclosure;
FIG. 5 is a flowchart illustrating a sentence vector calculation method according to an embodiment of the present disclosure;
FIG. 6 is a schematic flow chart of another method for data mining according to an embodiment of the present disclosure;
FIG. 7 illustrates a functional block diagram of a data mining apparatus provided by an embodiment of the present application;
FIG. 8 illustrates another functional block diagram of a data mining apparatus provided by an embodiment of the present application;
fig. 9 shows another functional block diagram of a data mining apparatus provided by an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the accompanying drawings in the present application are only for the purpose of illustration and description, and are not intended to limit the protection scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this application, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art.
In addition, the described embodiments are only some, but not all, of the embodiments of the present application. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
In order to enable those skilled in the art to use the present disclosure, the following embodiments are presented in connection with a specific application scenario "network about car scenario". It will be apparent to those having ordinary skill in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present application. Although the present application is primarily described in terms of a "net jockey scenario," it should be understood that this is but one exemplary embodiment. The present application may be applied to any other traffic type. For example, the present application may be applied to different transportation system environments, including land, sea, or air, among others, or any combination thereof. The transportation means of the transportation system may include taxis, private cars, windmills, buses, trains, bullet trains, high speed railways, subways, ships, airplanes, spacecraft, hot air balloons, or unmanned vehicles, etc., or any combination thereof. The present application may also include any service system for network about a drive, e.g., a system for sending and/or receiving express, a service system for a business of both parties. Applications of the systems or methods of the present application may include web pages, plug-ins to browsers, client terminals, customization systems, internal analysis systems, or artificial intelligence robots, etc., or any combination thereof.
It should be noted that the term "comprising" will be used in the embodiments of the present application to indicate the presence of the features stated hereinafter, but not to exclude the addition of other features.
The terms "passenger," "requestor," "attendant," "service requestor," and "customer" are used interchangeably herein to refer to a person, entity, or tool that may request or subscribe to a service. The terms "driver," "provider," "service provider," and "provider" are used interchangeably herein to refer to a person, entity, or tool that can provide a service. The term "user" in this application may refer to a person, entity, or tool requesting, subscribing to, providing, or facilitating the provision of a service. For example, the user may be a passenger, driver, operator, etc., or any combination thereof. In this application, "passenger" and "passenger terminal" may be used interchangeably, and "driver" and "driver terminal" may be used interchangeably.
According to the technical problems known in the prior art, many existing knowledge points are not single in ideas, and a certain knowledge point is used as: "how to change a password" is an example, and "how to change a password" is not an ideographic single knowledge point. For example, in some scenarios, many users may consult the question of "how to modify the login password" and "how to modify the rendering password", however, the modification of these two passwords is not the same, and mapping to the same knowledge point is not appropriate. Therefore, how to analyze and mine the historical dialogue of the related scene according to the existing ideographic knowledge points to obtain a plurality of knowledge points with single expression meaning, and configuring corresponding answers or speaking techniques for each ideographic knowledge point, thereby improving the accuracy of single-round dialogue answers, and further improving the user satisfaction, intelligent resolution and intelligent service duty ratio is a great difficulty in the field.
The inventor of the application researches and discovers that the current knowledge point mining work is mainly completed manually by operators. However, in the process of manually mining knowledge points, a person generally reads historical dialogue data related to a certain knowledge point to summarize and summarize the problem of the user, so as to form a plurality of different knowledge points, wherein the knowledge points may or may not exist before. However, there are many problems with manually mining knowledge points: firstly, understanding of the same problem of a user by different people is different, so that summary of the problem of the user cannot be accurately summarized in the process of manually excavating knowledge points, the quality of knowledge point splitting is low, and the excavated knowledge points are still ideographic; secondly, manual knowledge point mining consumes a great deal of human resources, and the efficiency is quite low.
To solve this problem, another approach is to try to cluster the problems of users in the relevant scenario using the method of k-means clustering. For example, with historical dialogue data of relevant scenes as input, embedded representations of user questions are obtained based on deep learning model training, and finally clustering is performed by using a k-means algorithm. By adopting the scheme, a large amount of human resources can be saved, and the working efficiency is improved. However, for a given scenario, semantics between different questions of a user in the relevant historical dialog data are very similar, resulting in poor distinguishability between embedded representations of the different questions, and thus the method of k-means clustering is not ideal or even less effective than manually mining knowledge points.
Based on the above technical problems, the embodiments of the present application provide a data mining method, apparatus, server and readable storage medium, which analyze historical service record data of a user according to a target service item selected by the user from a service recommendation menu issued, determine a data mining path according to an analysis result, and then perform a splitting process on the service acquisition request according to the determined data mining path, so as to implement intelligent splitting of the service acquisition request according to the historical service record data of different users, so that the user finds a channel for solving the problem of the user more quickly, and user experience is improved.
Fig. 1 is a schematic architecture diagram of a data mining system 100 according to an alternative embodiment of the present application. For example, the data mining system 100 may be an online transport service platform for a transport service such as a taxi service, a ride service, a express service, a carpool service, a bus service, a driver rental service, or a airliner service, or a combination service of any of the above. The data mining system 100 may include a server 110, a network 120, a service requester terminal 130, a service provider terminal 140, and a database 150, and a processor executing instruction operations may be included in the server 110. The data mining system 100 shown in FIG. 1 is only one possible example, and in other possible embodiments, the data mining system 100 may include only a portion of the components shown in FIG. 1 or may include other components as well.
In some embodiments, the server 110 may be a single server or a group of servers. The server farm may be centralized or distributed (e.g., server 110 may be a distributed system). In some embodiments, the server 110 may be local or remote to the terminal. For example, the server 110 may access information stored in the service requester terminal 130, the service provider terminal 140, and the database 150, or any combination thereof, via the network 120. As another example, the server 110 may be directly connected to at least one of the service requester terminal 130, the service provider terminal 140, and the database 150 to access information and/or data stored therein. In some embodiments, server 110 may be implemented on a cloud platform; for example only, the cloud platform may include a private cloud, public cloud, hybrid cloud, community cloud (community cloud), distributed cloud, inter-cloud (inter-cloud), multi-cloud (multi-cloud), and the like, or any combination thereof. In some embodiments, server 110 may be implemented on an electronic device 200 having one or more of the components shown in fig. 2 herein.
In some embodiments, server 110 may include a processor. The processor may process information and/or data related to the service request to perform one or more functions described herein. For example, in a express service, the processor may determine the target vehicle based on a service request obtained from the service requester terminal 130. The processor may include one or more processing cores (e.g., a single core processor (S) or a multi-core processor (S)). By way of example only, the Processor may include a central processing unit (Central Processing Unit, CPU), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), special instruction set Processor (Application Specific Instruction-set Processor, ASIP), graphics processing unit (Graphics Processing Unit, GPU), physical processing unit (Physics Processing Unit, PPU), digital signal Processor (Digital Signal Processor, DSP), field programmable gate array (Field Programmable Gate Array, FPGA), programmable logic device (Programmable Logic Device, PLD), controller, microcontroller unit, reduced instruction set computer (Reduced Instruction Set Computing, RISC), microprocessor, or the like, or any combination thereof.
Network 120 may be used for the exchange of information and/or data. In some embodiments, one or more components in the data mining system 100 (e.g., the server 110, the service requester terminal 130, the service provider terminal 140, and the database 150) may send information and/or data to other components. For example, the server 110 may obtain a service request from the service requester terminal 130 via the network 120. In some embodiments, network 120 may be any type of wired or wireless network, or a combination thereof. By way of example only, the network 130 may include a wired network, a wireless network, a fiber optic network, a telecommunications network, an intranet, the internet, a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN), a wireless local area network (Wireless Local Area Networks, WLAN), a metropolitan area network (Metropolitan Area Network, MAN), a wide area network (Wide Area Network, WAN), a public switched telephone network (Public Switched Telephone Network, PSTN), a bluetooth network, a ZigBee network, a near field communication (Near Field Communication, NFC) network, or the like, or any combination thereof. In some embodiments, network 120 may include one or more network access points. For example, network 120 may include wired or wireless network access points, such as base stations and/or network switching nodes, through which one or more components of data mining system 100 may connect to network 120 to exchange data and/or information.
In some embodiments, the user of the service requester terminal 130 may be a person other than the actual consumer of the service. For example, user a of service requester terminal 130 may use service requester terminal 130 to initiate a service request for service actual requester B (e.g., user a may call his own friend B), or receive service information or instructions from server 110, etc. In some embodiments, the user of the service provider terminal 140 may be the actual service provider or may be a person other than the actual service provider. For example, user C of service provider terminal 140 may use service provider terminal 140 to receive a service request for providing a service by service actual provider D (e.g., user C may pick up for driver D employed by himself), and/or information or instructions from server 110. In some embodiments, "service requester" and "service requester terminal" may be used interchangeably and "service provider" and "service provider terminal" may be used interchangeably.
In some embodiments, the service requester terminal 130 may include a mobile device, a tablet computer, a laptop computer, or a built-in device in a motor vehicle, or the like, or any combination thereof. In some embodiments, the mobile device may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, or an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart lighting device, a control device for a smart appliance device, a smart monitoring device, a smart television, a smart video camera, or an intercom, or the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, a smart lace, a smart glass, a smart helmet, a smart watch, a smart garment, a smart backpack, a smart accessory, etc., or any combination thereof. In some embodiments, the smart mobile device may include a smart phone, a personal digital assistant (PersonalDigital Assistant, PDA), a gaming device, a navigation device, or a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, an augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include various virtual reality products, and the like. In some embodiments, the built-in devices in the motor vehicle may include an on-board computer, an on-board television, and the like.
Database 150 may store data and/or instructions. In some embodiments, database 150 may store data obtained from service requester terminal 130 and/or service provider terminal 140. In some embodiments, database 150 may store data and/or instructions for the exemplary methods described in this application. In some embodiments, database 150 may include mass storage, removable storage, volatile Read-write Memory, or Read-Only Memory (ROM), or the like, or any combination thereof. By way of example, mass storage may include magnetic disks, optical disks, solid state drives, and the like; removable memory may include flash drives, floppy disks, optical disks, memory cards, zip disks, magnetic tape, and the like; the volatile read-write memory may include random access memory (Random Access Memory, RAM); the RAM may include dynamic RAM (Dynamic Random Access Memory, DRAM), double data Rate Synchronous dynamic RAM (DDR SDRAM); static Random-Access Memory (SRAM), thyristor RAM (T-RAM) and Zero-capacitor RAM (Zero-RAM), etc. By way of example, ROM may include Mask Read-Only Memory (MROM), programmable ROM (Programmable Read-Only Memory, PROM), erasable programmable ROM (Programmable Erasable Read-Only Memory, PEROM), electrically erasable programmable ROM (Electrically Erasable Programmable Read Only Memory, EEPROM), compact disk ROM (CD-ROM), digital versatile disk ROM, and the like. In some embodiments, database 150 may be implemented on a cloud platform. For example only, the cloud platform may include a private cloud, public cloud, hybrid cloud, community cloud, distributed cloud, cross-cloud, multi-cloud, or other similar, or the like, or any combination thereof.
In some embodiments, database 150 may be connected to network 120 to communicate with one or more components in data mining system 100 (e.g., server 110, service requester terminal 130, service provider terminal 140, etc.). One or more components in the data mining system 100 may access data or instructions stored in the database 150 via the network 120. In some embodiments, database 150 may be directly connected to one or more components in data mining system 100 (e.g., server 110, service requester terminal 130, service provider terminal 140, etc.); alternatively, in some embodiments, database 150 may also be part of server 110.
In some embodiments, one or more components in the data mining system 100 (e.g., server 110, service requester terminal 130, service provider terminal 140, etc.) may have access to the database 150. In some embodiments, one or more components in the data mining system 100 may read and/or modify information related to a service requester, a service provider, or the public, or any combination thereof, when certain conditions are met. For example, server 110 may read and/or modify information of one or more users after receiving a service request.
Fig. 2 shows a schematic diagram of exemplary hardware and software components of an electronic device 200 provided by some embodiments of the present application that may implement the concepts of the present application, a server 110, a service requester terminal 130, and a service provider terminal 140. For example, the processor 220 may be used on the electronic device 200 and to perform the functions herein.
The electronic device 200 may be a general purpose computer or a special purpose computer, both of which may be used to implement the data mining methods of the present application. Although only one computer is shown, the functionality described herein may be implemented in a distributed fashion across multiple similar platforms for convenience to balance processing loads.
For example, the electronic device 200 may include a network port 210 connected to a network, one or more processors 220 for executing program instructions, a communication bus 230, and various forms of storage media 240, such as magnetic disk, ROM, or RAM, or any combination thereof. By way of example, the computer platform may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The methods of the present application may be implemented in accordance with these program instructions. The electronic device 200 also includes an Input/Output (I/O) interface 250 between the computer and other Input/Output devices (e.g., keyboard, display screen).
For ease of illustration, only one processor is depicted in the electronic device 200. It should be noted, however, that the electronic device 200 in the present application may also include multiple processors, and thus steps performed by one processor described in the present application may also be performed jointly by multiple processors or separately. For example, if the processor of the electronic device 200 performs steps a and B, it should be understood that steps a and B may also be performed by two different processors together or performed separately in one processor. For example, the first processor performs step a, the second processor performs step B, or the first processor and the second processor together perform steps a and B.
Fig. 3 is a flow chart illustrating a data mining method provided in some embodiments of the present application, which may be performed by the server 110 shown in fig. 1. It should be understood that, in other embodiments, the order of some steps in the data mining method of this embodiment may be interchanged according to actual needs, or some steps may be omitted or deleted. The detailed steps of the data mining method are described below.
Step S110, obtain the target dialogue data and extract the question data from the target dialogue data.
As a possible implementation manner, the human-machine history dialogs in each dialog scene may be obtained from the history database, and the history dialogs of the dialog requester may be obtained from the human-machine history dialogs of each dialog scene as the target dialog data. For example, under each dialogue scene (e.g., trip problem dialogue scene, takeaway problem dialogue scene, etc.), each time a dialogue requester (e.g., passenger, driver) performs a single-round dialogue with the server 110 through the service requester terminal 130, a man-machine history dialogue of the dialogue requester with the server 110 may be saved, and replies of the system and customer service may be filtered from the man-machine history dialogue, only the history dialogue of the dialogue requester may be reserved as target dialogue data.
In this way, the embodiment obtains the target dialogue data by selecting the history dialogue of the dialogue requester from the man-machine history dialogues without adopting all the man-machine history dialogues, so that the calculation amount in the data mining process can be effectively reduced.
In addition, the inventor has studied and found that not all the history dialogs of the dialog requester are meaningful question sentences, and in order to obtain a truly meaningful question sentence, the present embodiment also needs to filter out meaningless, informationless dialog sentences from the target dialog data, and the following describes in detail how to extract the question data from the target dialog data in conjunction with several examples.
For example, for each dialogue sentence in the target dialogue data, the dialogue sentence may be matched with each keyword in the preset keyword table, if the dialogue sentence is matched with any keyword in the preset keyword table, the dialogue sentence is determined to be a question sentence, and the question data is obtained according to the determined question sentence.
In this embodiment, the high-frequency words of the questions related to each dialog scene may be selected to form a preset keyword table. For example, for a question that most conversation requesters ask, the conversation sentence corresponding to the question may include high frequency words such as "how," "why," "forgotten," "password," "ask," "can not," etc. Thus, if a certain dialogue sentence includes any of the high-frequency words, the dialogue sentence can be determined as a question sentence.
For another example, for each dialogue sentence in the target dialogue data, it may be determined whether the sentence length of the dialogue sentence is within a preset length range, if the sentence length of the dialogue sentence is within the preset length range, the dialogue sentence is determined as a question sentence, and the question data is obtained according to the determined question sentence.
In this embodiment, the preset length range may be designed according to practical situations, for example, the preset length range may be limited to 4-20 characters. That is, if the dialogue sentence is smaller than 4 characters, the dialogue sentence is considered as an invalid sentence, and if the dialogue sentence is larger than 20 characters, it is generally considered that the sentence ideas are already very clear, and knowledge point mining may not be required. Thus, if the sentence length of a certain dialogue sentence is within the preset length range, the dialogue sentence can be determined as a question sentence.
It should be noted that, in actual implementation, the exemplary embodiments of the two determined problem statements may be alternatively used or may be used simultaneously. For another example, the two exemplary embodiments described above may also be combined to determine a question statement. That is, for each dialogue sentence in the target dialogue data, matching the dialogue sentence with each keyword in the preset keyword table, and if the dialogue sentence is matched with any keyword in the preset keyword table, judging whether the sentence length of the dialogue sentence is within the preset length range. If the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence, and finally obtaining question data according to the determined question sentence. Each dialogue sentence in the target dialogue data is judged through a preset keyword list and a preset length range, and the accuracy of extracting the problem sentences can be remarkably improved.
Step S120, word segmentation is carried out on the problem data, and word segmentation results formed by a plurality of word segments are obtained.
The inventor also found that in some current problem sentences, a large number of special scene vocabularies exist. For example: a large number of users may ask "why my services have declined", in which case the conventional word segmentation method may have the result of (services, scores, declining) after word segmentation; but it is obvious that a "service score" is a special scene word and that the correct word segmentation result should be (service score, decline). That is, for the special scene word "service score", in the process of word segmentation, the word segmentation algorithm will typically divide the "service score" into two words: "services" and "scores". The reason for this is that the "service score" is a unique word in a private scenario and is not a generic word. In the historical dialogue database, a large number of special scene words exist, and if the special scene words cannot form one word in the word segmentation process, the special scene words are split into a plurality of words, the mining effect of the subsequent knowledge points can be directly influenced, and the classification analysis of the knowledge points is further influenced.
Based on the findings of the above technical problems, the present inventors have made a careful study to propose the following solutions to solve the above problems, and the following description will be made with reference to two exemplary embodiments.
Embodiment one: the problem data can be segmented according to a pre-configured scene word stock table, so as to obtain a word segmentation result composed of a plurality of segmentation words, wherein the scene word stock table comprises a plurality of special scene words related to a target service corresponding to the problem data.
Embodiment two: and segmenting the problem data according to a pre-trained scene word discovery model to obtain a segmentation result consisting of a plurality of segmentation words. The scene word discovery model is obtained through training in the following mode:
firstly, configuring a conditional random field algorithm CRF (Conditional Random Field algorithm) model, then taking the historical dialogue data of each dialogue scene as a model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as a model output, and iteratively training a CRF model to obtain a scene word discovery model.
Therefore, based on a large amount of historical dialogue data offline training conditional random field algorithm CRF model, a specific scene word library table of each dialogue scene is generated, and then Chinese word segmentation is carried out by referring to the scene word library table, so that the influence on the mining effect of subsequent knowledge points after the special scene words are split in the word segmentation process can be effectively avoided.
Step S130, constructing a corresponding frequent pattern tree according to the word segmentation result, and mining a frequent item set from the constructed frequent pattern tree.
As a possible implementation manner, the present embodiment first counts the support degree of each word in the word segmentation result, where the support degree represents the number of times the word segmentation occurs in the word segmentation result.
And sequentially inserting each word into a tree taking NULL as a root node according to the descending order of the support degree, and constructing a frequent pattern tree, wherein the frequent pattern tree comprises NULL root nodes and branch nodes, the NULL root nodes are invalid values, and the branch nodes correspond to one frequent item and the support degree thereof. Finally, the frequent item set is mined from the constructed frequent pattern tree.
Alternatively, mining frequent item sets from a built frequent pattern tree may be accomplished by:
first, for each frequent item in the constructed frequent pattern tree, constructing a conditional pattern base of the frequent item, and constructing a conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, wherein the conditional pattern base is a path set of a plurality of prefix paths taking the frequent item as a suffix item and connected with the suffix item.
And then updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuously executing the steps of constructing a conditional pattern base of each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, and outputting the frequent item corresponding to the conditional frequent pattern tree until the constructed conditional frequent pattern tree is empty or contains only one path, so as to obtain a frequent item set.
When the constructed conditional frequent pattern tree is empty, the prefix path of the conditional frequent pattern tree is determined as a frequent item, and when the constructed conditional frequent pattern tree only contains one path, all combined paths are connected with the prefix path of the conditional frequent pattern tree as the frequent item.
Optionally, in order to reduce frequent items with too low support, the embodiment may further filter out frequent items with support lower than a preset support in the frequent pattern tree before updating the frequent pattern tree based on each constructed conditional frequent pattern tree each time.
In this way, according to the data mining method provided by the embodiment, the target dialogue data is obtained, the problem data is extracted from the target dialogue data, the word segmentation result formed by a plurality of word segments is obtained by segmenting the problem data, so that a corresponding frequent pattern tree is constructed according to the word segmentation result, and a frequent item set is mined from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a data mined knowledge point. Knowledge points in the single-round dialogue can be accurately and comprehensively excavated, the efficiency and quality of knowledge point excavation are greatly improved, so that the problem of a user is effectively solved, and the satisfaction degree of the user is improved.
The inventors of the present application have also found during the course of the study that the frequent item set mined on the above basis may also include many redundant frequent items that may express the same meaning as other frequent items, which is determined by the pattern of the question statement. For example, when a user consults a problem with a decline in service, two questioning modes may occur:
question mode a: how does my service score drop? The frequent items corresponding to the method are as follows: (how, service score, decline);
question mode B: why is my service declined? The frequent items corresponding to the method are as follows: (why the service score, drop).
It is clear that although the question mode a and the question mode B are different, the same meaning is expressed, only one corresponding frequent item needs to be reserved at the moment, two corresponding frequent items do not need to be reserved, and otherwise redundant frequent items can be generated, so that the subsequent problem recognition efficiency is reduced.
In order to solve the above problem, referring to fig. 4, after the step S130, the data mining method provided in this embodiment may further include the following steps:
step S140, merging the frequent items with the same meaning in the frequent item set to obtain a merged frequent item set.
As one possible implementation, first, a question set is generated for each frequent item. For example, in the application scenario shown in fig. 5, the problem set related to the frequent item (B, D, C) may include problem 1, problem 2, problem 3.
Next, sentence vectors for each frequent item-related problem set are computed. Optionally, for each frequent item related problem set, performing word segmentation processing on each problem sentence in the frequent item related problem set to obtain a plurality of segmented words, inputting each segmented word into a pre-trained word vector model fasttext model to obtain a segmented word vector of each segmented word, and obtaining sentence vectors of the frequent item related problem set according to the segmented word vector of each segmented word. For example, in the application scenario shown in fig. 5, after the problem set related to the frequent item (B, D, C) is subjected to word segmentation, the "i", "service score", "having" can be obtained, and then the word vectors of the "i", "service score", "having" can be obtained by inputting the word segments "i", "having" into the word vector model fasttext model trained in advance, and the sentence vectors of the problem set related to the frequent item can be obtained by adding the word vectors of the "i", "having" and "having" each.
In the research process of the invention, the distance between the word segmentation vectors of the words with similar semantics is found to be relatively close, and the distance between the word segmentation vectors of the words with different semantics is relatively far. Based on this, the cosine distance between the sentence vectors of the problem sets related to any two frequent items can be calculated according to the calculated sentence vectors of the problem sets related to each frequent item, and the cosine distance is used as the similarity between the sentence vectors of the problem sets related to any two frequent items. On the basis, whether the similarity is larger than the preset similarity is judged, and if the similarity is not larger than the preset similarity, any one of the two corresponding frequent items is deleted. Otherwise, if the similarity is greater than the preset similarity, the two corresponding frequent items are reserved.
Therefore, in the embodiment, the frequent items with the same meaning in the frequent item set are combined, so that redundant frequent items are reduced, and the subsequent problem identification efficiency is improved.
On the basis of the foregoing, referring to fig. 6, after the step S140, the data mining method provided in this embodiment may further include the following steps:
step S150, obtaining and storing problem solving information of each problem in the problem set of each frequent item.
Step S160, when receiving the preset questions sent by the service requester terminal 130, matches the preset questions with each question in the question set of each frequent item, and sends the question resolution information of the questions matched with the preset questions to the service requester terminal 130.
The question set of each frequent item may be considered as one potential knowledge point, so that corresponding question-solving information, such as question-answering information or speaking information, may be configured for each potential knowledge point, and each potential knowledge point may be stored in association with corresponding question-solving information. In this way, in order to match a preset question with each question in the question set of each frequent item when the preset question is transmitted from the service requester terminal 130 later, and to transmit question resolution information of the question matched with the preset question to the service requester terminal 130. Therefore, the accuracy of single-round dialogue answers can be improved, and user satisfaction, intelligent resolution and intelligent service duty ratio are further improved.
Fig. 7 is a functional block diagram of a data mining apparatus 300 according to some embodiments of the present application, where functions implemented by the data mining apparatus 300 may correspond to steps performed by the above-described method. The data mining apparatus 300 may be understood as the above-mentioned server 110, or a processor of the server 110, or may be understood as a component, which is independent from the above-mentioned server 110 or a processor and is controlled by the server 110 to implement the functions of the present application, as shown in fig. 7, the data mining apparatus 300 may include a first obtaining module 310, a word segmentation module 320, and a mining module 330, and the functions of the respective functional modules of the data mining apparatus 300 are described in detail below.
The first obtaining module 310 may be configured to obtain target dialogue data and extract problem data from the target dialogue data. It is understood that the first obtaining module 310 may be configured to perform the step S110, and the detailed implementation of the first obtaining module 310 may refer to the content related to the step S110.
The word segmentation module 320 may be configured to segment the question data to obtain a word segmentation result composed of a plurality of word segments. It is understood that the word segmentation module 320 may be used to perform the step S120, and reference may be made to the details of the implementation of the word segmentation module 320 related to the step S120.
The mining module 330 may be configured to construct a corresponding frequent pattern tree according to the word segmentation result, and mine a frequent item set from the constructed frequent pattern tree, where the frequent item set includes a plurality of frequent items, each of which corresponds to a knowledge point of data mining. It is understood that the mining module 330 may be used to perform the step S130 described above, and reference may be made to the details of the implementation of the mining module 330 regarding the step S130 described above.
In one possible implementation, the first obtaining module 310 may specifically obtain the target session data by:
Acquiring man-machine history conversations in each conversation scene from a history database;
the historical dialogue of the dialogue requester is obtained from the man-machine historical dialogue of each dialogue scene as target dialogue data.
In one possible implementation, the first obtaining module 310 may specifically extract the problem data by:
matching each dialogue sentence with each keyword in a preset keyword list aiming at each dialogue sentence in target dialogue data;
if the dialogue sentence is matched with any one keyword in the preset keyword list, determining the dialogue sentence as a question sentence;
and obtaining problem data according to the determined problem statement.
In one possible implementation, the first obtaining module 310 may specifically extract the problem data by:
judging whether the sentence length of each dialogue sentence in the target dialogue data is within a preset length range or not according to each dialogue sentence in the target dialogue data;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a problem sentence;
and obtaining problem data according to the determined problem statement.
In one possible implementation, the first obtaining module 310 may specifically extract the problem data by:
Matching each dialogue sentence with each keyword in a preset keyword list aiming at each dialogue sentence in target dialogue data;
if the dialogue sentence is matched with any one keyword in the preset keyword list, judging whether the sentence length of the dialogue sentence is within the preset length range;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a problem sentence;
and obtaining problem data according to the determined problem statement.
In one possible implementation, the word segmentation module 320 may specifically obtain a word segmentation result composed of a plurality of words by:
performing word segmentation on the problem data according to a pre-configured scene word stock table to obtain word segmentation results composed of a plurality of word segments, wherein the scene word stock table comprises a plurality of special scene words related to a target service corresponding to the problem data; or alternatively
And segmenting the problem data according to a pre-trained scene word discovery model to obtain a segmentation result consisting of a plurality of segmentation words.
In one possible implementation, the scene word discovery model is trained by:
configuring a conditional random field algorithm CRF model;
And taking the historical dialogue data of each dialogue scene as a model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as a model output, and iteratively training the CRF model to obtain a scene word discovery model.
In one possible implementation, the mining module 330 may specifically mine the frequent item set by:
counting the support degree of each word in the word segmentation result, wherein the support degree represents the occurrence times of the word in the word segmentation result;
sequentially inserting each word into a tree taking NULL as a root node according to a descending order of support degree, and constructing a frequent pattern tree, wherein the frequent pattern tree comprises NULL root nodes and branch nodes, the NULL root nodes are invalid values, and the branch nodes correspond to one frequent item and support degree thereof;
and mining the frequent item set from the constructed frequent pattern tree.
In one possible implementation, the mining module 330 may specifically mine the set of frequent items from the built frequent pattern tree by:
constructing a condition pattern base of each frequent item in the constructed frequent pattern tree, and constructing a condition frequent pattern tree of the frequent item based on the constructed condition pattern base, wherein the condition pattern base is a path set of a plurality of prefix paths taking the frequent item as a suffix item and connected with the suffix item;
Updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuously executing the steps of constructing a conditional pattern base of each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, and outputting the frequent item corresponding to the conditional frequent pattern tree until the constructed conditional frequent pattern tree is empty or contains only one path, so as to obtain a frequent item set;
when the constructed conditional frequent pattern tree is empty, the prefix path of the conditional frequent pattern tree is determined as a frequent item, and when the constructed conditional frequent pattern tree only contains one path, all combined paths are connected with the prefix path of the conditional frequent pattern tree as the frequent item.
In a possible implementation manner, the mining module 330 is specifically further configured to filter out frequent items with a support degree lower than a preset support degree in the frequent pattern tree.
In a possible implementation manner, referring to fig. 8, the data mining apparatus 300 may further include a merging module 340, where the merging module 340 may be configured to merge frequent items with the same meaning in the set of frequent items to obtain the merged set of frequent items. It is understood that the merging module 340 may be used to perform the step S140, and reference may be made to the details of the implementation of the merging module 340 regarding the step S140.
The merging module 340 may specifically obtain the merged frequent item set by:
generating a problem set related to each frequent item;
calculating sentence vectors of problem sets related to each frequent item;
according to the calculated sentence vectors of the problem sets related to each frequent item, calculating cosine distances between sentence vectors of the problem sets related to any two frequent items, and taking the cosine distances as similarity between sentence vectors of the problem sets related to any two frequent items;
judging whether the similarity is larger than the preset similarity, and deleting any one of the two corresponding frequent items if the similarity is not larger than the preset similarity.
In one possible implementation, the merging module 340 may specifically calculate the sentence vector of each frequent item-related question set by:
aiming at each frequent item related problem set, respectively performing word segmentation processing on each problem sentence in the frequent item related problem set to obtain a plurality of segmented words;
inputting each word into a pre-trained word vector model fasttext model to obtain a word vector of each word;
and obtaining sentence vectors of the problem set related to the frequent item according to the word segmentation vector of each word segmentation.
In one possible implementation, referring further to fig. 9, the data mining apparatus 300 may further include a second acquisition module 350 and a problem matching module 360.
The second obtaining module 350 may be configured to obtain and store problem resolution information of each problem in the problem set of each frequent item. It is understood that the second obtaining module 350 may be used to perform the step S150, and the detailed implementation of the second obtaining module 350 may refer to the content related to the step S150.
The problem matching module 360 may be configured to, when receiving a preset problem sent by the service requester terminal 130, match the preset problem with each problem in the problem set of each frequent item, and send problem resolution information of the problem matched with the preset problem to the service requester terminal 130. It is understood that the question matching module 360 may be used to perform the above step S160, and reference may be made to the above description of the step S160 for a detailed implementation of the question matching module 360.
The modules may be connected or communicate with each other via wired or wireless connections. The wired connection may include a metal cable, optical cable, hybrid cable, or the like, or any combination thereof. The wireless connection may include a connection through a LAN, WAN, bluetooth, zigBee, or NFC, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the method embodiments, which are not described in detail in this application. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, and for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (26)

1. A data mining method, applied to a server, the method comprising:
acquiring target dialogue data and extracting problem data from the target dialogue data;
performing word segmentation on the problem data to obtain word segmentation results composed of a plurality of word segmentation;
counting the support degree of each word in the word segmentation result, sequentially inserting each word into a tree taking NULL as a root node according to the descending order of the support degree, constructing a frequent pattern tree, and mining a frequent item set from the constructed frequent pattern tree; the frequent pattern tree comprises a NULL root node and a branch node, wherein the NULL root node is an invalid value, the branch node corresponds to one frequent item and the support degree thereof, the support degree represents the frequency of occurrence of the word in the word segmentation result, the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a knowledge point of data mining;
merging the frequent items with the same meaning in the frequent item set based on cosine distances between sentence vectors of any two related problem sets of the frequent item set to obtain a merged frequent item set;
wherein, the word segmentation is performed on the problem data, and one implementation mode for obtaining the word segmentation result composed of a plurality of word segments comprises the following steps: and segmenting the problem data according to a pre-configured scene word stock table to obtain a word segmentation result composed of a plurality of segmented words, wherein the scene word stock table comprises a plurality of special scene words related to a target service corresponding to the problem data.
2. The data mining method according to claim 1, wherein the step of acquiring target dialogue data includes:
acquiring man-machine history conversations in each conversation scene from a history database;
and acquiring the history dialogue of the dialogue requester from the man-machine history dialogue of each dialogue scene as the target dialogue data.
3. The data mining method of claim 1, wherein the step of extracting problem data from the target session data comprises:
matching each dialogue sentence with each keyword in a preset keyword list aiming at each dialogue sentence in the target dialogue data;
if the dialogue sentence is matched with any one keyword in the preset keyword list, determining the dialogue sentence as a question sentence;
and obtaining the problem data according to the determined problem statement.
4. The data mining method of claim 1, wherein the step of extracting problem data from the target session data comprises:
judging whether the sentence length of each dialogue sentence is within a preset length range or not according to each dialogue sentence in the target dialogue data;
If the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a problem sentence;
and obtaining the problem data according to the determined problem statement.
5. The data mining method of claim 1, wherein the step of extracting problem data from the target session data comprises:
matching each dialogue sentence with each keyword in a preset keyword list aiming at each dialogue sentence in the target dialogue data;
if the dialogue sentence is matched with any one keyword in the preset keyword list, judging whether the sentence length of the dialogue sentence is within a preset length range;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a problem sentence;
and obtaining the problem data according to the determined problem statement.
6. The data mining method of claim 1, wherein the step of segmenting the problem data to obtain a segmentation result comprising a plurality of segmentation results comprises:
and segmenting the problem data according to a pre-trained scene word discovery model to obtain a segmentation result consisting of a plurality of segmentation words.
7. The data mining method of claim 6, wherein the scene word discovery model is trained by:
configuring a conditional random field algorithm CRF model;
and taking the historical dialogue data of each dialogue scene as a model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as a model output, and iteratively training the CRF model to obtain the scene word discovery model.
8. The data mining method of claim 1, wherein the step of mining the set of frequent items from the constructed frequent pattern tree comprises:
constructing a condition pattern base of each frequent item in the constructed frequent pattern tree, and constructing a condition frequent pattern tree of the frequent item based on the constructed condition pattern base, wherein the condition pattern base is a path set of a plurality of prefix paths taking the frequent item as a suffix item and connected with the suffix item;
updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuously executing each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, constructing a conditional pattern base of the frequent item, constructing a conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, and outputting the frequent item corresponding to the conditional frequent pattern tree until the constructed conditional frequent pattern tree is empty or contains only one path, so as to obtain a frequent item set;
And when the constructed conditional frequent pattern tree only contains one path, connecting all combined paths with the prefix path of the conditional frequent pattern tree as frequent items.
9. The data mining method of claim 8, wherein after the step of updating the frequent pattern tree based on each constructed conditional frequent pattern tree, the method further comprises:
and filtering out frequent items with the support degree lower than a preset support degree in the frequent pattern tree.
10. The data mining method of claim 1, wherein,
the step of merging the frequent items with the same meaning in the frequent item set based on the cosine distance between sentence vectors of the problem set related to any two frequent items in the frequent item set to obtain a merged frequent item set includes:
generating a problem set related to each frequent item;
calculating sentence vectors of problem sets related to each frequent item;
according to the calculated sentence vectors of the problem sets related to each frequent item, calculating cosine distances between the sentence vectors of the problem sets related to any two frequent items, and taking the cosine distances as the similarity between the sentence vectors of the problem sets related to the any two frequent items;
Judging whether the similarity is larger than a preset similarity, and deleting any one of the two corresponding frequent items if the similarity is not larger than the preset similarity.
11. The data mining method of claim 10, wherein the step of computing sentence vectors for each frequent item-related problem set comprises:
aiming at each frequent item related problem set, respectively performing word segmentation processing on each problem sentence in the frequent item related problem set to obtain a plurality of segmented words;
inputting each word into a pre-trained word vector model fasttext model to obtain a word vector of each word;
and obtaining sentence vectors of the problem set related to the frequent item according to the word segmentation vector of each word segmentation.
12. The method according to claim 10, wherein after the step of merging frequent items with the same meaning in the set of frequent items to obtain the merged set of frequent items, the method further comprises:
acquiring and storing problem solving information of each problem in the problem set of each frequent item;
when a preset problem sent by a service requester terminal is received, matching the preset problem with each problem in a problem set of each frequent item, and sending problem solving information of the problem matched with the preset problem to the service requester terminal.
13. A data mining apparatus for application to a server, the apparatus comprising:
the first acquisition module is used for acquiring target dialogue data and extracting problem data from the target dialogue data;
the word segmentation module is used for segmenting the problem data to obtain word segmentation results composed of a plurality of word segments;
the mining module is used for counting the support degree of each word in the word segmentation result, sequentially inserting each word into a tree taking NULL as a root node according to the descending order of the support degree, constructing a frequent pattern tree, and mining a frequent item set from the constructed frequent pattern tree; the frequent pattern tree comprises a NULL root node and a branch node, wherein the NULL root node is an invalid value, the branch node corresponds to one frequent item and the support degree thereof, the support degree represents the frequency of occurrence of the word in the word segmentation result, the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a knowledge point of data mining;
the merging module is used for merging the frequent items with the same meaning in the frequent item set based on cosine distances between sentence vectors of any two frequent item related problem sets in the frequent item set to obtain a merged frequent item set;
The word segmentation module is used for obtaining word segmentation results composed of a plurality of word segments according to the following specific implementation mode: and segmenting the problem data according to a pre-configured scene word stock table to obtain a word segmentation result composed of a plurality of segmented words, wherein the scene word stock table comprises a plurality of special scene words related to a target service corresponding to the problem data.
14. The data mining apparatus of claim 13, wherein the first acquisition module acquires the target session data specifically by:
acquiring man-machine history conversations in each conversation scene from a history database;
and acquiring the history dialogue of the dialogue requester from the man-machine history dialogue of each dialogue scene as the target dialogue data.
15. The data mining apparatus of claim 13, wherein the first acquisition module extracts the problem data specifically by:
matching each dialogue sentence with each keyword in a preset keyword list aiming at each dialogue sentence in the target dialogue data;
if the dialogue sentence is matched with any one keyword in the preset keyword list, determining the dialogue sentence as a question sentence;
And obtaining the problem data according to the determined problem statement.
16. The data mining apparatus of claim 13, wherein the first acquisition module extracts the problem data specifically by:
judging whether the sentence length of each dialogue sentence is within a preset length range or not according to each dialogue sentence in the target dialogue data;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a problem sentence;
and obtaining the problem data according to the determined problem statement.
17. The data mining apparatus of claim 13, wherein the first acquisition module extracts the problem data specifically by:
matching each dialogue sentence with each keyword in a preset keyword list aiming at each dialogue sentence in the target dialogue data;
if the dialogue sentence is matched with any one keyword in the preset keyword list, judging whether the sentence length of the dialogue sentence is within a preset length range;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a problem sentence;
And obtaining the problem data according to the determined problem statement.
18. The data mining apparatus of claim 13, wherein the word segmentation module further obtains the word segmentation result consisting of a plurality of words by, in particular, the following other way:
and segmenting the problem data according to a pre-trained scene word discovery model to obtain a segmentation result consisting of a plurality of segmentation words.
19. The data mining apparatus of claim 18, wherein the scene word discovery model is trained by:
configuring a conditional random field algorithm CRF model;
and taking the historical dialogue data of each dialogue scene as a model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as a model output, and iteratively training the CRF model to obtain the scene word discovery model.
20. The data mining apparatus of claim 13, wherein the mining module specifically mines the set of frequent items from the constructed frequent pattern tree by:
constructing a condition pattern base of each frequent item in the constructed frequent pattern tree, and constructing a condition frequent pattern tree of the frequent item based on the constructed condition pattern base, wherein the condition pattern base is a path set of a plurality of prefix paths taking the frequent item as a suffix item and connected with the suffix item;
Updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuously executing each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, constructing a conditional pattern base of the frequent item, constructing a conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, and outputting the frequent item corresponding to the conditional frequent pattern tree until the constructed conditional frequent pattern tree is empty or contains only one path, so as to obtain a frequent item set;
and when the constructed conditional frequent pattern tree only contains one path, connecting all combined paths with the prefix path of the conditional frequent pattern tree as frequent items.
21. The data mining apparatus according to claim 20, wherein the mining module is further specifically configured to filter out frequent items in the frequent pattern tree with a support level lower than a preset support level.
22. The data mining apparatus according to claim 13, wherein the merging module obtains the merged frequent item set specifically by:
Generating a problem set related to each frequent item;
calculating sentence vectors of problem sets related to each frequent item;
according to the calculated sentence vectors of the problem sets related to each frequent item, calculating cosine distances between the sentence vectors of the problem sets related to any two frequent items, and taking the cosine distances as the similarity between the sentence vectors of the problem sets related to the any two frequent items;
judging whether the similarity is larger than a preset similarity, and deleting any one of the two corresponding frequent items if the similarity is not larger than the preset similarity.
23. The data mining apparatus of claim 22, wherein the merging module calculates the sentence vector for each frequent item-related problem set by:
aiming at each frequent item related problem set, respectively performing word segmentation processing on each problem sentence in the frequent item related problem set to obtain a plurality of segmented words;
inputting each word into a pre-trained word vector model fasttext model to obtain a word vector of each word;
and obtaining sentence vectors of the problem set related to the frequent item according to the word segmentation vector of each word segmentation.
24. The data mining apparatus of claim 22, wherein the apparatus further comprises:
the second acquisition module is used for acquiring and storing the problem solving information of each problem in the problem set of each frequent item;
and the problem matching module is used for matching the preset problem with each problem in the problem set of each frequent item when receiving the preset problem sent by the service requester terminal, and sending the problem solving information of the problem matched with the preset problem to the service requester terminal.
25. A server, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when run by a server, the processor executing the machine-readable instructions to perform the steps of the data mining method of any of claims 1-12 when executed.
26. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the data mining method according to any of claims 1-12.
CN201811526754.5A 2018-12-13 2018-12-13 Data mining method, device, server and readable storage medium Active CN111401388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811526754.5A CN111401388B (en) 2018-12-13 2018-12-13 Data mining method, device, server and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811526754.5A CN111401388B (en) 2018-12-13 2018-12-13 Data mining method, device, server and readable storage medium

Publications (2)

Publication Number Publication Date
CN111401388A CN111401388A (en) 2020-07-10
CN111401388B true CN111401388B (en) 2023-06-30

Family

ID=71428222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811526754.5A Active CN111401388B (en) 2018-12-13 2018-12-13 Data mining method, device, server and readable storage medium

Country Status (1)

Country Link
CN (1) CN111401388B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164401B (en) * 2020-09-18 2022-03-18 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN113157766A (en) * 2021-03-12 2021-07-23 Oppo广东移动通信有限公司 Application analysis method and device, electronic equipment and computer-readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480128A (en) * 2017-07-17 2017-12-15 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system
CN108132947A (en) * 2016-12-01 2018-06-08 百度在线网络技术(北京)有限公司 Entity digging system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649345A (en) * 2015-10-30 2017-05-10 微软技术许可有限责任公司 Automatic session creator for news

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108132947A (en) * 2016-12-01 2018-06-08 百度在线网络技术(北京)有限公司 Entity digging system and method
CN107480128A (en) * 2017-07-17 2017-12-15 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈慧萍 等.基于FP-tree和支持度数组的最大频繁项集挖掘算法.系统工程与电子技术.2005,第27卷(第9期),1631-1635. *

Also Published As

Publication number Publication date
CN111401388A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN104102719B (en) The method for pushing and device of a kind of trace information
CN113657465A (en) Pre-training model generation method and device, electronic equipment and storage medium
CN111353092B (en) Service pushing method, device, server and readable storage medium
CN112487173B (en) Man-machine conversation method, device and storage medium
CN107194158A (en) A kind of disease aided diagnosis method based on image recognition
CN110009059B (en) Method and apparatus for generating a model
CN110555896B (en) Image generation method and device and storage medium
CN111753551B (en) Information generation method and device based on word vector generation model
CN113344089B (en) Model training method and device and electronic equipment
CN111401388B (en) Data mining method, device, server and readable storage medium
CN112037775B (en) Voice recognition method, device, equipment and storage medium
CN112418302A (en) Task prediction method and device
CN113408570A (en) Image category identification method and device based on model distillation, storage medium and terminal
CN116994021A (en) Image detection method, device, computer readable medium and electronic equipment
CN111274348B (en) Service feature data extraction method and device and electronic equipment
CN110046571B (en) Method and device for identifying age
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN113766633A (en) Data processing method, data processing device, electronic equipment and storage medium
CN116363457B (en) Task processing, image classification and data processing method of task processing model
CN112860995A (en) Interaction method, device, client, server and storage medium
WO2024051146A1 (en) Methods, systems, and computer-readable media for recommending downstream operator
CN112989177A (en) Information processing method, information processing device, electronic equipment and computer storage medium
CN114969195B (en) Dialogue content mining method and dialogue content evaluation model generation method
CN116958033A (en) Abnormality detection method, model training method, device, equipment and medium
CN115239744A (en) Blood vessel segmentation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant