CN111401388A - Data mining method, device, server and readable storage medium - Google Patents

Data mining method, device, server and readable storage medium Download PDF

Info

Publication number
CN111401388A
CN111401388A CN201811526754.5A CN201811526754A CN111401388A CN 111401388 A CN111401388 A CN 111401388A CN 201811526754 A CN201811526754 A CN 201811526754A CN 111401388 A CN111401388 A CN 111401388A
Authority
CN
China
Prior art keywords
frequent
data
dialogue
sentence
frequent item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811526754.5A
Other languages
Chinese (zh)
Other versions
CN111401388B (en
Inventor
吴康康
王鹏
柳俊宏
王杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN201811526754.5A priority Critical patent/CN111401388B/en
Publication of CN111401388A publication Critical patent/CN111401388A/en
Application granted granted Critical
Publication of CN111401388B publication Critical patent/CN111401388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data mining method, a data mining device, a server and a readable storage medium, which are used for acquiring target conversation data, extracting problem data from the target conversation data, and segmenting the problem data to obtain a segmentation result consisting of a plurality of segmentation words, so that a corresponding frequent pattern tree is constructed according to the segmentation result, and a frequent item set is mined from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a data mining knowledge point. Therefore, the knowledge points in the single-turn conversation can be accurately and comprehensively mined, the efficiency and the quality of knowledge point mining are greatly improved, the problem of the user is effectively solved, and the satisfaction degree of the user is improved.

Description

Data mining method, device, server and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data mining method, apparatus, server, and readable storage medium.
Background
At present, along with the popularization of intelligent terminals, various Applications (APPs) providing life convenience services are also in endless range, and provide services (such as travel services, take-away services, and the like) for people who eat and wear the mobile terminal. In using these services, a user typically selects multiple rounds of dialog, a single round of dialog (QA), and chatting about the robot in the customer service system to consult the problem to be solved. The single-turn dialogue plays a great role in solving user problems and realizing intelligent services, and the knowledge point is the most important part in the single-turn dialogue. How to accurately and comprehensively mine knowledge points in the single-turn conversation so as to solve the problem of the user more effectively and improve the satisfaction degree of the user is a technical problem to be solved urgently by technical personnel in the field.
Disclosure of Invention
In view of this, an object of the embodiments of the present application is to provide a data mining method, apparatus, server and readable storage medium, so as to accurately and comprehensively mine knowledge points in a single-turn dialog, so as to solve the problem of a user more effectively and improve the satisfaction of the user.
According to an aspect of embodiments of the present application, there is provided an electronic device that may include one or more storage media and one or more processors in communication with the storage media. One or more storage media store machine-readable instructions executable by a processor. When the electronic device is operated, the processor is communicated with the storage medium through the bus, and the processor executes the machine readable instructions to execute the data mining method.
According to another aspect of the embodiments of the present application, there is provided a data mining method applied to a server, where the method may include:
acquiring target dialogue data and extracting problem data from the target dialogue data;
performing word segmentation on the problem data to obtain a word segmentation result consisting of a plurality of words;
and constructing a corresponding frequent pattern tree according to the word segmentation result, and mining a frequent item set from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a data mining knowledge point.
In a possible implementation manner, the step of obtaining target session data includes:
acquiring man-machine historical conversations in each conversation scene from a historical database;
and acquiring the historical dialogue of the dialogue requester from the man-machine historical dialogue of each dialogue scene as the target dialogue data.
In a possible implementation, the step of extracting question data from the target dialogue data includes:
aiming at each dialogue statement in the target dialogue data, matching the dialogue statement with each keyword in a preset keyword table;
if the conversation sentence is matched with any keyword in the preset keyword table, determining the conversation sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
In a possible implementation, the step of extracting question data from the target dialogue data includes:
judging whether the sentence length of each dialogue sentence in the target dialogue data is within a preset length range or not;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
In a possible implementation, the step of extracting question data from the target dialogue data includes:
aiming at each dialogue statement in the target dialogue data, matching the dialogue statement with each keyword in a preset keyword table;
if the dialogue statement is matched with any keyword in the preset keyword table, judging whether the statement length of the dialogue statement is within a preset length range;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
In a possible implementation manner, the step of performing word segmentation on the question data to obtain a word segmentation result composed of a plurality of word segmentations includes:
performing word segmentation on the problem data according to a pre-configured scene word bank table to obtain a word segmentation result consisting of a plurality of words, wherein the scene word bank table comprises a plurality of special scene words related to target services corresponding to the problem data; or
And segmenting the problem data according to a pre-trained scene word finding model to obtain a segmentation result consisting of a plurality of segmentation words.
In one possible implementation, the scene word discovery model is trained by:
configuring a conditional random field algorithm CRF model;
and taking the historical dialogue data of each dialogue scene as model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as model output, and iteratively training the CRF model to obtain the scene word discovery model.
In a possible implementation manner, the step of constructing a corresponding frequent pattern tree according to the word segmentation result, and mining a frequent item set from the constructed frequent pattern tree includes:
counting the support degree of each word in the word segmentation result, wherein the support degree represents the occurrence times of the word in the word segmentation result;
sequentially inserting each participle into a tree with NU LL as a root node according to a support degree descending order, and constructing a frequent pattern tree, wherein the frequent pattern tree comprises a NU LL root node and branch nodes, the NU LL root node is an invalid value, and the branch nodes correspond to a frequent item and the support degree thereof;
and mining a frequent item set from the constructed frequent pattern tree.
In a possible implementation, the step of mining a frequent item set from the constructed frequent pattern tree includes:
for each frequent item in the constructed frequent pattern tree, constructing a conditional pattern base of the frequent item, and constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, wherein the conditional pattern base is a path set of a plurality of prefix paths which take the frequent item as a suffix item and are connected with the suffix item;
updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuing to execute the steps of constructing a conditional pattern base of each frequent item aiming at each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, and constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, until the constructed conditional frequent pattern tree is empty or only contains one path, outputting the frequent item corresponding to the conditional frequent pattern tree to obtain a frequent item set;
when the constructed conditional frequent pattern tree only contains one path, all the combined paths are connected with the prefix path of the conditional frequent pattern tree as the frequent item.
In a possible implementation, after the step of updating the frequent pattern tree based on each constructed conditional frequent pattern tree, the method may further include:
and filtering out frequent items with the support degree lower than a preset support degree in the frequent pattern tree.
In a possible implementation manner, after the step of constructing a corresponding frequent pattern tree according to the word segmentation result and mining a frequent item set from the constructed frequent pattern tree, the method may further include:
merging the frequent items with the same meaning in the frequent item set to obtain a merged frequent item set;
the step of combining the frequent items with the same meaning in the frequent item set to obtain a combined frequent item set includes:
generating a problem set related to each frequent item;
calculating a sentence vector of the problem set related to each frequent item;
calculating cosine distance between sentence vectors of the problem sets related to any two frequent items according to the sentence vectors of the problem sets related to each frequent item, and taking the cosine distance as similarity between the sentence vectors of the problem sets related to any two frequent items;
and judging whether the similarity is greater than a preset similarity, and if the similarity is not greater than the preset similarity, deleting any one of the two corresponding frequent items.
In one possible embodiment, the step of calculating a sentence vector of the problem set associated with each frequent term includes:
aiming at each problem set related to the frequent items, performing word segmentation processing on each problem statement in the problem set related to the frequent items respectively to obtain a plurality of word segments;
inputting each participle into a word vector model fasttext model trained in advance to obtain a participle vector of each participle;
and obtaining a sentence vector of the problem set related to the frequent item according to the word segmentation vector of each word segmentation.
In a possible implementation manner, after the step of combining the frequent terms with the same meaning in the frequent item set to obtain a combined frequent item set, the method may further include:
acquiring and storing problem solving information of each problem in the problem set of each frequent item;
when receiving a preset problem sent by a service requester terminal, matching the preset problem with each problem in each problem set of frequent items, and sending problem solution information of the problem matched with the preset problem to the service requester terminal.
According to another aspect of the embodiments of the present application, there is provided a data mining apparatus applied to a server, the apparatus including:
the first acquisition module is used for acquiring target dialogue data and extracting problem data from the target dialogue data;
the word segmentation module is used for segmenting the question data to obtain a word segmentation result consisting of a plurality of words;
and the mining module is used for constructing a corresponding frequent pattern tree according to the word segmentation result and mining a frequent item set from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a knowledge point mined by data.
According to another aspect of embodiments of the present application, there is provided a readable storage medium, on which a computer program is stored, which, when executed by a processor, is capable of performing the steps of the data mining method described above.
Based on any one of the above aspects, in the embodiment of the application, the target dialogue data is acquired, the problem data is extracted from the target dialogue data, and the problem data is subjected to word segmentation to obtain a word segmentation result composed of a plurality of words, so that a corresponding frequent pattern tree is constructed according to the word segmentation result, and a frequent item set is mined from the constructed frequent pattern tree, wherein the frequent item set includes a plurality of frequent items, and each frequent item corresponds to a data mining knowledge point. Therefore, the knowledge points in the single-turn conversation can be accurately and comprehensively mined, the efficiency and the quality of knowledge point mining are greatly improved, the problem of the user is effectively solved, and the satisfaction degree of the user is improved.
In order to make the aforementioned objects, features and advantages of the embodiments of the present application more comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 illustrates an interactive schematic block diagram of a data mining system provided by an embodiment of the present application;
FIG. 2 illustrates a schematic diagram of exemplary hardware and software components of an electronic device that may implement the server, the service requester terminal, and the service provider terminal of FIG. 1 provided by an embodiment of the present application;
FIG. 3 is a flow chart illustrating a data mining method provided by an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram illustrating a data mining method provided by an embodiment of the present application;
fig. 5 is a flowchart illustrating a sentence vector calculation method provided in an embodiment of the present application;
FIG. 6 is a schematic flow chart diagram illustrating a data mining method according to an embodiment of the present application;
FIG. 7 is a functional block diagram of a data mining device provided by an embodiment of the present application;
FIG. 8 is a block diagram of another functional module of a data mining device provided by an embodiment of the present application;
fig. 9 shows another functional block diagram of a data mining device provided in an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
To enable those skilled in the art to utilize the present disclosure, the following embodiments are presented in conjunction with a specific application scenario, "a network appointment scenario". It will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the application. Although the present application is described primarily in the context of a "net appointment scenario," it should be understood that this is merely one exemplary embodiment. The application can be applied to any other traffic type. For example, the present application may be applied to different transportation system environments, including terrestrial, marine, or airborne, among others, or any combination thereof. The vehicle of the transportation system may include a taxi, a private car, a windmill, a bus, a train, a bullet train, a high speed rail, a subway, a ship, an airplane, a spacecraft, a hot air balloon, or an unmanned vehicle, etc., or any combination thereof. The application can also comprise any service system for online taxi taking, for example, a system for sending and/or receiving express delivery, and a service system for business transaction of buyers and sellers. Applications of the system or method of the present application may include web pages, plug-ins for browsers, client terminals, customization systems, internal analysis systems, or artificial intelligence robots, among others, or any combination thereof.
It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.
The terms "passenger," "requestor," "service person," "service requestor," and "customer" are used interchangeably in this application to refer to an individual, entity, or tool that can request or order a service. The terms "driver," "provider," "service provider," and "provider" are used interchangeably in this application to refer to an individual, entity, or tool that can provide a service. The term "user" in this application may refer to an individual, entity or tool that requests a service, subscribes to a service, provides a service, or facilitates the provision of a service. For example, the user may be a passenger, a driver, an operator, etc., or any combination thereof. In the present application, "passenger" and "passenger terminal" may be used interchangeably, and "driver" and "driver terminal" may be used interchangeably.
According to the technical problems known in the background art, many existing knowledge points are not single in meaning, and a certain knowledge point is taken as: "how to change the password" is an example, and "how to change the password" is not an ideographic single knowledge point. For example, in some scenarios, many users may consult the problem of "how to modify the login password" and "how to modify the presentation password", however, the two passwords are not modified in the same way, and mapping to the same knowledge point is not appropriate. Therefore, for the existing knowledge points with unclear ideographs, how to analyze and mine the historical conversation of the related scenes to obtain a plurality of knowledge points with single expression meaning, and for each knowledge point with single ideograph, a corresponding answer or dialect is configured, so that the accuracy of single-turn conversation answer is improved, and further the user satisfaction, the intelligent solution rate and the intelligent service duty ratio are improved, which is a big problem in the field.
The inventor of the application finds that the current work of knowledge point mining is mainly completed manually by operators. However, in the process of manually mining knowledge points, since people generally read historical dialogue data related to a certain knowledge point, the questions of the user are summarized and generalized to form a plurality of different knowledge points, which may or may not exist before. However, there are many problems with manually mining knowledge points: firstly, different people understand the same problem of the user differently, so that in the process of manually mining knowledge points, the problems of the user cannot be summarized and summarized accurately, the quality of splitting the knowledge points is low, and the mined knowledge points are still ambiguous; secondly, manually mining knowledge points consumes a large amount of human resources, and the efficiency is very low.
In order to solve the problem, another method is currently used for attempting to cluster the problems of the users in the related scenes by using a k-means clustering method. For example, historical dialogue data of a relevant scene is used as input, embedded representation of a user problem is obtained based on deep learning model training, and finally a k-means algorithm is used for clustering. By adopting the scheme, a large amount of human resources can be saved, and the working efficiency is improved. However, for a given scene, the semantics of different questions of users in the related historical dialogue data are very similar, so that the embedded representations of different questions are poor in distinguishability, and therefore the k-means clustering method is not ideal, and even not as good as the effect of manually mining knowledge points.
Based on the above technical problem, embodiments of the present application provide a data mining method, apparatus, server, and readable storage medium, where historical service record data of a user is analyzed according to a target service item selected by the user from a delivered service recommendation menu, a data mining path is determined according to an analysis result, and then a service acquisition request is subjected to offloading processing according to the determined data mining path, so that intelligent offloading of the service acquisition request is achieved according to historical service record data of different users, and the user can find a channel for solving a problem of the user more quickly, thereby improving user experience.
FIG. 1 is an architectural diagram of a data mining system 100 provided in an alternative embodiment of the present application. For example, the data mining system 100 may be an online transportation service platform relied upon for transportation services such as taxi service, designated drive service, express service, carpool service, bus service, driver rental service, or regular service, or a combination of any of the above. The data mining system 100 may include a server 110, a network 120, a service requester terminal 130, a service provider terminal 140, and a database 150, and the server 110 may include a processor therein that performs instruction operations. The data mining system 100 shown in FIG. 1 is only one possible example, and in other possible embodiments, the data mining system 100 may include only some of the components shown in FIG. 1 or may include other components.
In some embodiments, the server 110 may be a single server or a group of servers. The set of servers can be centralized or distributed (e.g., the servers 110 can be a distributed system). In some embodiments, the server 110 may be local or remote to the terminal. For example, the server 110 may access information stored in the service requester terminal 130, the service provider terminal 140, and the database 150, or any combination thereof, via the network 120. As another example, the server 110 may be directly connected to at least one of the service requester terminal 130, the service provider terminal 140, and the database 150 to access information and/or data stored therein. In some embodiments, the server 110 may be implemented on a cloud platform; by way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud (community cloud), a distributed cloud, an inter-cloud, a multi-cloud, and the like, or any combination thereof. In some embodiments, the server 110 may be implemented on an electronic device 200 having one or more of the components shown in FIG. 2 in the present application.
In some embodiments, the server 110 may include a Processor that may process information and/or data related to the service request to perform one or more functions described herein, e.g., in express service, the Processor may determine the target vehicle based on the service request obtained from the service requester terminal 130. the Processor may include one or more Processing cores (e.g., a single core Processor (S) or a multi-core Processor (S)), merely by way of example, the Processor may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction Set Processor (ASIP), a Graphics Processing Unit (GPU), a physical Processing Unit (Processing Unit, PPU), a Digital Signal Processor (Signal Processor, Device), a Field Programmable Gate Array (FPGA, 3585), a Field Programmable logic Unit (FPGA, simplified microprocessor), a Field Programmable logic Unit (DSP), a Field Programmable logic Unit (FPGA, Processor), a simplified microprocessor, or any combination thereof.
In some embodiments, Network 120 may be any type of wired or Wireless Network, or a combination thereof, Network 130 may include, by way of example only, a wired Network, a Wireless Network, a fiber optic Network, a telecommunications Network, AN intranet, the Internet, a local Area Network (L Area Network, L AN), a Wide Area Network (WAN), a Wireless local Area Network (Wireless Area Networks L, W L AN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a Switched telephone Network (NFC), a Bluetooth Network, a Bluetooth access point, a Bluetooth Network, a Bluetooth Network, a Network.
In some embodiments, the user of the service requestor terminal 130 may be someone other than the actual demander of the service. For example, the user a of the service requester terminal 130 may use the service requester terminal 130 to initiate a service request for the service actual demander B (for example, the user a may call a car for his friend B), or receive service information or instructions from the server 110. In some embodiments, the user of the service provider terminal 140 may be the actual provider of the service or may be another person than the actual provider of the service. For example, user C of the service provider terminal 140 may use the service provider terminal 140 to receive a service request serviced by the service provider entity D (e.g., user C may pick up an order for driver D employed by user C), and/or information or instructions from the server 110. In some embodiments, "service requester" and "service requester terminal" may be used interchangeably, and "service provider" and "service provider terminal" may be used interchangeably.
In some embodiments, the service requester terminal 130 may comprise a mobile device, a tablet computer, a laptop computer, or a built-in device in a motor vehicle, etc., or any combination thereof. In some embodiments, the mobile device may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home devices may include smart lighting devices, control devices for smart electrical devices, smart monitoring devices, smart televisions, smart cameras, or walkie-talkies, or the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, a smart lace, smart glass, a smart helmet, a smart watch, a smart garment, a smart backpack, a smart accessory, and the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a Personal Digital Assistant (PDA), a gaming device, a navigation device, or a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, virtual reality glass, a virtual reality patch, an augmented reality helmet, augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or augmented reality device may include various virtual reality products and the like. In some embodiments, the built-in devices in the motor vehicle may include an on-board computer, an on-board television, and the like.
Database 150 may store data and/or instructions. In some embodiments, the database 150 may store data obtained from the service requester terminal 130 and/or the service provider terminal 140. In some embodiments, database 150 may store data and/or instructions for the exemplary methods described herein. In some embodiments, database 150 may include mass storage, removable storage, volatile Read-write Memory, or Read-Only Memory (ROM), among others, or any combination thereof. By way of example, mass storage may include magnetic disks, optical disks, solid state drives, and the like; removable memory may include flash drives, floppy disks, optical disks, memory cards, zip disks, tapes, and the like; volatile read-write Memory may include Random Access Memory (RAM); the RAM may include Dynamic RAM (DRAM), Double data Rate Synchronous Dynamic RAM (DDR SDRAM); static RAM (SRAM), Thyristor-Based Random Access Memory (T-RAM), Zero-capacitor RAM (Zero-RAM), and the like. By way of example, ROMs may include Mask Read-Only memories (MROMs), Programmable ROMs (PROMs), Erasable Programmable ROMs (PERROMs), Electrically Erasable Programmable ROMs (EEPROMs), compact disk ROMs (CD-ROMs), digital versatile disks (ROMs), and the like. In some embodiments, database 150 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, across clouds, multiple clouds, or the like, or any combination thereof.
In some embodiments, a database 150 may be connected to the network 120 to communicate with one or more components in the data mining system 100 (e.g., the server 110, the service requester terminal 130, the service provider terminal 140, etc.). One or more components in the data mining system 100 may access data or instructions stored in the database 150 via the network 120. In some embodiments, the database 150 may be directly connected to one or more components in the data mining system 100 (e.g., the server 110, the service requestor terminal 130, the service provider terminal 140, etc.); alternatively, in some embodiments, database 150 may also be part of server 110.
In some embodiments, one or more components (e.g., server 110, service requestor terminal 130, service provider terminal 140, etc.) in the data mining system 100 may have access to the database 150. In some embodiments, one or more components in the data mining system 100 may read and/or modify information related to a service requestor, a service provider, or the public, or any combination thereof, when certain conditions are met. For example, server 110 may read and/or modify information for one or more users after receiving a service request.
Fig. 2 illustrates a schematic diagram of exemplary hardware and software components of an electronic device 200 of a server 110, a service requester terminal 130, and a service provider terminal 140, which may implement the concepts of the present application, provided by some embodiments of the present application. For example, the processor 220 may be used on the electronic device 200 and to perform the functions herein.
The electronic device 200 may be a general purpose computer or a special purpose computer, both of which may be used to implement the data mining methods of the present application. Although only a single computer is shown, for convenience, the functions described herein may be implemented in a distributed fashion across multiple similar platforms to balance processing loads.
For example, the electronic device 200 may include a network port 210 connected to a network, one or more processors 220 for executing program instructions, a communication bus 230, and a different form of storage medium 240, such as a disk, ROM, or RAM, or any combination thereof. Illustratively, the computer platform may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The method of the present application may be implemented in accordance with these program instructions. The electronic device 200 also includes an Input/Output (I/O) interface 250 between the computer and other Input/Output devices (e.g., keyboard, display screen).
For ease of illustration, only one processor is depicted in the electronic device 200. However, it should be noted that the electronic device 200 in the present application may also comprise a plurality of processors, and thus the steps performed by one processor described in the present application may also be performed by a plurality of processors in combination or individually. For example, if the processor of the electronic device 200 executes steps a and B, it should be understood that steps a and B may also be executed by two different processors together or separately in one processor. For example, a first processor performs step a and a second processor performs step B, or the first processor and the second processor perform steps a and B together.
Fig. 3 illustrates a flow diagram of a data mining method provided by some embodiments of the present application, which may be performed by the server 110 shown in fig. 1. It should be understood that, in other embodiments, the order of some steps in the data mining method of this embodiment may be interchanged according to actual needs, or some steps may be omitted or deleted. The detailed steps of the data mining method are described below.
Step S110, acquiring target dialogue data and extracting question data from the target dialogue data.
As a possible implementation, the human-computer history dialog in each dialog scene may be obtained from the history database, and the history dialog of the dialog requester may be obtained from the human-computer history dialog in each dialog scene as the target dialog data. For example, in each dialog scenario (e.g., a trip question dialog scenario, a take-away question dialog scenario, etc.), each time a dialog requester (e.g., a passenger, a driver) makes a single round of dialog with the server 110 through the service requester terminal 130, a man-machine history dialog of the dialog requester with the server 110 may be saved, and replies of the system and the customer service may be filtered out from the man-machine history dialog, and only the history dialog of the dialog requester may be retained as target dialog data.
Thus, the present embodiment can effectively reduce the amount of calculation in the data mining process by selecting the history dialog of the dialog requester from the history dialogs of human-machine to obtain the target dialog data without using all the history dialogs of human-machine.
In addition, the inventor has studied carefully to find that not all the historical dialogs of the dialog requester are meaningful question sentences, and in order to obtain truly meaningful question sentences, this embodiment also needs to filter out meaningless and informant dialog sentences from the target dialog data, and how to extract question data from the target dialog data will be described in detail with reference to several examples.
For example, for each dialog statement in the target dialog data, the dialog statement may be matched with each keyword in the preset keyword table, and if the dialog statement is matched with any keyword in the preset keyword table, the dialog statement may be determined as a question statement, and question data may be obtained according to the determined question statement.
In this embodiment, the high-frequency words of the problem related to each dialog scenario may be selected to form the preset keyword table. For example, for a question asked by most of the dialog requesters, the dialog sentence corresponding to the question may include high-frequency words such as "how", "why", "forgotten", "password", "ask" and "unable". Therefore, if any high-frequency word is matched into a certain dialogue statement and included in the dialogue statement, the dialogue statement can be determined as a question statement.
For another example, it may be determined whether the term length of each dialogue term in the target dialogue data is within a preset length range, and if the term length of the dialogue term is within the preset length range, the dialogue term is determined as a question term, and question data is obtained from the determined question term.
In this embodiment, the preset length range may be designed according to actual situations, for example, the preset length range may be limited to 4 to 20 characters. That is, if the dialog sentence is smaller than 4 characters, the dialog sentence is considered as an invalid sentence, and if the dialog sentence is larger than 20 characters, the sentence is generally considered to have a very clear ideogram, and knowledge point mining may not be required. Thus, if the sentence length of a certain dialogue sentence is within the preset length range, the certain dialogue sentence can be determined as a question sentence.
It should be noted that, in practical implementation, the above two exemplary embodiments of the determined question sentences may be used alternatively or simultaneously. For another example, the two exemplary embodiments described above may also be combined to determine question statements. That is, for each dialog statement in the target dialog data, the dialog statement is matched with each keyword in the preset keyword table, and if the dialog statement is matched with any keyword in the preset keyword table, it is determined whether the statement length of the dialog statement is within the preset length range. And if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence, and finally obtaining question data according to the determined question sentence. Each dialogue statement in the target dialogue data is judged through the preset keyword list and the preset length range, and accuracy of problem statement extraction can be remarkably improved.
And step S120, performing word segmentation on the question data to obtain a word segmentation result consisting of a plurality of words.
The inventor of the application also finds that in some current problem sentences, a large amount of special scene vocabularies exist. For example: a large number of users may ask "why my service score drops", in which the traditional word segmentation method may have the result after the word segmentation (service, score, drop); but obviously, the "service score" is a special scene word, and the correct word segmentation result should be (service score, descent). That is, for the special scene word "service score", in the word segmentation process, the word segmentation algorithm will generally divide the "service score" into two words: "service" and "score". The reason for this is that the "service score" is a unique word in a specific scenario and is not a common word. However, a large number of such special scene words exist in the historical dialogue database, and if the special scene words cannot form a word in the word segmentation process and are split into a plurality of words, the mining effect of subsequent knowledge points can be directly influenced, and the classification analysis of the knowledge points is further influenced.
Based on the above technical problem, the inventors of the present application have made extensive studies and then have proposed the following solutions to solve the above problems, and the following description will be made in conjunction with two exemplary embodiments.
The first implementation mode comprises the following steps: the method can be used for segmenting the question data according to a pre-configured scene word library table to obtain a segmentation result consisting of a plurality of segmentation words, wherein the scene word library table comprises a plurality of special scene words related to the target service corresponding to the question data.
The second embodiment: and performing word segmentation on the problem data according to a pre-trained scene word discovery model to obtain a word segmentation result consisting of a plurality of words. The scene word discovery model is obtained by training in the following mode:
firstly, configuring a conditional Random Field algorithm CRF (conditional Random Field algorithm) model, then, taking historical dialogue data of each dialogue scene as model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as model output, and iteratively training the CRF model to obtain a scene word discovery model.
Therefore, a scene word stock table specific to each dialogue scene is generated based on a large amount of historical dialogue data offline training conditional random field algorithm CRF models, and then Chinese word segmentation is carried out by referring to the scene word stock table, so that the condition that the mining effect of subsequent knowledge points is influenced after the special scene words are split in the word segmentation process can be effectively avoided.
And S130, constructing a corresponding frequent pattern tree according to the word segmentation result, and mining a frequent item set from the constructed frequent pattern tree.
As a possible implementation manner, the present embodiment first counts the support degree of each participle in the participle result, wherein the support degree represents the number of times the participle appears in the participle result.
And then, sequentially inserting each participle into a tree with NU LL as a root node according to a descending order of support degree, and constructing a frequent pattern tree, wherein the frequent pattern tree comprises a NU LL root node and branch nodes, the NU LL root node is an invalid value, and the branch nodes correspond to a frequent item and the support degree thereof.
Optionally, mining the frequent item set from the constructed frequent pattern tree may be implemented as follows:
firstly, for each frequent item in the constructed frequent pattern tree, constructing a condition pattern base of the frequent item, and constructing the conditional frequent pattern tree of the frequent item based on the constructed condition pattern base, wherein the condition pattern base is a path set of a plurality of prefix paths which take the frequent item as a suffix item and are connected with the suffix item.
And then, updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuing to execute the steps of constructing a conditional pattern base of each frequent item aiming at each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, and constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base until the constructed conditional frequent pattern tree is empty or only contains one path, outputting the frequent item corresponding to the conditional frequent pattern tree to obtain a frequent item set.
When the constructed conditional frequent pattern tree only contains one path, all combined paths are connected with the prefix path of the conditional frequent pattern tree to be used as frequent items.
Optionally, in order to reduce the frequent items with too low support degree, the present embodiment may further filter out the frequent items with support degree lower than the preset support degree in the frequent pattern tree before updating the frequent pattern tree based on each constructed conditional frequent pattern tree.
Therefore, according to the data mining method provided by this embodiment, the target dialogue data is acquired, the problem data is extracted from the target dialogue data, and the problem data is subjected to word segmentation to obtain a word segmentation result composed of a plurality of word segments, so that a corresponding frequent pattern tree is constructed according to the word segmentation result, and a frequent item set is mined from the constructed frequent pattern tree, where the frequent item set includes a plurality of frequent items, and each frequent item corresponds to a data mining knowledge point. The method can accurately and comprehensively mine the knowledge points in the single-turn conversation, greatly improve the efficiency and quality of knowledge point mining, so as to solve the problem of the user more effectively and improve the satisfaction degree of the user.
During research, the inventor of the present application also finds that a frequent item set mined on the basis of the above may also include many redundant frequent items, and these redundant frequent items may express the same meaning as other frequent items, which is determined by the pattern of the question statement. For example, when the user is in the process of consulting the problem of service score decline, the following two question asking modes may occur:
question asking mode A: how did my service score decline? The corresponding frequent items are: (service score, how, drop);
question asking mode B: why did my service decline? The corresponding frequent items are: (why, service points, drops).
It is easy to see that the question mode a and the question mode B express the same meaning although the question modes are different, and at this time, only one and two corresponding frequent items need to be reserved, otherwise, redundant frequent items are generated, and the subsequent problem identification efficiency is reduced.
In order to solve the above problem, please further refer to fig. 4, after the step S130, the data mining method provided in this embodiment may further include the following steps:
and step S140, merging the frequent items with the same meaning in the frequent item set to obtain a merged frequent item set.
As a possible implementation, first, a problem set is generated that is associated with each frequent item. For example, in the application scenario shown in fig. 5, the problem set related to the frequent items (B, D, C) may include problem 1, problem 2, problem 3, and problem N.
Next, a sentence vector is calculated for each frequently-term-related problem set. Optionally, for each question set related to the frequent item, performing word segmentation on each question sentence in the question set related to the frequent item to obtain a plurality of segmented words, then inputting each segmented word into a word vector model fasttext model trained in advance to obtain a segmented word vector of each segmented word, and obtaining a sentence vector of the question set related to the frequent item according to the segmented word vector of each segmented word. For example, in the application scenario shown in fig. 5, a question set related to a frequent item (B, D, C) may be subjected to word segmentation processing to obtain "i", "service division", "yes.," yes ", and then these word segmentations" i "," that "," service division "," yes., "yes" are input into a word vector model fasttext model trained in advance, so that respective word segmentations vectors may be obtained, and the respective word segmentations vectors of "i", "that", "service division", "yes.," and "yes" are added, so that a sentence vector of the question set related to the frequent item may be obtained.
In the research process, the distance between the participle vectors of the participles with similar semantics is found to be short, and the distance between the participle vectors of the participles with different semantics is found to be long. Based on this, the cosine distance between the sentence vectors of the problem sets related to any two frequent terms can be calculated according to the calculated sentence vector of the problem set related to each frequent term, and the cosine distance is used as the similarity between the sentence vectors of the problem sets related to any two frequent terms. And judging whether the similarity is greater than the preset similarity or not on the basis, and if the similarity is not greater than the preset similarity, deleting any one of the two corresponding frequent items. Otherwise, if the similarity is greater than the preset similarity, both the corresponding two frequent items are reserved.
Therefore, the frequent items with the same meaning in the frequent item set are merged, so that redundant frequent items are reduced, and the subsequent problem identification efficiency is improved.
On the basis, please further refer to fig. 6, after the step S140, the data mining method provided in this embodiment may further include the following steps:
and step S150, acquiring and storing the problem solving information of each problem in the problem set of each frequent item.
Step S160, when receiving the preset problem sent by the service requester terminal 130, matches the preset problem with each problem in the problem set of each frequent item, and sends the problem solution information of the problem matched with the preset problem to the service requester terminal 130.
The problem set of each frequent item can be regarded as a potential knowledge point, so that corresponding problem solving information, such as question answering information or dialectical information, can be configured for each potential knowledge point, and each potential knowledge point is stored in association with the corresponding problem solving information. In this way, when a preset problem is sent by the subsequent service requester terminal 130, the preset problem is matched with each problem in the problem set of each frequent item, and the problem solution information of the problem matched with the preset problem is sent to the service requester terminal 130. Therefore, the accuracy of single-turn dialogue response can be improved, and further the user satisfaction, the intelligent resolution and the intelligent service duty can be improved.
Fig. 7 illustrates a functional block diagram of a data mining device 300 according to some embodiments of the present application, where the functions implemented by the data mining device 300 may correspond to the steps performed by the above-described method. The data mining apparatus 300 may be understood as the server 110, or a processor of the server 110, or may be understood as a component that is independent from the server 110 or the processor and implements the functions of the present application under the control of the server 110, as shown in fig. 7, the data mining apparatus 300 may include a first obtaining module 310, a word segmentation module 320, and a mining module 330, and the functions of the respective functional modules of the data mining apparatus 300 are described in detail below.
The first obtaining module 310 may be configured to obtain target session data and extract question data from the target session data. It is understood that the first obtaining module 310 may be configured to perform the step S110, and for a detailed implementation of the first obtaining module 310, reference may be made to the content related to the step S110.
The word segmentation module 320 may be configured to perform word segmentation on the question data to obtain a word segmentation result composed of a plurality of words. It is understood that the word segmentation module 320 can be used to perform the step S120, and the detailed implementation manner of the word segmentation module 320 can refer to the content related to the step S120.
The mining module 330 may be configured to construct a corresponding frequent pattern tree according to the word segmentation result, and mine a frequent item set from the constructed frequent pattern tree, where the frequent item set includes a plurality of frequent items, and each frequent item corresponds to a data mining knowledge point. It is understood that the mining module 330 may be configured to perform the step S130, and for the detailed implementation of the mining module 330, reference may be made to the content related to the step S130.
In a possible implementation manner, the first obtaining module 310 may specifically obtain the target dialog data by:
acquiring man-machine historical conversations in each conversation scene from a historical database;
and acquiring the historical dialogue of the dialogue requester from the man-machine historical dialogue of each dialogue scene as target dialogue data.
In a possible implementation, the first obtaining module 310 may specifically extract the problem data by:
aiming at each dialogue statement in the target dialogue data, matching the dialogue statement with each keyword in a preset keyword table;
if the dialogue statement is matched with any keyword in a preset keyword table, determining the dialogue statement as a question statement;
and obtaining the question data according to the determined question sentences.
In a possible implementation, the first obtaining module 310 may specifically extract the problem data by:
judging whether the sentence length of each dialogue sentence in the target dialogue data is within a preset length range or not according to each dialogue sentence in the target dialogue data;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
In a possible implementation, the first obtaining module 310 may specifically extract the problem data by:
aiming at each dialogue statement in the target dialogue data, matching the dialogue statement with each keyword in a preset keyword table;
if the dialogue statement is matched with any keyword in a preset keyword table, judging whether the statement length of the dialogue statement is within a preset length range;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
In one possible embodiment, the word segmentation module 320 may obtain a word segmentation result composed of a plurality of word segmentations by:
performing word segmentation on the problem data according to a pre-configured scene word bank table to obtain a word segmentation result consisting of a plurality of words, wherein the scene word bank table comprises a plurality of special scene words related to target services corresponding to the problem data; or
And performing word segmentation on the problem data according to a pre-trained scene word discovery model to obtain a word segmentation result consisting of a plurality of words.
In one possible implementation, the scene word discovery model is trained by:
configuring a conditional random field algorithm CRF model;
and (3) taking the historical dialogue data of each dialogue scene as model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as model output, and iteratively training the CRF model to obtain a scene word discovery model.
In a possible implementation, the mining module 330 may specifically mine the frequent item set by:
counting the support degree of each word in the word segmentation result, wherein the support degree represents the occurrence frequency of the word in the word segmentation result;
sequentially inserting each participle into a tree with NU LL as a root node according to a support degree descending order, and constructing a frequent pattern tree, wherein the frequent pattern tree comprises a NU LL root node and branch nodes, the NU LL root node is an invalid value, and the branch nodes correspond to a frequent item and the support degree thereof;
and mining a frequent item set from the constructed frequent pattern tree.
In a possible implementation, the mining module 330 may specifically mine the frequent item set from the constructed frequent pattern tree by:
constructing a conditional mode base of each frequent item in the constructed frequent mode tree, and constructing the conditional frequent mode tree of the frequent item based on the constructed conditional mode base, wherein the conditional mode base is a path set of a plurality of prefix paths which take the frequent item as a suffix item and are connected with the suffix item;
updating a frequent pattern tree based on each constructed conditional frequent pattern tree, continuing to execute the steps of constructing a conditional pattern base of each frequent item aiming at each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, and constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, and outputting the frequent item corresponding to the conditional frequent pattern tree until the constructed conditional frequent pattern tree is empty or only contains one path, so as to obtain a frequent item set;
when the constructed conditional frequent pattern tree only contains one path, all combined paths are connected with the prefix path of the conditional frequent pattern tree to be used as frequent items.
In a possible implementation manner, the mining module 330 is further specifically configured to filter out frequent items in the frequent pattern tree whose support degree is lower than a preset support degree.
In a possible implementation manner, referring to fig. 8 further, the data mining apparatus 300 may further include a merging module 340, where the merging module 340 may be configured to merge frequent items with the same meaning in the frequent item set to obtain a merged frequent item set. It is understood that the merging module 340 can be used to perform the step S140, and for the detailed implementation of the merging module 340, reference can be made to the above description regarding the step S140.
The merging module 340 may specifically obtain the merged frequent item set by the following method:
generating a problem set related to each frequent item;
calculating a sentence vector of the problem set related to each frequent item;
calculating cosine distance between sentence vectors of the problem sets related to any two frequent items according to the sentence vectors of the problem sets related to each frequent item, and taking the cosine distance as similarity between the sentence vectors of the problem sets related to any two frequent items;
and judging whether the similarity is greater than the preset similarity, and if the similarity is not greater than the preset similarity, deleting any one of the two corresponding frequent items.
In one possible implementation, the merging module 340 may specifically calculate a sentence vector of the problem set associated with each frequent term by:
aiming at each problem set related to the frequent items, performing word segmentation processing on each problem statement in the problem set related to the frequent items respectively to obtain a plurality of word segments;
inputting each participle into a word vector model fasttext model trained in advance to obtain a participle vector of each participle;
and obtaining a sentence vector of the problem set related to the frequent item according to the word segmentation vector of each word segmentation.
In one possible implementation, referring further to fig. 9, the data mining device 300 may further include a second obtaining module 350 and a problem matching module 360.
The second obtaining module 350 may be configured to obtain and store problem solution information of each problem in the problem set of each frequent item. It is understood that the second obtaining module 350 may be configured to perform the step S150, and for a detailed implementation of the second obtaining module 350, reference may be made to the content related to the step S150.
The problem matching module 360 may be configured to, when receiving a preset problem sent by the service requester terminal 130, match the preset problem with each problem in the problem set of each frequent item, and send problem solution information of the problem matched with the preset problem to the service requester terminal 130. It is understood that the question matching module 360 can be used to execute the above step S160, and the detailed implementation of the question matching module 360 can refer to the above contents related to the step S160.
The wired connections may include connections in the form of L AN, WAN, Bluetooth, ZigBee, or NFC, or the like, or any combination thereof.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (28)

1. A data mining method is applied to a server, and the method comprises the following steps:
acquiring target dialogue data and extracting problem data from the target dialogue data;
performing word segmentation on the problem data to obtain a word segmentation result consisting of a plurality of words;
and constructing a corresponding frequent pattern tree according to the word segmentation result, and mining a frequent item set from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a data mining knowledge point.
2. The data mining method of claim 1, wherein the step of obtaining target dialogue data comprises:
acquiring man-machine historical conversations in each conversation scene from a historical database;
and acquiring the historical dialogue of the dialogue requester from the man-machine historical dialogue of each dialogue scene as the target dialogue data.
3. The data mining method of claim 1, wherein the step of extracting question data from the target dialogue data comprises:
aiming at each dialogue statement in the target dialogue data, matching the dialogue statement with each keyword in a preset keyword table;
if the conversation sentence is matched with any keyword in the preset keyword table, determining the conversation sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
4. The data mining method of claim 1, wherein the step of extracting question data from the target dialogue data comprises:
judging whether the sentence length of each dialogue sentence in the target dialogue data is within a preset length range or not;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
5. The data mining method of claim 1, wherein the step of extracting question data from the target dialogue data comprises:
aiming at each dialogue statement in the target dialogue data, matching the dialogue statement with each keyword in a preset keyword table;
if the dialogue statement is matched with any keyword in the preset keyword table, judging whether the statement length of the dialogue statement is within a preset length range;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
6. The data mining method of claim 1, wherein the step of performing word segmentation on the question data to obtain a word segmentation result consisting of a plurality of word segmentations comprises:
performing word segmentation on the problem data according to a pre-configured scene word bank table to obtain a word segmentation result consisting of a plurality of words, wherein the scene word bank table comprises a plurality of special scene words related to target services corresponding to the problem data; or
And segmenting the problem data according to a pre-trained scene word finding model to obtain a segmentation result consisting of a plurality of segmentation words.
7. The data mining method of claim 6, wherein the scene word discovery model is trained by:
configuring a conditional random field algorithm CRF model;
and taking the historical dialogue data of each dialogue scene as model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as model output, and iteratively training the CRF model to obtain the scene word discovery model.
8. The data mining method according to any one of claims 1 to 7, wherein the step of constructing a corresponding frequent pattern tree according to the word segmentation result and mining a frequent item set from the constructed frequent pattern tree comprises:
counting the support degree of each word in the word segmentation result, wherein the support degree represents the occurrence times of the word in the word segmentation result;
sequentially inserting each participle into a tree with NU LL as a root node according to a support degree descending order, and constructing a frequent pattern tree, wherein the frequent pattern tree comprises a NU LL root node and branch nodes, the NU LL root node is an invalid value, and the branch nodes correspond to a frequent item and the support degree thereof;
and mining a frequent item set from the constructed frequent pattern tree.
9. The data mining method of claim 8, wherein the step of mining a frequent item set from the constructed frequent pattern tree comprises:
for each frequent item in the constructed frequent pattern tree, constructing a conditional pattern base of the frequent item, and constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, wherein the conditional pattern base is a path set of a plurality of prefix paths which take the frequent item as a suffix item and are connected with the suffix item;
updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuing to execute the steps of constructing a conditional pattern base of each frequent item aiming at each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, and constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, until the constructed conditional frequent pattern tree is empty or only contains one path, outputting the frequent item corresponding to the conditional frequent pattern tree to obtain a frequent item set;
when the constructed conditional frequent pattern tree only contains one path, all the combined paths are connected with the prefix path of the conditional frequent pattern tree as the frequent item.
10. The data mining method of claim 9, wherein after the step of updating the frequent pattern tree based on each constructed conditional frequent pattern tree, the method further comprises:
and filtering out frequent items with the support degree lower than a preset support degree in the frequent pattern tree.
11. The data mining method according to claim 1, wherein after the step of constructing a corresponding frequent pattern tree according to the word segmentation result and mining a frequent item set from the constructed frequent pattern tree, the method further comprises:
merging the frequent items with the same meaning in the frequent item set to obtain a merged frequent item set;
the step of combining the frequent items with the same meaning in the frequent item set to obtain a combined frequent item set includes:
generating a problem set related to each frequent item;
calculating a sentence vector of the problem set related to each frequent item;
calculating cosine distance between sentence vectors of the problem sets related to any two frequent items according to the sentence vectors of the problem sets related to each frequent item, and taking the cosine distance as similarity between the sentence vectors of the problem sets related to any two frequent items;
and judging whether the similarity is greater than a preset similarity, and if the similarity is not greater than the preset similarity, deleting any one of the two corresponding frequent items.
12. The data mining method of claim 11, wherein the step of computing a sentence vector for each frequent item related problem set comprises:
aiming at each problem set related to the frequent items, performing word segmentation processing on each problem statement in the problem set related to the frequent items respectively to obtain a plurality of word segments;
inputting each participle into a word vector model fasttext model trained in advance to obtain a participle vector of each participle;
and obtaining a sentence vector of the problem set related to the frequent item according to the word segmentation vector of each word segmentation.
13. The data mining method of claim 11, wherein after the step of merging the frequent terms with the same meaning in the frequent item set to obtain a merged frequent item set, the method further comprises:
acquiring and storing problem solving information of each problem in the problem set of each frequent item;
when receiving a preset problem sent by a service requester terminal, matching the preset problem with each problem in each problem set of frequent items, and sending problem solution information of the problem matched with the preset problem to the service requester terminal.
14. A data mining device, which is applied to a server, the device comprising:
the first acquisition module is used for acquiring target dialogue data and extracting problem data from the target dialogue data;
the word segmentation module is used for segmenting the question data to obtain a word segmentation result consisting of a plurality of words;
and the mining module is used for constructing a corresponding frequent pattern tree according to the word segmentation result and mining a frequent item set from the constructed frequent pattern tree, wherein the frequent item set comprises a plurality of frequent items, and each frequent item corresponds to a knowledge point mined by data.
15. The data mining device of claim 14, wherein the first obtaining module obtains the target conversation data by:
acquiring man-machine historical conversations in each conversation scene from a historical database;
and acquiring the historical dialogue of the dialogue requester from the man-machine historical dialogue of each dialogue scene as the target dialogue data.
16. The data mining device of claim 14, wherein the first obtaining module extracts the problem data by:
aiming at each dialogue statement in the target dialogue data, matching the dialogue statement with each keyword in a preset keyword table;
if the conversation sentence is matched with any keyword in the preset keyword table, determining the conversation sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
17. The data mining device of claim 14, wherein the first obtaining module extracts the problem data by:
judging whether the sentence length of each dialogue sentence in the target dialogue data is within a preset length range or not;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
18. The data mining device of claim 14, wherein the first obtaining module extracts the problem data by:
aiming at each dialogue statement in the target dialogue data, matching the dialogue statement with each keyword in a preset keyword table;
if the dialogue statement is matched with any keyword in the preset keyword table, judging whether the statement length of the dialogue statement is within a preset length range;
if the sentence length of the dialogue sentence is within the preset length range, determining the dialogue sentence as a question sentence;
and obtaining the question data according to the determined question sentences.
19. The data mining device of claim 14, wherein the segmentation module obtains a segmentation result consisting of a plurality of segmentation words by:
performing word segmentation on the problem data according to a pre-configured scene word bank table to obtain a word segmentation result consisting of a plurality of words, wherein the scene word bank table comprises a plurality of special scene words related to target services corresponding to the problem data; or
And segmenting the problem data according to a pre-trained scene word finding model to obtain a segmentation result consisting of a plurality of segmentation words.
20. The data mining device of claim 19, wherein the scene word discovery model is trained by:
configuring a conditional random field algorithm CRF model;
and taking the historical dialogue data of each dialogue scene as model input, taking a plurality of special scene words in the historical dialogue data of each dialogue scene as model output, and iteratively training the CRF model to obtain the scene word discovery model.
21. The data mining device of any one of claims 14-20, wherein the mining module is configured to mine the frequent item set by:
counting the support degree of each word in the word segmentation result, wherein the support degree represents the occurrence times of the word in the word segmentation result;
sequentially inserting each participle into a tree with NU LL as a root node according to a support degree descending order, and constructing a frequent pattern tree, wherein the frequent pattern tree comprises a NU LL root node and branch nodes, the NU LL root node is an invalid value, and the branch nodes correspond to a frequent item and the support degree thereof;
and mining a frequent item set from the constructed frequent pattern tree.
22. The data mining device of claim 21, wherein the mining module is further configured to mine the frequent item set from the constructed frequent pattern tree by:
for each frequent item in the constructed frequent pattern tree, constructing a conditional pattern base of the frequent item, and constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, wherein the conditional pattern base is a path set of a plurality of prefix paths which take the frequent item as a suffix item and are connected with the suffix item;
updating the frequent pattern tree based on each constructed conditional frequent pattern tree, continuing to execute the steps of constructing a conditional pattern base of each frequent item aiming at each frequent item in the constructed frequent pattern tree based on the updated frequent pattern tree, and constructing the conditional frequent pattern tree of the frequent item based on the constructed conditional pattern base, until the constructed conditional frequent pattern tree is empty or only contains one path, outputting the frequent item corresponding to the conditional frequent pattern tree to obtain a frequent item set;
when the constructed conditional frequent pattern tree only contains one path, all the combined paths are connected with the prefix path of the conditional frequent pattern tree as the frequent item.
23. The data mining device of claim 22, wherein the mining module is further configured to filter out frequent items in the frequent pattern tree with a support degree lower than a preset support degree.
24. The data mining device of claim 14, wherein the device further comprises:
a merging module, configured to merge frequent items with the same meaning in the frequent item set to obtain a merged frequent item set;
the merging module obtains the merged frequent item set specifically by the following method:
generating a problem set related to each frequent item;
calculating a sentence vector of the problem set related to each frequent item;
calculating cosine distance between sentence vectors of the problem sets related to any two frequent items according to the sentence vectors of the problem sets related to each frequent item, and taking the cosine distance as similarity between the sentence vectors of the problem sets related to any two frequent items;
and judging whether the similarity is greater than a preset similarity, and if the similarity is not greater than the preset similarity, deleting any one of the two corresponding frequent items.
25. The data mining device of claim 24, wherein the merging module calculates a sentence vector for each frequent item related problem set by:
aiming at each problem set related to the frequent items, performing word segmentation processing on each problem statement in the problem set related to the frequent items respectively to obtain a plurality of word segments;
inputting each participle into a word vector model fasttext model trained in advance to obtain a participle vector of each participle;
and obtaining a sentence vector of the problem set related to the frequent item according to the word segmentation vector of each word segmentation.
26. The data mining device of claim 24, the device further comprising:
the second acquisition module is used for acquiring and storing the problem solution information of each problem in the problem set of each frequent item;
and the problem matching module is used for matching the preset problem with each problem in the problem set of each frequent item when receiving the preset problem sent by the service requester terminal, and sending the problem solving information of the problem matched with the preset problem to the service requester terminal.
27. A server, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the server is running, the processor executing the machine-readable instructions to perform the steps of the data mining method of any one of claims 1-13 when executed.
28. A readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the steps of the data mining method according to any one of claims 1-13.
CN201811526754.5A 2018-12-13 2018-12-13 Data mining method, device, server and readable storage medium Active CN111401388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811526754.5A CN111401388B (en) 2018-12-13 2018-12-13 Data mining method, device, server and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811526754.5A CN111401388B (en) 2018-12-13 2018-12-13 Data mining method, device, server and readable storage medium

Publications (2)

Publication Number Publication Date
CN111401388A true CN111401388A (en) 2020-07-10
CN111401388B CN111401388B (en) 2023-06-30

Family

ID=71428222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811526754.5A Active CN111401388B (en) 2018-12-13 2018-12-13 Data mining method, device, server and readable storage medium

Country Status (1)

Country Link
CN (1) CN111401388B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164401A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112364128A (en) * 2020-11-06 2021-02-12 北京乐学帮网络技术有限公司 Information processing method and device, computer equipment and storage medium
CN113157766A (en) * 2021-03-12 2021-07-23 Oppo广东移动通信有限公司 Application analysis method and device, electronic equipment and computer-readable storage medium
CN112364128B (en) * 2020-11-06 2024-05-24 北京乐学帮网络技术有限公司 Information processing method, device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480128A (en) * 2017-07-17 2017-12-15 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system
CN108132947A (en) * 2016-12-01 2018-06-08 百度在线网络技术(北京)有限公司 Entity digging system and method
US20180322188A1 (en) * 2015-10-30 2018-11-08 Microsoft Technology Licensing, Llc Automatic conversation creator for news

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180322188A1 (en) * 2015-10-30 2018-11-08 Microsoft Technology Licensing, Llc Automatic conversation creator for news
CN108132947A (en) * 2016-12-01 2018-06-08 百度在线网络技术(北京)有限公司 Entity digging system and method
CN107480128A (en) * 2017-07-17 2017-12-15 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈慧萍 等: "基于FP-tree和支持度数组的最大频繁项集挖掘算法" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164401A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112364128A (en) * 2020-11-06 2021-02-12 北京乐学帮网络技术有限公司 Information processing method and device, computer equipment and storage medium
CN112364128B (en) * 2020-11-06 2024-05-24 北京乐学帮网络技术有限公司 Information processing method, device, computer equipment and storage medium
CN113157766A (en) * 2021-03-12 2021-07-23 Oppo广东移动通信有限公司 Application analysis method and device, electronic equipment and computer-readable storage medium

Also Published As

Publication number Publication date
CN111401388B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
US10043514B2 (en) Intelligent contextually aware digital assistants
WO2019232772A1 (en) Systems and methods for content identification
US10496751B2 (en) Avoiding sentiment model overfitting in a machine language model
CN113407850B (en) Method and device for determining and acquiring virtual image and electronic equipment
CN112241715A (en) Model training method, expression recognition method, device, equipment and storage medium
US20200176019A1 (en) Method and system for recognizing emotion during call and utilizing recognized emotion
CN111353092A (en) Service pushing method, device, server and readable storage medium
CN112037775B (en) Voice recognition method, device, equipment and storage medium
CN111275470A (en) Service initiation probability prediction method and training method and device of model thereof
JP2020052463A (en) Information processing method and information processing apparatus
CN112818227A (en) Content recommendation method and device, electronic equipment and storage medium
CN113627536A (en) Model training method, video classification method, device, equipment and storage medium
CN111401388A (en) Data mining method, device, server and readable storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN114723949A (en) Three-dimensional scene segmentation method and method for training segmentation model
CN113158030B (en) Recommendation method and device for remote interest points, electronic equipment and storage medium
CN112860995A (en) Interaction method, device, client, server and storage medium
WO2024051146A1 (en) Methods, systems, and computer-readable media for recommending downstream operator
CN111680497B (en) Session recognition model training method and device
EP4174439A1 (en) Method and apparatus for processing map information, device, and storage medium
CN113110782B (en) Image recognition method and device, computer equipment and storage medium
CN113704256B (en) Data identification method, device, electronic equipment and storage medium
CN114490986A (en) Computer-implemented data mining method, computer-implemented data mining device, electronic device, and storage medium
CN113361363A (en) Training method, device and equipment for face image recognition model and storage medium
CN113554062A (en) Training method, device and storage medium of multi-classification model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant