CN116578885A - Data processing method and device and electronic equipment - Google Patents

Data processing method and device and electronic equipment Download PDF

Info

Publication number
CN116578885A
CN116578885A CN202310339465.9A CN202310339465A CN116578885A CN 116578885 A CN116578885 A CN 116578885A CN 202310339465 A CN202310339465 A CN 202310339465A CN 116578885 A CN116578885 A CN 116578885A
Authority
CN
China
Prior art keywords
data
cluster
intention
data cluster
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310339465.9A
Other languages
Chinese (zh)
Inventor
李彤
李让
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Nuodi Beijing Intelligent Technology Co ltd
Original Assignee
Lenovo Nuodi Beijing Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Nuodi Beijing Intelligent Technology Co ltd filed Critical Lenovo Nuodi Beijing Intelligent Technology Co ltd
Priority to CN202310339465.9A priority Critical patent/CN116578885A/en
Publication of CN116578885A publication Critical patent/CN116578885A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Abstract

The disclosure provides a data processing method, a data processing device and electronic equipment, wherein the method comprises the following steps: acquiring a data set to be processed; carrying out semantic clustering on the data group to be processed and the data set to obtain a plurality of first data clusters; determining at least one third data cluster of the first data cluster that is different from the second data cluster; entity identification is carried out on the first data in the third data cluster so as to determine a triplet corresponding to the first data in the third data cluster; and determining the intention corresponding to the third data cluster through the triplet corresponding to the first data in the third data cluster based on an inference model.

Description

Data processing method and device and electronic equipment
Technical Field
The disclosure relates to data intention recognition technology, and in particular relates to a data processing method, a data processing device and electronic equipment.
Background
In an intelligent customer service scenario, traditional intent is typically of a single category, while structured intent is to split intent understanding for fine-grained intent recognition, thereby understanding the semantics of the user's voice data in more detail. In identifying a structuring intent, it is often necessary to predefine a set of outline (schema) systems, including three sub-taxonomy systems: subject, action, and specific content, but cannot recognize new intent outside of the hierarchy. A common solution is to find new intent by manual quality inspection of feedback error data (Case), a problem with this approach is: since the structured intention is more complex, new intention labels which are summarized and reasonably summarized from similar cases are needed, and when the intention system is large, the labor and time are consumed.
Disclosure of Invention
The disclosure provides a data processing method, a data processing device and electronic equipment.
According to a first aspect of the present disclosure, there is provided a data processing method comprising:
acquiring a data set to be processed, wherein the data set to be processed comprises at least one first data with unknown intention, and the data is voice data or text data;
carrying out semantic clustering on the data set to be processed and a data set to obtain a plurality of first data clusters, wherein the data set comprises a plurality of second data with known intention, and the plurality of second data are clustered into at least one second data cluster based on the semantic clustering;
determining at least one third data cluster different from the second data cluster in the first data cluster, wherein the third data cluster comprises at least one first data;
entity identification is carried out on the first data in the third data cluster so as to determine a triplet corresponding to the first data in the third data cluster;
and determining the intention corresponding to the third data cluster through the triplet corresponding to the first data in the third data cluster based on an inference model.
In an embodiment, before determining the intention corresponding to the third data cluster, the method further includes:
and carrying out entity normalization processing on the triples corresponding to the first data in the third data cluster based on the appointed knowledge graph to obtain candidate intents corresponding to the first data in the third data cluster.
In an embodiment, the determining the intent corresponding to the third data cluster includes:
converting candidate intentions corresponding to each first data in the third data cluster into semantic vectors based on an inference model, and determining semantic similarity among the candidate intentions based on the semantic vectors of the first data in the third data cluster;
and taking the candidate intention of which the semantic similarity meets the condition as the intention of the third data cluster.
In an embodiment, the method further comprises:
and taking more than two candidate intentions with the semantic similarity reaching a set threshold value as candidate intention groups, and determining that candidate intentions corresponding to the candidate intention group with the largest number of candidate intentions in the candidate intention groups meet the semantic similarity condition.
In an embodiment, the determining the triplet corresponding to the first data in the third data cluster includes:
and extracting the subject, predicate and object triples of the first data without supervision based on at least one of parts of speech, dependency syntax, semantic roles and clauses through an open information extraction sequence labeling method.
In an embodiment, after determining the intent corresponding to the third data cluster, the method further includes:
determining the intention of the first data in the third data cluster as the corresponding intention of the third data cluster;
wherein the first data is changed to data of known intent.
In an embodiment, the method further comprises:
and if the first data with unknown intention is clustered to a second data cluster based on semantics, taking the intention corresponding to the second data cluster as the intention of the first data.
In an embodiment, the method further comprises: the first data is added to the data set.
According to a second aspect of the present disclosure, there is provided a data processing apparatus comprising:
an acquisition unit for acquiring a data set to be processed; the data set to be processed comprises at least one first data with unknown intention, wherein the data is voice data or text data;
the clustering unit is used for carrying out semantic clustering on the data group to be processed and a data set to obtain a plurality of first data clusters, wherein the data set comprises a plurality of second data with known intention, and the plurality of second data are clustered into at least one second data cluster based on the semantic;
a first determining unit configured to determine at least one third data cluster different from the second data cluster in the first data cluster, where the third data cluster includes at least one first data;
the second determining unit is used for carrying out entity identification on the first data in the third data cluster so as to determine a triplet corresponding to the first data in the third data cluster;
and the third determining unit is used for determining the intention corresponding to the third data cluster through the triplet corresponding to the first data in the third data cluster based on the reasoning model.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the data processing methods described in the present disclosure.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the steps of the data processing method described in the present disclosure.
According to the data processing method, the data processing device and the electronic equipment, the third data cluster, namely the new data cluster, can be determined by carrying out semantic clustering on the data to be processed and the known data set, and the new data cluster comprises the first data, namely the data with unknown intention. Based on entity recognition of the data with unknown intention in the new data cluster, determining the triplet corresponding to the data with unknown intention, and further determining the intention corresponding to the new data cluster based on the reasoning model.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
FIG. 1 shows a schematic diagram of an implementation flow of a data processing method according to an embodiment of the disclosure;
FIG. 2 shows a second flowchart of an implementation of a data processing method according to an embodiment of the disclosure;
FIG. 3 shows a schematic of a structured description of an embodiment statement of the present disclosure;
FIG. 4 shows a schematic implementation of a data processing method of an embodiment of the present disclosure;
FIG. 5 is a schematic diagram showing the constitution of a data processing apparatus according to an embodiment of the present disclosure;
fig. 6 shows a schematic diagram of a composition structure of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, features and advantages of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure will be clearly described in conjunction with the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
Fig. 1 shows a schematic implementation flow diagram of a data processing method according to an embodiment of the disclosure, and as shown in fig. 1, the data processing method according to an embodiment of the disclosure includes the following processing steps:
step 101, a data set to be processed is acquired.
In the embodiment of the disclosure, the data set to be processed may be a voice data set input by a user through a voice acquisition device, such as voice data input by the user to an automatic voice processing system through an instant messaging tool, or related text data input by the user through an interactive system, or the like. The data to be processed may be related data obtained by searching through a search engine, or data stored in a related database or a log system, etc. The set of these relevant data may be used as the data set to be processed.
The data set to be processed comprises at least one first data with unknown intention, and the data in the data set to be processed is voice data or text data. The voice data may be video data having voice, or the like.
And 102, carrying out semantic clustering on the data group to be processed and the data set to obtain a plurality of first data clusters.
In an embodiment of the disclosure, the data set includes a plurality of second data with known intent, the plurality of second data clustered into at least one second data cluster based on semantics. In the embodiment of the disclosure, the data set is a set of data with known intention, and each data in the data set has determined its intention, so that all data are clustered according to the intention of the data in the data set to obtain a plurality of second data clusters, where the second data clusters are all data clusters with known intention. As an example, the automatic response system may use relevant data in the data set to identify intent of voice data or text data input by the user, so as to implement interaction with the user, such as automatically responding to the consultation of the user, so as to meet relevant requirements of the user, such as automatic consultation.
In the embodiment of the disclosure, firstly, a data group to be processed and a data set are subjected to semantic clustering, and first data in the data group to be processed and data in the data set are subjected to semantic aggregation to obtain a first data cluster; the first data cluster comprises a second data cluster and a third data cluster, and a part of the second data clusters possibly contain first data, namely the first data can be semantically clustered to the second data cluster with known intention. And for other first data which cannot be clustered in the second data cluster in the data set, semantically aggregating the other first data into a new data cluster according to a semantic aggregation algorithm, wherein the new data cluster is the third data cluster.
Step 103, determining at least one third data cluster different from the second data cluster in the first data cluster.
In an embodiment of the disclosure, the third data cluster includes at least one of the first data. The first data clusters are the result of semantic clustering of the data group to be processed and the data set, so that the plurality of first data clusters comprise second data clusters with known intentions; in general, since the first data included in the data set to be processed is data of unknown intention, there will be first data that cannot be semantically clustered into the second data cluster in the data set, and these first data that cannot be semantically clustered into the second data cluster are semantically clustered into the third data cluster. Thus, a third data cluster is typically also included in the plurality of first data clusters, which third data cluster is a data cluster other than the second data cluster.
The embodiment of the disclosure mainly determines the intention of a third data cluster with unknown intention in a first data cluster, so that the intention of the third data cluster is determined by firstly determining the third data cluster which is different from the second data cluster in the first data cluster and mining the related data in the third data cluster.
And 104, performing entity identification on the first data in the third data cluster to determine a triplet corresponding to the first data in the third data cluster.
In the embodiment of the disclosure, the entity identification is performed on the first data in the third data cluster to determine the triplet corresponding to the first data. The triples herein mainly include subjects, predicates, and objects. Specifically, determining the triplet corresponding to the first data in the third data cluster includes: and extracting the subject, predicate and object triples of the first data without supervision based on at least one of parts of speech, dependency syntax, semantic roles and clauses through an open information extraction sequence labeling method.
It should be understood by those skilled in the art that the foregoing triplet extraction is merely exemplary and is not a limitation of the technical solution of the present disclosure.
And 105, determining the intention corresponding to the third data cluster through the triplet corresponding to the first data in the third data cluster based on an inference model.
In the embodiment of the disclosure, all subject-predicate-object triples in first data in a third data cluster are extracted unsupervised through an open information extraction (OpenIE) sequence labeling method, and combinations of all subjects, actions and specific contents are obtained and used as candidate intents; and screening the intention with the highest similarity with the original text from the candidate intentions as a new intention of the data through a BERT-NLI reasoning model.
Specifically, converting candidate intentions corresponding to each first data in the third data cluster into semantic vectors based on an inference model such as a BERT-NLI inference model, and determining semantic similarity among the candidate intentions based on the semantic vectors of the first data in the third data cluster; and taking the candidate intention of which the semantic similarity meets the condition as the intention of the third data cluster.
As an implementation manner, converting candidate intentions corresponding to data into semantic vectors based on an inference model, and determining euclidean distances and/or cosine similarities of the semantic vectors based on the semantic vectors of the first data in the third data cluster; determining semantic similarity between candidate intents based on the euclidean distance and/or the cosine similarity; for example, each candidate intent whose euclidean distance is less than a first set threshold is determined to be semantically similar; alternatively, a determination that cosine similarity reaches 95% is made as semantic similarity.
Here, the condition is satisfied, for example, that there is a data group whose semantic similarity reaches a set threshold, such as when there is only one group of data whose semantic similarity reaches 90%, the intention of the data in the data group whose semantic similarity reaches 90% is the intention as the third data cluster. When the data with the semantic similarity reaching 90% is more than one group, the intention corresponding to the group of data with the greatest quantity and the semantic similarity reaching 90% is taken as the intention of the third data cluster. And when the data group with the semantic similarity reaching 90% does not exist, reducing the value of the set threshold, determining the data group with the semantic similarity reaching the corresponding threshold according to the mode, and taking the intention corresponding to the data in the determined data group as the intention of the third data cluster.
According to the data processing method, the third data cluster, namely the new data cluster, can be determined by carrying out semantic clustering on the data to be processed and the known data set, and the new data cluster comprises the first data, namely the data with unknown intention. Based on entity recognition of the data with unknown intention in the new data cluster, determining the triplet corresponding to the data with unknown intention, and further determining the intention corresponding to the new data cluster based on the reasoning model.
In an embodiment of the present disclosure, after determining the intention corresponding to the third data cluster, the method further includes: determining the intention of the first data in the third data cluster as the corresponding intention of the third data cluster; at this time, all the first data in the third data cluster are changed into the data with known intention, and the first data with determined intention can be added into the data set to continuously update the capacity of the data with known intention in the data set, enrich the intention library of the voice or text data, and facilitate the intention recognition of natural language based on the data set.
In the embodiment of the disclosure, for first data with unknown intention in data to be processed, if the first data with unknown intention is clustered into a second data cluster based on semantics, the intention corresponding to the second data cluster is taken as the intention of the first data; wherein the first data of unknown intent is changed to data of known intent. Thus, when the intention of the first data is determined, the first data can be added into the data set, so that the language mode of the data set is increased, and the language intention recognition capability of the data set is improved.
Those skilled in the art will appreciate that the BERT-NLI inference model is merely an exemplary illustration and is not a limitation on the technical solutions of the embodiments of the present disclosure.
Fig. 2 shows a second implementation flow chart of the data processing method according to the embodiment of the disclosure, and as shown in fig. 2, the data processing method according to the embodiment of the disclosure includes the following processing steps:
step 201, a data set to be processed is acquired.
And 202, carrying out semantic clustering on the data group to be processed and the data set to obtain a plurality of first data clusters.
Step 203, determining at least one third data cluster different from the second data cluster in the first data cluster.
And 204, performing entity identification on the first data in the third data cluster to determine a triplet corresponding to the first data in the third data cluster.
The details of steps 201-204 are described above and will not be repeated here
And 205, performing entity normalization processing on the triples corresponding to the first data in the third data cluster based on the specified knowledge graph to obtain candidate intents corresponding to the first data in the third data cluster.
In the embodiment of the disclosure, the entity identified by the first data in the third data cluster is normalized by using the externally constructed knowledge graph to obtain a unified entity expression, so as to normalize the entity identified by the first data in the third data cluster. As an example, assume that the identified entities are: (z 6 pro, query pre-installed software, office), (computer, reload system, null), (thinkbook, query pre-installed software, office), then after normalization processing: (computer, pre-installed software, office), (computer, reinstalled system, null), (computer, query pre-installed software, office). The normalization is given here by way of example only and is not limiting of the disclosed embodiments.
And 206, determining the intention corresponding to the third data cluster through the triplet corresponding to the first data in the third data cluster based on an inference model.
In the embodiment of the disclosure, all subject-predicate-object triples in first data in a third data cluster are extracted unsupervised through an open information extraction (OpenIE) sequence labeling method, and combinations of all subjects, actions and specific contents are obtained and used as candidate intents; and screening the intention with the highest similarity with the original text from the candidate intentions as a new intention of the data through a BERT-NLI reasoning model.
Specifically, converting candidate intentions corresponding to each first data in the third data cluster into semantic vectors based on an inference model such as a BERT-NLI inference model, and determining semantic similarity among the candidate intentions based on the semantic vectors of the first data in the third data cluster; and taking the candidate intention of which the semantic similarity meets the condition as the intention of the third data cluster.
Here, the condition is satisfied, for example, that there is a data group whose semantic similarity reaches a set threshold, such as when there is only one group of data whose semantic similarity reaches 90%, the intention of the data in the data group whose semantic similarity reaches 90% is the intention as the third data cluster. When the data with the semantic similarity reaching 90% is more than one group, the intention corresponding to the group of data with the greatest quantity and the semantic similarity reaching 90% is taken as the intention of the third data cluster. And when the data group with the semantic similarity reaching 90% does not exist, reducing the value of the set threshold, determining the data group with the semantic similarity reaching the corresponding threshold according to the mode, and taking the intention corresponding to the data in the determined data group as the intention of the third data cluster.
As one implementation, more than two candidate intentions with semantic similarity reaching a set threshold are taken as candidate intention groups, and the candidate intention corresponding to the candidate intention group with the largest candidate intention number in the candidate intention groups is determined to meet the semantic similarity condition. For example, candidate intentions with semantic similarity reaching 85% are used as candidate intention groups, and candidate intentions corresponding to the candidate intention groups with the largest number of candidate intentions are used as intentions of a third data cluster, namely, candidate intentions corresponding to the candidate intention groups with the largest number of candidate intentions in the candidate intention groups meet the semantic similarity condition.
The essence of the technical solution of the embodiments of the present disclosure is further elucidated below by means of specific examples.
In application scenarios such as intelligent customer service, the structuring intention is to split intention understanding so as to understand the intention of a user in more detail, so as to perform friendly interaction with the user more accurately. FIG. 3 shows a schematic diagram of a structured description of a statement of an embodiment of the present disclosure, as shown in FIG. 3, in identifying a structured intent, a predefined set of schema hierarchies is typically required, consisting essentially of three sub-taxonomy hierarchies: body, action, and specific content.
The current intention recognition mode can only recognize the intention of the currently defined system, and if the user changes the expression form or a new intention outside the system appears, the intention cannot be accurately recognized. For this, for a new intention outside the system, it is necessary to find the new intention by manual quality inspection feedback. However, since the structured intention is more complex, new intention labels which are summarized and reasonably summarized from similar data events are needed, when the intention system is large, the comparison is labor-and time-consuming, and the summarized new intention cannot necessarily cover related data.
According to the embodiment of the disclosure, the unsupervised new intention is constructed to extract the Pipeline, and the new structured intention and the label are automatically extracted from the error Case by combining the external knowledge graph, so that the quality inspection is assisted, the new intention of the new problem of the user is further understood, and the user experience is improved.
Fig. 4 shows a schematic implementation diagram of a data processing method according to an embodiment of the present disclosure, and as shown in fig. 4, the data processing method according to an embodiment of the present disclosure includes:
and carrying out semantic clustering on the data set to be processed to obtain multiple cases with new intentions.
Performing semantic clustering on error data (error case) with low confidence coefficient and the existing training data D through a single-pass algorithm to generate a new data cluster, wherein the new data cluster represents a new intention; as shown in fig. 4, the low confidence error case in the semantic clustering is clustered with the training data D in the system, the data which can be clustered are clustered into the data clusters with the same intention, and the data which cannot be clustered into the existing data clusters are used as new data clusters. The Single-Pass algorithm is a streaming data clustering method. For the data which arrives in sequence, relevant data is processed each time according to the input sequence, and the data is judged to be the existing class or a new data class is created according to the matching degree of the current data and the existing class, so that the increment and dynamic clustering of the streaming data are realized. As shown in fig. 4, by clustering with training data in a system, it is possible to determine which data can be semantically clustered into existing training data, so that when the data is clustered into an existing data cluster, the intention of the existing data cluster is the intention of the clustered data, and for data which cannot be clustered into training data, a new data cluster is formed as new data.
And extracting candidate intents based on the OpenIE unsupervised entity relation for the new data cluster determined after the semantic clustering. Specifically, for the data in each new data cluster, all subject-predicate-object (SPO) triples are extracted unsupervised through an open information extraction (OpenIE) sequence labeling method, and all combinations of one-hop and two-hop relations are obtained and used as candidate intents.
And carrying out entity normalization on the extracted subject-predicate-object (SPO) triples by using an external knowledge graph to obtain unified entity expression of the same kind of entity.
Candidate intents which can represent the new data cluster are screened out from the candidate intents through a BERT-NLI reasoning model and used as labels of the intents of the new data cluster. Specifically, based on a BERT-NLI reasoning model, converting candidate intentions corresponding to each data in a new data cluster into semantic vectors, and determining Euclidean distance and/or cosine similarity of each semantic vector; determining semantic similarity between candidate intents based on the euclidean distance and/or the cosine similarity; and taking more than two candidate intentions with semantic similarity reaching a set threshold as candidate intention groups, and determining candidate intentions corresponding to the candidate intention group with the largest number of candidate intentions in the candidate intention groups as intentions of a new data cluster.
According to the data processing method, the third data cluster, namely the new data cluster, can be determined by carrying out semantic clustering on the data to be processed and the known data set, and the new data cluster comprises the first data, namely the data with unknown intention. Based on entity recognition of the data with unknown intention in the new data cluster, determining the triplet corresponding to the data with unknown intention, and further determining the intention corresponding to the new data cluster based on the reasoning model.
The data processing method can further add the determined intention of the new data cluster into the corresponding data set, so that the intention in the data set is continuously enriched, and when the data to be recognized by the intention is recognized based on the data set, the recognition result is more accurate. The method and the device construct an unsupervised new intention, extract the Pipeline of the new intention data cluster, automatically induce the structured intention and the label of the new data cluster by combining the external knowledge graph through information extraction and intention reasoning, further understand the new problems of different users and new application scenes, continuously identify the new intention through continuous self-learning, thereby better communicating with the users and improving the user experience.
Fig. 5 is a schematic diagram showing a composition structure of a data processing apparatus according to an embodiment of the present disclosure, and as shown in fig. 5, the data processing apparatus according to an embodiment of the present disclosure includes:
an acquisition unit 50 for acquiring a data set to be processed; the data set to be processed comprises at least one first data with unknown intention, wherein the data is voice data or text data;
a clustering unit 51, configured to semantically cluster the data set to be processed with a data set to obtain a plurality of first data clusters, where the data set includes a plurality of second data with known intent, and the plurality of second data is clustered into at least one second data cluster based on semantics;
a first determining unit 52, configured to determine at least one third data cluster different from the second data cluster in the first data cluster, where the third data cluster includes at least one first data;
a second determining unit 53, configured to perform entity identification on the first data in the third data cluster, so as to determine a triplet corresponding to the first data in the third data cluster;
and a third determining unit 54, configured to determine, based on an inference model, an intent corresponding to the third data cluster through a triplet corresponding to the first data in the third data cluster.
On the basis of the data processing apparatus shown in fig. 5, the data processing apparatus of the embodiment of the present disclosure further includes:
and the normalization processing unit (not shown in fig. 5) is used for carrying out entity normalization processing on the triples corresponding to the first data in the third data cluster based on the specified knowledge graph to obtain candidate intents corresponding to the first data in the third data cluster.
As an implementation manner, the third determining unit 54 is further configured to:
converting candidate intentions corresponding to each first data in the third data cluster into semantic vectors based on an inference model, and determining semantic similarity among the candidate intentions based on the semantic vectors of the first data in the third data cluster; and taking the candidate intention of which the semantic similarity meets the condition as the intention of the third data cluster.
Specifically, the third determining unit 54 is further configured to: and taking more than two candidate intentions with the semantic similarity reaching a set threshold value as candidate intention groups, and determining that candidate intentions corresponding to the candidate intention group with the largest number of candidate intentions in the candidate intention groups meet the semantic similarity condition.
As an implementation, the second determining unit 53 is further configured to:
and extracting the subject, predicate and object triples of the first data without supervision based on at least one of parts of speech, dependency syntax, semantic roles and clauses through an open information extraction sequence labeling method.
The third determining unit 54 is further configured to determine, after determining the intention corresponding to the third data cluster, that the intention of the first data in the third data cluster is the intention corresponding to the third data cluster; if first data with unknown intentions are clustered into a second data cluster based on semantics, the intentions corresponding to the second data cluster are used as intentions of the first data; wherein the first data is changed to data of known intent.
On the basis of the data processing apparatus shown in fig. 5, the data processing apparatus of the embodiment of the present disclosure further includes:
an adding unit (not shown in fig. 5) for adding the first data to the data set.
In an exemplary embodiment, the acquisition unit 50, the clustering unit 51, the first determination unit 52, the second determination unit 53, the third determination unit 54, the addition unit, etc. may be implemented by one or more central processing units (CPU, central Processing Unit), graphic processors (GPU, graphics Processing Unit), application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field programmable gate arrays (FPGA, field-Programmable Gate Array), general purpose processors, controllers, microcontrollers (MCU, micro Controller Unit), microprocessors (Microprocessor), or other electronic components.
The specific manner in which the various modules and units perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.
Fig. 6 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the steps of the data processing method of embodiments of the present disclosure by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (10)

1. A data processing method, comprising:
acquiring a data set to be processed, wherein the data set to be processed comprises at least one first data with unknown intention, and the data is voice data or text data;
carrying out semantic clustering on the data set to be processed and a data set to obtain a plurality of first data clusters, wherein the data set comprises a plurality of second data with known intention, and the plurality of second data are clustered into at least one second data cluster based on the semantic clustering;
determining at least one third data cluster different from the second data cluster in the first data cluster, wherein the third data cluster comprises at least one first data;
entity identification is carried out on the first data in the third data cluster so as to determine a triplet corresponding to the first data in the third data cluster;
and determining the intention corresponding to the third data cluster through the triplet corresponding to the first data in the third data cluster based on an inference model.
2. The data processing method of claim 1, further comprising, prior to determining the intent corresponding to the third data cluster:
and carrying out entity normalization processing on the triples corresponding to the first data in the third data cluster based on the appointed knowledge graph to obtain candidate intents corresponding to the first data in the third data cluster.
3. The data processing method according to claim 2, the determining the intent corresponding to the third data cluster, comprising:
converting candidate intentions corresponding to each first data in the third data cluster into semantic vectors based on an inference model, and determining semantic similarity among the candidate intentions based on the semantic vectors of the first data in the third data cluster;
and taking the candidate intention of which the semantic similarity meets the condition as the intention of the third data cluster.
4. The data processing method of claim 2, the method further comprising:
and taking more than two candidate intentions with the semantic similarity reaching a set threshold value as candidate intention groups, and determining that candidate intentions corresponding to the candidate intention group with the largest number of candidate intentions in the candidate intention groups meet the semantic similarity condition.
5. The method of claim 1, wherein the determining the triplet corresponding to the first data in the third data cluster comprises:
and extracting the subject, predicate and object triples of the first data without supervision based on at least one of parts of speech, dependency syntax, semantic roles and clauses through an open information extraction sequence labeling method.
6. The method of any of claims 1 to 5, wherein after determining the intent corresponding to the third data cluster, the method further comprises:
determining the intention of the first data in the third data cluster as the corresponding intention of the third data cluster;
wherein the first data is changed to data of known intent.
7. The method according to claim 1, wherein the method further comprises:
if first data with unknown intentions are clustered into a second data cluster based on semantics, the intentions corresponding to the second data cluster are used as intentions of the first data;
wherein the first data is changed to data of known intent.
8. The method according to any one of claims 6 or 7, further comprising: the first data is added to the data set.
9. A data processing apparatus comprising:
an acquisition unit for acquiring a data set to be processed; the data set to be processed comprises at least one first data with unknown intention, wherein the data is voice data or text data;
the clustering unit is used for carrying out semantic clustering on the data group to be processed and a data set to obtain a plurality of first data clusters, wherein the data set comprises a plurality of second data with known intention, and the plurality of second data are clustered into at least one second data cluster based on the semantic;
a first determining unit configured to determine at least one third data cluster different from the second data cluster in the first data cluster, where the third data cluster includes at least one first data;
the second determining unit is used for carrying out entity identification on the first data in the third data cluster so as to determine a triplet corresponding to the first data in the third data cluster;
and the third determining unit is used for determining the intention corresponding to the third data cluster through the triplet corresponding to the first data in the third data cluster based on the reasoning model.
10. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the data processing method of any one of claims 1 to 8.
CN202310339465.9A 2023-03-31 2023-03-31 Data processing method and device and electronic equipment Pending CN116578885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310339465.9A CN116578885A (en) 2023-03-31 2023-03-31 Data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310339465.9A CN116578885A (en) 2023-03-31 2023-03-31 Data processing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN116578885A true CN116578885A (en) 2023-08-11

Family

ID=87538470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310339465.9A Pending CN116578885A (en) 2023-03-31 2023-03-31 Data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN116578885A (en)

Similar Documents

Publication Publication Date Title
US10664505B2 (en) Method for deducing entity relationships across corpora using cluster based dictionary vocabulary lexicon
US20220318275A1 (en) Search method, electronic device and storage medium
US20220058222A1 (en) Method and apparatus of processing information, method and apparatus of recommending information, electronic device, and storage medium
JP2022191412A (en) Method for training multi-target image-text matching model and image-text retrieval method and apparatus
CN111552788B (en) Database retrieval method, system and equipment based on entity attribute relationship
CN113032673B (en) Resource acquisition method and device, computer equipment and storage medium
CN114495143B (en) Text object recognition method and device, electronic equipment and storage medium
CN113590776A (en) Text processing method and device based on knowledge graph, electronic equipment and medium
CN112528641A (en) Method and device for establishing information extraction model, electronic equipment and readable storage medium
CN113988157A (en) Semantic retrieval network training method and device, electronic equipment and storage medium
CN113609847B (en) Information extraction method, device, electronic equipment and storage medium
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
CN112560480B (en) Task community discovery method, device, equipment and storage medium
CN116955856A (en) Information display method, device, electronic equipment and storage medium
CN115905497B (en) Method, device, electronic equipment and storage medium for determining reply sentence
CN116467461A (en) Data processing method, device, equipment and medium applied to power distribution network
CN113221566B (en) Entity relation extraction method, entity relation extraction device, electronic equipment and storage medium
CN115292506A (en) Knowledge graph ontology construction method and device applied to office field
CN116578885A (en) Data processing method and device and electronic equipment
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
CN113590774A (en) Event query method, device and storage medium
CN115828915B (en) Entity disambiguation method, device, electronic equipment and storage medium
CN116484870B (en) Method, device, equipment and medium for extracting text information
CN113535958B (en) Production line aggregation method, device and system, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination