CN111723582A - Intelligent semantic classification method, device, equipment and storage medium - Google Patents

Intelligent semantic classification method, device, equipment and storage medium Download PDF

Info

Publication number
CN111723582A
CN111723582A CN202010581247.2A CN202010581247A CN111723582A CN 111723582 A CN111723582 A CN 111723582A CN 202010581247 A CN202010581247 A CN 202010581247A CN 111723582 A CN111723582 A CN 111723582A
Authority
CN
China
Prior art keywords
coarse
grained
intention
role
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010581247.2A
Other languages
Chinese (zh)
Other versions
CN111723582B (en
Inventor
马丹
勾震
曾增烽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010581247.2A priority Critical patent/CN111723582B/en
Publication of CN111723582A publication Critical patent/CN111723582A/en
Application granted granted Critical
Publication of CN111723582B publication Critical patent/CN111723582B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The scheme relates to the field of artificial intelligence, is applied to semantic analysis, and provides an intelligent semantic classification method, device, equipment and storage medium. The method comprises the following steps: the method comprises the steps of obtaining original text data, labeling the original text data through a preset intention role labeling model to obtain coarse-grained speech segments of the original text data and intention roles corresponding to the coarse-grained speech segments, classifying the coarse-grained speech segments to corresponding intention roles respectively, clustering coarse-grained speech segments in a coarse-grained speech segment set corresponding to the intention roles to obtain semantic clusters corresponding to the coarse-grained speech segment set under the intention roles, and naming. According to the invention, semantic classification can be carried out on the speech segments without marking data, so that the data classification efficiency is improved. In addition, the invention also relates to a block chain technology, and semantic groups corresponding to the coarse-grained corpus can be stored in the block chain under each intention role.

Description

Intelligent semantic classification method, device, equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, is applied to semantic analysis, and particularly relates to an intelligent semantic classification method, device, equipment and storage medium.
Background
With the progress of society and the development of big data, the technology of spoken language understanding plays a crucial role in the current voice assistant development process which is concerned by the industry. Especially, voice assistants in the fields of finance and the like are often required to solve user problems in various scenes and fields, including many professional fields. As the breadth of corresponding corpus coverage topics and domains has increased rapidly, traditional spoken language understanding techniques have been unable to provide effective services.
The existing method adopted by the SLU needs to perform intention classification and slot filling work, which are classification work with fine granularity and have very fine granularity. Thus, the classifiers required to accomplish these tasks often have multiple intended roles; thereby increasing the difficulty of classification and reducing the effect of classification. Meanwhile, the traditional SLU method is a bottom-up process, and the format and content of SLU output data are determined by a downstream function and parameters thereof. Such a design is highly limited, and often can only be applied to a single downstream task, while migration capability for other business scenarios is poor.
Disclosure of Invention
The invention mainly aims to solve the technical problems of high difficulty and low classification efficiency of the classification of the intention roles.
The invention provides an intelligent semantic classification method in a first aspect, which comprises the following steps:
acquiring original text data from a preset corpus;
marking the original text data through a preset intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments;
classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
and clustering the coarse-grained corpus respectively to obtain semantic clans corresponding to the coarse-grained corpus under each intention role and naming the semantic clans.
Optionally, in a first implementation manner of the first aspect of the present invention, before the labeling the original text data by presetting an intention role labeling model to obtain coarse-grained language segments in the original text data and intention roles corresponding to the coarse-grained language segments, the method further includes:
reading text corpora;
labeling the text corpus according to a BIO labeling format to obtain a labeled corpus of the text corpus;
and inputting the labeling linguistic data serving as a training set into a preset serialization labeling model for training, and outputting an intention role labeling model.
Optionally, in a second implementation manner of the first aspect of the present invention, the inputting the markup corpus into a preset serialization markup model as a training set for training, and the outputting the intention role markup model includes:
inputting the labeled corpus into a preset serialization labeling model for pre-training, and performing sequence labeling on the labeled corpus through the serialization labeling model to obtain the prediction labeling results of a plurality of tasks;
calculating a model loss value according to the prediction labeling result;
reversely inputting the model loss value into the serialized annotation model, and judging whether the model loss value reaches a preset loss value or not;
if not, updating the parameters of the serialized annotation model according to the model loss value by adopting a back propagation algorithm;
processing the annotation corpus through a serialized annotation model after parameter updating to obtain the prediction annotation results of a plurality of tasks;
recalculating the model loss value based on the prediction labeling result;
and if the model loss value reaches a preset loss value, confirming model convergence, and taking the serialized annotation model after the parameters are updated as the finally trained intention role annotation model.
Optionally, in a third implementation manner of the first aspect of the present invention, the tagging, performed by presetting an intention role tagging model, on the original text data to obtain coarse-grained language segments in the original text data and intention roles corresponding to the coarse-grained language segments includes:
performing intention role labeling on the original text data through a preset intention role labeling model to obtain an intention role labeling result of the original text data;
determining an intention role corresponding to each character and punctuation marks in the original text data based on the intention role labeling result;
and determining coarse grain language segments in the original text data and the corresponding intention roles of the coarse grain language segments based on each character and the corresponding intention role of the punctuation mark.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the clustering the coarse-grained corpus sets respectively to obtain semantic groups corresponding to the coarse-grained corpus sets in the intended roles and naming the semantic groups respectively includes:
vectorizing each coarse-grained speech segment in the coarse-grained speech segment set corresponding to each intention role respectively to obtain a corresponding coarse-grained speech segment vector;
respectively calculating first cosine similarity between every two coarse-granularity speech segment vectors based on a preset cosine similarity algorithm;
based on the first cosine similarity, clustering each coarse-grained corpus under each intention role to obtain a plurality of semantic clans corresponding to each coarse-grained corpus under each intention role;
and naming the semantic families respectively, wherein one semantic family comprises a plurality of coarse-grained language segments with similar semantics.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the clustering, based on the first cosine similarity, the coarse-grained corpus sets in each intended role to obtain a plurality of semantic families corresponding to the coarse-grained corpus sets in each intended role includes:
setting the clustering number of the coarse-grained corpus to be k under each intention role, and randomly selecting k coarse-grained corpus as an initial clustering center;
classifying the coarse-grained speech segments in the coarse-grained speech segment set under each intention role to a semantic family group corresponding to each initial clustering center respectively based on the first cosine similarity until the classification of the coarse-grained speech segments is finished;
and determining the real clustering center of each semantic clan group to obtain a plurality of target semantic clans corresponding to each coarse-grained corpus set under each intention role.
Optionally, in a sixth implementation manner of the first aspect of the present invention, after the clustering the coarse-grained corpus sets respectively to obtain semantic families corresponding to the coarse-grained corpus sets in each intended role and naming the semantic families, the method further includes:
receiving a question of a user;
marking the question of the user through the intention role marking model to obtain a marking language segment and an intention role corresponding to the marking language segment;
vectorizing the tagged corpus to obtain tagged corpus vectors, and calculating second cosine similarity between the tagged corpus vectors and the clustering center of each semantic clan;
and determining the semantic clan to which the labeled language segment belongs based on the second cosine similarity, and determining the real intention of the user based on the semantic clan.
The second aspect of the present invention provides an intelligent semantic classification device, including:
the acquisition module is used for acquiring original text data from a preset corpus;
the marking module is used for marking the original text data through a preset intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments;
the classification module is used for classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
and the clustering module is used for respectively clustering the coarse-grained corpus sets to obtain semantic clans corresponding to the coarse-grained corpus sets under the intention roles and naming the semantic clans.
Optionally, the intelligent semantic classification device further includes:
the reading module is used for reading the text corpora;
the second labeling module is used for labeling the text corpus according to a BIO labeling format to obtain a labeled corpus of the text corpus;
and the training module is used for inputting the labeling linguistic data into a preset serialization labeling model as a training set for training and outputting an intention role labeling model.
Optionally, in a first implementation manner of the second aspect of the present invention, the training module is specifically configured to:
inputting the labeled corpus into a preset serialization labeling model for pre-training, and performing sequence labeling on the labeled corpus through the serialization labeling model to obtain the prediction labeling results of a plurality of tasks;
calculating a model loss value according to the prediction labeling result;
reversely inputting the model loss value into the serialized annotation model, and judging whether the model loss value reaches a preset loss value or not;
if not, updating the parameters of the serialized annotation model according to the model loss value by adopting a back propagation algorithm;
processing the annotation corpus through a serialized annotation model after parameter updating to obtain the prediction annotation results of a plurality of tasks;
recalculating the model loss value based on the prediction labeling result;
and if the model loss value reaches a preset loss value, confirming model convergence, and taking the serialized annotation model after the parameters are updated as the finally trained intention role annotation model.
Optionally, in a second implementation manner of the second aspect of the present invention, the first labeling module is specifically configured to:
performing intention role labeling on the original text data through a preset intention role labeling model to obtain an intention role labeling result of the original text data;
determining an intention role corresponding to each character and punctuation marks in the original text data based on the intention role labeling result;
and determining coarse grain language segments in the original text data and the corresponding intention roles of the coarse grain language segments based on each character and the corresponding intention role of the punctuation mark.
Optionally, in a third implementation manner of the second aspect of the present invention, the clustering module includes:
the processing unit is used for respectively carrying out vectorization processing on each coarse-grained speech segment in the coarse-grained speech segment set corresponding to each intention role to obtain a corresponding coarse-grained speech segment vector;
the calculating unit is used for respectively calculating the first cosine similarity between every two coarse-granularity speech segment vectors based on a preset cosine similarity calculation method;
a clustering unit, configured to cluster the coarse-grained corpus under each intention role based on the first cosine similarity, so as to obtain multiple semantic clans corresponding to each coarse-grained corpus under each intention role;
and the naming unit is used for naming the plurality of semantic clans respectively, wherein one semantic clan comprises a plurality of coarse-grained language segments with similar semantics.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the clustering unit is specifically configured to:
setting the clustering number of the coarse-grained corpus to be k under each intention role, and randomly selecting k coarse-grained corpus as an initial clustering center;
classifying the coarse-grained speech segments in the coarse-grained speech segment set under each intention role to a semantic family group corresponding to each initial clustering center respectively based on the first cosine similarity until the classification of the coarse-grained speech segments is finished;
and determining the real clustering center of each semantic clan group to obtain a plurality of target semantic clans corresponding to each coarse-grained corpus set under each intention role.
Optionally, the intelligent semantic classification device further includes:
the receiving module is used for receiving a question of a user;
the third labeling module is used for labeling the question of the user through the intention role labeling model to obtain a labeled language segment and an intention role corresponding to the labeled language segment;
the processing module is used for vectorizing the tagged corpus to obtain tagged corpus vectors and calculating second cosine similarity between the tagged corpus vectors and the clustering center of each semantic clan;
and the determining module is used for determining the semantic clan to which the labeled language segment belongs based on the second cosine similarity and determining the real intention of the user based on the semantic clan.
A third aspect of the present invention provides an intelligent semantic classification device, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invokes the instructions in the memory to cause the intelligent semantic categorization apparatus to perform the intelligent semantic categorization method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the above-mentioned intelligent semantic classification method.
In the technical scheme provided by the invention, original text data are marked by a preset intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments; classifying each coarse-grained speech segment to a corresponding intention role respectively to obtain a coarse-grained speech segment set corresponding to each intention role; and clustering each coarse-grained corpus respectively to obtain coarse-grained corpus and corresponding semantic clans under each intention role and naming the coarse-grained corpus and the corresponding semantic clans. The scheme can be applied to the field of artificial intelligence, so that social progress is promoted, the language segments can be classified according to intelligent semantics without any labeled data, the accuracy is high, the output data can be directly used for downstream tasks, and the data classification efficiency is improved.
Drawings
FIG. 1 is a schematic diagram of a first embodiment of the intelligent semantic classification method of the present invention;
FIG. 2 is a schematic diagram of a second embodiment of the intelligent semantic classification method of the present invention;
FIG. 3 is a schematic diagram of a third embodiment of the intelligent semantic classification method according to the present invention;
FIG. 4 is a schematic diagram of a fourth embodiment of the intelligent semantic classification method according to the present invention;
FIG. 5 is a schematic diagram of a fifth embodiment of the intelligent semantic classification method according to the present invention;
FIG. 6 is a schematic diagram of a first embodiment of the intelligent semantic classification device according to the present invention;
FIG. 7 is a schematic diagram of a second embodiment of the intelligent semantic classification device according to the present invention;
fig. 8 is a schematic diagram of an embodiment of the intelligent semantic classification device of the invention.
Detailed Description
The embodiment of the invention relates to artificial intelligence, and provides an intelligent semantic classification method, device, equipment and storage medium. The scheme belongs to the field of artificial intelligence, social progress and development can be promoted through the scheme, the language segments can be classified according to intelligent semantics without marking data, the accuracy rate is high, the output data can be directly used for downstream tasks, and the classification efficiency is improved.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For easy understanding, a detailed flow of the embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of the intelligent semantic classification method of the present invention includes:
101. acquiring original text data from a preset corpus;
in this embodiment, the original text data is obtained from a preset corpus. The original text data refers to the corpus containing the user question, and in most cases, the original text data needs to be cleaned. The data cleaning refers to the last procedure of finding and correcting recognizable errors in the original text, and includes checking data consistency, processing invalid values and missing values and the like. The process of re-examining and verifying data aims to remove duplicate information, correct existing errors, and provide data consistency. And filtering the data which do not meet the requirements according to a certain rule, correcting the original text data, and then labeling.
In this example, a labeling person needs to label each word segment of the original text data according to a serialized labeling format of a sub-word level BIO mode to obtain thousands of first corpora, which are consistent with BERT model training data. The labeling rule of the BIO mode is to label each element in the text to be labeled as 'B-X', 'I-X' or 'O'. Wherein "B-X" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to X type and the element is in the middle position of the fragment, and "O" indicates that the fragment does not belong to any type. For example, we label the name of the commodity (cp), then three labels of the BIO are: b-cp: beginning of trade name, I-cp: middle of trade name, O: not the name of the commodity. For example, the "go out prompt transaction fails, how does" → "go to, B-Action", "go, I-Action", "provide B-project", "show I-project", "deal with I-project", "easy I-project", "lose I-project", ", O", "how B-query", "how I-query", and "do I-query".
102. Marking the original text data through a preset intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments;
in this embodiment, a small amount of labeled text data is used for training to obtain an intention role labeling model, and then the model is used for typing intention roles on all original text data to obtain coarse-grained speech segments and intention roles corresponding to the coarse-grained speech segments.
In this embodiment, the intention role refers to the following five intention roles, including: slot (Slot), Background (Background), Action (Action), status (promlem), Question (Question).
In this embodiment, an intention role tagging (intention role tagging: intentrolelabelying) model can dig out a speech segment with a designated intention role, and all intention roles are integrated to obtain a more accurate user intention. For example, "do you can pay a premium for the survival fund", and according to the labeled result, the organized speech segments include: "live gold", "can", "deal", "keep charge", "do? "these; the intended roles of these speech segments are: slot, queue, action, slot, queue. These results here are organized from the direct intent roles at the word level. Specifically, the output format "BIO" is defined as B: begin, I: instadeO: outside. Then the part of B and I immediately following "-", "indicates to which intended character this word belongs in particular. For example, in fig. two a, "B-slot generation", "I-slot storage", and "gold I-slot" form a BIO format, and the corresponding intent roles are "slot", which indicates that the three words "B-slot generation", "I-slot storage", and "gold I-slot" form a coarse-grained speech segment, and the intent role of the speech segment is "slot".
In this embodiment, the coarse-grained speech segment refers to a speech segment, which may refer to a word or a phrase segment. Granularity is a metaphor that compares the length of a word or phrase to stone grit. For example, "prompt transaction failure" is a coarse-grained language segment, i.e., it is relatively long and contains relatively rich meaning. However, the conventional method does not analyze such long fragments, and the general method is to divide the fragments into three words of "fine granularity" to prompt "" deal "" and "fail" (i.e. short words with definite meaning), and then analyze the three words respectively.
103. Classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
in this embodiment, according to the intention role corresponding to each coarse-grained corpus, all coarse-grained corpora are classified into 5 classes respectively under the intention role corresponding to the corpus. For example, a batch of original text data is labeled to obtain 400 coarse-grained speech segments, and the coarse-grained speech segments are respectively stored in five corresponding intention roles, namely, Slot (Slot), Background (Background), Action (Action), status (promble), and Question (Question), according to the difference of the intention roles corresponding to the coarse-grained speech segments, and then the coarse-grained speech segments under each intention role are respectively clustered.
104. And clustering the coarse-grained corpus respectively to obtain semantic clans corresponding to the coarse-grained corpus under each intention role and naming the semantic clans.
In this embodiment, the coarse-grained speech segments in the coarse-grained speech segment sets under each intention role are clustered, and a semantic group corresponding to each intention role is generated and named.
In this embodiment, all the coarse-grained speech segments are clustered to generate a semantic group composed of a plurality of coarse-grained speech segments with similar meanings, for example, a batch of original text data is labeled to obtain 400 coarse-grained speech segments, and the coarse-grained speech segments are stored in five corresponding intention roles, namely, Slot (Slot), Background (Background), Action (Action), status (promble), and Question (Question). The Slot (Slot) of the intention role comprises 120 coarse-grained speech segments, the 120 coarse-grained speech segments are clustered to obtain a group consisting of 15 coarse-grained speech segments with similar semantics, which is also called a semantic group, and the group is named according to the semantic number of the coarse-grained speech segments in each group and each semantic category.
In this embodiment, clustering is a special classification process that divides uncertain sample data with insufficient prior knowledge into several classes. The specific division is based on dividing the data records with larger meaning similarity into the same semantic family, and maximizing the dissimilarity among the data records in different groups. This is a statistical analysis method to study (sample or index) classification problems. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters.
In this embodiment, the coarse-grained speech segments are vectorized, and then the coarse-grained speech segments are clustered according to the cosine similarity between every two vectors, so as to determine a clustering result. For example, the corresponding coarse-grained speech segments of the original text data are n, M1,M2,M3...MnVectorizing the speech segments to obtain n vectors V1,V2,V3...Vn. And respectively calculating cosine similarity between every two vectors, and gathering the vectors with higher similarity together. And clustering the vectors to obtain clustering results, so that the linguistic segments with similar meanings are clustered into a class.
In the embodiment of the invention, original text data is marked by a preset intention role marking model to obtain coarse-grained speech segments and intention roles corresponding to the coarse-grained speech segments, and the coarse-grained speech segments are clustered to obtain semantic clans corresponding to the coarse-grained speech segments with similar meanings; and defining a corresponding semantic group according to the semantics of the coarse-grained language segments, and constructing a corresponding concept knowledge base. The scheme belongs to the field of artificial intelligence, social progress and development can be promoted through the scheme, the language segments can be classified according to intelligent semantics in a most time-saving and labor-saving mode under the condition that no labeled data is needed, output data and results can be directly used for downstream tasks, and data classification efficiency is improved. It should be emphasized that, in order to further ensure the privacy and security of the semantic group corresponding to each coarse-grained corpus under each intent role, the semantic group corresponding to each coarse-grained corpus under each intent role may also be stored in a node of a block chain.
Referring to fig. 2, a second embodiment of the intelligent semantic classification method of the present invention includes:
201. reading text corpora;
in this embodiment, in order to obtain a model, first (a batch of) data needs to be collected, the data is used as a training basis set, and training learning is performed on features of the data included in the training data set, so as to obtain a model with a certain function. The text data is a user question obtained from a service website or a user data information base related to the field.
202. Labeling the text corpus according to a BIO labeling format to obtain a labeled corpus of the text corpus;
in this embodiment, according to the BIO labeling format, the text corpus is labeled to obtain a labeled corpus corresponding to the text corpus. For example, "go out prompt transaction failure, how do" → "go out, B-action", "go out, I-action", "bring up B-project", "show I-project", "enter I-project", "easy I-project", "lose I-project", ", O", "how B-query", "do I-query", and "do I-query".
203. Inputting the labeled corpus into a preset serialization labeling model for pre-training, and performing sequence labeling on the labeled corpus through the serialization labeling model to obtain the prediction labeling results of a plurality of tasks;
in this embodiment, the labeled corpus is input into a preset serialization labeling model for training, and the labeled corpus is subjected to sequence labeling through the serialization labeling model to obtain the predicted labeling results of multiple tasks. The labeling systems include a BIOES system, a BIO system and the like. These labeling systems are all encoded on the text to be labeled with single or discontinuous english character strings.
In this embodiment, BIO refers to labeling each element as "B-X", "I-X", or "O". Wherein "B-X" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to X type and the element is in the middle position of the fragment, and "O" indicates that the fragment does not belong to any type. For example, we denote X as a noun phrase (NounPhrase, NP), three labels of BIO are: (1) B-NP: the beginning of a noun phrase; (2) I-NP: the middle of a noun phrase; (3) o: not noun phrases. One can therefore divide a dialog into the following results: the loan application is cancelled, B-Action is taken, I-Action is eliminated, O is just obtained, O is 'good', B-Slot is credited, 'I-Slot is paid,' I-Slot is declared, and I-Slot is please be obtained.
204. Calculating a model loss value according to the prediction labeling result;
in this embodiment, a corresponding loss function is obtained according to the prediction labeling result corresponding to each task, and a model loss value is calculated according to a loss value corresponding to the loss function. The loss function (lossfunction) or the cost function (costfunction) is a function that maps a random event or a value of a random variable related to the random event to a non-negative real number to represent a "risk" or a "loss" of the random event.
205. Reversely inputting the model loss value into the serialized annotation model, and judging whether the model loss value reaches a preset loss value or not;
in this embodiment, the model loss value is reversely input into the first serialization labeling model, whether the model loss value reaches the preset loss value is judged, and the parameters corresponding to the model are updated according to the model loss value, so as to obtain the optimized new model.
206. If not, updating the parameters of the serialized annotation model according to the model loss value by adopting a back propagation algorithm;
in this embodiment, if the model loss value does not reach the preset loss value, a back propagation algorithm is adopted, and the corresponding parameter corresponding to the first serialization labeling model is updated according to the model loss value.
The back propagation algorithm is a supervised learning algorithm (namely BP algorithm) and is a learning algorithm suitable for a multilayer neuron network, and is based on a gradient descent method. The input-output relationship of the BP network is substantially a mapping relationship: an n-input m-output BP neural network performs the function of continuous mapping from n-dimensional euclidean space to a finite field in m-dimensional euclidean space, which is highly non-linear. Its information processing ability comes from multiple composition of simple non-linear function, so it has strong function reproduction ability.
207. Processing the annotation corpus through a serialized annotation model after parameter updating to obtain the prediction annotation results of a plurality of tasks;
in this embodiment, each training sample in the first training sample set is processed by the first serialized annotation model after the parameter update, so as to obtain a prediction annotation result corresponding to each training sample.
And after a prediction labeling result is obtained, updating parameters of the sequence labeling model according to a gradient descent algorithm to obtain a trained first intention role labeling model.
In this embodiment, the gradient of the loss function may be calculated by a gradient descent method, and it is determined whether the parameters W and b of the first recurrent neural network layer, the parameter Wa of the attention layer, and the probability transition matrix a of the CRF layer in the sequence annotation model need to be updated [ Aij ], and if the sequence annotation model includes the second recurrent neural network layer, the parameters need to be updated further include the parameters W and b of the second recurrent neural network layer; and if the parameters of each network layer in the first intention role labeling model need to be updated, circularly obtaining the prediction result and calculating the loss function until the loss function reaches the minimum value. And finally, when the loss function meets a preset convergence condition, stopping parameter updating to obtain the trained first intention role marking model.
208. Recalculating the model loss value based on the prediction labeling result;
in this embodiment, according to the prediction labeling result corresponding to each task, the corresponding model loss value is recalculated, and whether the model has converged or not is determined according to the size of the model loss value, so as to obtain a corresponding optimized model.
209. If the model loss value reaches a preset loss value, confirming model convergence, and taking the serialized annotation model after the parameter updating as an intention role annotation model obtained by final training;
in this embodiment, if the model loss value reaches the preset loss value, it indicates that the model has converged, and the first serialized annotation model with updated parameters is used as the finally obtained first intended character annotation model. It should be noted that, the parameter updating algorithm may be set based on an actual situation, which is not specifically limited in this application, and optionally, the parameter of the first serialization labeling model is updated based on a back propagation algorithm.
The convergence condition refers to that the loss function reaches a minimum value, and specifically, the preset convergence condition may be a preset number of times or a preset value set according to experience. That is, when the iteration number of the model reaches the preset number or the loss function reaches the preset value, the parameter updating of the model is stopped, and the trained first serialization labeling model is obtained.
210. Marking the original text data through a preset intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments;
211. classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
212. clustering the coarse-grained corpus respectively to obtain semantic clans corresponding to the coarse-grained corpus under each intention role and naming the semantic clans;
213. receiving a question of a user;
in this embodiment, a user question is received. The user question is a query sentence of the user, such as "do you, survival fund can offset premium", "how do the transaction is prompted by transferring account", and "how do the repayment is failed in my credit card tomorrow to repayment date? "and so on" are user question sentences.
214. Marking the question of the user through the intention role marking model to obtain a marking language segment and an intention role corresponding to the marking language segment;
in this embodiment, the user question is labeled by using an intention role labeling model, so as to obtain the labeling language segment and the intention role corresponding to the labeling language segment. For example, the annotation phrase "prompt transaction failure" is a "promtem", and the corresponding intended role of the phrase is a status.
215. Vectorizing the tagged corpus to obtain tagged corpus vectors, and calculating second cosine similarity between the tagged corpus vectors and the clustering center of each semantic clan;
in this embodiment, vectorization processing is performed on the tagged field to obtain tagged corpus vectors, and further, cosine values between the tagged corpus vectors and each coarse-grained corpus vector corresponding to the clustering center of each semantic clan are respectively calculated, and the semantic clan to which the tagged corpus belongs is determined according to the size of the cosine values. For example, there is a batch of data, and after clustering, a total of 200 semantic clusters are obtained under five types of intention roles. The clustering center of each semantic group corresponds to a coarse-grained speech segment (the word meaning of the coarse-grained speech segment is defined as the concept of the semantic group), the 200 coarse-grained speech segments are subjected to vectorization processing, cosine values between each coarse-grained speech segment vector (the coarse-grained speech segment corresponding to the clustering center) and the labeled speech segment vector are respectively calculated, and the larger the cosine value is, the more similar the two speech segments are.
216. And determining the semantic clan to which the labeled language segment belongs based on the second cosine similarity, and determining the real intention of the user based on the semantic clan.
In this embodiment, the semantic family corresponding to the largest cosine value is determined according to the cosine value between the coarse-grained corpus vector and the tagged corpus vector, and then the semantic family is the semantic family to which the tagged corpus corresponds. Further, the concept of the semantic family is determined according to the word senses of the coarse-grained language segments corresponding to the clustering center of the semantic family. And determining the meaning of the labeled language segment according to the concept of the semantic family, and further determining the real intention of the user. Through the step, the meaning of the label field can be understood from coarse to fine, and the problem of the user is solved.
In the embodiment of the invention, the training process of the intention role marking model is introduced in detail. And carrying out serialization labeling on the collected text predictions to obtain a labeled corpus, and inputting the labeled corpus into a preset sequence labeling model to obtain an intention role labeling model. Marking original text data through an intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments; classifying each coarse-grained speech segment to a corresponding intention role to obtain a coarse-grained speech segment set corresponding to each intention role; and clustering each coarse-grained corpus respectively to obtain a semantic group corresponding to the next coarse-grained corpus of each intention role and naming the semantic group. The scheme belongs to the field of artificial intelligence, social progress and development can be promoted through the scheme, the language segments can be classified according to intelligent semantics without marking data, the accuracy rate is high, the output data can be directly used for downstream tasks, and the data classification efficiency is improved.
Referring to fig. 3, a third embodiment of the intelligent semantic classification method of the present invention includes:
301. acquiring original text data from a preset corpus;
302. performing intention role labeling on the original text data through a preset intention role labeling model to obtain an intention role labeling result of the original text data;
in this embodiment, the original text data is labeled by using an intention role labeling model, so as to obtain coarse-grained language segments in the original text data and intention roles corresponding to the coarse-grained language segments.
In this embodiment, labeling means labeling each element in a sequence with an intended role. In general, a sequence refers to a sentence, and an element refers to a word in the sentence. For example, the information extraction problem may be regarded as a labeling problem, such as extracting meeting time, meeting place, and the like. Sequence annotations can generally be divided into two categories: original annotation (Rawlabeling) and Joint annotation (Joint segmentation and labeling) as used herein, is an original annotation, i.e., each element needs to be labeled as an intended role.
Training a sequence marking model, namely an intention role marking model, by using the marked text, wherein an intention role is marked on other original texts by using the intention role marking model.
In this embodiment, the specific labeling manner is referred to as BIO labeling: each element is labeled "B-X", "I-X", or "O". Wherein "B-X" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to X type and the element is in the middle position of the fragment, and "O" indicates that the fragment does not belong to any type. For example, we denote X as a Noun Phrase (Noun Phrase, NP), then three labels of BIO are:
(1) B-NP: beginning of noun phrases
(2) I-NP: middle of noun phrase
(3) O: not noun phrases
X is defined herein as five different types of intended roles. For example, the BIO labeling results for a section are as follows: "go out prompt transaction failure, how do" → "go to, B-action", "go out, I-action", "go to B-project", "show I-project", "go to I-project", "easy I-project", "lose I-project", ", O", "how B-query", "how I-query", and "do I-query".
303. Determining an intention role corresponding to each character and punctuation marks in the original text data based on the intention role labeling result;
in this embodiment, according to the labeling result of the original text data, the intended role corresponding to each character and punctuation mark in the original text data is determined. For example, "go out prompt transaction failure, how do" → "go out, B-action", "go out, I-action", "bring up B-project", "show I-project", "enter I-project", "easy I-project", "lose I-project", ", O", "how B-query", "do I-query", and "do I-query".
304. Determining coarse grain language segments in the original text data and the corresponding intention roles of the coarse grain language segments based on each character and the corresponding intention role of the punctuation mark;
in this embodiment, the coarse-grained speech segments included in the original text data are determined according to the intention roles corresponding to each word and punctuation mark, and the intention roles corresponding to the coarse-grained speech segments are determined according to the intention roles corresponding to each word in the coarse-grained speech segments. For example, "go out prompt transaction failure, how do" → "go out, B-action", "go out, I-action", "bring up B-project", "show I-project", "enter I-project", "easy I-project", "lose I-project", ", O", "how B-query", "do I-query", and "do I-query". Wherein, the coarse-grained sentences contained in the question sentence of the user include "roll out", "prompt transaction failure" and "how to do".
305. Classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
306. and clustering the coarse-grained corpus respectively to obtain semantic clans corresponding to the coarse-grained corpus under each intention role and naming the semantic clans.
The embodiment of the invention provides a detailed process for labeling original text data through an intention role labeling model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments. The scheme belongs to the field of artificial intelligence, social progress and development can be promoted through the scheme, language segments can be classified according to intelligent semantics without marking data, the accuracy rate is high, output data and results can be directly used for downstream tasks, and the data classification efficiency is high.
Referring to fig. 4, a fourth embodiment of the intelligent semantic classification method of the present invention includes:
401. acquiring original text data from a preset corpus;
402. marking the original text data through a preset intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments;
403. classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
404. vectorizing each coarse-grained speech segment in the coarse-grained speech segment set corresponding to each intention role respectively to obtain a corresponding coarse-grained speech segment vector;
in this embodiment, coarse-grained speech segments under each intention role classification are obtained. There are five general categories of intent roles, mainly including: slot (Slot), Background (Background), Action (Action), status (promlem), Question (Question). In the coarse-grained speech segments generated after the original text data are labeled, each speech segment corresponds to an intention role type, and the coarse-grained speech segments under each intention role are respectively counted according to the difference of the intention roles corresponding to the coarse-grained speech segments. For each language segment, a vector corresponding to each language segment is obtained by using the language model.
In this embodiment, the phrase segments are directly mapped to the vector space by using phrase2vec and other methods, the vectorization processing is directly performed on the phrase segments, and each coarse-grained phrase segment is directly mapped into a vector.
405. Respectively calculating first cosine similarity between every two coarse-granularity speech segment vectors based on a preset cosine similarity algorithm;
in this embodiment, a cosine similarity calculation method is adopted to calculate the cosine similarity between every two coarse-grained corpus vectors under each intention role. And (5) cosine similarity value calculation process. The vector has a directional attribute in addition to a numerical attribute. And calculating cosine values of included angles of the two vectors as similarity values. The formula is as follows:
Figure BDA0002553274450000131
in the formula, Ai,BiExpressed as the components of vector a representing coarse grain speech segment a and vector B representing coarse grain speech segment B, respectively, n represents n coarse grain speech segments.
The similarity value obtained by the cosine similarity algorithm ranges from-1 to 1. When the cosine value is equal to 1, it indicates that the directions of the two vectors are consistent, so the higher the cosine value is, the more similar the two words are.
406. Based on the first cosine similarity, clustering each coarse-grained corpus under each intention role to obtain a plurality of semantic clans corresponding to each coarse-grained corpus under each intention role;
in this embodiment, the cosine similarity value is obtained through calculation, and the coarse grain granularity speech segments classified by each intention role are clustered to obtain a plurality of semantic clans corresponding to each coarse grain granularity speech segment set under each intention role. For example, the coarse-grained language segments are 'bank card', 'savings card', 'bound card', 'gold card', 'credit card' which are a semantic clan; asthma, cancer, infection and lupus are the same semantic group.
In this embodiment, clustering is a special classification process that divides uncertain sample data with insufficient prior knowledge into several classes, and the basis of the division is to divide data records with large similarity into the same semantic family, so that the degree of dissimilarity among data records in different groups is maximized. Is a statistical analysis method for researching (sample or index) classification problems. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters.
The current popular clustering algorithms include LPA (intention role propagation algorithm) and minimum entropy, k-means algorithm, C-value algorithm and the like. For example, the coarse-grained speech segments include an identity card, a user script, a bank card, a credit card and a bankbook. The identity card and the bank card are set as fixed words, the distance between the account book and the identity card can be obtained after the distance is calculated, and the similarity between the credit card and the bankbook is high, so that the coarse-grained field with the five intention roles of 'slot position' is divided into two semantic groups which are named as the identity card and the storage card respectively.
407. And naming the semantic families respectively, wherein one semantic family comprises a plurality of coarse-grained language segments with similar semantics.
In this embodiment, according to the clustering result, a plurality of semantic groups included in the coarse-grained speech segment under each intention role are obtained, and each semantic group includes a plurality of coarse-grained speech segments with similar meanings, for example, the speech segments of "prompt transaction failure" and "say that transaction cannot be completed" and "say that this transaction cannot be completed" are clustered together. The cosine values of the coarse-grained speech segments between two coarse-grained speech segments are calculated, the coarse-grained speech segments are clustered to obtain a corresponding semantic clan group under each intention role, all the coarse-grained speech segments under the action of the intention role are clustered by taking the action of the intention role as an example, the cosine values of the coarse-grained speech segments between two coarse-grained speech segments are calculated, and a plurality of coarse-grained speech segment sets-semantic clans containing coarse-grained segments with different meanings are obtained.
In the embodiment of the present invention, a process of clustering coarse-grained speech segments in each coarse-grained speech segment set under each intention role to obtain a semantic clan in the coarse-grained speech segment set corresponding to each intention role and naming the semantic clan is described in detail. The language segments can be classified according to semantics without marking data, the accuracy is high, the output data can be directly used for downstream tasks, and the data classification efficiency is high.
Referring to fig. 5, a fifth embodiment of the intelligent semantic classification method of the present invention includes:
501. acquiring original text data from a preset corpus;
502. marking the original text data through a preset intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments;
503. classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
504. vectorizing each coarse-grained speech segment in the coarse-grained speech segment set corresponding to each intention role respectively to obtain a corresponding coarse-grained speech segment vector;
505. respectively calculating first cosine similarity between every two coarse-granularity speech segment vectors based on a preset cosine similarity algorithm;
506. setting the clustering number of the coarse-grained corpus to be k under each intention role, and randomly selecting k coarse-grained corpus as an initial clustering center;
in this embodiment, the clustering center is to divide the input coarse-grained segment into different parts in a vector space according to features, that is, a clustering center is a center of the clustering.
In this embodiment, clustering refers to a process of dividing a set of physical or abstract objects into a plurality of classes composed of similar objects, and is a statistical analysis method for studying (sample or index) classification problems. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters.
In this embodiment, it is assumed that the coarse-grained corpus corresponding to each intended role in the coarse-grained corpus is divided into k semantic groups with different meanings, and k coarse-grained corpus samples are randomly selected from the data center as a clustering center. For example, we assume that all the coarse-grained speech segments in a coarse-grained speech segment set can be divided into A, B, C, D, E, F, G, 7 semantic clusters, each representing 7 concepts, where A, B, C, D, E, F, G is the clustering center of the 7 semantic clusters.
In this embodiment, the determination of the cluster center is divided into an initial case and a non-initial case. In the initial situation, k samples are randomly selected from the sample data to serve as initial clustering centers. The initial cluster center is represented as: mp (1) ═ (vi1, vi 2.., vij); where, p is 1,2, …, k, k represents the number of clusters. And after the clustering is finished, finding out the real clustering center of the corresponding semantic clan according to the clustering result.
It should be noted that, if other clustering algorithms are used in the present scheme, a belonging to the semantic group a can be found1A2A3…AnN speech segments, and then directly calculating the centers V of the n vectorsmThen find out the distance VmNearest vector VmmAnd corresponding speech segment AmmThus, the real clustering center of the corresponding semantic group is determined.
507. Classifying the coarse-grained speech segments in the coarse-grained speech segment set under each intention role to a semantic family group corresponding to each initial clustering center respectively based on the first cosine similarity until the classification of the coarse-grained speech segments is finished;
in this embodiment, according to the first cosine similarity, each coarse-grained speech segment in the coarse-grained speech segment set under each intention role is classified into the semantic group corresponding to each initial clustering center respectively until the classification of the coarse-grained speech segments is completed.
The cosine similarity obtained by the cosine similarity algorithm ranges from-1 to 1. When the cosine similarity is equal to 1, it indicates that the directions of the two vectors are consistent, so the higher the cosine similarity is, the more similar the two words are.
508. Determining a real clustering center of each semantic clan group to obtain a plurality of target semantic clans corresponding to each coarse-grained corpus set under each intention role;
in this embodiment, after the coarse-grained speech segments in the coarse-grained speech segment set under each intention role are clustered to obtain corresponding semantic clans, the real clustering center of each semantic clan is determined to obtain a plurality of target semantic clans corresponding to each coarse-grained speech segment set under each intention role.
509. And naming the semantic families respectively, wherein one semantic family comprises a plurality of coarse-grained language segments with similar semantics.
In this embodiment, the obtained multiple target semantic families are named according to the meanings of the coarse-grained language segments in the semantic families, respectively, where one semantic family includes multiple coarse-grained language segments with similar semantics.
On the basis of the previous embodiment, the detailed process of coarse-grained corpus clustering in the coarse-grained corpus corresponding to each intention role is added, all coarse-grained corpus in the coarse-grained corpus corresponding to each intention role are respectively obtained, and vectorization processing is performed on each coarse-grained corpus to obtain corresponding coarse-grained corpus vectors; and respectively calculating cosine similarity between every two coarse-grained corpus vectors, and clustering all coarse-grained corpora in a coarse-grained corpus corresponding to each intention role according to the cosine similarity, so as to obtain a plurality of semantic clans contained in each coarse-grained corpus under each intention role and name the semantic clans. The scheme belongs to the field of artificial intelligence, social progress and development can be promoted through the scheme, the language segments can be classified according to semantics under the condition of not marking data, accuracy is high, output data can be directly used for downstream tasks, and data classification efficiency is high.
The above describes the intelligent semantic classifying method according to the embodiment of the present invention, and the following describes the intelligent semantic classifying device according to the embodiment of the present invention, with reference to fig. 6, the first embodiment of the intelligent semantic classifying device according to the present invention includes:
an obtaining module 601, which obtains original text data from a preset corpus;
a first labeling module 602, configured to label, by using a preset intention role labeling model, the original text data to obtain coarse-grained language segments in the original text data and intention roles corresponding to the coarse-grained language segments;
a classifying module 603, configured to classify each coarse-grained corpus into a corresponding intention role, respectively, to obtain a coarse-grained corpus corresponding to each intention role;
and a clustering module 604, configured to cluster the coarse-grained corpus respectively to obtain semantic clusters corresponding to the coarse-grained corpus under each intended role, and name the semantic clusters.
Optionally, the first labeling module 602 is specifically configured to:
performing intention role labeling on the original text data through a preset intention role labeling model to obtain an intention role labeling result of the original text data;
determining an intention role corresponding to each character and punctuation marks in the original text data based on the intention role labeling result;
and determining coarse grain language segments in the original text data and the corresponding intention roles of the coarse grain language segments based on each character and the corresponding intention role of the punctuation mark.
Optionally, the clustering module 604 includes:
a processing unit 6041, configured to perform vectorization processing on each coarse-grained corpus in the coarse-grained corpus corresponding to each intended role, respectively, to obtain a corresponding coarse-grained corpus vector;
a calculating unit 6042, configured to calculate, based on a preset cosine similarity algorithm, first cosine similarities between every two coarse-grained corpus vectors respectively;
a clustering unit 6043, configured to cluster, based on the first cosine similarity, each coarse-grained corpus under each intention role to obtain a plurality of semantic clans corresponding to each coarse-grained corpus under each intention role;
a naming unit 6044, configured to name the semantic families, where a semantic family includes a plurality of coarse-grained linguistic segments with similar semantics.
Optionally, the clustering unit 6043 is specifically configured to:
setting the clustering number of the coarse-grained corpus to be k under each intention role, and randomly selecting k coarse-grained corpus as an initial clustering center;
classifying the coarse-grained speech segments in the coarse-grained speech segment set under each intention role to a semantic family group corresponding to each initial clustering center respectively based on the first cosine similarity until the classification of the coarse-grained speech segments is finished;
and determining the real clustering center of each semantic clan group to obtain a plurality of target semantic clans corresponding to each coarse-grained corpus set under each intention role.
In the embodiment of the invention, the original text data is labeled by a preset intention role labeling model to obtain coarse-grained speech segments and intention roles corresponding to the coarse-grained speech segments, the coarse-grained speech segments are clustered to obtain the semantic categories of semantic clans corresponding to the coarse-grained speech segments with similar meanings, and each semantic clan is named. The scheme belongs to the field of artificial intelligence, social progress and development can be promoted through the scheme, the language segments can be classified according to intelligent semantics without marking data, the output data can be directly used for downstream tasks, and the data classification efficiency is improved.
Referring to fig. 7, a second embodiment of the intelligent semantic classifier of the present invention includes:
an obtaining module 701, configured to obtain original text data from a preset corpus;
a first labeling module 702, configured to label the original text data by presetting an intention role labeling model, so as to obtain coarse-grained language segments in the original text data and intention roles corresponding to the coarse-grained language segments;
a classifying module 703, configured to classify each coarse-grained corpus into a corresponding intention role, respectively, so as to obtain a coarse-grained corpus corresponding to each intention role;
a clustering module 704, configured to cluster the coarse-grained corpus respectively to obtain semantic clusters corresponding to the coarse-grained corpus under each intended role, and name the semantic clusters;
a reading module 705, configured to read a text corpus;
the second labeling module 706 is configured to label the text corpus according to a BIO labeling format to obtain a labeled corpus of the text corpus;
a training module 707, configured to input the labeled corpus as a training set into a preset serialization labeling model for training, and output an intention role labeling model;
optionally, the intelligent semantic classification device further includes:
a receiving module 708, configured to receive a question from a user;
a third labeling module 709, configured to label the user question through the intention role labeling model to obtain a labeled sentence segment and an intention role corresponding to the labeled sentence segment;
the processing module 710 is configured to perform vectorization processing on the tagged corpus to obtain tagged corpus vectors, and calculate cosine values of the tagged corpus vectors and the clustering centers of each semantic clan;
the determining module 711 is configured to determine, based on the cosine value, a semantic family to which the labeled corpus belongs, and determine the real intention of the user based on the semantic family.
Optionally, the first labeling module 702 is specifically configured to:
performing intention role labeling on the original text data through a preset intention role labeling model to obtain an intention role labeling result of the original text data;
determining an intention role corresponding to each character and punctuation marks in the original text data based on the intention role labeling result;
and determining coarse grain language segments in the original text data and the corresponding intention roles of the coarse grain language segments based on each character and the corresponding intention role of the punctuation mark.
Optionally, the clustering module 704 includes:
a processing unit 7041, configured to perform vectorization processing on each coarse-grained speech segment in the coarse-grained speech segment set corresponding to each intended role, respectively, to obtain a corresponding coarse-grained speech segment vector;
a calculating unit 7042, configured to calculate, based on a preset cosine similarity algorithm, first cosine similarities between every two coarse-grained speech segment vectors respectively;
a clustering unit 7043, configured to cluster, based on the first cosine similarity, each coarse-grained corpus under each intended role to obtain a plurality of semantic clans corresponding to each coarse-grained corpus under each intended role;
a naming unit 7044, configured to name the semantic families, respectively, where a semantic family includes a plurality of coarse-grained speech segments with similar semantics.
Optionally, the clustering unit 7043 is specifically configured to:
setting the clustering number of the coarse-grained corpus to be k under each intention role, and randomly selecting k coarse-grained corpus as an initial clustering center;
classifying the coarse-grained speech segments in the coarse-grained speech segment set under each intention role to a semantic family group corresponding to each initial clustering center respectively based on the first cosine similarity until the classification of the coarse-grained speech segments is finished;
and determining the real clustering center of each semantic clan group to obtain a plurality of target semantic clans corresponding to each coarse-grained corpus set under each intention role.
Optionally, the training module 707 is specifically configured to:
inputting the labeled corpus into a preset serialization labeling model for pre-training, and performing sequence labeling on the labeled corpus through the serialization labeling model to obtain the prediction labeling results of a plurality of tasks;
calculating a model loss value according to the prediction labeling result;
reversely inputting the model loss value into the serialized annotation model, and judging whether the model loss value reaches a preset loss value or not;
if not, updating the parameters of the serialized annotation model according to the model loss value by adopting a back propagation algorithm;
processing the annotation corpus through a serialized annotation model after parameter updating to obtain the prediction annotation results of a plurality of tasks;
recalculating the model loss value based on the prediction labeling result;
and if the model loss value reaches a preset loss value, confirming model convergence, and taking the serialized annotation model after the parameters are updated as the finally trained intention role annotation model.
In the embodiment of the invention, original text data is marked by a preset intention role marking model to obtain coarse-grained speech segments and intention roles corresponding to the coarse-grained speech segments, and the coarse-grained speech segments are clustered to obtain semantic clans corresponding to the coarse-grained speech segments with similar meanings; and defining a corresponding semantic group according to the semantics of the coarse-grained language segments, and constructing a corresponding concept knowledge base. The scheme belongs to the field of artificial intelligence, social progress and development can be promoted through the scheme, the language segments can be classified according to intelligent semantics without marking data, the output data can be directly used for downstream tasks, and the data classification efficiency is improved.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Fig. 6 and fig. 7 describe the intelligent semantic classifying device in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the intelligent semantic classifying device in the embodiment of the present invention in detail from the perspective of hardware processing.
Fig. 8 is a schematic structural diagram of an intelligent semantic classifier 800 according to an embodiment of the present invention, where the intelligent semantic classifier 800 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 810 (e.g., one or more processors) and a memory 820, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 833 or data 832. Memory 820 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instructions for the intelligent semantic classification device 800. Still further, the processor 810 may be configured to communicate with the storage medium 830 and execute a series of instruction operations in the storage medium 830 on the intelligent semantic sorting device 800.
The intelligent semantic classifier 800 may also include one or more power supplies 840, one or more wired or wireless network interfaces 850, one or more input-output interfaces 860, and/or one or more operating systems 831, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will appreciate that the intelligent semantic classifier architecture illustrated in FIG. 8 does not constitute a limitation of the intelligent semantic classifier and may include more or fewer components than those illustrated, or some components in combination, or a different arrangement of components.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the intelligent semantic classification method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An intelligent semantic classification method, characterized by comprising the steps of:
acquiring original text data from a preset corpus;
marking the original text data through a preset intention role marking model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments;
classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
and clustering the coarse-grained corpus respectively to obtain semantic clans corresponding to the coarse-grained corpus under each intention role and naming the semantic clans.
2. The intelligent semantic classification method according to claim 1, wherein before labeling the original text data by presetting an intention role labeling model to obtain coarse-grained speech segments in the original text data and intention roles corresponding to the coarse-grained speech segments, the method further comprises:
reading text corpora;
labeling the text corpus according to a BIO labeling format to obtain a labeled corpus of the text corpus;
and inputting the labeling linguistic data serving as a training set into a preset serialization labeling model for training, and outputting an intention role labeling model.
3. The intelligent semantic classification method according to claim 2, wherein the input of the labeled corpus into a preset serialization labeling model for training as a training set, and the output of the intention role labeling model comprises:
inputting the labeled corpus into a preset serialization labeling model for pre-training, and performing sequence labeling on the labeled corpus through the serialization labeling model to obtain the prediction labeling results of a plurality of tasks;
calculating a model loss value according to the prediction labeling result;
reversely inputting the model loss value into the serialized annotation model, and judging whether the model loss value reaches a preset loss value or not;
if not, updating the parameters of the serialized annotation model according to the model loss value by adopting a back propagation algorithm;
processing the annotation corpus through a serialized annotation model after parameter updating to obtain the prediction annotation results of a plurality of tasks;
recalculating the model loss value based on the prediction labeling result;
and if the model loss value reaches a preset loss value, confirming model convergence, and taking the serialized annotation model after the parameters are updated as the finally trained intention role annotation model.
4. The intelligent semantic classification method according to claim 1, wherein the labeling the original text data by presetting an intention role labeling model to obtain coarse-grained speech segments in the original text data and intention roles corresponding to the coarse-grained speech segments comprises:
performing intention role labeling on the original text data through a preset intention role labeling model to obtain an intention role labeling result of the original text data;
determining an intention role corresponding to each character and punctuation marks in the original text data based on the intention role labeling result;
and determining coarse grain language segments in the original text data and the corresponding intention roles of the coarse grain language segments based on each character and the corresponding intention role of the punctuation mark.
5. The intelligent semantic classification method according to claim 1, wherein the clustering the coarse-grained corpus respectively to obtain semantic families corresponding to the coarse-grained corpus under each intended role and naming the semantic families comprises:
vectorizing each coarse-grained speech segment in the coarse-grained speech segment set corresponding to each intention role respectively to obtain a corresponding coarse-grained speech segment vector;
respectively calculating first cosine similarity between every two coarse-granularity speech segment vectors based on a preset cosine similarity algorithm;
based on the first cosine similarity, clustering each coarse-grained corpus under each intention role to obtain a plurality of semantic clans corresponding to each coarse-grained corpus under each intention role;
and naming the semantic families respectively, wherein one semantic family comprises a plurality of coarse-grained language segments with similar semantics.
6. The intelligent semantic classification method according to claim 5, wherein the clustering the coarse-grained corpus of each of the intended roles based on the first cosine similarity to obtain a plurality of semantic families corresponding to the coarse-grained corpus of each of the intended roles comprises:
setting the clustering number of the coarse-grained corpus to be k under each intention role, and randomly selecting k coarse-grained corpus as an initial clustering center;
classifying the coarse-grained speech segments in the coarse-grained speech segment set under each intention role to a semantic family group corresponding to each initial clustering center respectively based on the first cosine similarity until the classification of the coarse-grained speech segments is finished;
and determining the real clustering center of each semantic clan group to obtain a plurality of target semantic clans corresponding to each coarse-grained corpus set under each intention role.
7. The intelligent semantic classification method according to any one of claims 1 to 6, further comprising, after the clustering each coarse-grained corpus respectively to obtain and name a semantic clan corresponding to each coarse-grained corpus in each intended role:
receiving a question of a user;
marking the question of the user through the intention role marking model to obtain a marking language segment and an intention role corresponding to the marking language segment;
vectorizing the tagged corpus to obtain tagged corpus vectors, and calculating second cosine similarity between the tagged corpus vectors and the clustering center of each semantic clan;
and determining the semantic clan to which the labeled language segment belongs based on the second cosine similarity, and determining the real intention of the user based on the semantic clan.
8. An intelligent semantic classifier, comprising:
the acquisition module is used for acquiring original text data from a preset corpus;
the first labeling module is used for labeling the original text data through a preset intention role labeling model to obtain coarse grain language segments in the original text data and intention roles corresponding to the coarse grain language segments;
the classification module is used for classifying the coarse-grained speech segments to corresponding intention roles respectively to obtain a coarse-grained speech segment set corresponding to each intention role;
and the clustering module is used for respectively clustering the coarse-grained corpus sets to obtain semantic clans corresponding to the coarse-grained corpus sets under the intention roles and naming the semantic clans.
9. An intelligent semantic classification device, comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invokes the instructions in the memory to cause the data real-time retrieval device to perform the intelligent semantic classification method of any one of claims 1-7.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when being executed by a processor performs the steps of the intelligent semantic categorization method of any of claims 1-7.
CN202010581247.2A 2020-06-23 2020-06-23 Intelligent semantic classification method, device, equipment and storage medium Active CN111723582B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010581247.2A CN111723582B (en) 2020-06-23 2020-06-23 Intelligent semantic classification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010581247.2A CN111723582B (en) 2020-06-23 2020-06-23 Intelligent semantic classification method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111723582A true CN111723582A (en) 2020-09-29
CN111723582B CN111723582B (en) 2023-07-25

Family

ID=72568459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010581247.2A Active CN111723582B (en) 2020-06-23 2020-06-23 Intelligent semantic classification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111723582B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000028091A1 (en) * 1998-11-12 2000-05-18 Scios Inc. Systems for the analysis of gene expression data
US8510308B1 (en) * 2009-06-16 2013-08-13 Google Inc. Extracting semantic classes and instances from text
US20150134336A1 (en) * 2007-12-27 2015-05-14 Fluential Llc Robust Information Extraction From Utterances
CN110309400A (en) * 2018-02-07 2019-10-08 鼎复数据科技(北京)有限公司 A kind of method and system that intelligent Understanding user query are intended to

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000028091A1 (en) * 1998-11-12 2000-05-18 Scios Inc. Systems for the analysis of gene expression data
US20150134336A1 (en) * 2007-12-27 2015-05-14 Fluential Llc Robust Information Extraction From Utterances
US8510308B1 (en) * 2009-06-16 2013-08-13 Google Inc. Extracting semantic classes and instances from text
CN110309400A (en) * 2018-02-07 2019-10-08 鼎复数据科技(北京)有限公司 A kind of method and system that intelligent Understanding user query are intended to

Also Published As

Publication number Publication date
CN111723582B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN110163478A (en) A kind of the risk checking method and device of contract terms
CN111221944B (en) Text intention recognition method, device, equipment and storage medium
CN111859983B (en) Natural language labeling method based on artificial intelligence and related equipment
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
WO2020114100A1 (en) Information processing method and apparatus, and computer storage medium
US11573994B2 (en) Encoding entity representations for cross-document coreference
CN112395875A (en) Keyword extraction method, device, terminal and storage medium
CN111325036A (en) Emerging technology prediction-oriented evidence fact extraction method and system
CN111859984B (en) Intention mining method, device, equipment and storage medium
CN113627182A (en) Data matching method and device, computer equipment and storage medium
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN115358817A (en) Intelligent product recommendation method, device, equipment and medium based on social data
Guadie et al. Amharic text summarization for news items posted on social media
CN111723582B (en) Intelligent semantic classification method, device, equipment and storage medium
CN111985217B (en) Keyword extraction method, computing device and readable storage medium
CN114580398A (en) Text information extraction model generation method, text information extraction method and device
Wang et al. Detecting coreferent entities in natural language requirements
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
CN112836014A (en) Multi-field interdisciplinary-oriented expert selection method
Desai et al. Analysis of Health Care Data Using Natural Language Processing
Olivo et al. CRFPOST: Part-of-Speech Tagger for Filipino Texts using Conditional Random Fields
CN114706927B (en) Data batch labeling method based on artificial intelligence and related equipment
Jo Table based KNN for categorizing words
Justnes Using Word Embeddings to Determine Concepts of Values In Insurance Claim Spreadsheets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant