CN112084334B

CN112084334B - Label classification method and device for corpus, computer equipment and storage medium

Info

Publication number: CN112084334B
Application number: CN202010922970.2A
Authority: CN
Inventors: 张惠玲
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2023-11-21
Anticipated expiration: 2040-09-04
Also published as: CN112084334A

Abstract

The embodiment of the application relates to the field of artificial intelligence, and provides a label classification method of corpus, which comprises the following steps: performing word segmentation on the multi-segment text data of the multi-segment corpus data to obtain corresponding multi-segment word segmentation results; inputting the multi-section word segmentation result into a probability model, and analyzing the word segmentation result through the probability model modeling to obtain a plurality of K values; calculating the confusion degree of a plurality of K values, and rounding the K value with the minimum confusion degree to obtain a corresponding first-level label; inputting the corresponding multi-segment word segmentation result into a deformed bidirectional encoder representation model corresponding to the primary label, and obtaining the sub-label under the primary label through the deformed bidirectional encoder representation model. In addition, the application also relates to a blockchain technology, and multiple pieces of text data can be stored in the blockchain. The application also provides a label classification device, computer equipment and a storage medium of the corpus. The label classification accuracy of the corpus is improved.

Description

Label classification method and device for corpus, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method and apparatus for classifying labels of corpus, a computer device, and a storage medium.

Background

Consultation and complaints are important contacts of customer service, and directly influence the perception and evaluation of customers on enterprise brands and services. The traditional consultation and complaint are that after the clients chat with the customer service, the manual customer service refines the chat content record, marks and enters the system, thereby facilitating the analysis and the solution of the follow-up problems. On the one hand, manual labeling standards are different, so that complaint points cannot be accurately positioned. On the other hand, after the customer service clients chat, the time for manually refining and summarizing the chat content is required to be recorded, so that the method is quite inefficient. In the algorithm scheme, when a large number of labels appear, the effect is not satisfactory, and the labels cannot be accurately classified.

Disclosure of Invention

The embodiment of the application aims to provide a label classification method, a device, computer equipment and a storage medium for corpus, so as to solve the problem that labels in identification cannot be accurately classified.

In order to solve the above technical problems, the embodiment of the present application provides a label classification method for corpus, which adopts the following technical scheme:

acquiring multiple sections of corpus data, and segmenting the corpus data into multiple sections of text data;

performing word segmentation on each section of text data to obtain a corresponding multi-section word segmentation result;

Inputting the corresponding multi-segment word segmentation result into a probability model, and analyzing the corresponding multi-segment word segmentation result through modeling of the probability model to obtain a plurality of K values;

calculating the confusion degree corresponding to the K values, selecting the K value with the minimum confusion degree as a target K value, rounding the target K value to obtain first-level labels corresponding to the target K value, wherein the K value is used for representing the number of the first-level labels;

obtaining a trained deformation bidirectional encoder representation model corresponding to the primary label;

inputting the corresponding multi-segment word segmentation result into the trained deformed bidirectional encoder representation model, and obtaining the sub-label under the primary label through the trained deformed bidirectional encoder representation model.

Further, the corpus data includes speech data, and the step of segmenting the corpus data into a plurality of segments of text data specifically includes:

extracting voice data of corpus data, and dividing the voice data into user voice data and staff voice data;

converting the user voice data and the staff voice data into a user text and a staff text respectively;

Respectively breaking sentences of the user text and the staff text to obtain a user text after breaking sentences and a staff text after breaking sentences;

and arranging the user text after sentence breaking and the worker text after sentence breaking according to the sequence of the texts to obtain a plurality of sections of text data.

Further, before the step of obtaining the first-level label corresponding to the trained deformed bidirectional encoder representation model, the method further includes:

acquiring a plurality of training data and labeling labels corresponding to the training data;

inputting the training data and the corresponding labeling label into the initial deformation bidirectional encoder representation model;

training the initial deformation bidirectional encoder representation model under a plurality of neural network model parameters through a training function to obtain a plurality of deformation bidirectional encoder representation models;

calculating a loss function value of the plurality of deformed bidirectional encoder representation models, and taking the deformed bidirectional encoder representation model with the minimum loss function value as the target deformed bidirectional encoder representation model;

and deploying the target deformation bidirectional encoder representation model to obtain the trained deformation bidirectional encoder representation model.

Further, the step of calculating the loss function values of the plurality of deformed bidirectional encoder representation models, and using the deformed bidirectional encoder representation model with the smallest loss function value as the target deformed bidirectional encoder representation model specifically includes:

acquiring a plurality of test samples;

inputting the plurality of test samples to a plurality of the deformed bidirectional encoder representation models;

and calculating the loss function values of a plurality of target deformation bidirectional encoder representation models through a loss function, and taking the deformation bidirectional encoder representation model with the minimum loss function value as the target deformation bidirectional encoder representation model.

Further, after the step of segmenting each segment of the text data to obtain the corresponding multi-segment segmentation result, the method further includes:

presetting a plurality of domain professional words, and adding the domain professional words into a word stock;

presetting a plurality of useless words, and adding the useless words into the word stock;

optimizing the corresponding multi-segment word segmentation result through the word stock.

Further, the step of inputting the corresponding multi-segment word segmentation result into a probability model, and analyzing the corresponding multi-segment word segmentation result through the probability model modeling to obtain a plurality of K values specifically includes:

Inputting the corresponding multi-segment word segmentation results into a probability model, and analyzing the corresponding multi-segment word segmentation results through modeling of the probability model to obtain topic probability distribution of the multi-segment word segmentation results;

and performing topic clustering or text classification on the multi-segment word segmentation result based on the topic probability distribution to obtain a plurality of topics of the multi-segment word segmentation result and the number of the plurality of topics, wherein the number of the plurality of topics is the plurality of K values.

Further, the step of obtaining the plurality of pieces of voice data and converting the plurality of pieces of voice data into the corresponding plurality of pieces of text data further includes:

the multi-segment text data is stored in a blockchain network.

In order to solve the above technical problems, the embodiment of the present application further provides a label classification device for corpus, which adopts the following technical scheme:

a label classification device for corpus comprises an acquisition module, a word segmentation module, an algorithm analysis module, a tuning module, a model acquisition module and a classification module.

The corpus acquisition module is used for acquiring multiple sections of corpus data and segmenting the corpus data into multiple sections of text data;

the word segmentation module is used for segmenting each section of text data to obtain corresponding multi-section word segmentation results;

The algorithm analysis module is used for inputting the corresponding multi-segment word segmentation result into a probability model, and analyzing the corresponding multi-segment word segmentation result through modeling of the probability model to obtain a plurality of K values;

the tuning module is used for selecting the K value with the minimum confusion degree from the K values as a target K value, rounding the target K value and obtaining a corresponding first-level label; wherein the K value is used for representing the number of primary labels;

the model acquisition module is used for acquiring a trained deformation bidirectional encoder representation model corresponding to the primary label;

and the classification module is used for inputting the corresponding multi-segment word segmentation result into the trained deformed bidirectional encoder representation model, and obtaining the sub-label under the primary label through the trained deformed bidirectional encoder representation model.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

a computer device comprising at least one connected processor, a memory, and a network interface, wherein the memory is configured to store computer readable instructions, and the processor is configured to invoke the computer readable instructions in the memory to perform the steps of the label classification method for corpus as described above.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of the label classification method for corpora described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

the application provides a new label classification method based on standard corpus, which has better interpretation of an algorithm through a probability model, but under the condition of a large number of labels, other labels under a first-level label are further obtained by using a deformed bidirectional encoder representation model, so that a certain classification effect can be achieved under a multi-level label, and the accuracy of label classification is improved through the combination of the probability model and the deformed bidirectional encoder representation model.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2-1 is a flow chart of one embodiment of a method of tag classification of a corpus according to the present application;

FIG. 2-2 is a schematic diagram of a bert model of a label classification method for corpora according to the present application;

FIG. 3 is a schematic diagram of one embodiment of a label classification device for corpora in accordance with the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the method for classifying the labels of the corpus provided by the embodiment of the application is generally executed by a server/terminal device, and correspondingly, the device for classifying the labels of the corpus is generally arranged in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2-1, a flow chart of one embodiment of a method of tag classification of a corpus according to the present application is shown. The label classification method of the corpus comprises the following steps:

Step 201, obtaining a plurality of pieces of corpus data, and segmenting the corpus data into a plurality of pieces of text data.

In this embodiment, the electronic device (for example, the server/terminal device shown in fig. 1) on which the corpus tag classification method operates may receive the user request through a wired connection manner or a wireless connection manner for calibration. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connections, wiFi connections, bluetooth connections, wiMAX connections, zigbee connections, UWB (ultra wideband) connections, and other now known or later developed wireless connection means.

In this embodiment, the expected data used in the present application may be all text data, all speech data, or part of speech data plus part of text data. If the corpus data has voice data, the voice data can be converted into text data through a mature voice conversion technology and then used. After complaint voice data is obtained, the cost for translating conversation voice into conversation text in real time is too high, and the conversation voice can be generally recorded into a recording file in an off-line manner by double channels, so that the cost for saving translation and the recording file is reduced.

And 202, word segmentation is carried out on each section of text data, and a corresponding multi-section word segmentation result is obtained.

In this embodiment, lexical analysis (word frequency analysis/text clustering, etc.) is performed on the text data to perform topic mining on the text data, and after classification, a piece of text data can be divided into a plurality of words, so that a multi-segment word segmentation result can be obtained. After the multi-section word segmentation result is obtained, some keywords can be extracted, the interference of some useless words is eliminated, and the self-built dictionary and the self-used stop word list can be loaded to improve the word segmentation effect. For example, the word "apple is delicious" is segmented into apple/delicious/mock, and the three word segmentation results can be used for providing the unnecessary word "mock" and only inputting "apple", "delicious".

Step 203, inputting the corresponding multi-segment word segmentation result into a probability model, and analyzing the corresponding multi-segment word segmentation result through the probability model modeling to obtain a plurality of K values.

In this embodiment, the K value is used to indicate the number of primary labels, and a three-layer Bayesian probability model (Latent DirichletAllocation, LDA) algorithm performs topic modeling, outputs topics corresponding to each chat text data and a plurality of keywords in the front of the topics, wherein the input of the LDA algorithm is a text set of a multi-segment word segmentation result, an initial topic number, super parameters alpha and beta (the super parameters are generally not set by default), and the output is that: text-topic probability distribution, topic-word distribution; the theme is the first-level label. LDA plays a very important role in topic models, often used for text classification. LDA was proposed in 2003 to infer topic distribution of word segmentation results. The method can give the topic of each vocabulary in the multi-segment word segmentation results in the form of probability distribution, so that topic clustering or text classification can be carried out according to topic distribution after analyzing some word segmentation results and extracting the topic distribution. The text set is a multi-segment word segmentation result, and the super parameters alpha and beta generally take values of 0 to 1. This text-to-topic probability distribution and the topic-to-word distribution are converted into a classification result. The specific process is as follows: according to the prior probability p (d _i ) Selecting a multi-segment word segmentation result d _i Sampling from Dirichlet distribution alpha to generate a multi-segment word segmentation result d _i Subject distribution θ _i Subject distribution θ _i Generated from dirichlet distribution with a super parameter alpha. Polynomial distribution θ from subject _i The middle sampling generates a multi-segment word segmentation result d _i Subject z of the jth word _i,j The topics of the words are further obtained, and a plurality of topics are obtained, namely the topic number K is obtained. Sampling from dirichlet distribution beta to generate topic z _i,j Corresponding word distributionWord distribution->Generated from a dirichlet distribution with parameter beta. From polynomial distribution of words ++>And finally generating words by sampling, and repeating the training process.

And 204, calculating the confusion degree corresponding to the K values, selecting the K value with the minimum confusion degree as a target K value, rounding the target K value, and obtaining the primary labels corresponding to the target K value, wherein the K value is used for representing the number of the primary labels.

In this embodiment, the confusion degree (perplexity) of the business experience and the best model effect is combined to form the optimal K value, K primary labels are classified, and the complaint expression primary label of each customer on the corresponding primary label concerned by the business is correspondingly output to predict the primary label of the plurality of labels, and the primary label classification effect is good based on the lexical analysis. Therefore, the optimal K value is more in line with the actual service model, for example, the K value under certain parameters is calculated to be 10, and the actual service cannot be rejected. The general evaluation confusion is word confusion. The confusion specific calculation method may be such that given sentences s=w1, W2, W3,..wi, the probability of the i-th word may be expressed as P (S) =p (W1) P (w2|w1)..p (wk|w1, W2,., wi-1), the confusion may be passed Obtained. In the theory of information, the confusion is used to measure how well a probability distribution or probability model predicts a sample. It can also be used to compare two probability distributions or probability models (it should be comparing the merits of both on the predicted samples). A low-confusion probability distribution model or probability model can better predict samples. The smaller the confusion, the greater the sentence probability and the better the language model. For example, using 10.9,6.9,8.9,4.9 as a result of the K values calculated by the LDA model, the confusion of a plurality of K values is calculated, and the K value with the smallest confusion is selected as the target K value, and the confusion can be calculated by +.>And correspondingly, if the confusion degree corresponding to the K value of 4.9 is minimum, selecting 4.9 as a target K value. Because the number of the labels is not decimal, 4.9 is rounded, the number of the labels can be adjusted to be 3,4,5 or 6, the actual business condition is adjusted according to the actual condition, the data of the labels 3,4,5 and 6 are measured, and a better solution is selected.

And 205, acquiring a trained deformation bidirectional encoder representation model corresponding to the primary label.

In this embodiment, a deformed bi-directional encoder representation model (Bidirectional Encoder Representations from Transformers, bert) model schematic is shown in fig. 2-2. Firstly, obtaining the representation of each word (token) according to an initial Bert, wherein "[ CLS ] is a special symbol added to the head of a sentence by a model, obtaining the code (E1 … … En) of each word through a multi-layer network of Bert, finally [ CLS ] obtaining the complete information (T1 … … Tn) containing the sentence, namely (E1 … … En) obtained after sentence input, for the result obtained after the model is represented by a deformed bidirectional encoder, obtaining the information of the sentence through the inside of a neural network, encoding, obtaining the probability of each class through the encoding of [ CLS ] through a full connection layer and using a sigmoid activation function, and obtaining the category of the text according to a preset threshold. Binary cross entropy functions (binary cross entropy loss) are used in model training. Each level of label corresponds to a trained bert model, and sub-labels of each segment of word segmentation result are obtained through the trained bert model, so that multi-level classification is completed. So that the algorithm has better effect and interpretation.

And 206, inputting the corresponding multi-segment word segmentation result into the trained deformed bidirectional encoder representation model, and obtaining the sub-label under the primary label through the trained deformed bidirectional encoder representation model.

In this embodiment, a new label classification method based on a standard corpus is provided, where the first n most probable labels are selected from a plurality of subdivision labels, for example, "customer complaints: the case is not provided with a loss amount of half a month, the service performance quality of the slow # (the second-level label) of the loss amount of the case is poor due to the fact that the case is not provided with the loss amount of half a month, the case is not provided with a plurality of labels after being contacted for many times, the service performance quality of the case is poor due to the fact that the loss amount of the case is slow # (the first-level label) of the case is provided with a plurality of labels, and the case is combined with the LDA algorithm through the pre-training model of the bert model, so that the algorithm has a certain interpretation and a good classification effect.

In some optional implementations, the corpus data includes speech data, and the step of segmenting the corpus data into multiple segments of text data specifically includes:

The expected data used in the present application may be all text data, all voice data, or part of voice data plus part of text data. If the corpus data has voice data, the voice data is firstly divided into user voice data and staff voice data according to the difference of call objects, and then the voice data can be converted into text data (user text and staff text) through a mature voice conversion technology.

In the above embodiment, the sentence breaking may be performed by using a manually labeled method, or by inputting the sentence breaking through a neural network model, or by using a word segmentation method, and the reduction is mainly performed by judging the context sequence of the text after the sentence breaking, the staff and the user sentences. Or reduced by the bert model. For example, the user may start speaking first, and the dialogue content may be restored by arranging the first sentence of the user, the first sentence of the staff member, the second sentence of the user, the second sentence of the staff member, the nth sentence of the user, and the first sentence of the staff member.

When the corpus data is part of voice data and part of text data, the text data in the corpus data can be divided according to the difference of corpus objects, the text data in the corpus data can be directly divided into user text and staff text, and then the user text and the staff text are combined with the text data (the user text and the staff text) converted by the voice data to form complete user text and staff text. And finally, executing sentence breaking based on the complete user text and the text of the staff, and arranging according to the sequence of the texts to obtain a plurality of sections of text data. In other embodiments of the present application, when all the expected data are text data, the text data in the corpus data are segmented according to the corpus object, and after the text data are directly segmented into the user text and the worker text, the text is cut, and the text data are arranged according to the sequence of the text, so as to obtain a plurality of segments of text data.

In some optional implementations, the step of obtaining the first-level label corresponding to the trained deformed bidirectional encoder representation model further includes:

In the above embodiment, the training function is

Representing the output of the n-1 th layer in the multi-layer perceptron according to the target deformation bi-directional encoder representation model, training the weights obtained by the kth neuron in the n-th layer,/>Representation->Corresponding bias, f _i ⁿ After the ith training data is input into the target deformation bidirectional encoder representation model, the output of the nth layer of the target deformation bidirectional encoder representation model is obtained, i and k are any positive integers, n is a natural number, and f _i ^n-1 After representing the i-th training data input to the target deformed bidirectional encoder representation model, the output of the n-1 th layer of the target deformed bidirectional encoder representation model is output. Training data is word segmentation results obtained after word segmentation of a plurality of text data, training labels are labels corresponding to each text data, the model is trained in the mode, and the trained model is deployed at a client or a server and is provided for users.

In some optional implementations, the step of calculating the loss function values of the plurality of deformed bidirectional encoder representation models, and taking the deformed bidirectional encoder representation model with the smallest loss function value as the target deformed bidirectional encoder representation model specifically includes:

acquiring a plurality of test samples;

In the above embodiment, the loss function isH _p (q) wherein N is the number of test samples, y _i For the ith test sample, p is the output value of the sample input to the target deformation bidirectional encoder representation model, q is the q target deformation bidirectional encoder representation model, the error magnitude of one model can be estimated through the value of the loss function, and H _p The smaller the value of (q), the smaller the error, H _p The larger the value of (q), the larger the error, by evaluating H _p The value of (q) can evaluate the quality of the bert model under a plurality of parameters, and then the optimal model is selected for the user to use.

In some optional implementations, the step of segmenting each segment of the text data to obtain a corresponding multi-segment segmentation result further includes:

In the above embodiment, the jieba package of Python supports loading user-defined dictionaries and disabling vocabularies. The establishment of the custom dictionary is the term analyzed in the field lexical analysis, the establishment of the stop word list is the general stop word list (and the like), the word frequency analyzed in the lexical analysis is large but useless words are not used, for example, a film can be divided into action sheets and crime sheets at the same time, and a news can belong to politics and laws at the same time, and the problems of gene function prediction, scene recognition, disease diagnosis and the like in biology are also solved. Unlike traditional single-corpus tag classification, each sample of the single-corpus tag classification has only one associated tag. The PCB word is taken as an example, belongs to a printed circuit board in the industrial field, but is a process management block in the computer field, and is polychlorinated biphenyl in the chemical field, so that the PCB word can be put into different professional word libraries according to the actual demand field, and the influence of word segmentation caused by different fields is prevented. The same holds true for the garbage.

In some optional implementations, the step of inputting the corresponding multi-segment word segmentation result to a probability model, and analyzing the corresponding multi-segment word segmentation result through modeling of the probability model to obtain a plurality of K values specifically includes:

In the above embodiment, the LDA defines the following generation process for each document in the thesaurus: extracting a label from the topic distribution for each word segmentation result; extracting a word from the word distribution corresponding to the extracted theme; the above process is repeated until each word in the document is traversed. Each document in the word corresponds to a polynomial distribution of T topics, which is noted as θ. Each tag in turn corresponds to a polynomial distribution of V words in the vocabulary (vocabolary), which is noted as phi. The dimension of global parameters beta (word-topic probability matrix) and alpha (dirichlet allocation parameter) and the number k of labels in the LDA process; then, the transformed global parameters beta, alpha and K are used as input to perform variation training (the variation training refers to related content of Dirichlet distribution in step 203); finally, the LDA topic incremental training algorithm (see the content of the topic number K in the step 203) is called circularly, and the circulation process is stopped when the likelihood function value converges, so that the incremental training of K is completed.

In some optional implementations, the step of obtaining the multi-segment complaint voice data and converting the multi-segment voice data into the corresponding multi-segment text data further includes:

the multi-segment text data is stored in a blockchain network.

In the above embodiment, the information stored in the blockchain is more secure.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2-1, the present application provides an embodiment of a label classification apparatus for corpus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 3, the label classification device 300 for corpus according to the present embodiment includes: an acquisition module 301, a word segmentation module 302, an algorithm analysis module 303, a tuning module 304, a model acquisition module 305 and a classification module 306. Wherein:

The corpus acquisition module 301 is configured to acquire a plurality of segments of corpus data, and segment the corpus data into a plurality of segments of text data;

the word segmentation module 302 is configured to segment each segment of the text data to obtain a corresponding multi-segment word segmentation result;

the algorithm analysis module 303 is configured to input the corresponding multi-segment word segmentation result to a probability model, and analyze the corresponding multi-segment word segmentation result through modeling of the probability model to obtain a plurality of K values;

the tuning module 304 is configured to calculate a confusion degree corresponding to the plurality of K values, select a K value with a minimum confusion degree as a target K value, and round the target K value to obtain a first-level tag corresponding to the target K value, where the K value is used to represent the number of first-level tags;

the model obtaining module 305 is configured to obtain a trained deformed bidirectional encoder representation model corresponding to the primary label;

the classification module 306 is configured to input the corresponding multi-segment word segmentation result to the trained deformed bidirectional encoder representation model, and obtain the sub-label under the primary label through the trained deformed bidirectional encoder representation model.

In some optional implementations of this embodiment, the obtaining module includes a segmentation sub-module, a text conversion sub-module, a sentence breaking sub-module, and an atomic return module.

The segmentation sub-module is used for extracting voice data of the corpus data and segmenting the voice data into user voice data and staff voice data;

the text conversion sub-module is used for respectively converting the user voice data and the staff voice data into user text and staff;

the sentence breaking module is used for respectively breaking sentences of the user text and the staff text to obtain the user text after the sentence breaking and the staff text after the sentence breaking;

and the atomic module is used for arranging the user text after sentence breaking and the worker text after sentence breaking according to the text sequence to obtain multi-section text data.

In some optional implementations of this embodiment, the apparatus 300 further includes: a training module, the training module comprising: the system comprises a training data acquisition unit, a training data input unit, a calculation unit, a training unit and a deployment unit.

The training data acquisition unit is used for acquiring a plurality of training data and labeling labels corresponding to the training data;

the training data input unit is used for inputting the training data and the corresponding labeling labels into the initial deformation bidirectional encoder representation model;

The training unit is used for training the initial deformation bidirectional encoder representation model under the parameters of the multiple neural network models through a training function to obtain multiple deformation bidirectional encoder representation models;

a calculation unit configured to calculate a loss function value of the plurality of deformed bidirectional encoder representation models, and take the deformed bidirectional encoder representation model with the smallest loss function value as the target deformed bidirectional encoder representation model;

the deployment unit is used for deploying the target deformation bidirectional encoder representation model to obtain the trained deformation bidirectional encoder representation model.

In some optional implementations of this embodiment, the apparatus 300 further includes: the test module comprises a test data acquisition unit, a test data input unit and a model selection unit.

The test data acquisition unit is used for acquiring a plurality of test samples;

a test data input unit for inputting the plurality of test samples to a plurality of the deformed bidirectional encoder representation models;

and the test function calculation unit is used for calculating the loss function values of the target deformation bidirectional encoder representation models through a loss function, and taking the deformation bidirectional encoder representation model with the minimum loss function value as the target deformation bidirectional encoder representation model.

In some optional implementations of this embodiment, the apparatus 300 further includes: the word stock module comprises a professional word preset unit, a useless word preset unit and an optimization unit.

The professional word presetting unit is used for presetting a plurality of field professional words and adding the field professional words into a word stock;

the useless word presetting unit is used for presetting a plurality of useless words and adding the useless words into the word stock; and the optimizing unit is used for optimizing the corresponding multi-segment word segmentation result through the word stock.

Further, the algorithm analysis module further comprises a theme distribution calculation unit and a K value calculation unit.

The topic distribution calculation unit is used for inputting the corresponding multi-segment word segmentation results into a probability model, and analyzing the corresponding multi-segment word segmentation results through modeling of the probability model to obtain topic probability distribution of the multi-segment word segmentation results;

the K value calculation unit is used for carrying out topic clustering or text classification on the multi-segment word segmentation result based on the topic probability distribution to obtain a plurality of topics of the multi-segment word segmentation result and the number of the topics, wherein the number of the topics is the K values.

Further, the label classification device of the corpus further comprises a storage unit.

And the storage unit is used for storing the multi-section text data in a blockchain network.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 6. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is generally used to store an operating system installed on the computer device 4 and various application software, such as computer readable instructions of a label classification method of corpus. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing a label classification method of the corpus.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of a label classification method of corpus as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. The label classification method of the corpus is characterized by comprising the following steps:

inputting the corresponding multi-segment word segmentation result into the trained deformed bidirectional encoder representation model, and obtaining a sub-label under the primary label through the trained deformed bidirectional encoder representation model;

the step of segmenting the corpus data into a plurality of segments of text data specifically comprises the steps of:

arranging the user text after sentence breaking and the worker text after sentence breaking according to the sequence of the text to obtain multi-section text data;

The step of obtaining the first-level label corresponding to the trained deformed bidirectional encoder representation model further comprises the following steps:

inputting the training data and the corresponding labeling label into an initial deformation bi-directional encoder representation model;

calculating loss function values of the plurality of deformed bidirectional encoder representation models, and taking the deformed bidirectional encoder representation model with the minimum loss function value as a target deformed bidirectional encoder representation model;

deploying the target deformation bidirectional encoder representation model to obtain the trained deformation bidirectional encoder representation model;

the step of inputting the corresponding multi-segment word segmentation result into a probability model and analyzing the corresponding multi-segment word segmentation result through the probability model modeling to obtain a plurality of K values specifically comprises the following steps:

2. The method according to claim 1, wherein the step of calculating the loss function value of the plurality of deformed bidirectional encoder representation models and using the deformed bidirectional encoder representation model with the smallest loss function value as the target deformed bidirectional encoder representation model specifically includes:

acquiring a plurality of test samples;

3. The method for classifying labels of corpus according to claim 1, wherein after the step of segmenting each segment of the text data to obtain a corresponding segmented word result, the method further comprises:

4. The method for classifying labels of corpus according to claim 1, wherein after the step of obtaining a plurality of pieces of corpus data and dividing the corpus data into a plurality of pieces of text data, the method further comprises:

the multi-segment text data is stored in a blockchain network.

5. A label classification device for corpus, comprising:

the algorithm analysis module inputs the corresponding multi-segment word segmentation result into a probability model, and analyzes the corresponding multi-segment word segmentation result through the probability model modeling to obtain a plurality of K values;

the optimizing module is used for calculating the confusion degree corresponding to the K values, selecting the K value with the minimum confusion degree as a target K value, rounding the target K value to obtain first-level labels corresponding to the target K value, wherein the K value is used for representing the number of the first-level labels;

the classification module is used for inputting the corresponding multi-segment word segmentation result into the trained deformed bidirectional encoder representation model, and obtaining a sub-label under the primary label through the trained deformed bidirectional encoder representation model;

the corpus acquisition module comprises a segmentation submodule, a text conversion submodule, a sentence breaking submodule and a restoration submodule;

the segmentation submodule is used for extracting voice data of corpus data and segmenting the voice data into user voice data and staff voice data;

the text conversion sub-module is used for respectively converting the user voice data and the staff voice data into a user text and a staff text;

the sentence breaking submodule is used for respectively breaking sentences of the user text and the worker text to obtain the user text after the sentence breaking and the worker text after the sentence breaking;

the restoring submodule is used for arranging the user text after sentence breaking and the worker text after sentence breaking according to the text sequence to obtain a plurality of sections of text data;

The corpus label classifying device further comprises a training module, wherein the training module comprises: the system comprises a training data acquisition unit, a training data input unit, a calculation unit, a training unit and a deployment unit;

the training data input unit is used for inputting the training data and the corresponding labeling labels into an initial deformation bidirectional encoder representation model;

the training unit is used for training the initial deformation bidirectional encoder representation model under a plurality of neural network model parameters through a training function to obtain a plurality of deformation bidirectional encoder representation models;

the calculating unit is configured to calculate loss function values of the plurality of deformed bidirectional encoder representation models, and take the deformed bidirectional encoder representation model with the smallest loss function value as a target deformed bidirectional encoder representation model;

the deployment unit is used for deploying the target deformation bidirectional encoder representation model to obtain the trained deformation bidirectional encoder representation model;

the algorithm analysis module further comprises a theme distribution calculation unit and a K value calculation unit;

the K value calculation unit is used for performing topic clustering or text classification on the multi-segment word segmentation result based on the topic probability distribution to obtain a plurality of topics of the multi-segment word segmentation result and the number of the topics, wherein the number of the topics is the K values.

6. A computer device comprising a memory having stored therein computer readable instructions which when executed by the processor implement the steps of the method of tag classification of a corpus as claimed in any of claims 1 to 4.

7. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the method of label classification of a corpus as claimed in any of claims 1 to 4.