CN113095073A

CN113095073A - Corpus tag generation method and device, computer equipment and storage medium

Info

Publication number: CN113095073A
Application number: CN202110270401.9A
Authority: CN
Inventors: 周炬; 邵俊
Original assignee: Shenzhen Suoxinda Data Technology Co ltd
Current assignee: Shenzhen Suoxinda Data Technology Co ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-07-09
Anticipated expiration: 2041-03-12
Also published as: CN113095073B

Abstract

The application relates to a corpus tag generation method, a corpus tag generation device, computer equipment and a storage medium. The method comprises the following steps: based on the current entity vocabulary, performing word segmentation processing on each corpus sample to obtain a plurality of corresponding word elements; counting the occurrence frequency of each word element to obtain corresponding word frequency; marking the word elements with the word frequency in a preset word frequency interval as entity words, and updating a current entity word list; performing word segmentation processing on each corpus sample again according to the updated entity vocabulary, and determining corpus keywords corresponding to each corpus sample; clustering analysis is carried out on the corpus key words, and at least one corpus category is obtained according to the clustering analysis result; and respectively calculating the characteristic value of the corpus keyword in the corresponding corpus category for each corpus category, and taking the corpus keyword with the characteristic value meeting the condition as the corpus label of the corresponding corpus category. By adopting the method, the corpus tag can be conveniently, quickly and accurately generated.

Description

Corpus tag generation method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of intelligent robot technology, and in particular, to a corpus tag generation method, apparatus, computer device, and storage medium.

Background

With the development of AI (Artificial Intelligence) technology, chat robot technology has been deeply applied to various commercial fields. At present, the chat robot is mainly used in a plurality of after-sale or marketing links and is used for automatically answering questions consulted by a user and handling conventional tasks. The work flow of the method mainly identifies the real intention of a user according to the input information of the user, and then executes the corresponding task flow according to the intention. For example, a user a asks in a certain bank chat robot "how to modify a bank card transaction password? The real intention of the robot to recognize the problem firstly belongs to the bank card password modification, and then the password modification process is activated: inputting a card number, confirming the identity, inputting an original password, inputting a new password, confirming the submission and successfully modifying.

In order to improve the accuracy of letting the chat robot recognize the true intention of the user, it is necessary to tag the user with an intention tag. The current method relies on manual understanding and intention labeling of user problems, and when various user problems are faced, more manpower and more time are required to be allocated to manually identify the category to which each problem belongs, and then corresponding label labeling is carried out on the problems, so that the problem of low label labeling efficiency exists.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device and a storage medium for generating a corpus tag, which can conveniently, quickly and accurately generate the corpus tag.

A corpus tag generation method, characterized in that the method comprises:

based on a current entity vocabulary, performing word segmentation processing on each corpus sample in a corpus sample set respectively to obtain a plurality of word elements corresponding to the corpus sample set;

counting the occurrence frequency of each word element in the corpus sample set to obtain the word frequency corresponding to each word element;

marking the word elements with the word frequency in a preset word frequency interval as entity words, and updating the current entity vocabulary table based on the word elements with the entity word marks;

performing word segmentation processing on each corpus sample in the corpus sample set again according to the updated entity vocabulary, and determining corpus keywords corresponding to each corpus sample according to the number of words contained in each corpus sample;

clustering analysis is carried out on the corpus key words corresponding to the corpus sample set, and a plurality of corpus key words are classified according to clustering analysis results to obtain at least one corpus category corresponding to the corpus sample set;

and respectively calculating the characteristic value of the corpus keyword in the corresponding corpus category for each corpus category, and taking the corpus keyword with the characteristic value meeting the condition as the corpus label of the corresponding corpus category.

In one embodiment, the method further comprises: receiving a newly added corpus sample, and respectively calculating the label probability of the newly added corpus sample belonging to each category of the at least one corpus category;

comparing each label probability with a preset label probability threshold, and giving the corpus labels corresponding to the corpus categories to which the label probabilities meeting the preset label probability conditions belong to the newly added corpus samples when the label probabilities meeting the preset label probability conditions exist;

and when the tag probability meeting the preset tag probability condition does not exist, storing the newly added corpus sample into a newly added corpus sample set, updating the corpus sample set through the newly added sample set when the newly added sample set reaches the preset condition, and generating the corpus tag again based on the updated corpus sample set.

In one embodiment, before determining the corpus keyword corresponding to each corpus sample according to the number of words contained in each corpus sample, the method further includes:

respectively carrying out word attribute marking on each word obtained after word segmentation of each corpus sample, and carrying out attribute statistics on each word appearing in each corpus sample based on the word attributes;

for each corpus sample, when the number of words corresponding to each word attribute is smaller than a preset word number threshold value corresponding to the corresponding word attribute, marking the current corpus sample as a first structure corpus;

and when the number of words corresponding to any word attribute is larger than or equal to a preset word number threshold value corresponding to the word attribute, marking the current corpus sample as a second structure corpus.

In one embodiment, determining the corpus keywords corresponding to each corpus sample according to the number of words contained in each corpus sample includes:

when the current corpus sample belongs to a first structure corpus, determining corpus keywords corresponding to the current corpus sample based on word characteristics of words in the current corpus sample;

and when the current corpus sample belongs to a second structure corpus, performing semantic coding on the current corpus sample through a trained syntactic analysis model, and determining corpus keywords corresponding to the current corpus sample based on a coding result.

In one embodiment, counting the occurrence frequency of each word element in the corpus sample set to obtain a word frequency corresponding to each word element includes:

dividing the word elements into a plurality of word element groups, wherein each word element group comprises a plurality of word elements with the same word length;

counting the occurrence frequency of each word element in the corpus sample set and the total occurrence frequency of each word element in each word element group in the corpus sample set;

and determining the word frequency of each word element in the corpus sample set based on the times of each word element and the total times of the corresponding word element groups.

In one embodiment, based on the current entity vocabulary, performing word segmentation processing on each corpus sample in the corpus sample set respectively, and acquiring a plurality of word elements corresponding to the corpus sample set, including:

for each corpus sample, removing non-Chinese characters in the current corpus sample to obtain corresponding corpus characters;

based on the current entity vocabulary, performing word segmentation processing on each corpus character to obtain a word combination corresponding to each corpus sample;

and summarizing all word combinations to obtain a word set corresponding to the plurality of corpus samples, and performing de-duplication processing on the word set to obtain a plurality of word elements corresponding to the corpus sample set.

In one embodiment, the method further comprises:

receiving a chat corpus sent by a chat object through a chat robot, removing non-Chinese characters in the chat corpus, and acquiring corresponding chat characters; the chat robot is a customer service robot or a social contact robot;

based on a current entity vocabulary, performing word segmentation processing on the chat characters to obtain chat word combinations corresponding to the chat characters;

determining chat keywords corresponding to the chat linguistic data through the chat word combination;

and determining a corpus tag corresponding to the chat corpus according to the corpus category to which the chat keyword belongs.

A corpus tag generation apparatus, the apparatus comprising:

the first segmentation module is used for performing segmentation processing on each corpus sample in the corpus sample set based on a current entity vocabulary respectively to obtain a plurality of word elements corresponding to the corpus sample set;

the statistical module is used for counting the occurrence frequency of each word element in the corpus sample set to obtain the word frequency corresponding to each word element;

the marking module is used for marking the word elements with the word frequency in the preset word frequency interval as entity words and updating the current entity vocabulary table based on the word elements with the entity word marks;

the second word segmentation module is used for respectively carrying out word segmentation on each corpus sample in the corpus sample set again according to the updated entity vocabulary, and determining the corpus key words corresponding to each corpus sample according to the number of words contained in each corpus sample;

the clustering module is used for carrying out clustering analysis on the corpus key words corresponding to the corpus sample set and classifying a plurality of corpus key words according to a clustering analysis result to obtain at least one corpus category corresponding to the corpus sample set;

and the label generation module is used for respectively calculating the characteristic value of the corpus keyword in the corresponding corpus category for each corpus category and taking the corpus keyword with the characteristic value meeting the condition as the corpus label of the corresponding corpus category.

A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

According to the corpus tag generation method, the corpus tag generation device, the computer equipment and the storage medium, firstly, word segmentation processing is carried out on each corpus sample in the corpus sample set based on the current entity vocabulary, a plurality of word elements corresponding to the corpus sample set are obtained, then, the frequency of each word element in the corpus sample set is counted, and the current entity vocabulary is updated according to the frequency corresponding to each word element; then carrying out word segmentation again based on the updated entity vocabulary corpus sample set, and determining corpus keywords corresponding to each corpus sample according to results obtained by word segmentation; performing clustering analysis based on corpus key words corresponding to the corpus sample set, and classifying a plurality of corpus key words according to clustering results; and determining a corpus keyword as a corpus tag of each classification. Through the steps, the corresponding corpus labels can be accurately and objectively generated based on the content of each corpus sample, and manpower and time are greatly saved.

Drawings

FIG. 1 is a diagram illustrating an exemplary embodiment of a method for generating a corpus tag;

FIG. 2 is a flow chart illustrating a corpus tag generation method according to an embodiment;

FIG. 3 is a flowchart illustrating the corpus tag generation step in one embodiment;

FIG. 4 is a block diagram showing the construction of a corpus tag generation apparatus according to an embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The corpus tag generation method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The corpus samples are stored in a database, which may be located on the server 104 or exist independently from the server 104, before receiving the corpus from the terminal 102, the server 104 first obtains a corpus sample set from the database, and performs word segmentation on each corpus sample in the corpus sample set based on the current entity vocabulary to obtain a plurality of word elements. And then counting the occurrence frequency of each word element in the corpus sample set, and updating the current entity vocabulary. Then, the server 104 performs word segmentation processing on each corpus sample in the corpus sample set again based on the updated entity vocabulary, determines corresponding corpus keywords according to word segmentation results, performs cluster analysis based on the corpus keywords corresponding to the corpus sample set, and classifies a plurality of corpus keywords according to the cluster results; and determining a corpus keyword as a corpus tag of each classification. After obtaining the corpus tags, when the server 104 receives a new corpus next time, the corpus tags may be used to label the newly received corpus. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers, which is not limited in this application.

Before describing the corpus tag generation method in the present application, the following explanations are first made for some terms involved in the embodiments of the present application:

and (3) corpus sample: a chat sentence is a corpus sample, for example, "modify bank card password", "i want to modify mobile phone number", "modify bound mobile phone number", etc., and each sentence segment is a corpus sample.

Entity vocabulary: the entity vocabulary includes a plurality of entity vocabularies, for example, for a chat robot applied in banking, the corresponding entity vocabulary includes a plurality of banking vocabularies, such as: bank card, password, withdrawal, remittance, identity card, mobile phone number, etc.

Word elements: a word element is a word and is a constituent part of the corpus sample.

Corpus keywords: and finding out the key words which are most consistent with the real intention of the corpus in each corpus sample based on the word segmentation result.

Corpus tag: key information for presenting the true intent of a type of corpus sample.

In one embodiment, as shown in fig. 2, a corpus tag generation method is provided, which is described by taking the method as an example applied to the server 104 in fig. 1, and includes the following steps:

step S202, based on the current entity vocabulary, each corpus sample in the corpus sample set is respectively subjected to word segmentation processing, and a plurality of word elements corresponding to the corpus sample set are obtained.

Specifically, before starting to generate a specific corpus tag, the server obtains an existing entity vocabulary including a certain number of entity vocabularies. Based on the entity vocabulary, the server firstly carries out word segmentation processing on each corpus sample in the corpus sample set respectively, and then a plurality of word elements can be obtained based on the word segmentation result.

Step S204, counting the occurrence frequency of each word element in the corpus sample set to obtain the word frequency corresponding to each word element.

Specifically, the server further counts the occurrence frequency of each word element in the corpus sample set, and determines the corresponding word frequency of each word element in the corpus sample set based on the total occurrence frequency of each word element in the corpus sample set.

Step S206, the word elements with the word frequency in the preset word frequency interval are marked as entity words, and the current entity vocabulary is updated based on the word elements with the entity word marks.

Specifically, since there are many word elements, which may include some vocabularies without entity meaning (e.g. the greeting "hello", the subject "us", the connection word "and", etc.), before generating the specific corpus tag, it is necessary to remove the part of vocabularies without entity meaning, and then select out the vocabularies with entity meaning (e.g. "bank card", "password", "id card", "mobile phone number", etc.). In this embodiment, the word frequency of the word element that meets the requirement of the entity word is limited in the interval based on the characteristics of the entity word in the actual corpus sample, the word element whose word frequency is in the preset word frequency interval is marked as the entity word, and then the entity word list is updated based on the entity word obtained by marking.

And step S208, performing word segmentation processing on each corpus sample in the corpus sample set again according to the updated entity vocabulary, and determining corpus keywords corresponding to each corpus sample according to the number of words contained in each corpus sample.

Specifically, after the entity vocabulary is updated, the server performs the word segmentation again on each corpus sample in the corpus sample set based on the updated entity vocabulary, and the result obtained by the word segmentation is higher in accuracy compared with the result obtained by the first word segmentation. Based on the result of the word segmentation, the number of words contained in each corpus sample can be determined, and the corresponding corpus keywords in each corpus sample can be further determined according to the number of the words.

Step S210, clustering analysis is carried out on the corpus key words corresponding to the corpus sample set, and a plurality of corpus key words are classified according to the clustering analysis result to obtain at least one corpus category corresponding to the corpus sample set.

Specifically, since the corpus samples are different, the corpus keywords corresponding to each corpus sample are not necessarily identical, and for each corpus sample that has obtained the corresponding corpus keywords, in this embodiment, the server performs cluster analysis on a plurality of corpus keywords to obtain a cluster model, and then classifies the plurality of corpus keywords according to the cluster model result to obtain one or more corpus categories. The corpus keywords included in a corpus category are not necessarily identical, but they are consistent with the clustering characteristics. In a specific clustering process, each corpus keyword can be converted into a vector form for calculation so as to obtain a corresponding clustering result. The algorithm used for clustering may be a partition method (Partitioning method), a Hierarchical method (Hierarchical method), a density-Based method (density-Based method), a grid-Based method (grid-Based method), a Model-Based method (Model-Based method), or any one of the clustering algorithms such as K-MEANS, K-medoid, Clara, and classes, or a combination of different clustering algorithms, which is not specifically limited in this embodiment.

Step S212, for each corpus category, respectively calculating a feature value of the corpus keyword in the corresponding corpus category, and using the corpus keyword with the feature value meeting the condition as the corpus tag of the corresponding corpus category.

Specifically, because the corpus keywords corresponding to one corpus category are not necessarily completely the same, the server further extracts the feature value of each corpus keyword, in this embodiment, the feature value of a corpus keyword is represented by a TF-IDF value (Term Frequency-Inverse Document Frequency), TF is Term Frequency, and IDF is Inverse text Frequency index Inverse Document Frequency. Furthermore, according to the size of the characteristic value, the corpus keyword meeting the preset condition is used as the label of the corresponding corpus category, so that the accuracy of the corpus label corresponding to each corpus category can be well ensured.

In the corpus tag generation method, firstly, word segmentation processing is carried out on each corpus sample in a corpus sample set based on a current entity vocabulary to obtain a plurality of word elements corresponding to the corpus sample set, then, the frequency of each word element appearing in the corpus sample set is counted, and the current entity vocabulary is updated according to the frequency corresponding to each word element; then carrying out word segmentation again based on the updated entity vocabulary corpus sample set, and determining corpus keywords corresponding to each corpus sample according to results obtained by word segmentation; performing clustering analysis based on corpus key words corresponding to the corpus sample set, and classifying a plurality of corpus key words according to clustering results; and determining a corpus keyword as a corpus tag of each classification. Through the steps, the corresponding corpus labels can be accurately and objectively generated based on the content of each corpus sample, and manpower and time are greatly saved.

In one embodiment, the method further comprises: receiving a newly added corpus sample, and respectively calculating the label probability of the newly added corpus sample belonging to each category of at least one corpus category; comparing each label probability with a preset label probability threshold, and when the label probability meeting a preset label probability condition exists, giving a new corpus sample to the corpus label corresponding to the corpus category to which the label probability meeting the condition belongs; and when the tag probability meeting the preset tag probability condition does not exist, storing the newly added corpus sample into the newly added corpus sample set, updating the corpus sample set through the newly added sample set when the newly added sample set reaches the preset condition, and generating the corpus tag again based on the updated corpus sample set.

Specifically, when the server receives a new corpus, on one hand, the newly received corpus needs to be labeled based on the obtained corpus tag, and on the other hand, the newly received corpus needs to be added to the corpus sample set as a new corpus sample. For a newly received corpus sample, the server firstly determines the probability (label probability) that the corpus sample belongs to each existing corpus class, then compares the label probabilities with a preset label threshold, and when the label probability meeting the preset label probability condition exists, the newly received corpus can be attached with a corresponding corpus label, and the corpus is divided into corresponding corpus classes. When the tag probability meeting the preset tag probability condition does not exist, the server stores the tag probability in a newly added sample set independently, and when the corpus samples which do not belong to any one existing corpus classification in the newly added sample set reach a certain number or the newly added sample set reaches other preset conditions, the server returns to the step S202 based on all the corpus samples at present to generate the corpus tags again.

In the foregoing process, the embodiment preferentially selects the maximum tag probability to compare with the preset tag threshold, and in other embodiments, it is also feasible to select other tag probabilities to compare with the preset tag threshold in order to remove noise influence, and as long as the tag probability meeting the tag threshold condition exists in each calculated tag probability, the corpus tag corresponding to the corpus category to which the tag probability meeting the condition belongs may be assigned to the newly-added corpus sample.

In the above embodiment, whether the current corpus tag is complete or not may be checked in real time based on the newly added corpus sample, and a new corpus tag is regenerated when the condition is met, so as to ensure real-time validity of the corpus tag, and avoid that the tag cannot accurately label the newly received corpus due to change of the corpus sample.

In an embodiment, before determining the corpus keyword corresponding to each corpus sample according to the number of words contained in each corpus sample, the method further includes: respectively carrying out word attribute marking on each word obtained after word segmentation of each corpus sample, and carrying out attribute statistics on each word appearing in each corpus sample based on the word attributes; for each corpus sample, when the number of words corresponding to each word attribute is smaller than a preset word number threshold value corresponding to the corresponding word attribute, marking the current corpus sample as a first structure corpus; and when the number of words corresponding to any word attribute is larger than or equal to a preset word number threshold value corresponding to the word attribute, marking the current corpus sample as a second structure corpus.

Specifically, after the server performs the segmentation again on the corpus samples in the corpus sample set by using the updated entity vocabulary, the server needs to further determine the structure of the corpus sample based on the result obtained by the segmentation. In this embodiment, the server performs attribute tagging on each word obtained by segmenting each corpus sample, and counts the number of words corresponding to each attribute. For example, the corpus participle result is "modify + password", where "modify" is verb, the statistical number is 1, "password" is noun, and the statistical number is 1. In order to accurately judge the corpus structure, in the embodiment, threshold presetting is performed on the number of words corresponding to the word attributes, and when the number of words corresponding to each word attribute is smaller than the preset number threshold of words corresponding to the word attribute, the current corpus sample is marked as a first structure corpus; and when the number of words corresponding to any word attribute is larger than or equal to a preset word number threshold value corresponding to the word attribute, marking the current corpus sample as a second structure corpus.

Assuming that the threshold of the number of words preset for the word attribute is 2, the corpus of "modified + password + id card" corresponds to the second structural corpus, and for a corpus of which the word segmentation result is "customer service", the correspondence belongs to the first structural corpus.

In the above embodiment, by marking the word attributes and counting the number, and combining the word number threshold preset by the word attributes, the corpus structure of the corpus sample can be effectively judged, and a more accurate result can be obtained when a specific corpus structure is processed.

In one embodiment, determining the corpus keywords corresponding to each corpus sample according to the number of words contained in each corpus sample includes: when the current corpus sample belongs to the first structure corpus, determining corpus keywords corresponding to the current corpus sample based on word characteristics of all words in the current corpus sample; and when the current corpus sample belongs to the second structural corpus, performing semantic coding on the current corpus sample through the trained syntactic analysis model, and determining corpus keywords corresponding to the current corpus sample based on a coding result.

Specifically, for the first structure corpus, based on the corpus structure characteristics, the server may determine corpus keywords corresponding to the current corpus sample according to word characteristics (e.g., word position characteristics, part-of-speech characteristics, and word frequency characteristics) of each word in the current corpus sample. For example, for the first structural corpus with the word segmentation result of "customer service", the "customer service" may be directly used as the corresponding corpus keyword, and for the first structural corpus with the word segmentation result of "modify + password", the password may be used as the corresponding corpus keyword. For the corpus samples belonging to the second structure corpus, the structure of the corpus samples is changed, and the corresponding corpus keywords cannot be obtained by directly adopting the processing mode of the first structure corpus, so that the server selects the trained syntactic analysis model to process the corpus samples to obtain the corresponding corpus keywords. For example, for a particular corpus, such as "why my bank card suddenly became unusable due to demagnetization, was your bank logged off? "the word segmentation result is" bank card + demagnetization + logout ", at this time, the concrete meaning corresponding to the corpus cannot be directly judged, the corresponding keyword cannot be directly determined, and the problem that the actually expressed meaning is the demagnetization of the bank card needs to be determined through a syntactic analysis model, so that the corresponding corpus keyword is determined.

In the above embodiment, by determining the corpus structure, the recognition efficiency and accuracy of the corpus keywords corresponding to each corpus sample can be effectively improved, and the corpus tagging efficiency of the server is further improved.

In one embodiment, counting the occurrence frequency of each word element in the corpus sample set to obtain a word frequency corresponding to each word element, includes: dividing a plurality of word elements into a plurality of word element groups, wherein each word element group comprises a plurality of word elements with the same word length; counting the occurrence frequency of each word element in the corpus sample set and the total occurrence frequency of each word element in each word element group in the corpus sample set; and determining the word frequency of each word element in the corpus sample set based on the times of each word element and the total times of the corresponding word element groups.

Specifically, in order to improve the accuracy of the entity vocabulary, in this embodiment, the server further classifies each word element according to word length, and word elements with the same word length belong to the same classification, for example, in a classification with a word length of 2, the included word elements may be word elements composed of two words (or two bytes); in the classification with the word length of 3, the contained word elements may be word elements composed of three words (or three bytes); other classifications are analogized. And when the word frequency corresponding to each word element is specifically calculated, calculating according to a calculation mode of 'the number of times of the current word element appearing/the total number of times of the word element appearing in the length'. Therefore, the frequency quantity of the word elements in each word length classification can be effectively ensured, and the frequency quantity of the word elements with other word lengths cannot be diluted. Otherwise, on one hand, the word frequency of a single word element is reduced, so that the judgment difficulty is increased, and on the other hand, when the number of words with other word lengths is too large, a few entity words are excluded due to too low frequency.

In the embodiment, the word elements of each word length are respectively subjected to frequency calculation, so that the real weight of each word element can be effectively determined, the labeling quality of the entity vocabulary is improved, the quality of the entity vocabulary is improved, and more accurate corpus tags are generated.

In one embodiment, based on the current entity vocabulary, performing word segmentation processing on each corpus sample in the corpus sample set respectively to obtain a plurality of word elements corresponding to the corpus sample set, including: for each corpus sample, removing non-Chinese characters in the current corpus sample to obtain corresponding corpus characters; based on the current entity vocabulary, performing word segmentation processing on each corpus character to obtain a word combination corresponding to each corpus sample; and summarizing all word combinations to obtain a word set corresponding to the plurality of corpus samples, and performing de-duplication processing on the word set to obtain a plurality of word elements corresponding to the corpus sample set.

Specifically, for a specific corpus sample, which may contain non-chinese characters, such as punctuation marks, numbers, english alphabets without substantial meaning, etc., before performing specific word segmentation processing, it is necessary to remove these non-chinese characters and then perform word segmentation processing according to the current entity vocabulary. After the word segmentation result corresponding to each corpus sample is obtained, the results obtained by word segmentation of each corpus sample in the corpus sample set are collected and deduplicated to obtain a plurality of word elements corresponding to the corpus sample set, namely, the word segmentation result of each corpus sample is one or more of the word elements.

In the embodiment, the non-Chinese character removing processing is performed on the corpus sample, so that the number of irrelevant words in the word segmentation result of the corpus sample can be effectively reduced, the proportion of the entity words is increased, and the accuracy of labeling the entity words is further improved.

In one embodiment, the method further comprises: receiving a chat corpus sent by a chat object through a chat robot, removing non-Chinese characters in the chat corpus, and acquiring corresponding chat characters; the chat robot is a customer service robot or a social contact robot; based on the current entity vocabulary, carrying out word segmentation processing on the chat characters to obtain chat word combinations corresponding to the chat characters; determining chat keywords corresponding to the chat linguistic data through the chat word combination; and determining a corpus tag corresponding to the chat corpus according to the corpus category to which the chat keyword belongs.

Specifically, the server receives a chat corpus sent by a chat object through the chat robot, takes out non-Chinese characters in the chat corpus, performs word segmentation on the chat corpus based on a current entity vocabulary, determines chat keywords of the chat corpus based on results obtained by word segmentation, and then determines corpus tags corresponding to the chat corpus according to the corpus category to which the chat keywords belong. According to the corpus tag, answers can be made for the current chat corpus, and real-time interaction between the chat robot and the chat object is realized.

In the above embodiment, according to the corpus tag, the intention of the chat corpus can be automatically understood without manually labeling the chat corpus, and the chat robot can respond to the current chat corpus to realize real-time interaction between the chat robot and the chat object.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

Fig. 3 is a flowchart of corpus tag generation in an embodiment. The process comprises the following steps:

step 1: and acquiring a corpus sample set and segmenting words.

And (4) preprocessing a user history problem (corpus sample) set, namely preprocessing an original corpus sample set. In this embodiment, first, the non-chinese characters in each question of the historical data set are removed, and then word segmentation is performed based on the current entity vocabulary.

Step 2: and updating the deactivation word list and the entity word list.

Specifically, first, for the corpus (i.e., word segmentation result) preprocessed in step 1, all words appearing in the corpus sample set are classified according to word length. For example, words with a word length of 2 are a class, words with a word length of 3 are a class, and words with a word length greater than 4 are a class.

Then, counting the word frequency under each category length, and updating the stop word list and the vocabulary list according to the word frequency result of each word length category, wherein a specific updating rule is as follows:

and adding the word frequency with the word length of 2 into the entity word list, wherein the word frequency belongs to a preset word frequency interval T _ freq2, and otherwise, adding the stop vocabulary.

And adding the word frequency with the word length of 3 into the entity word list, wherein the word frequency belongs to a preset word frequency interval T _ freq3, and otherwise, adding the stop vocabulary.

And adding the word frequency with the word length of more than 4 into the entity word list, wherein the word frequency belongs to a preset word frequency interval T _ freq4, and otherwise, adding the stop word list.

And step 3: and re-segmenting words.

And (4) re-segmenting the user question by using the stop word list and the vocabulary list updated in the step (2).

And 4, step 4: and judging the user problem structure.

According to the number of phrases after the user question is participated in step 3, the user question is divided into 2 types in this embodiment: one is a simple structure problem A, and is mainly characterized in that the problem description is short, noun phrases are few, and the semantics are clear; the second type of complex structure problem B is mainly characterized by long problem description, more noun phrases and fuzzy language.

According to the judgment rule of the embodiment, a noun phrase and a verb phrase in a user problem are marked through a syntactic analysis model; then counting the number of noun phrases and verb phrases; the noun phrase number or verb phrase number > threshold T _ struct1 is of type B of complex structure problem, otherwise is of type A of simple structure problem.

And 5: and (5) extracting keywords.

In this embodiment, for a simple short structure problem a, keywords are extracted based on statistical features. Namely, the word position characteristics, the part of speech characteristics, the word frequency and other characteristics corresponding to each word are determined, and the corresponding keywords are determined according to the characteristics. For the class B of the long structure problem, semantic information inside the problem needs to be considered, and a trained syntactic analysis model (for example, a time sequence depth model) is used to extract the keywords, which is not described in this embodiment.

Step 6: and (6) clustering.

In the clustering process, the embodiment first performs word vectorization on the keywords extracted in step 5, and then clusters a plurality of corpus keywords by using a partition clustering model, thereby clustering all user questions into N classes, where each class is a corpus class.

And 7: and generating a corpus tag library.

And 6, further counting the IF-IDF value of each class in the final N classes of keywords, and taking the word with the highest IF-IDF value as the label of the corresponding class problem. Wherein, TF is Term Frequency, and IDF is Inverse text Frequency index Inverse Document Frequency. This is not described in detail in this embodiment.

And 8: and newly adding and updating the tag library.

When a new user question Q is input, calling the clustering model in the step 6, calculating the probability that the question Q belongs to each of the N classes, judging whether the highest probability is smaller than a threshold value T _ clu, if not, indicating that the question Q belongs to the class corresponding to the highest probability, and marking the label of the question Q as the class label (corpus label) with the highest probability. Otherwise, the Q does not belong to the category corresponding to the highest probability, the problem Q is put into a sample Set (newly added corpus sample Set), and when the number of the sample Set is greater than the T _ ratio value of the N types of whole sample sets, the label library is triggered to enter the step 1 to retrain to generate a new label library.

The corpus tag generation method comprises the steps of firstly carrying out word segmentation processing on each corpus sample in a corpus sample set based on a current entity vocabulary to obtain a plurality of word elements corresponding to the corpus sample set, then counting the frequency of each word element in the corpus sample set, and updating the current entity vocabulary according to the frequency corresponding to each word element; then carrying out word segmentation again based on the updated entity vocabulary corpus sample set, and determining corpus keywords corresponding to each corpus sample according to results obtained by word segmentation; performing clustering analysis based on corpus key words corresponding to the corpus sample set, and classifying a plurality of corpus key words according to clustering results; and determining a corpus keyword as a corpus tag of each classification. Through the steps, the corresponding corpus labels can be accurately and objectively generated based on the content of each corpus sample, and manpower and time are greatly saved.

In one embodiment, as shown in fig. 4, there is provided a corpus tag generation apparatus 400, including: a first segmentation module 402, a statistics module 404, a labeling module 406, a second segmentation module 408, a clustering module 410, and a label generation module 412, wherein:

the first segmentation module 402 is configured to perform segmentation processing on each corpus sample in the corpus sample set based on the current entity vocabulary, and obtain a plurality of word elements corresponding to the corpus sample set.

The counting module 404 is configured to count the occurrence frequency of each word element in the corpus sample set, so as to obtain a word frequency corresponding to each word element.

The tagging module 406 is configured to tag a word element with a word frequency in a preset word frequency interval as an entity vocabulary, and update the current entity vocabulary based on the word element with the entity vocabulary tag.

The second segmentation module 408 is configured to perform segmentation processing on each corpus sample in the corpus sample set again according to the updated entity vocabulary, and determine a corpus keyword corresponding to each corpus sample according to the number of words included in each corpus sample.

The clustering module 410 is configured to perform cluster analysis on the corpus keywords corresponding to the corpus sample set, and classify a plurality of corpus keywords according to a cluster analysis result to obtain at least one corpus category corresponding to the corpus sample set.

The tag generating module 412 is configured to calculate, for each corpus category, a feature value of a corpus keyword in the corresponding corpus category, and use the corpus keyword whose feature value meets the condition as a corpus tag of the corresponding corpus category.

The corpus tag generation device firstly carries out word segmentation processing on each corpus sample in the corpus sample set based on the current entity vocabulary to obtain a plurality of word elements corresponding to the corpus sample set, then counts the frequency of each word element in the corpus sample set, and updates the current entity vocabulary according to the frequency corresponding to each word element; then carrying out word segmentation again based on the updated entity vocabulary corpus sample set, and determining corpus keywords corresponding to each corpus sample according to results obtained by word segmentation; performing clustering analysis based on corpus key words corresponding to the corpus sample set, and classifying a plurality of corpus key words according to clustering results; and determining a corpus keyword as a corpus tag of each classification. Through the steps, the corresponding corpus labels can be accurately and objectively generated based on the content of each corpus sample, and manpower and time are greatly saved.

In one embodiment, the apparatus is further configured to: receiving a newly added corpus sample, and respectively calculating the label probability of the newly added corpus sample belonging to each category of at least one corpus category; comparing each label probability with a preset label probability threshold, and when the label probability meeting a preset label probability condition exists, giving a new corpus sample to the corpus label corresponding to the corpus category to which the label probability meeting the condition belongs; and when the tag probability meeting the preset tag probability condition does not exist, storing the newly added corpus sample into the newly added corpus sample set, updating the corpus sample set through the newly added sample set when the newly added sample set reaches the preset condition, and generating the corpus tag again based on the updated corpus sample set.

In one embodiment, the second participle module is further configured to: respectively carrying out word attribute marking on each word obtained after word segmentation of each corpus sample, and carrying out attribute statistics on each word appearing in each corpus sample based on the word attributes; for each corpus sample, when the number of words corresponding to each word attribute is smaller than a preset word number threshold value corresponding to the corresponding word attribute, marking the current corpus sample as a first structure corpus; and when the number of words corresponding to any word attribute is larger than or equal to a preset word number threshold value corresponding to the word attribute, marking the current corpus sample as a second structure corpus.

In one embodiment, the second participle module is further configured to: when the current corpus sample belongs to the first structure corpus, determining corpus keywords corresponding to the current corpus sample based on word characteristics of all words in the current corpus sample; and when the current corpus sample belongs to the second structural corpus, performing semantic coding on the current corpus sample through the trained syntactic analysis model, and determining corpus keywords corresponding to the current corpus sample based on a coding result.

In one embodiment, the statistics module is further configured to: dividing a plurality of word elements into a plurality of word element groups, wherein each word element group comprises a plurality of word elements with the same word length; counting the occurrence frequency of each word element in the corpus sample set and the total occurrence frequency of each word element in each word element group in the corpus sample set; and determining the word frequency of each word element in the corpus sample set based on the times of each word element and the total times of the corresponding word element groups.

In one embodiment, the first segmentation module is further configured to: for each corpus sample, removing non-Chinese characters in the current corpus sample to obtain corresponding corpus characters; based on the current entity vocabulary, performing word segmentation processing on each corpus character to obtain a word combination corresponding to each corpus sample; and summarizing all word combinations to obtain a word set corresponding to the plurality of corpus samples, and performing de-duplication processing on the word set to obtain a plurality of word elements corresponding to the corpus sample set.

In one embodiment, the apparatus is further configured to: receiving a chat corpus sent by a chat object through a chat robot, removing non-Chinese characters in the chat corpus, and acquiring corresponding chat characters; the chat robot is a customer service robot or a social contact robot; based on the current entity vocabulary, carrying out word segmentation processing on the chat characters to obtain chat word combinations corresponding to the chat characters; determining chat keywords corresponding to the chat linguistic data through the chat word combination; and determining a corpus tag corresponding to the chat corpus according to the corpus category to which the chat keyword belongs.

For the specific limitations of the corpus tag generation apparatus, reference may be made to the above limitations of the corpus tag generation method, which are not described herein again. All or part of the modules in the corpus tag generation device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the corpus tag to generate corpus tag data and can also be used for storing a corpus sample set. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a corpus tag generation method.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A corpus tag generation method, characterized in that the method comprises:

2. The method of claim 1, further comprising:

receiving a newly added corpus sample, and respectively calculating the label probability of the newly added corpus sample belonging to each category of the at least one corpus category;

3. The method according to claim 1, wherein before determining the corpus keyword corresponding to each corpus sample according to the number of words contained in each corpus sample, the method further comprises:

4. The method according to claim 3, wherein determining the corpus keywords corresponding to each corpus sample according to the number of words contained in each corpus sample comprises:

5. The method according to claim 1, wherein the counting the number of times each word element appears in the corpus sample set to obtain a word frequency corresponding to each word element comprises:

6. The method according to claim 1, wherein the performing a word segmentation process on each corpus sample in a corpus sample set based on a current entity vocabulary to obtain a plurality of word elements corresponding to the corpus sample set comprises:

7. The method according to any one of claims 1-6, further comprising:

8. A corpus tag generation apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.