CN112699645A

CN112699645A - Corpus labeling method, apparatus and device

Info

Publication number: CN112699645A
Application number: CN202110318770.0A
Authority: CN
Inventors: 袁徐磊; 宋鑫; 肖鹏
Original assignee: Beijing Absolute Health Ltd
Current assignee: Beijing Shuidi Technology Group Co ltd
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-04-23
Anticipated expiration: 2041-03-25
Also published as: CN112699645B

Abstract

The application discloses a corpus labeling method, a corpus labeling device and a corpus labeling device, relates to the technical field of artificial intelligence, and can generate text corpora of different violation types in batch and save corpus labeling time. The method comprises the following steps: sentence breaking processing is carried out on the text data under different service scenes, and text corpora formed after the sentence breaking processing are stored in a corpus database; dividing preset standard violation description into different violation categories by taking semantic points as units; establishing a keyword semantic rule according to the entity concepts contained in the semantic points and the logic relation between the entity concepts, wherein the keyword semantic rule is an illegal expression mapped on different illegal categories aiming at standard illegal description; and matching target text corpora containing different violation categories from the corpus database by using the violation expressions, and labeling the target text corpora based on the violation categories.

Description

Corpus labeling method, apparatus and device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a corpus tagging method, device, and apparatus.

Background

In order to promote enterprise sales, develop markets and improve customer satisfaction, enterprises generally widely use customer service centers to reach customers, huge call records and chat records are generated in the process so as to monitor customer service quality, and identification is mainly carried out on whether customer service uses illegal terms or not, for example, whether customer service personnel use standard terms or not is detected, and whether customer service personnel popularize specified products or not is detected.

Traditional artifical quality control inefficiency, labor repeatability is big, and with the help of artificial intelligence technique at present, use natural language processing technology to carry out the pre-training recognition model, can assist discernment violation term, improve recognition efficiency by a wide margin. However, in the process of using the natural language processing technology, a large amount of corpora are needed to train the recognition model, especially, the more complex semantics require more corpora, in an actual application scenario, a large amount of labor cost is needed to be consumed in the labeling process of a large amount of corpora, the technical cost is increased, even a large number of complex business scenarios are difficult to provide sufficient corpora, so that the recognition result of the recognition model does not reach the standard, and the recognition effect of the model on illegal words cannot reach the theoretical accuracy.

Disclosure of Invention

In view of this, the present application provides a corpus tagging method, apparatus and device, and mainly aims to solve the problems of high labor cost in the corpus tagging process and insufficient corpus in a complex scene in the prior art.

According to a first aspect of the present application, there is provided a corpus annotation method, including:

sentence breaking processing is carried out on the text data under different service scenes, and text corpora formed after the sentence breaking processing are stored in a corpus database;

dividing preset standard violation description into different violation categories by taking semantic points as units;

establishing a keyword semantic rule according to the entity concepts contained in the semantic points and the logic relation between the entity concepts, wherein the keyword semantic rule is an illegal expression mapped on different illegal categories aiming at standard illegal description;

and matching target text corpora containing different violation categories from the corpus database by using the violation expressions, and labeling the target text corpora based on the violation categories.

Further, the sentence-breaking processing on the text data in different service scenes specifically includes:

according to the time sequence of interactive initiation in the text data, splitting the text data under different service scenes by taking sentences as units to obtain text corpora corresponding to an interactive initiator;

and splitting and/or merging the text corpora according to the text length mapped by the text corpora corresponding to the interaction initiator.

Further, the splitting and/or merging the text corpus according to the text length mapped by the text corpus corresponding to the interaction initiator specifically includes:

comparing the text length of the text corpus mapping corresponding to the interaction initiator with a preset text length range;

splitting the text corpus aiming at the text corpus of which the text length is greater than the maximum value in a preset text length range;

and combining the text corpora aiming at the text corpora with the text length smaller than the minimum value in the preset text length range.

Further, the semantic point is a single sentence or a compound sentence including at least one entity concept, and the classifying the preset standard violation description into different violation categories by using the semantic point as a unit specifically includes:

taking semantic points as a unit, and extracting a single sentence or a compound sentence containing at least one entity concept from preset standard violation description;

calculating the violation feature degree of the single sentence or the compound sentence containing at least one entity concept mapped on different violation categories;

and dividing the single sentence or the compound sentence containing at least one entity concept into different violation categories according to the violation feature degree.

Further, the calculating the violation feature degree of the single sentence or the multiple sentence including at least one entity concept mapped on different violation categories specifically includes:

aiming at the single sentence or the compound sentence containing at least one entity concept, extracting the entity concept and the logic relation between the entity concepts;

and calculating the violation feature degree of the single sentence or the compound sentence containing at least one entity concept mapped on different violation categories by matching the entity concepts and the logical relationship between the entity concepts with the violation features on different violation categories.

Further, the establishing a keyword semantic rule according to the entity concepts contained in the semantic point and the logical relationship between the entity concepts specifically includes:

determining logical operation conditions among keywords according to entity concepts contained in the semantic points and logical relations among the entity concepts;

and establishing a keyword semantic rule according to the logical operation condition among the keywords.

Further, the matching of the violation expressions from the corpus database to target text corpora containing different violation categories includes:

mapping entity concepts and logic relations related in the keyword semantic rules into a corpus query expression by using the keyword semantic rules;

and matching target text corpora containing different violation categories from the corpus database according to the corpus query expression.

Further, after the classifying the preset standard violation description into different violation categories in units of semantic points, the method further includes:

constructing a recognition model for each violation category, wherein the recognition model is used for recognizing violation texts based on input interactive texts;

and forming sample data and test data from the labeled target text corpus, and training the identification model of each violation category by using the sample data and the test data.

According to a second aspect of the present application, there is provided a corpus annotation apparatus, comprising:

the processing unit is used for carrying out sentence breaking processing on the text data in different service scenes and storing text corpora formed after the sentence breaking processing into a corpus database;

the division unit is used for dividing preset standard violation description into different violation categories by taking the semantic point as a unit;

the establishing unit is used for establishing a keyword semantic rule according to the entity concepts contained in the semantic points and the logic relationship between the entity concepts, wherein the keyword semantic rule is an illegal expression mapped on different illegal categories aiming at standard illegal description;

a labeling unit, configured to match target text corpora containing different violation categories from the corpus database by using the violation expression, and label the target text corpora based on the violation categories

Further, the processing unit includes:

the splitting module is used for splitting the text data under different service scenes by taking sentences as units according to the time sequence of interactive initiation in the text data to obtain a text corpus corresponding to an interactive initiator;

and the processing module is used for splitting and/or merging the text corpus according to the text length mapped by the text corpus corresponding to the interaction initiator.

Further, the processing module is specifically configured to compare a text length mapped by the text corpus corresponding to the interaction initiator with a preset text length range;

the processing module is specifically further configured to split the text corpus for text corpora of which the text length is greater than a maximum value in a preset text length range;

the processing module is specifically further configured to perform merging processing on the text corpus for which the text length is smaller than a minimum value in a preset text length range.

Further, the semantic point is a single sentence or a compound sentence containing at least one entity concept, and the dividing unit includes:

the extraction module is used for extracting a single sentence or a compound sentence containing at least one entity concept from preset standard violation description by taking the semantic point as a unit;

the calculation module is used for calculating the violation feature degree of the single sentence or the compound sentence containing at least one entity concept mapped on different violation categories;

and the dividing module is used for dividing the single sentence or the compound sentence containing at least one entity concept into different violation categories according to the violation feature degree.

Further, the calculation module includes:

the extraction submodule is used for extracting the entity concepts and the logic relation among the entity concepts aiming at the single sentence or the compound sentence containing at least one entity concept;

and the matching sub-module is used for matching the entity concepts and the logic relationship between the entity concepts with the violation characteristics on different violation categories, and calculating the violation characteristic degree of the single sentence or the compound sentence containing at least one entity concept mapped on different violation categories.

Further, the establishing unit includes:

the determining module is used for determining the logical operation conditions among the keywords according to the entity concepts contained in the semantic points and the logical relations among the entity concepts;

and the establishing module is used for establishing a keyword semantic rule according to the logical operation condition among the keywords.

Further, the labeling unit includes:

the mapping module is used for mapping entity concepts and logic relations related in the keyword semantic rules into a corpus query expression by using the keyword semantic rules;

and the matching module is used for matching target text corpora containing different violation categories from the corpus database according to the corpus query expression.

Further, the apparatus further comprises:

the construction unit is used for constructing an identification model aiming at each violation class after the preset standard violation description is divided into different violation classes by taking the semantic point as a unit, and the identification model is used for identifying the violation text based on the input interactive text;

and the training unit is used for forming the labeled target text corpus into sample data and test data, and training the identification model of each violation category by using the sample data and the test data.

According to a third aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the corpus tagging method described above.

According to a fourth aspect of the present application, there is provided a corpus tagging apparatus, including a storage medium, a processor, and a computer program stored on the storage medium and operable on the processor, wherein the processor implements the corpus tagging method when executing the program.

By the technical scheme, compared with the mode of training the recognition model by manually labeling a large number of corpora in the existing mode, the corpus labeling method, the device and the equipment provided by the application store the text corpora formed after sentence breaking processing into the corpus database by carrying out sentence breaking processing on the text data under different service scenes, divide preset standard violation descriptions into different violation categories by taking semantic points as units, establish a keyword semantic rule according to the entity concepts included by the semantic points and the logical relation among the entity concepts, the keyword semantic rule is a violation expression mapped on the different violation categories aiming at the standard violation descriptions, then match the target text corpora containing the different violation categories from the corpus database by utilizing the violation expression without manually labeling the text corpora, the target text corpora are labeled based on the violation categories, the text corpora of different violation types can be generated in batch, corpus labeling time is saved, and since standard violation language description uses semantic points to perform more detailed category division, the recognition effect of the model for the violation terms is optimized to a certain extent, and the semantic recognition accuracy is improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flow chart illustrating a corpus tagging method according to an embodiment of the present application;

fig. 2 is a schematic flow chart illustrating another corpus tagging method according to an embodiment of the present application;

FIG. 3 is a flow chart of a corpus tagging method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram illustrating a corpus tagging device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram illustrating another corpus tagging device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In the related technology, by means of an artificial intelligence technology, a natural language processing technology is used for pre-training a recognition model, illegal words can be recognized in an auxiliary mode, and recognition efficiency is greatly improved. However, in the process of using the natural language processing technology, a large amount of corpora are needed to train the recognition model, especially, the more complex semantics require more corpora, in an actual application scenario, a large amount of labor cost is needed to be consumed in the labeling process of a large amount of corpora, the technical cost is increased, even a large number of complex scenarios are difficult to provide sufficient corpora, so that the recognition result of the recognition model does not reach the standard, and the recognition effect of the model for illegal words cannot reach the theoretical accuracy.

In order to solve the problem, this embodiment provides a corpus tagging method, as shown in fig. 1, where the method is applied to a server of an internet platform, and includes the following steps:

101. and performing sentence breaking processing on the text data in different service scenes, and storing the text corpus formed after the sentence breaking processing into a corpus database.

The different business scenes relate to the business scenes of the customer service center, and are particularly suitable for the customer service centers with customer service requirements and standardized requirements on service quality, such as insurance, e-commerce, finance, medical treatment and the like. In order to ensure the service quality, the interactive information generated by the customer service center in the process of interacting with the customer can be recorded, and the text data is subjected to quality inspection to detect whether the customer service staff uses the standard expression, for example, whether the customer service staff provides help for the customer with related problems, whether the customer service staff sends out bad expressions, and the like.

Because the interactive form between the customer service staff and the customer is not limited to online conversation, voice communication and the like, the presentation form of the interactive information can include but is not limited to text, voice, pictures and the like, specifically, before sentence breaking processing is carried out on text data in different service scenes, the interactive information needs to be converted into text data, the interactive information in the voice form can be realized through a voice-to-text tool and can also be realized through translation software, and the interactive information in the picture form can be realized through OCR (optical character recognition) software and can also be realized through a character recognition tool, and the presentation form is not limited herein. In the process of sentence-breaking processing of text data in different scenes, text data can be split into text corpora in a sentence form according to the effective punctuation marks as sentence-breaking standards, the text data can also be split into the text corpora in the sentence form by combining the effective punctuation marks and the number of characters of the text as the sentence-breaking standards, other factors can also be added as the sentence-breaking standards, and the method is not limited here.

The execution main body of the embodiment of the invention can be a corpus labeling device, can be a service end of a customer service center, and can collect text data generated by each customer service staff in the interaction process, wherein some nonstandard phrases are few. To promote quality of service in customer service centers, recognition of non-canonical terms is particularly important, and in general, explicit non-canonical terms can be easily recognized using existing recognition models, but as the standard of the illegal phrase is not uniform, aiming at the identification of the illegal phrase with different standards, the text corpus formed after sentence breaking processing is stored in the corpus database, the text corpus containing illegal categories under different complex service scenes can be precipitated in advance, the text corpus can be used as a training corpus of the recognition model to provide a full-bodied training corpus for a complex business scene, in the corpus labeling process, on one hand, violation classification of the recognition model is divided more finely, and on the other hand, more sufficient text corpus supply is carried out on training of the recognition model, so that violation terms in a complex business scene are accurately recognized.

102. And dividing preset standard violation descriptions into different violation categories by taking semantic points as units.

The semantic point is a single sentence or a compound sentence containing at least one entity concept, for example, the user is asked to inquire the air quality of what place, and three entity concepts of the user, the place and the air quality are contained. The preset standard violation description is equivalent to a description preset by a customer service center and meeting the violation standard, and is usually expressed as a violation class aiming at different description dimensions, and may be a violation class aiming at the service product description dimensions, for example, the description of the product A is wrong, the description of the product A is omitted, or may be a violation class aiming at the text specification description dimensions, for example, the wrong description is used by using an uncivilized phrase, or may be a violation class aiming at the customer emotion description dimensions, for example, the customer is identified as being more satisfied, and the customer is identified as being more angry.

Specifically, a semantic point is taken as a unit, a preset standard violation description can be formed into a single sentence or a double sentence containing at least one entity concept, and the standard violation description is divided into different violation categories aiming at different entity concepts in the semantic point and logic relations among the concepts possibly related to different violation categories. Different violation categories relate to violation requirements for the entity concepts, and the violation requirements can be whether the violation concepts appear or not, for example, the violation category 1 is that the word a or the word B is not allowed to appear, that is, the case that the entity concept of the word a or the entity concept of the word B appears is classified into the violation category 1; the violation requirements can also be the appearance order of the violation concepts, for example, the violation category 2 is the case where the violation category 2 is that the word B must appear when the word a appears, that is, the violation category 2 is classified when the entity concept of the word a appears while the entity concept of the word B does not appear.

103. And establishing a keyword semantic rule according to the entity concepts contained in the semantic points and the logic relation among the entity concepts.

Wherein, the entity concept contained in the semantic point is equivalent to an object in which a specific object appears in a standard violation description, AND is generally represented by a specific noun, such as a city, an occupation, a Yangtze river, AND the like, but not an attribute reflecting the object, such as a good-looking, a nice-looking, a simple, a logical relationship between concepts may be a logical associated word describing between the entity concepts, such as an AND, an OR, an not, AND the like, the keyword semantic rule is a violation expression mapped on different violation categories for the standard violation description, the violation expression is equivalent to a rule expression satisfying a logical operation condition, the logical operation condition may be simultaneously containing the entity concept, limiting an appearance order of the entity concept, limiting a word number between the entity concepts, AND the like, for example, the rule expression is A AND B, the rule expression is required to contain A OR B, the rule expression is A OR B, the maximum number of words between A and B is required to not exceed N, and the regular expression is A SEPARATOR B SEPARATOR N.

The semantic point information extraction method specifically includes the steps of extracting entity concepts contained in the semantic point and logic relations among the entity concepts, wherein the information extraction mainly includes entity extraction and relation extraction, some discrete entity concepts can be obtained through the entity extraction, specifically, an entity recognition tool can be used for extracting the entity concepts, in order to obtain corpus information, the logic relations among the entities need to be extracted from the semantic point, the correlation relations among the entity concepts can be extracted through the relation extraction, for example, causal relations, master-slave relations, selection relations and the like, the semantic information of word sequences can be extracted specifically through dependency syntax or semantic dependency, and the logic relations among the entity concepts can be obtained sufficiently through screening of different dependency relations. Aiming at the entity concepts contained in the semantic points, the entity concepts can be associated by utilizing the logical relationship among the entity concepts, the entity concepts containing the association relationship are established keyword semantic rules, and the keyword semantic rules are the association relationship among the entity concepts extracted from the preset violation standard description, so that the characteristics of the standard violation description can be represented, and the method is equivalent to the violation expressions mapped on different violation categories aiming at the standard violation description.

104. And matching target text corpora containing different violation categories from the corpus database by using the violation expressions, and labeling the target text corpora based on the violation categories.

Because the corpus data records text corpora formed by interactive information under various service scenes, some of the text corpora are standard vocabularies, some are non-standard vocabularies, different violation categories may be related to the non-standard vocabularies, and the violation expressions are equivalent to character string matching modes of different violation categories, if the corpus database has text corpora of corresponding violation categories, the text corpora can be matched from the corpus database, in the invention, the violation expressions of different violation categories are formed aiming at standard violation description, the violation expressions of different violation categories are used, the target text corpora of different violation categories can be matched from the corpus database in batch, so that the target text corpora of each violation category is the text corpora with corresponding violation categories, and the target text corpora can be directly labeled by using the violation categories, the method is more suitable for the service scene of batch corpus labeling, and the efficiency of corpus labeling is improved.

Compared with the method of training an identification model by manually labeling a large number of corpora in the existing method, the corpus labeling method provided by the embodiment of the application stores the corpus of the text formed after sentence breakage processing into the corpus database by sentence breakage processing of the text data under different service scenes, divides preset standard violation descriptions into different violation categories by taking semantic points as units, establishes a keyword semantic rule according to the entity concepts included by the semantic points and the logical relation between the entity concepts, the keyword semantic rule is a violation expression mapped on different violation categories aiming at the standard violation descriptions, then matches target text corpora containing different violation categories from the corpus database by using the violation expression, does not need to label the text corpora manually, labels the target text corpora based on the violation categories, the text corpora of different violation types can be generated in batches, corpus labeling time is saved, and since standard violation language description uses semantic points to perform more detailed classification, the recognition effect of the model for the violation terms is optimized to a certain extent, and the semantic recognition accuracy is improved.

Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully describe the specific implementation process of the present embodiment, the present embodiment provides another corpus tagging method, as shown in fig. 2, the method includes:

201. according to the time sequence of interactive initiation in the text data, the text data under different service scenes are split by taking sentences as units, and text corpora corresponding to the interactive initiation party are obtained.

The text data is the text content precipitated by an interaction session formed by at least two interaction initiators, the interaction initiators have directionality in the process of the interaction session and generally have time sequence of interaction initiation, for example, customer service A initiates "what question you have about the product", customer B initiates "I want to consult the performance parameters of the product", the interaction initiators of the interaction session include customer service A and customer B, and the direction of the interaction session is from customer service A to customer B. The method specifically includes splitting text data under different service scenes by taking sentences as units according to a time sequence to form text sentences with time nodes and interactive parties, and sorting out text corpora formed by the interactive parties along a time direction according to the interactive parties and the time nodes, namely the text corpora corresponding to an interactive initiator.

Because a large amount of text data is involved in the interactive session, violation definitions of different interaction initiators are different, generally, a quality inspection scene for violation words mainly aims at a customer service party, and here, the customer service party can be mainly used as a text corpus corresponding to the initiator, and quality inspection can be performed on the text corpus corresponding to the customer service party.

202. And splitting and/or merging the text corpora according to the text length mapped by the text corpora corresponding to the interaction initiator.

It can be understood that, because the length of the text corpus corresponding to the interaction initiator is not fixed, the text corpus may be a long sentence or a short sentence, and for a case where the text corpus is too long, a plurality of violation categories may be included, and each violation category is relatively average in performance, at this time, the violation category of the text corpus may not be accurately represented, the text corpus may be split, and for a case where the text corpus is too short, the violation category may not exist, at this time, the violation category does not need to be identified, and the text corpus may be merged.

Specifically, the text length mapped by the text corpus corresponding to the interaction initiator can be compared with a preset text length range, the preset text length range can be 100-128 bytes, and for the text corpus of which the text length is greater than the maximum value in the preset text length range, the text corpus is too long, the violation category is difficult to accurately extract, and the text corpus is split; and aiming at the text corpus of which the text length is smaller than the minimum value in the preset text length range, the text corpus is indicated to be too short, and the violation type is difficult to obviously show, and the text corpora are subjected to merging processing.

203. And taking semantic points as units, and extracting a single sentence or a compound sentence containing at least one entity concept from preset standard violation description.

Because the standard violation description may include a plurality of semantic points, that is, the standard violation description may include a plurality of violation categories or may have a plurality of violation features for one violation category, here, a single sentence or a compound sentence including at least one entity concept is extracted from the preset standard violation description by taking the semantic points as a unit, and the standard violation description can be split according to the semantic points to form a representative violation semantic for one violation category.

The preset standard violation description is equivalent to a text corpus formed by describing violation matters with different violation levels, and the higher the violation level is, the higher the violation degree is, and the higher the penalty for the violation matters possibly follows is.

204. And calculating the violation feature degree of the single sentence or the compound sentence containing at least one entity concept mapped on different violation categories.

Specifically, the entity concepts and the logical relationship between the entity concepts can be extracted for a single sentence or a compound sentence containing at least one entity concept, where the logical relationship between the entity concepts mainly consists of logical conjunction words and/or negatives, and multiple logical relationships can be embodied, and then the violation feature degrees of the single sentence or the compound sentence containing at least one entity concept mapped on different violation categories are calculated by matching the logical relationship between the entity concepts and the entity concepts with violation features on different violation categories, and in the process of calculating the violation feature degrees, the more obvious the violation categories are expressed, the higher the violation feature degrees are, as one mode, the number of violation features containing the violation categories can be counted, the larger the number of violation features is, the higher the violation feature degrees are, as another mode, the level of the violation features containing the violation categories can be counted, the higher the violation feature level, the higher the violation feature level.

205. And dividing the single sentence or the compound sentence containing at least one entity concept into different violation categories according to the violation feature degree.

The rule violation feature degree can reflect the probability of rule violation features on corresponding rule violation categories to a certain extent, and the rule violation feature degree of a single sentence or a compound sentence of at least one entity concept on a rule violation category is higher, which indicates that the probability of a semantic point on the rule violation category is higher, so that the single sentence or the compound sentence containing at least one entity concept can be divided into the rule violation categories, and further semantic points of different rule violation categories are formed.

It should be noted that, in order to perform more detailed division on the violation categories, subdivision on different violation levels may be set for the violation categories, for example, for the violation category 1, the subordinate categories 11, 12, and 13 of the violation category 1 may be specifically set, and thus the violation categories in the text corpus may be identified more accurately.

206. And establishing a keyword semantic rule according to the entity concepts contained in the semantic points and the logic relation among the entity concepts.

Specifically, the logical operation conditions between the keywords may be determined according to the entity concepts contained in the semantic points and the logical relationship between the entity concepts, where the logical operation conditions are equivalent to the logical conditions that the keywords need to satisfy in the operation process, and further, the keyword semantic rules are established according to the logical operation conditions between the keywords.

207. And mapping entity concepts and logic relations related in the keyword semantic rules into a corpus query expression by using the keyword semantic rules.

It is understood that the logical operation condition in the keyword semantic rule is composed of logical operators to indicate that the entity concept satisfies the corresponding logical operation condition. Illustratively, the logical operator "&" represents a logical and operation, indicating that both the front and rear conditions are satisfied, such as "water drop & insurance" matching a corpus containing both "water drop" and "insurance"; the logical operator "|" represents logical or operation, which indicates that at least one of the two conditions before and after the logical operator "|" is satisfied, for example, "drip | insurance" matches the corpus containing "drip" or "insurance"; the logical operator "!" represents a logical "NOT" operation, indicating that the conditions behind it need not be met, e.g., the "! blob" matches a corpus that does not contain the blob. The symbol "(" and ")" indicates a left and right parenthesis for grouping, respectively; in addition, a plurality of keywords which appear sequentially need to be separated by a symbol _ "and the maximum number of words between adjacent keywords is specified by a number, for example," water drop _ insurance _5 "indicates that two keywords of" water drop "and" insurance "need to appear sequentially, and the maximum number of words between the" water drop "and the" insurance "can be 5 words.

208. And matching target text corpora containing different violation categories from the corpus database according to the corpus query expression, and labeling the target text corpora based on the violation categories.

Specifically, in the process of labeling the target text corpus based on the violation categories, because the target violation corpus is equivalent to the text corpus screened from the corpus database for the violation categories, and the violation categories have corresponding violation features, the target text corpus may be directly labeled with labels of the corresponding violation categories, and of course, the target corpus of each violation category may be further labeled based on the subordinate categories, so as to improve the labeling accuracy of the text corpus on different violation categories.

209. For each violation category, a recognition model is constructed.

Because different violation categories have different violation characteristics, different recognition models need to be used in the violation category recognition process, and different violation hierarchy subdivisions are set for the violation categories, where a recognition model is generally constructed for the violation category and is applicable to the subordinate categories of the violation category, for example, the subordinate category 11, the subordinate category 12, and the subordinate category 13 set for the violation category 1 are all applicable to the violation model a, and the subordinate category 21, the subordinate category 22, and the subordinate category 23 set for the violation category 2 are all applicable to the violation model B.

210. And forming sample data and test data from the labeled target text corpus, and training the identification model of each violation category by using the sample data and the test data.

Specifically, taking customer service conversation quality inspection under a service scene as an actual application scene, needing to construct an identification model of illegal terms and training the identification model by using text corpora containing different illegal categories, as shown in fig. 3, the whole process of identification model training and corpus labeling can be realized by firstly converting conversation and chatting records into texts, storing the texts as corpora after sentence breakage into an Elastic Search text corpus, then combing standard violation descriptions, classifying the standard violation descriptions according to semantic points as units, establishing a corresponding BERT model for each category, further establishing corresponding keyword semantic rules according to concepts and logic relations contained by the semantic points, mapping the keyword semantic rules into query expressions of an Elastic Search text corpus, matching the text corpora of different categories from the Elastic Search text corpus, and taking sentences as units when matching, and labeling the matched text corpora by using the violation categories to form a training set and a testing set of the BERT model, and training the BERT model.

Further, as a specific implementation of the method in fig. 1-2, an embodiment of the present application provides a corpus tagging device, as shown in fig. 4, the device includes: a processing unit 31, a dividing unit 32, a building unit 33, and a labeling unit 34.

The processing unit 31 may be configured to perform sentence segmentation on the text data in different service scenarios, and store a text corpus formed after the sentence segmentation to a corpus database;

the dividing unit 32 may be configured to divide preset standard violation descriptions into different violation categories by taking a semantic point as a unit;

the establishing unit 33 may be configured to establish a keyword semantic rule according to the entity concepts included in the semantic point and the logical relationship between the entity concepts, where the keyword semantic rule is an illegal expression mapped on different illegal categories according to standard illegal descriptions;

the labeling unit 34 may be configured to match a target text corpus including different violation categories from the corpus database by using the violation expression, and label the target text corpus based on the violation categories

Compared with the mode of training an identification model by manually marking a large number of corpora in the existing mode, the corpus marking device provided by the embodiment of the invention carries out sentence-breaking processing on the text data under different service scenes, stores the text corpora formed after the sentence-breaking processing into the corpus database, divides preset standard violation descriptions into different violation categories by taking semantic points as units, establishes a keyword semantic rule according to the entity concepts included by the semantic points and the logical relation between the entity concepts, the keyword semantic rule is a violation expression mapped on different violation categories aiming at the standard violation descriptions, then matches target text corpora containing different violation categories from the corpus database by utilizing the violation expression, does not need to label the text corpora manually, marks the target text corpora based on the violation categories, the text corpora of different violation types can be generated in batches, corpus labeling time is saved, and since standard violation language description uses semantic points to perform more detailed classification, the recognition effect of the model for the violation terms is optimized to a certain extent, and the semantic recognition accuracy is improved.

In a specific application scenario, as shown in fig. 5, the processing unit 31 includes:

the splitting module 311 may be configured to split the text data in different service scenes in units of sentences according to a time sequence of an interaction initiation in the text data, and obtain a text corpus corresponding to an interaction initiator;

the processing module 312 may be configured to split and/or merge the text corpus according to the text length mapped by the text corpus corresponding to the interaction initiator.

In a specific application scenario, the processing module 312 may be specifically configured to compare a text length mapped by a text corpus corresponding to the interaction initiator with a preset text length range;

the processing module 312 may be further configured to split the text corpus for a text corpus in which the text length is greater than a maximum value in a preset text length range;

the processing module 312 may be further specifically configured to perform merging processing on the text corpus in which the text length is smaller than the minimum value in the preset text length range.

In a specific application scenario, as shown in fig. 5, the semantic point is a single sentence or a compound sentence containing at least one entity concept, and the dividing unit 32 includes:

an extracting module 321, configured to extract a single sentence or a compound sentence including at least one entity concept from a preset standard violation description by using a semantic point as a unit;

a calculating module 322, configured to calculate violation feature degrees of the single sentence or the multiple sentences including at least one entity concept mapped on different violation categories;

the dividing module 323 may be configured to divide the single sentence or the multiple sentences including the at least one entity concept into different violation categories according to the violation feature degrees.

In a specific application scenario, as shown in fig. 5, the calculating module 322 includes:

the extracting sub-module 3221 may be configured to, for the single sentence or the compound sentence containing at least one entity concept, extract entity concepts and logical relationships between the entity concepts;

the matching sub-module 3222 may be configured to calculate the violation feature degree of the single sentence or the multiple sentence including the at least one entity concept, which is mapped on different violation categories, by matching the entity concepts and the logical relationship between the entity concepts with the violation features on different violation categories.

In a specific application scenario, as shown in fig. 5, the establishing unit 33 includes:

the determining module 331, configured to determine a logical operation condition between the keywords according to the entity concepts contained in the semantic point and the logical relationship between the entity concepts;

the establishing module 332 may be configured to establish a keyword semantic rule according to a logical operation condition between the keywords.

In a specific application scenario, as shown in fig. 5, the labeling unit 34 includes:

the mapping module 341 may be configured to map, by using the keyword semantic rule, the entity concept and the logical relationship related to the keyword semantic rule into a corpus query expression;

the matching module 342 may be configured to match target text corpora containing different violation categories from the corpus database according to the corpus query expression.

In a specific application scenario, as shown in fig. 5, the apparatus further includes:

the constructing unit 35 may be configured to, after the preset standard violation description is divided into different violation categories by taking the semantic point as a unit, construct an identification model for each violation category, where the identification model is configured to identify a violation text based on an input interactive text;

the training unit 36 may be configured to form the labeled target text corpus into sample data and test data, and train the identification model of each violation category using the sample data and the test data.

It should be noted that other corresponding descriptions of the functional units related to the corpus tagging device applicable to the server side provided in this embodiment may refer to the corresponding descriptions in fig. 1 and fig. 2, and are not described herein again.

Based on the methods shown in fig. 1-2, correspondingly, an embodiment of the present application further provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the corpus tagging method shown in fig. 1-2;

based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.

Based on the method shown in fig. 1-2 and the virtual device embodiment shown in fig. 4-5, in order to achieve the above object, an embodiment of the present application further provides a server entity device, which may specifically be a computer, a server, or other network devices, and the entity device includes a storage medium and a processor; a storage medium for storing a computer program; a processor, configured to execute a computer program to implement the corpus tagging method shown in fig. 1-2.

Optionally, the above entity devices may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and the like. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

Those skilled in the art will appreciate that the entity device structure of a corpus annotation provided in this embodiment does not constitute a limitation to the entity device, and may include more or less components, or combine some components, or arrange different components.

The storage medium may further include an operating system and a network communication module. The operating system is a program for managing hardware and software resources of the actual device for store search information processing, and supports the operation of the information processing program and other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and communication with other hardware and software in the information processing entity device.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware. Through the technical scheme, compared with the prior art, the method and the device have the advantages that the violation expressions are used for matching the target text corpora containing different violation categories from the corpus database, the text corpora do not need to be manually marked, the target text corpora is marked based on the violation categories, the text corpora of different violation types can be generated in batch, corpus marking time is saved, because standard violation language description uses semantic points to perform more detailed category division, the recognition effect of the model for the violation terms is optimized to a certain extent, and the semantic recognition accuracy is improved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims

1. A corpus tagging method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the sentence-breaking processing on the context data in different service scenarios specifically includes:

3. The method according to claim 2, wherein the splitting and/or merging the text corpus according to the text length mapped by the text corpus corresponding to the interaction initiator specifically includes:

4. The method according to claim 1, wherein the semantic point is a single sentence or a compound sentence including at least one entity concept, and the classifying the preset standard violation description into different violation categories by using the semantic point as a unit specifically includes:

5. The method according to claim 4, wherein the calculating the violation feature degree of the single sentence or the multiple sentence including the at least one entity concept mapped on different violation categories specifically comprises:

6. The method according to claim 1, wherein the establishing of the keyword semantic rule according to the entity concepts contained in the semantic point and the logical relationship between the entity concepts specifically comprises:

7. The method according to claim 1, wherein the matching of the target text corpora containing different violation categories from the corpus database using the violation expressions specifically comprises:

8. The method according to any one of claims 1-7, wherein after the classifying the preset standard violation descriptions into different violation categories in semantic point units, the method further comprises:

9. A corpus tagging device, comprising:

and the marking unit is used for matching target text corpora containing different violation categories from the corpus database by using the violation expressions and marking the target text corpora based on the violation categories.

10. A storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the corpus tagging method according to any one of claims 1 to 8.

11. A corpus tagging device comprising a storage medium, a processor and a computer program stored on the storage medium and operable on the processor, wherein the processor implements the corpus tagging method according to any one of claims 1 to 8 when executing the program.