CN112463969B

CN112463969B - Method, system, equipment and medium for detecting new words of cigarette brand and product rule words

Info

Publication number: CN112463969B
Application number: CN202011443318.9A
Authority: CN
Inventors: 张侃弘; 谭琛; 栾晓宇; 周欣然; 李敏刚; 刘璐珺; 王佩军
Original assignee: Shanghai Tobacco Group Co Ltd
Current assignee: Shanghai Tobacco Group Co Ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-09-20
Anticipated expiration: 2040-12-08
Also published as: CN112463969A

Abstract

The invention provides a method, a system, equipment and a medium for detecting new words of cigarette brand and product rule words, wherein the detection method comprises the following steps: the method comprises the steps of obtaining a chatting text data set of a retailer in the tobacco industry, and searching candidate new words from the chatting text data set; confirming whether the candidate new words have new words of cigarette brand and product rule words; and if the new word of the cigarette brand and article rule word is confirmed to exist, matching the confirmed new word of the cigarette brand and article rule word with the formal word. The method, the system, the equipment and the medium for detecting the new words of the pragmatic jargon and the product of the cigarette can greatly improve the efficiency of finding the new words of the pragmatic jargon from the WeChat crowd of retail households in the tobacco industry and reduce the workload of personnel.

Description

Method, system, equipment and medium for detecting new words of cigarette brand and product rule words

Technical Field

The invention belongs to the technical field of natural language processing and information extraction, relates to a detection method and a detection system, and particularly relates to a detection method, a detection system, detection equipment and a detection medium for new words of a cigarette brand and product rule word.

Background

The existing new word discovery methods are mainly divided into an unsupervised method and a supervised method. The unsupervised method comprises the steps of firstly carrying out word segmentation, filtering and other processing on a text, then carrying out statistics on the information entropy of a word and the left and right words of the word, and the mutual information between points in the word, setting a threshold value for filtering, and filtering from an existing common word set to obtain a new word; the supervised method needs to label a large amount of linguistic data containing new words, and uses a machine learning model to train on the linguistic data, so that the model can distinguish the new words from non-new words in sentences, and the supervised method is used as a sequence labeling task, and a common model is Bi-LSTM + CRF.

After finding a new word from a text (e.g., a chat log of a WeChat group of retail users in the tobacco industry), the existing method manually judges whether the new word is a jargon word corresponding to a certain cigarette brand and product rule.

In the existing unsupervised new word discovery method, the word segmentation is firstly carried out, and for jargon words in the tobacco industry, the existing word segmentation tool cannot divide the words into independent and complete words, so that the new word cannot be discovered in the subsequent new word discovery step; the supervised new word discovery method needs a large amount of labeled corpora, and has certain limitation in practical application.

In the existing new word discovery process, the discovered new words can also provide a plurality of irrelevant words besides the jargon words of the cigarette brand and the product specification which are wanted by people, and the formal words corresponding to the jargon cannot be provided, so that the workload of manpower is increased.

Therefore, how to provide a method, a system, equipment and a medium for detecting new words of regular lines of cigarette brands and products is a technical problem to be solved by technical staff in the field, so as to solve the defects that many irrelevant words are provided in the new word discovery process in the prior art, formal words corresponding to the new words cannot be provided, the workload of labor is increased, and the like.

Disclosure of Invention

In view of the above disadvantages of the prior art, an object of the present invention is to provide a method, a system, a device, and a medium for detecting new words in a rule language of a brand or a product of a cigarette, which are used to solve the problem that many irrelevant words are proposed in a new word discovery process in the prior art, and a formal word corresponding to the rule language cannot be given, so that the workload of a human is increased.

In order to achieve the above and other related objects, the present invention provides a method for detecting new words in the rule and jargon of cigarette brands and products, comprising: the method comprises the steps of obtaining a chatting text data set of a retailer in the tobacco industry, and searching candidate new words from the chatting text data set; confirming whether the candidate new words have new words of cigarette brand and product rule words; and if the new word of the cigarette brand and article rule word is confirmed to exist, matching the confirmed new word of the cigarette brand and article rule word with the formal word.

In an embodiment of the present invention, the step of obtaining a chat text data set of a tobacco industry retailer and searching for a candidate new word from the chat text data set includes: using punctuation marks to perform sentence segmentation on the chat text data in the chat text data set, and segmenting each sentence of text according to characters; dividing the segmented characters into 2-tuple, 3-tuple, 4-tuple and 5-tuple to form a word set; counting left and right information entropies and average mutual information of all word strings; the left and right information entropies of the word string are used for measuring the richness of left and right adjacent characters of the word string; the average mutual information is used for measuring the degree of cohesion inside the word string; constructing candidate word scores based on left and right information entropies and average mutual information of all word strings; utilizing a plurality of word segmentation tools to segment words of the chat text data set to form word segmentation results, and taking a union set of the word segmentation results to obtain a union set word result; screening a screening word set with the candidate word score larger than a first preset candidate word score threshold value from the chat text data set, and calculating word overlapping degree of the screening word set and the union word result; and adjusting the first preset candidate word score threshold value to a second preset candidate word score threshold value according to the word overlapping degree, selecting a word segmentation result set corresponding to the second preset candidate word score threshold value, comparing the word segmentation result set with a union word set and a word library of the tobacco industry, and determining the newly appeared words as candidate new words.

In an embodiment of the present invention, the word overlap degree of the screening word set and the union result is equal to the number of intersection words of the word set and the union word result divided by the number of the word set.

In an embodiment of the present invention, the step of determining whether there is a new word of the cigarette brand and product rule language in the candidate new words includes: performing secondary classification on the chat text data set by using a pre-stored depth migration classification model, distinguishing text data containing cigarette information and text data not containing the cigarette information, and outputting a probability value contained in the cigarette information and a probability value not contained in the cigarette information; analyzing sentences of the chat text data set in the supply and demand relationship, extracting key description words from the sentences, summarizing the key description words into supply and demand relationship characteristic words, and detecting whether the chat text data set contains the supply and demand relationship characteristic words to form a supply and demand statistical result; the supply and demand statistical result comprises a supply and demand statistical result containing supply and demand relation characteristic words and a supply and demand statistical result not containing supply and demand relation characteristic words; counting whether the parts of speech of the words before and after the candidate new word in the chat text data set are pronouns, numerics, adjectives and verbs; the part-of-speech statistical result comprises pronouns, numerators, adjectives and verbs of the words before and after the candidate new words and parts of speech of the words before and after the candidate new words which are not the pronouns, the numerators, the adjectives and the verbs; replacing the candidate new words in the sentences with the candidate new words with a specific cigarette name, and calculating the language model score difference of the sentences before and after replacement; the language model is used for measuring the reasonability and the smoothness of the sentences; inputting a supply and demand statistical result containing supply and demand relation characteristic words, a supply and demand statistical result containing no supply and demand relation characteristic words, parts of speech of words before and after the candidate new words as pronouns, digraphs, adjectives and verb, parts of speech of words before and after the candidate new words as a model, a part of speech statistical result of the digraphs, the adjectives and verbs and a language model score difference value of the sentences before and after replacement as a model into a classification model to train the classification model and form the trained classification model; and classifying the candidate new words by using the trained classification model so as to distinguish the new words of the cigarette brand and the product rule words.

In an embodiment of the present invention, the supply and demand statistical result including the supply and demand relationship feature words is represented by 1; the supply and demand statistical result which does not contain the supply and demand relation characteristic words is represented by 0; the part-of-speech statistical results of the words before and after the candidate new word, such as pronouns, numerics, adjectives and verbs, are represented by 1; the part-of-speech statistical results of the words before and after the candidate new word, which are not pronouns, numerics, adjectives and verbs, are represented by 0.

In an embodiment of the present invention, if it is determined that the new word of the regular expression of the cigarette brand and the product exists, the step of matching the determined new word of the regular expression of the cigarette brand and the product with the formal word includes: searching similar names from formal cigarette brands and article rules for known jargon words, sequencing the words according to the sequence of the editing distance from the original words from small to large, and combining the words with the original words in the sequence of the order of the words with the number specified at the top; selecting respective vector representations of two words in each word and original word combination by using a deep migration classification model, and calculating cosine similarity of the two words; extracting price information of the cigarette products in the chat text by using a price template; if the price is mentioned in the price information of the cigarette product, calculating the difference between the price and the cigarette wholesale price corresponding to the cigarette brand or product rule, and calculating the price proportion value of the absolute value of the difference in the cigarette wholesale price; if the price is not mentioned in the price information of the cigarette product, directly setting the proportional value as a specified proportional value; the cosine similarity and the price ratio value of the two words are used as the input of a pre-stored two-classification model, and the pre-stored two-classification model is trained; searching similar names from formal cigarette brands and article rule lines for the confirmed new words of the cigarette brands and the article rule lines, sequencing the words according to the sequence of the editing distance from small to large of the original words, and combining the words with the original words in the sequence of the order of the words and the original words; the cosine similarity and the price difference ratio value of two words in the word and original word combination are used as the input of the pre-stored two classification models, and the matching degree of candidate new words in the word and original word combination with the formal cigarette brand and article rule and words in the order of the specified number is calculated; and taking the formal words with the highest matching degree and the matching degree score larger than the score threshold value as the words matched with the new words.

In an embodiment of the present invention, the price template includes: a price pre-extraction template and a price post-extraction template.

Another aspect of the present invention provides a system for detecting new words in the rule and jargon of cigarette brands and products, comprising: the system comprises an acquisition module, a database module and a database module, wherein the acquisition module is used for acquiring a chat text data set of a retailer in the tobacco industry; the searching module is used for searching candidate new words from the chatting text data set; the confirming module is used for confirming whether the new candidate words comprise new cigarette brand and product rule words; and the matching module is used for matching the confirmed new words of the cigarette brand and article rule words with the formal words if the new words of the cigarette brand and article rule words are confirmed to exist.

Yet another aspect of the present invention provides a medium having a computer program stored thereon, which when executed by a processor, implements a method for detecting said cigarette brand and product specification jargon new words.

A final aspect of the present invention provides a detection apparatus comprising: a processor and a memory; the memorizer is used for storing computer programs, and the processor is used for executing the computer programs stored by the memorizer so as to enable the detection equipment to execute the detection method of the new words of the cigarette brand and product rule words.

As mentioned above, the method, the system, the equipment and the medium for detecting the new words of the cigarette brand and product rule words have the following beneficial effects:

according to the method, the system, the equipment and the medium for detecting the new words of the cigarette brands and the product regular lines, the information entropy and the mutual information at the word level are firstly used for finding the new words, so that the recall rate of the new words of the cigarette brands and the product regular lines is remarkably improved; secondly, the new words are confirmed by combining with classification models of cigarette brands, product specifications, text information, supply and demand information, part-of-speech characteristics and language model scores, so that the accuracy rate of the new words of the cigarette brands and the product specifications is obviously improved; and finally, matching formal words corresponding to the new jargon words by using the cigarette price information and the word vector similarity, so that the matching automation degree is improved, and the manual workload is reduced.

Drawings

Fig. 1 is a schematic flow chart illustrating a method for detecting new words in the rule of cigarette brands and products according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart of S11 according to the present invention.

Fig. 3 is a flow chart illustrating S12 according to the present invention.

Fig. 4 is a flow chart illustrating S13 according to the present invention.

Fig. 5 is a schematic structural diagram of a detection system for new words of cigarette brand and product rules according to an embodiment of the present invention.

Description of the element reference numerals

1 cigarette brand and article rule new words detection system

51 acquisition module

52 lookup module

53 confirmation module

54 matching module

S11-S13 steps

S111 to S117

S121 to S126

S131 to S137 steps

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The embodiment provides a method for detecting new words of cigarette brand and product rule words, which comprises the following steps:

the method comprises the steps of obtaining a chatting text data set of a retailer in the tobacco industry, and searching candidate new words from the chatting text data set;

confirming whether the candidate new words have new words of cigarette brand and product rule words;

and if the new word of the cigarette brand and article rule word is confirmed to exist, matching the confirmed new word of the cigarette brand and article rule word with the formal word.

The following description will be made in conjunction with the drawings to illustrate a method for detecting new words of the cigarette brand and product rule words provided by the present embodiment. Please refer to fig. 1, which is a flowchart illustrating a method for detecting new words in a rule language of a brand and a product of a cigarette in an embodiment. As shown in fig. 1, the method for detecting new words in the rule words of cigarette brands and products specifically comprises the following steps:

s11, obtaining a chat text data set of the tobacco industry retailer, and searching candidate new words from the chat text data set.

In this embodiment, the chat text data set of the tobacco industry retailer includes a QQ chat text data set of the tobacco industry retailer, a WeChat chat text data set of the tobacco industry retailer, and so on.

Please refer to fig. 2, which is a schematic diagram illustrating the process of S11. The S11 includes:

and S111, segmenting the chat text data in the chat text data set by using punctuation marks, and segmenting each sentence of text according to characters.

And S112, dividing the segmented characters into 2-tuple, 3-tuple, 4-tuple and 5-tuple to form a word set.

For example, "700 yuan receives zhong hua soft", 2 tuples include "700 yuan", "yuan receives", "zhong", "chinese", "hua soft", and 3 tuples include "700 yuan receives", "zhong", "receives chinese", and "hua soft".

S113, counting left and right information entropies and average mutual information of all word strings; the left and right information entropies of the word string are used for measuring the richness of left and right adjacent characters of the word string; and the average mutual information is used for measuring the degree of cohesion inside the word string.

Specifically, the calculation formula of the left and right information entropy of the word string is as follows:

where W represents the set of left or right neighbors of the word string W. In the embodiment, the information entropy and the mutual information at the word level are used for finding the new words, so that the recall rate of the new words in the rule words of cigarette brands and products is remarkably improved.

The formula for calculating mutual information of the 2-tuple word string ab is as follows:

the formula for calculating mutual information of the 3-tuple abc is as follows:

the mutual information of the 4-tuple word string abcd is calculated in a manner similar to the calculation formula of the mutual information of the 5-tuple word string abcd.

Wherein p (a) represents the number of occurrences of a single character a in the data set, divided by the number of all single characters, p (b) similarly; p (ab) represents the number of occurrences of a 2-tuple ab in the dataset, such as "china", divided by the total number of all 2-tuples, and p (abc) represents the number of occurrences of a 3-tuple abc in the dataset, such as "soft chinese", divided by the total number of all 3-tuples.

And S114, constructing candidate word scores based on the left and right information entropies and the average mutual information of all the word strings.

The candidate word scores include a candidate word score for a 2-tuple word string, a candidate word score for a 3-tuple word string, a candidate word score for a 4-tuple word string, and a candidate word score for a 5-tuple word string.

Specifically, the calculation formula of the candidate word score is as follows:

the PMI respectively represents a candidate word score of a 2-tuple word string, a candidate word score of a 3-tuple word string, a candidate word score of a 4-tuple word string or a candidate word score of a 5-tuple word string; e (w) represents mutual information of 2-tuple word strings, 3-tuple word strings, 4-tuple word strings or 5-tuple word strings respectively; l) represents "left" and (R) represents "right".

And S115, performing word segmentation on the chat text data set by using a plurality of word segmentation tools to form word segmentation results, and taking a union set of the word segmentation results to obtain a union set word result. The word result of the union set contains most old words in the original data set, and because the word segmentation accuracy of the word segmentation tools on the cigarette jargon words is low, the possibility that the cigarette jargon words are contained in the word segmentation tools is low.

In this embodiment, the chat text data set is segmented by using three tools, i.e., jieba, hand, and THULAC.

S116, screening a screening word set with the candidate word score larger than a first preset candidate word score threshold value from the chat text data set, and calculating the word overlapping degree of the screening word set and the union word result.

In this embodiment, the word overlap degree of the screening word set and the union result is equal to the number of intersection words of the word set and the union word result divided by the number of the word set.

And S117, adjusting the score threshold of the first preset candidate word to the score threshold of the second preset candidate word according to the word overlapping degree, selecting a word segmentation result set corresponding to the score threshold of the second preset candidate word, comparing the word segmentation result set with the union word result and a word library of the tobacco industry, and determining a newly appeared word as a candidate new word.

And S12, determining whether the candidate new words include new words of cigarette brand and product rule words. Please refer to fig. 3, which shows a flowchart of S12. As shown in fig. 3, the S12 specifically includes the following steps:

and S121, performing secondary classification on the chat text data set by using a pre-stored deep migration classification model, distinguishing text data containing cigarette information from text data not containing cigarette information, and outputting a probability value contained in the cigarette information and a probability value not contained in the cigarette information.

In this embodiment, the deep migration classification model refers to a learning process that applies a model learned in a source domain to a new domain by using data, tasks, or similarities between models. There are several main forms of migratory learning: sample-based migration, feature-based migration, model-based migration, and relationship-based migration. The basic idea of model-based migration is to find parameter information shared between a source domain and a target domain from them to realize migration. Deep migration learning is mainly model migration, and the simplest and most common method is fine-tuning, namely, adjusting a target task by using a network trained by others. In recent years, BERT, GPT, XLNET models of big fire are pre-trained on a large amount of corpora, and then fine-tuning is performed on a target task.

The cigarette information containing the probability value and the cigarette information not containing the probability value are values obtained by performing normalization on output through matrix budget and the depth migration classification model.

S122, analyzing sentences of the chat text data set in the supply and demand relationship, extracting key description words from the sentences, summarizing the key description words into supply and demand relationship characteristic words, detecting whether the chat text data set contains the supply and demand relationship characteristic words or not, and forming a supply and demand statistical result; the supply and demand statistical result comprises a supply and demand statistical result containing the supply and demand relation characteristic words and a supply and demand statistical result not containing the supply and demand relation characteristic words.

The supply and demand relation characteristic words are, for example, "go", "sell", "needed", "you want", "I also have", "wanted relation", and the like; the rule templates with the relationship of 'demand' are as follows: "buy", "receive", "have goods", "ask", "find", "need", "have contact", "have come", etc.

In this embodiment, the supply and demand statistical result including the supply and demand relationship characteristic words is represented by 1; the supply and demand statistical result which does not contain the supply and demand relation characteristic words is represented by 0.

S123, counting whether the part of speech of the words before and after the candidate new word in the chatting text data set is pronouns, numerics, adjectives and verbs; the part-of-speech statistical result comprises pronouns, numerators, adjectives and verbs of the words before and after the candidate new words and parts of speech of the words before and after the candidate new words which are not pronouns, numerators, adjectives and verbs. In this embodiment, the part-of-speech statistics results of the words before and after the candidate new word, such as pronouns, numerics, adjectives, and verbs, are represented by 1; the part-of-speech statistical results of the words before and after the candidate new word, which are not pronouns, numerics, adjectives and verbs, are represented by 0.

S124, replacing the candidate new words in the sentence with the candidate new words with a specific cigarette name, and calculating the language model score difference of the sentence before and after replacement; the language model is used for measuring the reasonability and the smoothness of the sentences. In this embodiment, the language model is a BERT language model. The BERT language model is a language model, and can calculate an evaluation score for a sentence to determine whether the sentence is reasonable and smooth. If a new word is a regular cigarette brand and article word, the new word is changed into a regular cigarette name, the sentence is still reasonable and smooth as a whole, otherwise, the score of the sentence is reduced.

The process of calculating the evaluation score by the BERT language model comprises the following steps: inputting the sentence into a BERT encoder (in this embodiment, the BERT encoder is a structure that converts characters into vectors represented by numbers, and only converts characters into vectors, which can be calculated in a deep learning model), adding a classification layer to the output of the BERT encoder (the classification layer uses the output of the BERT encoder to calculate the probability of selecting each word in the vocabulary at each position corresponding to the input. for example, assuming that there are 2 ten thousand words in the chinese language, we use these 2 ten thousand words as the vocabulary, the classification layer is used to calculate the probability that the first word is one of the 2 ten thousand words in the vocabulary, and the second word may be … …, etc.), multiplying the output vector by an embedding matrix (the embedding matrix is used for matrix operation with the output vector of the BERT encoder), converting it into the dimension of the vocabulary, performing softmax operation, calculating the probability of each word at the position corresponding to the sentence of the BERT encoder before the BERT operation, the product of the corresponding probabilities of each word in the sentence is the score of the whole sentence.

And S125, inputting a supply and demand statistical result containing supply and demand relation characteristic words, a supply and demand statistical result containing no supply and demand relation characteristic words, a part-of-speech statistical result of candidate new words with the parts-of-speech of the words before and after, a part-of-speech statistical result of candidate new words without the parts-of-speech of the words before and after, a part-of-speech statistical result of the words before and after, and a language model score difference of sentences before and after replacement as models into a moving classification model to train the classification model and form the trained classification model. In the present embodiment, the classification model is trained using a boosted tree model for classification (xgboost),

and S126, classifying the candidate new words by using the trained classification model so as to distinguish the new words of the cigarette brand and the product rule words.

And S13, if the new word of the cigarette brand and product rule word is confirmed to exist, matching the confirmed new word of the cigarette brand and product rule word with the formal word. In this embodiment, the degree of match between the terms of the brand, article rule and formal terms of the cigarette can be represented by the probability of the classification model on the category predicted as "match".

Please refer to fig. 4, which shows a flowchart of S13. As shown in fig. 4, the S13 specifically includes the following steps:

s131, searching similar names from formal cigarette brands and article rules for known jargon terms, sorting the terms according to the sequence from small to large of the editing distance of the original terms, and combining the terms with the original terms in the top designated number (the designated number is 10 selected in the embodiment).

In the present embodiment, the editing distance refers to the minimum number of single character editing operations (including three types of insertion, deletion, and replacement) required when the character string a is modified into the character string B.

For example, if "soft center" is input, the similar words retrieved include "chinese (soft), chinese (hard), and south jing (soft)", where "soft center" and "chinese (soft)" are matched, "soft center" and "chinese (hard)" are unmatched, and "soft center" and "south jing (soft)" are unmatched.

S132, obtaining respective vector representation of each word and two words in the original word combination by using the deep migration classification model, and calculating cosine similarity of the two words.

In this embodiment, the cosine similarity is calculated by: the cosine of the included angle of the two vectors is used for representing the proximity degree (word similarity) of the two vectors, the larger the cosine is, the smaller the included angle is, and the closer the two vectors are, the more similar the words are.

S133, extracting the price information of the cigarette products in the chat text by using the price template; if the price is mentioned in the price information of the cigarette product, calculating the difference between the price and the cigarette wholesale price corresponding to the cigarette brand or product rule, and calculating the price proportion value of the absolute value of the difference in the cigarette wholesale price; and if the price is not mentioned in the price information of the cigarette product, directly setting the proportional value as a specified proportional value.

In this embodiment, the price template includes: a price pre-extraction template and a price post-extraction template.

Specifically, the price pre-extraction template is that the price number and the price descriptor are before the cigarette brand and the cigarette specification, and the price descriptor is that the number before the price descriptor can be used as the price attribute.

For example: "700 out of the day". The regular expression [0-9] + is used to indicate that at least one integer number appears, \[ 0-9] {1, 2} indicates two lower digits after the integer, the price descriptors include (buy, want, receive, etc.), and the summary is "([ 0-9] +) (\[ 0-9] {1, 2 })? And (4) harvesting.

Specifically, the price post-extraction template is that the price number and the price descriptor are behind the cigarette brand and the cigarette specification.

For example: each of the cloud 360. Also, regular expression [0-9] + is used to indicate that at least one integer number appears, \\ 0-9 {1, 2} indicates two lower digits after the integer, the price descriptors have (/ bar, element bar), and the summary is a regular expression of "([ 0-9] +) (\\ 0-9] {1, 2 })? And a/strip ".

And S134, taking the cosine similarity and the price ratio value of the two words as the input of a pre-stored two-classification model, and training the pre-stored two-classification model.

The pre-stored two-class model is used to predict whether two words in each word combination are matched. The input of the model is the cosine similarity and the price ratio of two words, and the output of the model is the probability of matching or unmatching. When the model is trained, training data is input to obtain the output of the model, the difference between the output of the model and the reality is measured by using the cross entropy, and the model training is completed when the difference is minimum through a plurality of rounds of iterative training.

S135, searching similar names from formal cigarette brands and article rules for the new words of confirmed existing cigarette brands and article rules, sorting the words in the order of small edit distance from the original words, and combining the words with the original words in the order of the top specified number (for example, top 10).

And S136, taking the cosine similarity and the price difference ratio of the two words in the word and original word combination as the input of the pre-stored two classification models, and calculating the matching degree of the candidate new words in the previously specified number of words and original word combination with formal cigarette brand and article rule and practice.

And S137, taking the formal word with the highest matching degree and the matching degree score larger than the score threshold value as the word matched with the new word.

For example, many cigarette brands and article rule words are found from the text by using the detection method provided by the embodiment and are accurately matched with corresponding formal cigarette names, such as: finding a new word 'five-color phoenix' and matching with the formal name 'phoenix (exit of five-color ramuscule)'; the new word "golden peony short branch" is found and matched with the formal brand name "peony (golden short branch)".

For the selection of 2 ten thousand tobacco industry retail customer WeChat group chat records as a data set. According to the difference of the jargon and article rule words of the cigarette contained in the text, the data set is divided into a training set (18000 items), a verification set (2000 items) and a test set (2000 items), and the jargon words contained in the three data sets are not overlapped.

After the test of the detection method for the new words of the cigarette brands and the product regular lines, the recall rate of the words of the cigarette brands and the product regular lines in the concentrated test is 87% and the accuracy rate is 91% through the steps of new word discovery and new word confirmation, and the recall rate of the words of the cigarette brands and the product regular lines in the concentrated test is 35% and the accuracy rate is 64% through the unsupervised new word discovery method based on word segmentation. The matching accuracy of the new jargon words and the formal words is 90%, and before the scheme, the matching of the new jargon words and the formal words is mainly carried out in a manual mode.

According to the method for detecting the new words of the cigarette brand and product rule words, the information entropy and mutual information of the word level are firstly used for finding the new words, so that the recall rate of the new words of the cigarette brand and product rule words is remarkably improved; secondly, the new words are confirmed by combining with classification models of cigarette brands, product specifications, text information, supply and demand information, part-of-speech characteristics and language model scores, so that the accuracy rate of the new words of the cigarette brands and the product specifications is obviously improved; and finally, matching formal words corresponding to the new words of the jargon by using the cigarette price information and the word vector similarity, so that the matching automation degree is improved, and the manual workload is reduced.

The present embodiment also provides a medium (also referred to as a computer-readable storage medium) on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for detecting the new words in the cigarette brand and product specification words.

One of ordinary skill in the art will appreciate that the computer-readable storage medium is: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Example two

The embodiment provides a detection system of new word of cigarette brand and article rule word, its characterized in that includes:

the system comprises an acquisition module, a database module and a database module, wherein the acquisition module is used for acquiring a chat text data set of a retailer in the tobacco industry;

the searching module is used for searching candidate new words from the chatting text data set;

the confirmation module is used for confirming whether the new candidate words have new words of cigarette brand and product regulation;

and the matching module is used for matching the confirmed new words of the cigarette brand and article rule words with the formal words if the new words of the cigarette brand and article rule words are confirmed to exist.

The following describes in detail the detection system of the new words of the cigarette brand and product rule words provided by the present embodiment with reference to the drawings. Please refer to fig. 5, which is a schematic diagram illustrating a schematic structure of a system for detecting new words of a rule of a brand and a product of a cigarette in an embodiment. As shown in fig. 5, the system 5 for detecting new words in the cigarette brand and product rule language includes an obtaining module 51, a searching module 52, a confirming module 53 and a matching module 54.

The obtaining module 51 is configured to obtain a chat text data set of a tobacco industry retailer.

The searching module 52 is configured to search the chat text data set for a candidate new word.

Specifically, the searching module 52 uses punctuation marks to perform sentence segmentation on the chat text data in the chat text data set, and segments each sentence of text according to characters; dividing the segmented characters into 2-tuple, 3-tuple, 4-tuple and 5-tuple to form a word set; counting left and right information entropies and average mutual information of all word strings; the left and right information entropies of the word string are used for measuring the richness of left and right adjacent characters of the word string; the average mutual information is used for measuring the degree of cohesion inside the word string; constructing candidate word scores based on the left and right information entropies and the average mutual information of all the word strings; utilizing a plurality of word segmentation tools to segment words of the chat text data set to form word segmentation results, and taking a union set of the word segmentation results to obtain a union set word result; screening a screening word set with the candidate word score larger than a first preset candidate word score threshold value from the chat text data set, and calculating word overlapping degree of the screening word set and the union word result; and adjusting the first preset candidate word score threshold value to a second preset candidate word score threshold value according to the word overlapping degree, selecting a word segmentation result set corresponding to the second preset candidate word score threshold value, comparing the word segmentation result set with union words and a word library of the tobacco industry, and determining newly appeared words as candidate new words.

The confirming module 53 connected to the obtaining module 51 and the searching module 52 is configured to confirm whether a new word of a cigarette brand and a new word of a product rule language exists in the candidate new words.

Specifically, the confirmation module 53 performs secondary classification on the chat text data set by using a pre-stored deep migration classification model, distinguishes text data containing cigarette information and text data not containing cigarette information, and outputs a cigarette information containing probability value and a cigarette information not containing probability value; analyzing sentences of the chat text data set in the supply and demand relationship, extracting key description words from the sentences, summarizing the key description words into supply and demand relationship characteristic words, and detecting whether the chat text data set contains the supply and demand relationship characteristic words to form a supply and demand statistical result; the supply and demand statistical result comprises a supply and demand statistical result containing supply and demand relation characteristic words and a supply and demand statistical result not containing supply and demand relation characteristic words; counting whether the parts of speech of the words before and after the candidate new word in the chat text data set are pronouns, numerics, adjectives and verbs; the part-of-speech statistical result comprises that the parts of speech of the words before and after the candidate new words are pronouns, numerals, adjectives, verbs and the parts of speech of the words before and after the candidate new words are not pronouns, numerals, adjectives and verbs; replacing the candidate new words in the sentences with the candidate new words with a specific cigarette name, and calculating the language model score difference of the sentences before and after replacement; the language model is used for measuring the reasonability and the smoothness of the sentences; inputting cigarette information including a probability value and cigarette information not including a probability value, a supply and demand statistical result including supply and demand relation characteristic words, a supply and demand statistical result not including supply and demand relation characteristic words, parts of speech of candidate new words and words before and after being replaced are pronouns, numerals, adjectives and parts of speech statistical results of verbs, parts of speech of candidate new words and words before and after being replaced are not pronouns, numerals, adjectives and parts of speech statistical results of verbs, and a language model score difference value before and after replacement of sentences as models to a classification model to train the classification model to form a trained classification model; and classifying the candidate new words by using the trained classification model so as to distinguish the new words of the cigarette brand and the product rule words.

The matching module 54 connected with the confirmation module 53 is used for searching similar names from formal cigarette brands and article rules for known jargon words, sequencing the words according to the sequence of the editing distance from the original words from small to large, and combining the words and the original words in the sequence with the specified number;

selecting respective vector representations of two words in each word and original word combination by using a deep migration classification model, and calculating cosine similarity of the two words;

extracting price information of the cigarette products in the chat text by using a price template; if the price is mentioned in the price information of the cigarette product, calculating the difference between the price and the cigarette wholesale price corresponding to the cigarette brand or product rule, and calculating the price proportion value of the absolute value of the difference in the cigarette wholesale price; if the price is not mentioned in the price information of the cigarette product, directly setting the proportional value as a specified proportional value;

the cosine similarity and the price ratio value of the two words are used as the input of a pre-stored two-classification model, and the pre-stored two-classification model is trained; searching similar names from formal cigarette brands and article rule lines for the confirmed new words of the cigarette brands and the article rule lines, sequencing the words according to the sequence of the editing distance from small to large of the original words, and combining the words with the original words in the sequence of the order of the words and the original words; the cosine similarity and the price difference ratio value of two words in the word and original word combination are used as the input of the pre-stored two classification models, and the matching degree of candidate new words in the word and original word combination with the formal cigarette brand and article rule and words in the order of the specified number is calculated; and taking the formal words with the highest matching degree and the matching degree score larger than the score threshold value as the words matched with the new words.

It should be noted that the division of the modules of the above system is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And the modules can be realized in a form that all software is called by the processing element, or in a form that all the modules are realized in a form that all the modules are called by the processing element, or in a form that part of the modules are called by the hardware. For example: the x module can be a separately established processing element, and can also be integrated in a certain chip of the system. In addition, the x-module may be stored in the memory of the system in the form of program codes, and may be called by one of the processing elements of the system to execute the functions of the x-module. Other modules are implemented similarly. All or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software. These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), and the like. When a module is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

EXAMPLE III

The present embodiment provides a detection apparatus, including: a processor, memory, transceiver, communication interface, or/and system bus; the storage and the communication interface are connected with the processor and the transceiver through a system bus and are used for completing mutual communication, the storage is used for storing the computer program, the communication interface is used for communicating with other equipment, and the processor and the transceiver are used for operating the computer program to enable the detection equipment to execute the steps of the detection method of the new words of the cigarette brand and product specification.

The above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

The protection scope of the method for detecting the new words in the cigarette brand and product rule words is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes of increasing and decreasing the steps and replacing the steps in the prior art according to the principle of the invention are included in the protection scope of the invention.

The invention also provides a system of the new words of the cigarette brand and the article rule words, which can realize the method of the new words of the cigarette brand and the article rule words, but the device for realizing the method of the new words of the cigarette brand and the article rule words comprises but is not limited to the structure of the system of the new words of the cigarette brand and the article rule words listed in the embodiment, and all the structural deformation and the replacement of the prior art made according to the principle of the invention are included in the protection scope of the invention.

In conclusion, the method, the system, the equipment and the medium for detecting the new words of the cigarette brands and the products in the regular words can greatly improve the efficiency of finding the new words of the regular words from the WeChat groups of retail users in the tobacco industry and reduce the workload of personnel. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for detecting new words of cigarette brand and product rule words is characterized by comprising the following steps:

confirming whether the candidate new words have new words of cigarette brand and product rule words; the step of confirming whether the new candidate words have new words of cigarette brand and product regulation words comprises the following steps:

performing secondary classification on the chat text data set by using a pre-stored depth migration classification model, distinguishing text data containing cigarette information and text data not containing the cigarette information, and outputting a probability value contained in the cigarette information and a probability value not contained in the cigarette information;

analyzing sentences related to supply and demand relations in the chat text data set, extracting key description words from the sentences, summarizing the key description words into supply and demand relation characteristic words, and detecting whether the chat text data set contains the supply and demand relation characteristic words or not to form a supply and demand statistical result; the supply and demand statistical result comprises a supply and demand statistical result containing supply and demand relation characteristic words and a supply and demand statistical result not containing supply and demand relation characteristic words;

counting whether the parts of speech of the words before and after the candidate new word in the chat text data set are pronouns, numerics, adjectives and verbs; the part-of-speech statistical result comprises pronouns, numerators, adjectives and verbs of the words before and after the candidate new words and parts of speech of the words before and after the candidate new words which are not the pronouns, the numerators, the adjectives and the verbs;

replacing candidate new words in a sentence with the candidate new words with a specific cigarette name, and calculating the language model score difference of the sentence before and after replacement; the language model is used for measuring the reasonability and the smoothness of the sentences;

inputting a supply and demand statistical result containing supply and demand relation characteristic words, a supply and demand statistical result containing no supply and demand relation characteristic words, parts of speech of words before and after the candidate new words as pronouns, digraphs, adjectives and verb, parts of speech of words before and after the candidate new words as a model, a part of speech statistical result of the digraphs, the adjectives and verbs and a language model score difference value of the sentences before and after replacement as a model into a classification model to train the classification model and form the trained classification model;

classifying the candidate new words by using the trained classification model so as to distinguish the new words of the cigarette brand and the product rule and jargon;

2. The method of claim 1, wherein the step of obtaining a chat text data set of a tobacco industry retailer and searching for candidate new words from the chat text data set comprises:

using punctuation marks to perform sentence segmentation on the chat text data in the chat text data set, and segmenting each sentence of text according to characters;

dividing the segmented characters into 2-tuple, 3-tuple, 4-tuple and 5-tuple to form word strings;

counting left and right information entropies and average mutual information of all word strings; the left and right information entropies of the word string are used for measuring the richness of left and right adjacent characters of the word string; the average mutual information is used for measuring the degree of cohesion inside the word string;

constructing candidate word scores based on left and right information entropies and average mutual information of the word strings;

utilizing a plurality of word segmentation tools to segment words of the chat text data set to form word segmentation results, and taking a union set of the word segmentation results to obtain a union set word result;

screening a screening word set with the candidate word score larger than a first preset candidate word score threshold value from the chat text data set, and calculating word overlapping degree of the screening word set and the union word result;

and adjusting the first preset candidate word score threshold value to a second preset candidate word score threshold value according to the word overlapping degree, selecting a word segmentation result set corresponding to the second preset candidate word score threshold value, comparing the word segmentation result set with a union word set and a word library of the tobacco industry, and determining the newly appeared words as candidate new words.

3. The method of claim 2, wherein the word overlap of the result of the screening of the word set and the union is equal to the number of the intersection words of the result of the word set and the union words divided by the number of the word set.

4. The method for detecting new words in the jargon and article regulations of cigarette as claimed in claim 1,

the supply and demand statistical result containing the supply and demand relation characteristic words is represented by 1;

the supply and demand statistical result which does not contain the supply and demand relation characteristic words is represented by 0;

the part-of-speech statistical results of the words before and after the candidate new word, such as pronouns, numerics, adjectives and verbs, are represented by 1;

the part-of-speech statistical results of the words before and after the candidate new word, which are not pronouns, numerics, adjectives and verbs, are represented by 0.

5. The method for detecting new words of cigarette brand and article regulations according to claim 1, wherein the step of matching the confirmed new words of cigarette brand and article regulations with the formal words if it is confirmed that the new words of cigarette brand and article regulations exist comprises:

searching similar names from formal cigarette brands and article rules for known jargon terms, sequencing the terms according to the sequence of the editing distance from the original terms from small to large, and combining the terms sequenced at the top in a specified number with the original terms;

obtaining respective vector representation of each word and two words in the original word combination by using a deep migration classification model, and calculating cosine similarity of the two words;

the cosine similarity and the price ratio value of the two words are used as the input of a pre-stored two-classification model, and the pre-stored two-classification model is trained;

searching similar names from formal cigarette brands and article rule lines for the confirmed new words of the cigarette brands and the article rule lines, sequencing the words according to the sequence of the editing distance from small to large of the original words, and combining the words with the original words in the sequence of the order of the words and the original words;

the cosine similarity and the price difference ratio value of two words in the word and original word combination are used as the input of the pre-stored two classification models, and the matching degree of candidate new words in the word and original word combination with the formal cigarette brand and article rule and words in the order of the specified number is calculated;

and taking the formal words with the highest matching degree and the matching degree score larger than the score threshold value as the words matched with the new words.

6. The method of claim 5, wherein the price template comprises: a price pre-extraction template and a price post-extraction template.

7. A detection system for new words of cigarette brand and product rule words is characterized by comprising the following components:

the confirming module is used for confirming whether the new candidate words comprise new cigarette brand and product rule words; the confirmation module carries out secondary classification on the chat text data set by utilizing a prestored depth migration classification model, distinguishes text data containing cigarette information and text data not containing cigarette information, and outputs a probability value contained in the cigarette information and a probability value not contained in the cigarette information; analyzing sentences related to supply and demand relations in the chat text data set, extracting key description words from the sentences, summarizing the key description words into supply and demand relation characteristic words, and detecting whether the chat text data set contains the supply and demand relation characteristic words or not to form a supply and demand statistical result; the supply and demand statistical result comprises a supply and demand statistical result containing supply and demand relation characteristic words and a supply and demand statistical result not containing supply and demand relation characteristic words; counting whether the parts of speech of the words before and after the candidate new word in the chat text data set are pronouns, numerics, adjectives and verbs; the part-of-speech statistical result comprises pronouns, numerators, adjectives and verbs of the words before and after the candidate new words and parts of speech of the words before and after the candidate new words which are not the pronouns, the numerators, the adjectives and the verbs; replacing candidate new words in a sentence with the candidate new words with a specific cigarette name, and calculating the language model score difference of the sentence before and after replacement; the language model is used for measuring the reasonability and the smoothness of the sentences; inputting a supply and demand statistical result containing supply and demand relation characteristic words, a supply and demand statistical result containing no supply and demand relation characteristic words, parts of speech of words before and after the candidate new words as pronouns, digraphs, adjectives and verb, parts of speech of words before and after the candidate new words as a model, a part of speech statistical result of the digraphs, the adjectives and verbs and a language model score difference value of the sentences before and after replacement as a model into a classification model to train the classification model and form the trained classification model; classifying the candidate new words by using the trained classification model to distinguish the new words of the cigarette brand and the product rule words;

8. A medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, implements a method for detecting a new word in the specification of a brand or product of a cigarette according to any one of claims 1 to 6.

9. A detection apparatus, comprising: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the detection equipment to execute the detection method of the new words of the cigarette brand and product rule words in any one of claims 1 to 6.