CN106897290B

CN106897290B - Method and device for establishing keyword model

Info

Publication number: CN106897290B
Application number: CN201510956045.0A
Authority: CN
Inventors: 邱志贤; 唐敏华; 孙佳伟; 顾伟; 束俞; 林嘉
Original assignee: China Mobile Group Shanghai Co Ltd
Current assignee: China Mobile Group Shanghai Co Ltd
Priority date: 2015-12-17
Filing date: 2015-12-17
Publication date: 2020-04-24
Anticipated expiration: 2035-12-17
Also published as: CN106897290A

Abstract

The invention discloses a method and a device for establishing a keyword model, wherein the method comprises the steps of retrieving a text statement where a core keyword is located in voice text data by acquiring the voice text data and the core keyword, counting words and word frequencies of the words in a set range in the text statement, sequencing the counted words according to the word frequencies of the words, determining auxiliary keywords of the words with the word frequencies of which the rank is greater than a ranking threshold, and performing logic relationship combination on the core keyword and the auxiliary keywords to establish the keyword model. The auxiliary keywords can be obtained by screening the words in the set range in the text sentence where the core keywords are located and sequencing the word frequency of the words, so that the keywords for establishing the keyword model are obtained, and then the keywords are logically combined to establish the keyword model, so that the efficiency and the accuracy of semantic analysis are improved.

Description

Method and device for establishing keyword model

Technical Field

The invention relates to the technical field of business support, in particular to a method and a device for establishing a keyword model.

Background

With the application of mobile communication data mining technology becoming more and more extensive, a large amount of recorded data stored in a customer service center becomes an important research direction for data mining, and at present, the recorded data of the customer service center is already subjected to text conversion and can be subjected to semantic analysis based on keywords.

When semantic analysis based on keywords is performed, the keywords need to be determined by artificial experience, and then the keywords are logically combined according to the logical relationship among the keywords to form a semantic analysis model. The establishment of such a model is mainly based on the experience of the service personnel.

Disclosure of Invention

The embodiment of the invention provides a method and a device for establishing a keyword model, which are used for solving the problem of manually determining keywords in the prior art.

The method for establishing the keyword model provided by the embodiment of the invention comprises the following steps:

acquiring voice text data and core keywords;

retrieving the text sentence where the core keyword is located in the voice text data, and counting words and word frequencies of the words in a set range in the text sentence, wherein the words in the set range refer to the words which are located in the text sentence in the set range before and after the core keyword;

sorting the counted words according to word frequency of the words, and determining auxiliary keywords for the words with the word frequency ranking larger than a ranking threshold;

and carrying out logic relation combination on the core keywords and the auxiliary keywords to establish a keyword model.

Preferably, after counting words and word frequencies of the words in the set range in the text sentence, the method further includes:

determining the tone expressed by the text sentence where the core keyword is located according to the punctuation mark of the text sentence where the core keyword is located, and determining the tone expressed by the text sentence where the core keyword is located as the tone expressed by the keyword model; or

Determining the tone expressed by the text sentence where each core keyword is located according to the punctuation marks of the text sentence where each core keyword is located; and determining the tone expressed by the keyword model according to the tone expressed by the text sentence where each core keyword is located and the tones expressed by two adjacent text sentences according to the tone expressed by the text sentence where the core keyword is located and the tones expressed by the two adjacent text sentences.

Preferably, the determining auxiliary keywords for the words with the word frequency ranking greater than the ranking threshold includes:

and performing field classification on the counted words, and determining words with word frequency ranking larger than a ranking threshold value in each field as the auxiliary keywords according to each field.

Preferably, the logically combining the core keyword and the auxiliary keyword includes:

determining the core keyword and the auxiliary keyword as model keywords;

performing field classification on the model keywords, and determining the logic relationship between the model keywords in different classes as a logic relationship sum;

if the model keywords in the same type are synonyms, determining the logical relationship between the model keywords in the same type as a logical relationship or;

and if the model keywords in the same class are not synonyms, determining the logical relationship between the model keywords in the same class as logical relationship negation.

Preferably, the obtaining the core keyword includes:

obtaining a core keyword determined according to the service type of the voice text data; or

And acquiring the core keywords input by the user.

Correspondingly, an embodiment of the present invention provides an apparatus for building a keyword model, including:

an acquisition unit configured to acquire voice text data and a core keyword;

a statistic unit, configured to retrieve a text statement in which the core keyword is located in the speech text data, and count words and word frequencies of the words in a set range in the text statement, where a word in the set range is a word in the text statement and located in a set range before and after the core keyword;

the determining unit is used for sequencing the counted words according to the word frequency of the words, and determining auxiliary keywords for the words with the word frequency ranking larger than a ranking threshold;

and the establishing unit is used for carrying out logic relation combination on the core key words and the auxiliary key words to establish a key word model.

Preferably, the statistical unit is further configured to:

Preferably, the determining unit is specifically configured to:

Preferably, the establishing unit is specifically configured to:

determining the core keyword and the auxiliary keyword as model keywords;

Preferably, the obtaining unit is specifically configured to:

And acquiring the core keywords input by the user.

The embodiment of the invention shows that by acquiring voice text data and core keywords, a text sentence where the core keywords are located in the voice text data is retrieved, words in a set range and word frequency of the words in the text sentence are counted, the words in the set range refer to words located before and after the core keywords in the text sentence, the counted words are sorted according to the word frequency of the words, the words with the word frequency ranking larger than a ranking threshold value are determined, auxiliary keywords are determined, and the core keywords and the auxiliary keywords are combined in a logical relationship to establish a keyword model. The auxiliary keywords can be obtained by screening the words in the set range in the text sentence where the core keywords are located and sequencing the word frequency of the words, so that the keywords for establishing the keyword model are obtained, and then the keywords are logically combined to establish the keyword model, so that the efficiency and the accuracy of semantic analysis are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a method for building a keyword model according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an apparatus for building a keyword model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the embodiment of the invention, the voice text data is the text data after voice recognition, and the data can be subjected to semantic analysis.

Fig. 1 illustrates a process for building a keyword model according to an embodiment of the present invention, which may be performed by an apparatus for building a keyword model, where the apparatus may be located in a semantic analysis system.

As shown in fig. 1, the process specifically includes:

step 101, acquiring voice text data and core keywords.

And 102, retrieving the text sentence where the core keyword is located in the voice text data, and counting words and word frequencies of the words in a set range in the text sentence.

And 103, sequencing the counted words according to the word frequency of the words, and determining the auxiliary keywords according to the words with the word frequency ranking larger than the ranking threshold.

And 104, performing logic relation combination on the core keywords and the auxiliary keywords to establish a keyword model.

In step 101, the core keyword may be the obtained core keyword determined according to the service type of the language text data, or the core keyword input by the user. The speech text data is for audio

The core keyword may be used to determine the specific content of the service. If the content of the language text data is the GPRS package complaint, a keyword model of the GPRS package complaint analysis needs to be established, and the core keywords can be 'GPRS' and 'complaint'. The core keywords input by the user, such as "88 set of meal" and "mobile phone terminal", can also be obtained through the interface.

When the core keyword is obtained, the core keyword can be expanded on the basis of the existing core keyword, and the core keyword can be a synonym or a similar synonym. For example, after the core keyword is determined to be "GPRS", the core keywords "internet access", "traffic", and the like may be extended.

In step 102, after the speech text data and the core keywords are obtained in step 101, the text sentence where the speech text data is located is retrieved, and then words and word frequencies of the words in a set range in the text sentence are counted, where the word frequencies of the words refer to the number of times the words appear and the frequency of the words. The words in the set range refer to words located in the set range before and after the core keyword in the text sentence, that is, the context of the text sentence where the core keyword is located is searched. The number of words away from the core keyword can be set, and the number of words before the core keyword and the number of words after the core keyword are counted. The setting range may be set empirically.

For example, the core keywords are "data", "traffic", "internet access", "GPRS", and 5 words before the core keywords are counted, and the following 8 words are counted, so that the case that "complaint" appears n times before the core keywords, m times after the core keywords, where n is a positive integer and m is a positive integer can be obtained. The Chinese language words can be screened and eliminated.

After the words and the word frequencies of the words are counted, the tone expressed by the text sentence where the core keyword is located can be determined according to the punctuation marks of the text sentence where the core keyword is located, and then the tone expressed by the text sentence where the core keyword is located can be determined as the tone expressed by the keyword model.

Or determining the tone expressed by the text sentence where each core keyword is located according to the punctuation mark of the text sentence where each core keyword is located, and then determining the tone expressed by the keyword model according to the tone expressed by the text sentence where each core keyword is located and the tones expressed by two adjacent text sentences for the tone expressed by the text sentence where each core keyword is located.

For example, the punctuation mark of the kth text sentence where the "traffic" is located is a question mark to express a query tone, and the punctuation mark of the ith text sentence where the "traffic" is located is a sigh mark to express a tone as anger tone. k is a positive integer and i is a positive integer. Or on the basis of the tone expressed by the text statement where the flow is located, according to the fact that the tones expressed by the two adjacent text statements are the atmosphere of anger, the tone expressed by the keyword model can be determined to be the atmosphere of anger. The tone expressed by the keyword model can be used for analyzing the tone expressed by the statement in the voice when performing semantic analysis on the voice, so that a worker can process the information with strong tone.

In step 103, the counted words are sorted according to the word frequency of the words, then the counted words are subjected to domain classification, and for each domain, the words with the word frequency ranking larger than the ranking threshold in each domain are determined as the auxiliary keywords. The ranking threshold value can be set according to experience, and the auxiliary keywords are used for further limiting text sentences, so that the fields covered by the keywords are wide. For example, in the vocabulary for expressing the business, 5 words with higher word frequency are selected, and in the vocabulary for expressing the attitude, 5 keywords with higher word frequency are selected. The setting of the ranking threshold value of each field can be the same or different, and the ranking threshold value can be set to be the same when the word frequency difference of statistics of different fields is small.

Meanwhile, when the statistics shows that the occurrence frequency of the word of the complaint is more, the complaint can be further taken as a core keyword, and then the word frequency of the word in the set range of the complaint is counted.

In step 104, after the core keyword and the auxiliary keyword are obtained, the core keyword and the auxiliary keyword are determined as model keywords, and the model keywords are classified.

After the classification, the logical relationship between the model keywords of different classes is determined as a logical relationship AND.

And if the model keywords of the same type are synonyms, determining the logical relationship between the model keywords of the same type as a logical relationship, or if the model keywords of the same type are not synonyms, determining the logical relationship between the model keywords of the same type as a logical relationship.

The "+" symbol may represent the logical relationship of "and the" - "symbol may represent the logical relationship of" or ". E.g., ("GPRS" + "data") | ("not to" cross-buckle "|" wrong ").

The embodiment shows that by acquiring voice text data and core keywords, a text sentence where the core keywords are located in the voice text data is retrieved, words and word frequencies of the words in a set range in the text sentence are counted, the words in the set range refer to words located before and after the core keywords in the text sentence, the counted words are sorted according to the word frequencies of the words, the words with the word frequencies ranked higher than a ranking threshold are determined as auxiliary keywords, and the core keywords and the auxiliary keywords are combined in a logical relationship to establish a keyword model. The auxiliary keywords can be obtained by screening the words in the set range in the text sentence where the core keywords are located and sequencing the word frequency of the words, so that the keywords for establishing the keyword model are obtained, and then the keywords are logically combined to establish the keyword model, so that the efficiency and the accuracy of semantic analysis are improved.

Based on the same technical concept, fig. 2 illustrates an apparatus for building a keyword model according to an embodiment of the present invention, which may perform a process of building a keyword model.

As shown in fig. 2, the apparatus specifically includes:

an acquisition unit 201 for acquiring voice text data and core keywords;

a counting unit 202, configured to retrieve a text statement in which the core keyword is located in the speech text data, and count words and word frequencies of the words in a set range in the text statement, where a word in the set range refers to a word in the text statement and located in a set range before and after the core keyword;

a determining unit 203, configured to sort the counted words according to word frequencies of the words, and determine an auxiliary keyword for a word whose rank of the word frequency is greater than a ranking threshold;

the establishing unit 204 is configured to perform logical relationship combination on the core keyword and the auxiliary keyword to establish a keyword model.

Preferably, the statistical unit 202 is further configured to:

Preferably, the determining unit 203 is specifically configured to:

Preferably, the establishing unit 204 is specifically configured to:

determining the core keyword and the auxiliary keyword as model keywords;

Preferably, the obtaining unit 201 is specifically configured to:

And acquiring the core keywords input by the user.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for building a keyword model, comprising:

acquiring voice text data and core keywords;

performing logic relation combination on the core keywords and the auxiliary keywords to establish a keyword model;

after counting words and word frequencies of the words in the set range in the text sentence, the method further comprises the following steps:

Determining the tone expressed by the text sentence where each core keyword is located according to the punctuation marks of the text sentence where each core keyword is located; and determining the tone expressed by the keyword model according to the tone expressed by the text sentence where the core keyword is located and the tones expressed by the two adjacent text sentences aiming at the tone expressed by the text sentence where the core keyword is located.

2. The method of claim 1, wherein ranking words whose word frequency is greater than a ranking threshold to determine auxiliary keywords comprises:

3. The method of claim 1, wherein said logically combining said core keywords and said auxiliary keywords comprises:

determining the core keyword and the auxiliary keyword as model keywords;

classifying the model keywords, and determining the logic relationship between the model keywords in different classes as a logic relationship sum;

4. The method of any of claims 1 to 3, wherein the obtaining core keywords comprises:

And acquiring the core keywords input by the user.

5. An apparatus for building a keyword model, comprising:

an acquisition unit configured to acquire voice text data and a core keyword;

the establishing unit is used for carrying out logic relation combination on the core keywords and the auxiliary keywords and establishing a keyword model;

the statistical unit is further configured to:

6. The apparatus of claim 5, wherein the determination unit is specifically configured to:

7. The apparatus according to claim 5, wherein the establishing unit is specifically configured to:

determining the core keyword and the auxiliary keyword as model keywords;

8. The apparatus according to any one of claims 5 to 7, wherein the obtaining unit is specifically configured to:

And acquiring the core keywords input by the user.