CN112148872A

CN112148872A - Natural conversation topic analysis method and device, electronic equipment and storage medium

Info

Publication number: CN112148872A
Application number: CN202011043378.1A
Authority: CN
Inventors: 刁则鸣; 夏致昊; 周小敏; 应鸿晖; 石易; 黄彦龙; 黄晓青; 莫凡; 耿夏楠; 罗海涛; 傅强; 阿曼太; 徐涛; 傅昕
Original assignee: Guangzhou Branch Center Of National Computer Network And Information Security Management Center; Eversec Beijing Technology Co Ltd
Current assignee: Guangzhou Branch Center Of National Computer Network And Information Security Management Center; Eversec Beijing Technology Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2020-12-29
Anticipated expiration: 2040-09-28
Also published as: CN112148872B

Abstract

The embodiment of the disclosure discloses a natural conversation theme analysis method, a natural conversation theme analysis device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of natural dialogue texts, and performing segmentation and word segmentation processing on any natural dialogue text to obtain a word segmentation sequence; aggregating and grouping the word segmentation sequence sets obtained according to the natural conversation texts into a plurality of word segmentation sequence subsets; extracting core keywords from any word segmentation sequence subset; respectively calculating the Levensian distance of any two core keyword word sequence character strings in any word sequence subset to obtain the subject purity; and outputting an analysis result according to the topic purity of each word segmentation sequence subset contained in the word segmentation sequence set and the corresponding keyword set. The technical scheme of the embodiment can directly analyze the theme according to batch or mass natural conversations, does not need manual participation, and can improve the theme analysis efficiency.

Description

Natural conversation topic analysis method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of natural language processing, in particular to a natural conversation topic analysis method and device, electronic equipment and a storage medium.

Background

Natural conversation refers to normal conversation behavior between natural people. In the natural conversation, two conversation parties mainly express intentions through mutually back and forth transferable information understandable by the other party; all elements of a dialog are usually linked in series (including context and overall dialog), so that the dialog intention can be fully reflected; in addition, for a conversation with a specific context, a short language is often used to express hidden meanings that are not directly described in the conversation, and different words and modes are used to express the same thing. The dialogue intention extraction is a crucial link in realizing natural language understanding and constructing a dialogue system.

Currently, the industry usually uses topic model to perform intent extraction. Under the background of massive natural dialogues, a great amount of natural dialog intention labeling corpora need to be accumulated based on the intention extraction mode of the currently popular topic model, so that the realization of large engineering quantity and machine, time consumption and labor consumption is realized; meanwhile, in a specific scene, people can pay more attention to rare topics in massive natural conversations, and the situation of misinformation or missing report is very likely to occur as a result of intention classification in a topic model mode.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a storage medium for analyzing a natural conversation topic, so as to improve efficiency of topic analysis on a natural conversation.

Additional features and advantages of the disclosed embodiments will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosed embodiments.

In a first aspect, an embodiment of the present disclosure provides a natural conversation topic analysis method, including:

acquiring a plurality of natural conversation texts, segmenting any one of the natural conversation texts according to conversation roles to obtain one or more conversation units, and performing word segmentation processing on the text content of any one of the conversation units to obtain a word segmentation sequence;

aggregating and grouping the word segmentation sequence sets obtained according to the natural conversation texts into a plurality of word segmentation sequence subsets;

for any word segmentation sequence subset, extracting the segmentation with the document frequency greater than a specified frequency threshold value from the included segmentation as a core keyword to obtain a core keyword set corresponding to the word segmentation sequence subset;

for any participle sequence in any participle sequence subset, generating a core keyword word sequence character string corresponding to the participle sequence according to the contained core keywords and the sequence of the core keywords, and respectively calculating the Levensian distance between any two core keyword word sequence character strings;

determining topic similarity of the participle sequence and the affiliated participle sequence subset according to the Levenson distance between the core keyword word sequence character string corresponding to any participle sequence and the core keyword word sequence character strings corresponding to other participle sequences in the affiliated participle sequence subset, and determining topic purity of the participle sequence subset according to the topic similarity of each participle sequence contained in any participle sequence subset;

and outputting an analysis result according to the topic purity of each word segmentation sequence subset contained in the word segmentation sequence set and the corresponding keyword set.

In an embodiment, grouping and aggregating the segmentation sequence sets obtained from the natural dialog texts into a plurality of segmentation sequence subsets includes:

the participle sequence set is divided into at least one initial set, and the following operations are executed on any initial set through an independent processing process:

newly building a word segmentation sequence subset for the initial set as an existing word segmentation sequence subset, and taking a word segmentation sequence from the initial set to add to the newly built word segmentation sequence subset;

traversing each existing word segmentation sequence subset of the initial set for any word segmentation sequence in the initial set;

if the number of the same words of each word segmentation sequence contained in the word segmentation sequence and any existing word segmentation sequence subset is smaller than or equal to a preset number threshold, traversing the next existing word segmentation sequence subset, and if the last existing word segmentation sequence subset is traversed, newly building an existing word segmentation sequence subset, merging the existing word segmentation sequence subset and adding the word segmentation sequence to the existing word segmentation sequence subset;

if the same word number of each word segmentation sequence contained in the word segmentation sequence and any existing word segmentation sequence subset is larger than the preset number threshold value, adding the word segmentation sequence into the existing word segmentation sequence subset;

and using the word segmentation sequence subset obtained according to the at least one initial set as the plurality of word segmentation sequence subsets.

In one embodiment, if the set of participle sequences is split into at least one initial set: and if the word segmentation sequence set is divided into a plurality of initial sets, the method further comprises, for any word segmentation sequence subset, extracting the word segmentation with the document frequency greater than the specified frequency threshold value from the included word segmentation as the core keyword to obtain the core keyword set corresponding to the word segmentation sequence subset, and then further comprises:

and carrying out merging judgment on any two word segmentation sequence subsets according to the coincidence rate of the corresponding core keyword sets, and if the merging is determined according to the judgment result, merging the two word segmentation sequence subsets until the number of the word segmentation sequence subsets is not changed any more.

In an embodiment, merging and judging any two word segmentation sequence subsets according to the coincidence rate of the corresponding core keyword set includes:

acquiring the average element number and the intersection element number of the core keyword set corresponding to the two word segmentation sequence subsets;

and determining whether the number of the intersection elements except the average number of the elements is larger than a preset coincidence rate threshold value, and if so, determining to subset the two word segmentation sequences.

In one embodiment, for any word segmentation sequence subset, extracting the word segmentation with the document frequency greater than the specified frequency threshold from the included word segmentation as the core keyword comprises:

for any word segmentation sequence subset, extracting high-frequency word segmentation according to the word segmentation sequences contained in the word segmentation sequence subset to obtain a keyword word bag of the word segmentation sequence subset, and extracting keywords with document frequency greater than a specified frequency threshold value from the keyword word bag to serve as core keywords;

in one embodiment, the document frequency of any keyword in the keyword bag of the word segmentation sequence subset is calculated as follows:

determining the number of word segmentation sequences of the keyword appearing in the word segmentation sequence subset as a first number;

determining the number of word segmentation sequences of the word segmentation sequence subset as a second number;

and taking the ratio of the first quantity to the second quantity as the document frequency of the keyword.

In one embodiment, the acquiring the plurality of natural dialog texts includes:

and acquiring a plurality of natural dialogue texts.

In one embodiment, the method further comprises:

after the word segmentation processing is carried out on the text content of any dialogue unit to obtain a word segmentation sequence, the method further comprises the following steps: deleting the participles belonging to the stop word list in the participle result according to a preset stop word list; and/or

After the word segmentation processing is carried out on the text content of any dialogue unit to obtain a word segmentation sequence, the method further comprises the following steps: deleting the word segmentation sequence with the text length smaller than a preset text length threshold value; and/or

After determining the topic similarity of the segmentation sequence and the segmentation sequence subset according to the Levensian distance between the core keyword word order character string corresponding to any segmentation sequence and the core keyword word order character strings corresponding to other segmentation sequences in the segmentation sequence subset, the method further comprises the following steps: removing word segmentation sequences with topic similarity smaller than a preset similarity threshold from the word segmentation sequence subset; and/or

After determining the topic purity of the word segmentation sequence subset according to the topic similarity of each word segmentation sequence contained in any word segmentation sequence subset, the method further comprises the following steps: and removing word segmentation sequence subsets with the topic purity smaller than a preset topic purity threshold value from the word segmentation sequence sets.

In a second aspect, an embodiment of the present disclosure further provides a natural conversation topic analysis device, including:

the data acquisition and preprocessing module is used for acquiring a plurality of natural conversation texts, segmenting any one of the natural conversation texts according to a conversation role to obtain one or more conversation units, and performing word segmentation processing on the text content of any one of the conversation units to obtain a word segmentation sequence;

the topic aggregation module is used for aggregating and grouping the word segmentation sequence set obtained according to the natural conversation texts into a plurality of word segmentation sequence subsets;

the keyword extraction module is used for extracting the participles with the document frequency greater than a specified frequency threshold from the participles contained in any participle sequence subset as core keywords so as to obtain a core keyword set corresponding to the participle sequence subset;

the distance calculation module is used for generating core keyword word sequence character strings corresponding to any word sequence in any word sequence subset according to the contained core keywords and the occurrence sequence of each core keyword, and calculating the Levensian distance between any two core keyword word sequence character strings respectively;

the topic purity calculation module is used for determining topic similarity of the participle sequence and the affiliated participle sequence subset according to the Levensstein distance between the core keyword word sequence character string corresponding to any participle sequence and the core keyword word sequence character strings corresponding to other participle sequences in the affiliated participle sequence subset, and determining the topic purity of the participle sequence subset according to the topic similarity of each participle sequence contained in any participle sequence subset;

and the analysis result output module is used for outputting analysis results according to the topic purities of all the word segmentation sequence subsets contained in the word segmentation sequence set and the corresponding keyword set.

In one embodiment, the topic aggregation module is configured to:

In one embodiment, if the set of participle sequences is split into at least one initial set: the device further includes a topic iteration fusion module, configured to extract, for any participle sequence subset, a participle with a document frequency greater than a specified frequency threshold from the participles included in the participle sequence subset as a core keyword, so as to obtain a core keyword set corresponding to the participle sequence subset, where the topic iteration fusion module further includes:

In an embodiment, the merging and determining, by the topic iterative fusion module, any two word segmentation sequence subsets according to the coincidence rate of the corresponding core keyword sets includes:

In one embodiment, the keyword extraction module is configured to:

in an embodiment, in the keyword extraction module, the document frequency of any keyword in the keyword bag of the word segmentation sequence subset is calculated as follows:

In one embodiment, the data acquiring and preprocessing module for acquiring a plurality of natural dialog texts comprises:

and acquiring a plurality of natural dialogue texts.

In one embodiment, the apparatus further comprises a first culling module, a second culling module, a third culling module, and/or a fourth culling module:

the first eliminating module is used for performing word segmentation processing on the text content of any dialogue unit to obtain a word segmentation sequence, and then: deleting the participles belonging to the stop word list in the participle result according to a preset stop word list;

the second eliminating module is used for performing word segmentation processing on the text content of any dialogue unit to obtain a word segmentation sequence, and then: deleting the word segmentation sequence with the text length smaller than a preset text length threshold value;

the third eliminating module is used for determining the topic similarity of the segmentation sequence and the corresponding segmentation sequence subset according to the Levensian distance between the core keyword word sequence character string corresponding to any segmentation sequence and the core keyword word sequence character strings corresponding to other segmentation sequences in the corresponding segmentation sequence subset: removing word segmentation sequences with topic similarity smaller than a preset similarity threshold from the word segmentation sequence subset;

the fourth eliminating module is used for determining the topic purity of the word segmentation sequence subset according to the topic similarity of each word segmentation sequence contained in any word segmentation sequence subset: and removing word segmentation sequence subsets with the topic purity smaller than a preset topic purity threshold value from the word segmentation sequence sets.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the instructions of the method of any one of the first aspects.

In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any one of the first aspect.

The method comprises the steps of obtaining a plurality of natural dialogue texts, and carrying out word segmentation and word segmentation processing on any one natural dialogue text to obtain a word segmentation sequence; aggregating and grouping the word segmentation sequence sets obtained according to the natural conversation texts into a plurality of word segmentation sequence subsets; extracting core keywords from any word segmentation sequence subset; respectively calculating the Levensian distance of any two core keyword word sequence character strings in any word sequence subset to obtain the subject purity; and outputting an analysis result according to the topic purity of each word segmentation sequence subset contained in the word segmentation sequence set and the corresponding keyword set. The technical scheme of the embodiment can directly analyze the theme according to batch or mass natural conversations, does not need manual participation, and can improve the theme analysis efficiency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments of the present disclosure will be briefly described below, and it is obvious that the drawings in the following description are only a part of the embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the contents of the embodiments of the present disclosure and the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a natural conversation topic analysis method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating another natural conversation topic analysis method provided by the disclosed embodiments;

FIG. 3 is a flow chart illustrating the processing of natural dialog text provided by an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of performing aggregate grouping processing according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart illustrating an iterative fusion process provided by an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart of a subject purification process provided by an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an apparatus for analyzing a natural conversation topic provided by an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of another natural conversation topic analysis apparatus provided in the embodiment of the present disclosure;

FIG. 9 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

In order to make the technical problems solved, technical solutions adopted and technical effects achieved by the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be described in further detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments, but not all embodiments, of the embodiments of the present disclosure. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present disclosure, belong to the protection scope of the embodiments of the present disclosure.

It should be noted that the terms "system" and "network" are often used interchangeably in the embodiments of the present disclosure. Reference to "and/or" in embodiments of the present disclosure is meant to include any and all combinations of one or more of the associated listed items. The terms "first", "second", and the like in the description and claims of the present disclosure and in the drawings are used for distinguishing between different objects and not for limiting a particular order.

It should also be noted that, in the embodiments of the present disclosure, each of the following embodiments may be executed alone, or may be executed in combination with each other, and the embodiments of the present disclosure are not limited specifically.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The technical solutions of the embodiments of the present disclosure are further described by the following detailed description in conjunction with the accompanying drawings.

Fig. 1 is a flowchart illustrating a natural conversation topic analysis method provided in an embodiment of the present disclosure, where this embodiment is applicable to a situation where a conversation topic is analyzed according to a batch or a mass of natural conversations or according to a batch or a mass of natural conversations acquired in real time, and the method may be executed by a natural conversation topic analysis apparatus configured in an electronic device, as shown in fig. 1, the natural conversation topic analysis method according to this embodiment includes:

in step S110, a plurality of natural conversation texts are obtained, any one of the natural conversation texts is segmented according to a conversation role to obtain one or more conversation units, and a word segmentation process is performed on text contents of any one of the conversation units to obtain a word segmentation sequence.

In this step, the obtaining of the plurality of natural dialogue texts may be obtaining of batch or mass static natural dialogue texts, or obtaining of generated natural dialogue texts in real time.

In step S120, the word segmentation sequence sets obtained from the natural dialog texts are grouped into a plurality of word segmentation sequence subsets.

The grouping of the sets can be realized in various ways, and the step is used for splitting the participle sequence set into a plurality of subsets with larger topic differences, and can be realized in various ways, for example, splitting the participle sequence set according to the similarity.

This embodiment illustrates a specific scheme of this step by taking a specific manner as an example, for example:

If the word segmentation sequence set is divided into an initial set, the step can be completed through one thread, so that the word segmentation sequence set obtained according to the natural conversation texts is aggregated and grouped into a plurality of word segmentation sequence subsets.

If the word segmentation sequence set is divided into a plurality of initial sets, executing the step on any initial set through an independent processing process, and converging one or more word segmentation sequence subsets obtained by each thread to obtain a plurality of word segmentation sequence subsets serving as the plurality of word segmentation sequence subsets of the word segmentation sequence set.

In this case, if the segmentation sequence set is segmented into a plurality of initial sets, the segmentation is not performed according to the difference of topics, and possible topics among the segmentation sequence subsets obtained according to different threads are also similar, and further processing is required for iterative fusion.

For example, after step S130, the following process may be further performed: and carrying out merging judgment on any two word segmentation sequence subsets according to the coincidence rate of the corresponding core keyword sets, and if the merging is determined according to the judgment result, merging the two word segmentation sequence subsets until the number of the word segmentation sequence subsets is not changed any more.

The merging judgment of any two word segmentation sequence subsets according to the coincidence rate of the corresponding core keyword sets can be realized in the following way:

In step S130, for any word segmentation sequence subset, the word segmentation with the document frequency greater than the specified frequency threshold is extracted from the included word segmentation as the core keyword, so as to obtain the core keyword set corresponding to the word segmentation sequence subset.

For example, for any word segmentation sequence subset, extracting high-frequency words from the word segmentation sequences contained in the word segmentation sequence subset to obtain a keyword word bag of the word segmentation sequence subset, and extracting keywords with document frequency greater than a specified frequency threshold value from the keyword word bag as core keywords.

The document frequency of any keyword in the keyword bag of the word segmentation sequence subset is calculated in the following mode:

In step S140, for any participle sequence in any participle sequence subset, the core keyword word sequence character string corresponding to the participle sequence is generated according to the included core keywords and the occurrence sequence of each core keyword, and the levenstein distance between any two core keyword word sequence character strings is calculated respectively.

In step S150, according to the levensian distance between the word sequence character string of the core keyword corresponding to any participle sequence and the word sequence character strings of the core keywords corresponding to other participle sequences in the subset of participle sequences, determining the topic similarity between the participle sequence and the subset of participle sequences to which the participle sequence belongs, and determining the topic purity of the subset of participle sequences according to the topic similarity of each participle sequence included in any participle sequence subset.

In step S160, an analysis result is output according to the topic purity of each word segmentation sequence subset included in the word segmentation sequence set and the corresponding keyword set.

According to the technical scheme, a plurality of natural dialogue texts are obtained, and any one natural dialogue text is cut and participled to obtain a participle sequence; aggregating and grouping the word segmentation sequence sets obtained according to the natural conversation texts into a plurality of word segmentation sequence subsets; extracting core keywords from any word segmentation sequence subset; respectively calculating the Levensian distance of any two core keyword word sequence character strings in any word sequence subset to obtain the subject purity; and outputting an analysis result according to the topic purity of each word segmentation sequence subset contained in the word segmentation sequence set and the corresponding keyword set. The technical scheme of the embodiment can directly analyze the theme according to batch or mass natural conversations, does not need manual participation, and can improve the theme analysis efficiency.

Fig. 2 is a schematic flow chart of another natural conversation topic analysis method provided in the embodiment of the present disclosure, and the embodiment is based on the foregoing embodiment and is optimized. As shown in fig. 2, the natural conversation topic analysis method according to the present embodiment includes:

in step S201, a plurality of natural conversation texts are obtained, any one of the natural conversation texts is segmented according to a conversation role to obtain one or more conversation units, and a word segmentation process is performed on text contents of any one of the conversation units to obtain a word segmentation sequence.

In step S202, the participles belonging to the deactivated vocabulary in the participle result are deleted according to a predetermined deactivated vocabulary.

The step is an optional step, and the word segmentation quality can be further improved through the step.

In step S203, the segmentation sequence having a text length smaller than a predetermined text length threshold is deleted.

The step is an optional step, and the purity of the word segmentation sequence can be improved by deleting the word segmentation sequence with less meaning.

Fig. 3 is an exemplary flowchart of the processing manner described in step S201 to step S203, and as shown in fig. 3, the processing manner includes:

in step S301, the natural conversation contents are segmented.

For example, reading natural conversation content, and segmenting the conversation content according to conversation roles; if the conversation role only has one person to be unique, the processing is not needed; and if the conversation content is a double-person or multi-person conversation, splicing the corresponding conversation content into one conversation content as a conversation unit according to the speaker, and sequentially generating an independent conversation ID for each conversation unit.

For convenience of description, the example content of the present embodiment indicates a dialogue unit corresponding to an ID by a dialogue ID. In addition, one word segmentation sequence subset corresponds to one topic, and for convenience of description, in the example content of this embodiment, a topic is sometimes used to represent the word segmentation sequence subset corresponding to the topic.

In step S302, the multi-process participle.

In the step, the text content of each conversation ID is segmented by adopting a natural language segmentation module, continuous natural sentences in the text content are segmented into individual words, and the segmentation result replaces the original text content corresponding to the conversation ID.

In step S303, stop words are removed.

The step is to load a preset stop word list, remove Chinese words and phrases which have no definite meaning in common, such as moods, adverbs, prepositions, conjunctions and the like, and remove non-Chinese characters.

In step S304, it is determined whether the text length is smaller than the threshold, if so, step S305 is executed, otherwise, step S306 is executed.

And the step is used for filtering out the text with the residual character length lower than the specified preset threshold value aiming at the output result in the last step. The step aims to reduce the data processing amount of a subsequent unsupervised topic clustering module and output the result of data preprocessing to a subsequent process.

In step S305, the line of text content is ignored.

In step S306, the normalized data is output, and the process ends.

In step S204, the word segmentation sequence sets obtained from the natural dialog texts are grouped and grouped into a plurality of word segmentation sequence subsets.

For the output result after data preprocessing, multi-process batch synchronous processing can be adopted. For a plurality of dialog IDs and their corresponding text contents in each batch, aggregation can be performed in a plurality of ways, fig. 4 is a flowchart of an exemplary embodiment of this step, and as shown in fig. 4, a method for performing aggregation grouping processing includes:

a new packet is initialized in step S401, and step S402 is executed.

For example, a session ID may be randomly drawn from the batch and divided into a new group.

In step S402, it is determined whether an unclassified participle sequence exists, if yes, step S403 is executed, otherwise, the process is ended.

An unclassified segmented word sequence is extracted in step S403, and step S404 is executed.

In step S404, the word segmentation result is intersected, and step S405 is performed.

A session ID is randomly drawn from the batch and matched against existing packets one by one. And if the word segmentation result list corresponding to the conversation ID is in the intersection with all the members in the group one by one, obtaining the number of intersection elements of the conversation ID and the word segmentation result list of each member in the group.

In step S405, it is determined whether the number of intersections satisfies the condition, if so, step S408 is executed, otherwise, step S406 is executed.

If the number of intersection elements of the conversation ID and the word segmentation result list of each member in the group is larger than a specified threshold value, the text word segmentation result corresponding to the conversation ID is similar to the intention theme, and the conversation ID is combined into the group; otherwise, the session ID is divided into a new group.

In step S406, it is determined whether all packets have been traversed, if so, step S407 is executed, otherwise, the process returns to step S404.

The process from step S402 to step S404 is repeated until all session IDs in the batch are put into the group.

In step S407, a packet is newly created, and step S409 is executed.

In step S408, the word segmentation sequence is added to the packet, and step S409 is performed.

In step S409, classification is completed, and the process returns to step S402.

In the same keyword aggregation step, because the embodiment adopts multi-process batch processing, while the overall operation efficiency is improved, text contents originally intended by the same theme may be divided into different batches, so that similar themes exist among different batches. In this step, the subject results aggregated among different batches are subjected to loop iterative fusion, so that texts with relatively consistent keywords in the whole data can be divided into the same subject, and the whole flow chart is shown in fig. 5, and the specific steps are as follows:

in step S501, the initial number of subjects is recorded, and step S502 is executed.

In step S502, the subject term bag is extracted, and step S503 is executed.

And for a plurality of topics aggregated by the same keywords in different batches, counting keyword sets corresponding to all conversation IDs of the topic in each topic to obtain a keyword bag of the topic.

In step S503, a core keyword is extracted, and step S504 is performed.

And traversing each keyword in the keyword bag of the subject in each subject respectively, and calculating the document frequency df of the keyword. And adds the keywords with document frequency df greater than a specified threshold to the core keyword list that is the subject. Counting the number of the dialog IDs containing the keyword in the subject, and recording the number as aim _ topic; counting the total number of the dialog IDs in the theme, and recording the total number as all _ topic; the formula for calculating the document frequency df is as follows:

wherein df is the document frequency;

aim _ topic is to count the number of dialog IDs containing the keyword in the subject;

all _ topic is the total number of dialog units (number of dialog IDs) within the topic.

In step S504, the overlap ratio is calculated, and step S505 is executed.

Comparing every two subjects, and assuming that the calculation is performed between the subject A and the subject B, setting the number of the intersection of the core keywords between the statistical subject A and the statistical subject B and recording as set _ kw; counting the number of core keywords of the theme A and the theme B, and setting the average value of the number of the core keywords of the two themes as avg _ kw; the formula for calculating the coincidence rate of the core keywords of the theme A and the theme B is as follows:

wherein set _ kw is the number of intersection of core keywords between two topics;

avg _ kw is the average value of the number of the core keywords of the two subjects;

coincidence _ per is the coincidence between two subjects.

In step S505, it is determined whether the overlap ratio is greater than a predetermined overlap ratio threshold, and if so, step S506 is performed.

In step S506, similar subjects are merged, and step S507 is performed.

And combining the topics with the coincidence rate of the topic core keywords exceeding a specified threshold value into a new topic, and updating the conversation ID dictionary contained in the corresponding topic ID.

In step S507, the number of fused subjects is recorded, and step S508 is executed.

In step S508, it is determined whether the number of topics is the same, if yes, step S509 is executed, otherwise, the process returns to step S501.

And if the total number of the topics does not change, the iteration step of the topic loop is completed.

In step S509, the fusion result is output, and the process ends.

In step S205, for any word segmentation sequence subset, the word segmentation with the document frequency greater than the specified frequency threshold is extracted from the included word segmentation as the core keyword, so as to obtain the core keyword set corresponding to the word segmentation sequence subset.

In step S206, for any participle sequence in any participle sequence subset, the core keyword word sequence character string corresponding to the participle sequence is generated according to the included core keywords and the occurrence sequence of each core keyword, and the levenstein distance between any two core keyword word sequence character strings is calculated respectively.

In step S207, according to the levensian distance between the word sequence character string of the core keyword corresponding to any participle sequence and the word sequence character strings of the core keywords corresponding to other participle sequences in the subset of participle sequences to which the word sequence belongs, the topic similarity between the participle sequence and the subset of participle sequences to which the word sequence belongs is determined.

In step S208, the word segmentation sequences with topic similarity smaller than a predetermined similarity threshold are removed from the word segmentation sequence subset.

Fig. 6 is an exemplary flowchart of the processing manner from step S206 to step S208, and as shown in fig. 6, the processing manner includes:

in step S601, an order-preserving string is generated, and step S602 is executed.

For all subjects output after the subjects are fused, calculating the original text corresponding to each conversation ID in each subject according to the core keyword list in each subject, and generating the core keyword word sequence character string corresponding to each conversation ID according to the appearance sequence of each core keyword.

In step S602, the distance ratio average is calculated, and step S603 is executed.

For example, for each topic, the levensian distance ratio between each dialog ID and the core keyword word-order character strings of the remaining dialog IDs is calculated as follows.

The meaning of the levenstan distance is: let s and t be two strings to be calculated, where s is n and t is m. If n is 0, returning to m and exiting; if m is 0, return n and exit. Otherwise, an array d [0.. m,0.. n ] is constructed, wherein ". multidot." means that intermediate integers are omitted, and the same applies below.

Row 0 is initialized to 0.. n and column 0 is initialized to 0.. m.

Step 1. examine each letter of s in turn (i ═ 1.. n).

Step 2. check each letter of t in turn (j ═ 1.. m).

Step 3, if s [ i ] is t [ j ], cost is 0; if s [ i ]! When t [ j ], cost is 1.

D [ i, j ] is set to the minimum of three values:

step 4. the value of the lattice immediately above the current lattice is increased by one, i.e. d [ i-1, j ] +1

Step 5. the value of the lattice immediately to the left of the current lattice is added with one, namely d [ i, j-1] +1

And 6, adding cost to the value of the grid above the left of the current grid, namely d [ i-1, j-1] + cost

And repeating the steps 3-6 until the circulation is finished, wherein the output d [ n, m ] is the Reinstein distance.

Levenstein distance ratio:

assuming sum is the sum of the lengths of two strings s and t to be compared, ldist represents the class edit distance (the class edit distance is compared with the edit distance, and the edit distance is +1 for each of the three operations; and in the class edit distance, deletion and insertion remain +1, but replace +2), the ratio of the levensian distances of s and t is:

in step S603, it is determined whether the average value is smaller than a predetermined similarity threshold, if so, step S604 is performed, otherwise, step S605 is performed.

In step S604, the word segmentation sequence is deleted, and the process ends.

And calculating the average value of the Levensan distance ratio of each conversation ID in the topic to the rest conversation IDs in the same topic to be used as the topic similarity score of the conversation ID, and if the topic similarity score of the conversation ID is lower than a specified threshold value, removing the conversation ID from the topic.

In step S605, the subject purity score is calculated, and step S606 is executed.

The subject similarity score average of the remaining dialog IDs in each subject is calculated as the purity score of the subject. And if the purity score of the theme is lower than a specified threshold value, the text corresponding to the conversation ID in the theme is considered to be impure, and the theme and the corresponding conversation ID are deleted.

In step S606, it is determined whether the topic purity score is less than a predetermined topic purity threshold, if yes, step S607 is executed, otherwise, step S608 is executed.

In step S607, the theme is deleted, and the process ends.

In step S608, the theme is retained, and the process ends.

The step is an optional step, and the purity of the word segmentation sequence subset can be improved.

In step S209, the topic purity of any of the word segmentation sequence subsets is determined according to the topic similarity of each word segmentation sequence included in the word segmentation sequence subset.

In step S210, a word segmentation sequence subset with a subject purity less than a predetermined subject purity threshold is removed from the word segmentation sequence set.

This step is an optional step to improve the purity of the set of word sequences.

In step S211, an analysis result is output according to the topic purity of each word segmentation sequence subset included in the word segmentation sequence set and the corresponding keyword set.

On the basis of the previous embodiment, some optional steps are added in the embodiment, so that the accuracy of the theme analysis result can be further improved.

As an implementation of the methods shown in the above drawings, the present application provides an embodiment of a natural conversation topic analysis apparatus, and fig. 7 shows a schematic structural diagram of a natural conversation topic analysis apparatus provided in this embodiment, where the embodiment of the apparatus corresponds to the method embodiments shown in fig. 1 to fig. 6, and the apparatus may be applied to various electronic devices. As shown in fig. 7, the natural conversation topic analysis apparatus according to the present embodiment includes a data acquisition and preprocessing module 710, a topic aggregation module 720, a keyword extraction module 730, a distance calculation module 740, a topic purity calculation module 750, and an analysis result output module 760.

The data obtaining and preprocessing module 710 is configured to obtain a plurality of natural dialog texts, segment any natural dialog text according to a dialog role to obtain one or more dialog units, and perform a word segmentation process on text contents of any dialog unit to obtain a word segmentation sequence.

The topic aggregation module 720 is configured to perform aggregation grouping on the segmentation sequence sets obtained from the natural dialog texts to divide the segmentation sequence sets into a plurality of segmentation sequence subsets.

The keyword extraction module 730 is configured to, for any participle sequence subset, extract participles with a document frequency greater than a specified frequency threshold from the contained participles as core keywords to obtain a core keyword set corresponding to the participle sequence subset.

The distance calculation module 740 is configured to, for any participle sequence in any participle sequence subset, generate a core keyword word sequence character string corresponding to the participle sequence according to the included core keywords and the occurrence sequence of each core keyword, and calculate the levenstein distance between any two core keyword word sequence character strings respectively.

The topic purity calculation module 750 is configured to determine topic similarity between a participle sequence and a participle sequence subset to which the participle sequence belongs according to the levensian distance between a core keyword word order character string corresponding to any participle sequence and a core keyword word order character string corresponding to other participle sequences in the participle sequence subset to which the participle sequence belongs, and determine topic purity of the participle sequence subset according to topic similarity of each participle sequence included in any participle sequence subset.

The analysis result output module 760 is configured to output an analysis result according to the topic purity of each word segmentation sequence subset included in the word segmentation sequence set and the corresponding keyword set.

Further, the topic aggregation module 720 is configured to:

Further, the keyword extraction module 805 is configured to further:

further, the keyword extraction module 805 is configured to further:

Further, the data acquisition and preprocessing module 801 is configured to further acquire a plurality of natural dialog texts.

The natural conversation topic analysis device provided by the embodiment can execute the natural conversation topic analysis method provided by the embodiment of the method disclosed by the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 8 is a schematic structural diagram of another natural conversation topic analysis device provided in the embodiment of the present disclosure, and as shown in fig. 8, the natural conversation topic analysis device according to the embodiment includes: the system comprises a data acquisition and preprocessing module 801, a first rejection module 802, a second rejection module 803, a topic aggregation module 804, a keyword extraction module 805, a topic iteration fusion module 806, a distance calculation module 807, a third rejection module 808, a topic purity calculation module 809, a fourth rejection module 810 and an analysis result output module 811.

The data obtaining and preprocessing module 801 is configured to obtain a plurality of natural dialogue texts, segment any natural dialogue text according to a dialogue role to obtain one or more dialogue units, and perform word segmentation processing on text contents of any dialogue unit to obtain a word segmentation sequence.

The first eliminating module 802 is configured to, after performing word segmentation processing on the text content of any dialog unit to obtain a word segmentation sequence: and deleting the participles belonging to the stop word list in the participle result according to a preset stop word list.

The second culling module 803 is configured to, after performing a word segmentation process on the text content of any dialog unit to obtain a word segmentation sequence: and deleting the word segmentation sequences with the text length smaller than a preset text length threshold value.

The topic aggregation module 804 is configured to perform aggregation grouping on the segmentation sequence sets obtained according to the natural conversation texts to divide the segmentation sequence sets into a plurality of segmentation sequence subsets.

In particular, the topic aggregation module 804 is configured to:

the participle sequence set is divided into a plurality of initial sets, and the following operations are executed on any initial set through an independent processing process:

and using the word segmentation sequence subset obtained according to the plurality of initial sets as the plurality of word segmentation sequence subsets.

The keyword extraction module 805 is configured to, for any word segmentation sequence subset, extract, as a core keyword, a word segmentation whose document frequency is greater than a specified frequency threshold from the included word segmentation, so as to obtain a core keyword set corresponding to the word segmentation sequence subset.

The topic iteration fusion module 806 is configured to perform merging judgment on any two word segmentation sequence subsets according to the coincidence rate of the corresponding core keyword sets, and if merging is determined according to a judgment result, merge the two word segmentation sequence subsets until the number of the word segmentation sequence subsets does not change any more.

The distance calculating module 807 is configured to generate a core keyword word sequence character string corresponding to any participle sequence in any participle sequence subset according to the included core keywords and the occurrence sequence of each core keyword, and calculate the levensian distance between any two core keyword word sequence character strings respectively.

The third culling module 808 is configured to, after determining the topic similarity between the segmentation sequence and the corresponding segmentation sequence subset according to the levensian distance between the core keyword word order character string corresponding to any segmentation sequence and the core keyword word order character strings corresponding to other segmentation sequences in the corresponding segmentation sequence subset: and removing the word segmentation sequences with the topic similarity smaller than a preset similarity threshold value from the word segmentation sequence subset.

The topic purity calculation module 809 is configured to determine topic similarity between a participle sequence and a participle sequence subset to which the participle sequence belongs according to the levensian distance between a core keyword word-order character string corresponding to any participle sequence and core keyword word-order character strings corresponding to other participle sequences in the participle sequence subset to which the participle sequence belongs, and determine topic purity of the participle sequence subset according to topic similarity of each participle sequence included in any participle sequence subset.

The fourth eliminating module 810 is configured to, after determining the topic purity of each word segmentation sequence subset according to the topic similarity of each word segmentation sequence included in any word segmentation sequence subset: and removing word segmentation sequence subsets with the topic purity smaller than a preset topic purity threshold value from the word segmentation sequence sets.

The analysis result output module 811 is configured to output an analysis result according to the topic purity of each word segmentation sequence subset included in the word segmentation sequence set and the corresponding keyword set.

Further, the topic iteration fusion module 806 is configured to:

Further, the keyword extraction module 805 is configured to:

further, the keyword extraction module 805 is configured to:

Referring now to FIG. 9, shown is a schematic diagram of an electronic device 900 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 900 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 901 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are also stored. The processing apparatus 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication device 909 may allow the electronic apparatus 900 to perform wireless or wired communication with other apparatuses to exchange data. While fig. 9 illustrates an electronic device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing apparatus 901.

It should be noted that the computer readable medium described above in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the disclosed embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the disclosed embodiments, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

The foregoing description is only a preferred embodiment of the disclosed embodiments and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure in the embodiments of the present disclosure is not limited to the particular combination of the above-described features, but also encompasses other embodiments in which any combination of the above-described features or their equivalents is possible without departing from the scope of the present disclosure. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A natural conversation topic analysis method, comprising:

2. The method of claim 1, wherein grouping a collection of participle sequences from the plurality of natural dialog texts into a plurality of subsets of participle sequences comprises:

3. The method of claim 2, wherein if the set of participle sequences is split into at least one initial set: and if the word segmentation sequence set is divided into a plurality of initial sets, the method further comprises, for any word segmentation sequence subset, extracting the word segmentation with the document frequency greater than the specified frequency threshold value from the included word segmentation as the core keyword to obtain the core keyword set corresponding to the word segmentation sequence subset, and then further comprises:

4. The method according to claim 3, wherein the merging judgment of any two word segmentation sequence subsets according to the coincidence rate of the corresponding core keyword sets comprises:

5. The method of claim 1, wherein for any word segmentation sequence subset, extracting the word segmentation with the document frequency greater than a specified frequency threshold from the included word segmentation as the core keyword comprises:

and for any word segmentation sequence subset, extracting high-frequency word segmentation according to the word segmentation sequences contained in the word segmentation sequence subset to obtain a keyword word bag of the word segmentation sequence subset, and extracting keywords with document frequency greater than a specified frequency threshold value from the keyword word bag to serve as core keywords.

6. The method of claim 5, wherein the document frequency for any keyword in the keyword bag of the subset of word sequences is calculated by:

7. The method of claim 1, wherein obtaining the plurality of natural dialog texts comprises:

a plurality of natural dialog texts are obtained in real time.

8. The method of claim 1, further comprising:

9. A natural conversation topic analysis device, comprising:

10. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

instructions which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1-8.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.