CN109657242B

CN109657242B - Automatic eliminating system for Chinese redundancy meaning items

Info

Publication number: CN109657242B
Application number: CN201811542048.XA
Authority: CN
Inventors: 符建辉
Original assignee: Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Current assignee: Zhongke Guoli Zhenjiang Intelligent Technology Co ltd
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2023-05-05
Anticipated expiration: 2038-12-17
Also published as: CN109657242A

Abstract

The invention discloses a Chinese redundancy meaning item automatic elimination system, which comprises a module A: labeling the meaning items of the segmented training corpus T gamma and analyzing the correlation of the meaning items; module B: eliminating redundant sense items by automatically detecting service independent sense items; module C: eliminating redundant sense items by comparing and analyzing a plurality of term proximity; module D: redundant sense items are eliminated by comparing the term near class with the term parent class. The invention provides a system and a method for automatically eliminating Chinese redundancy meaning items with high efficiency by means of artificial intelligence association analysis, statistical analysis and other technologies, thereby improving the accuracy and the efficiency of Chinese sentence analysis.

Description

Automatic eliminating system for Chinese redundancy meaning items

Technical Field

The invention relates to the fields of Chinese understanding, text automatic analysis, chinese machine learning and the like, in particular to a system for automatically eliminating Chinese redundancy meaning items.

Background

With the rapid development of artificial intelligence technology, the industry application demands with natural language as a core are becoming stronger. In the process of analyzing natural language sentences, there are two basic and important tasks: the words in the natural language sentence after word segmentation are labeled one by one. The former task is simply divided into words, and the latter task is simply referred to as meaning item annotation.

In labeling the meaning items of a natural language sentence (simply referred to as sentence) S, a difficulty that is generally encountered is how to accurately label the meaning items of the words in the sentence S. This problem becomes more serious for a particular industry application because in a particular industry, most words carry a number of possible meanings, which are compiled by different business personnel. Because of the lack of unified criteria, the phenomenon of one word in a sentence being labeled with multiple terms is quite common, and some of the terms are irrelevant, thus becoming redundant terms.

For example, for the sentence s= "how mobile phone card is" two sets of results can be obtained after word segmentation: ts1= "how mobile phone card is handled", and ts2= "how mobile phone card is handled". Labeling the results for their sense items may be: TS1 = "Mobile phone card { card proximity } { element parent }/how { proximity } { query parent }/office { handling proximity, office proximity }/", TS2 = "Mobile phone { } { product parent, device parent, movie parent }/card { card proximity, cartoon proximity } { element parent }/how { proximity } { query parent }/office { handling proximity, office proximity } { }/" where "handling { proximity, office proximity } { }" means that there are two items to handle a word, i.e., handling proximity, office proximity. However, it is easy to determine that the sense of the office in TS1 and TS2 does not include the sense of the office's close class, and that the correct sense should have only the office's close class. The redundant meaning term not only can reduce the analysis precision of the Chinese sentence, but also can reduce the processing speed of the Chinese sentence.

Although the problems of Chinese meaning labeling and redundant meaning elimination have been studied for many years, in the existing methods, there are still two closely related problems:

(1) The precision of the meaning item annotation is not high: the correct understanding of Chinese sentences depends on the meaning item labels of the sentences, and if the meaning item labels of the sentences are wrong, the understanding of the sentences leads to misunderstanding.

(2) Redundancy meaning item labeling: in the meaning item labeling of a sentence, irrelevant meaning items are generally labeled on the words, so that some words in the sentence are formed with redundant meaning items; the reason for this problem is that these words appear in different semantic types.

Disclosure of Invention

The invention aims to solve the technical problems that: aiming at the problems of low meaning item marking precision, redundant meaning item marking and the like of Chinese sentences, the invention provides a high-efficiency system for automatically eliminating the Chinese redundant meaning items by means of artificial intelligent association analysis, statistical analysis and other technologies, thereby improving the precision and efficiency of Chinese sentence analysis.

In order to solve the problems, the invention adopts the following technical scheme that the automatic Chinese redundancy meaning item eliminating system comprises the following modules:

module a: labeling the meaning items of the segmented training corpus T gamma and analyzing the correlation of the meaning items;

module B: eliminating redundant sense items by automatically detecting service independent sense items;

module C: eliminating redundant sense items by comparing and analyzing a plurality of term proximity;

module D: redundant sense items are eliminated by comparing the term near class with the term parent class.

Implementation of the module AThe method comprises the following steps: training corpus T gamma= { TS after word segmentation ₁ ,TS ₂ ,...,TS _n }, wherein TS _i (1. Ltoreq.i.ltoreq.n) in the form TS _i ＝t _i1 {}{}/t _i2 {}{}/...t _ij {}{}/...t _ik { } { }/; introducing a sense set, which is a set, initially empty; for each TS in T gamma _i For TS _i Each t of (2) _ij { } { }, performing the following steps:

step A-1: finding t in a near class dictionary _ij The belonging term classes are stored in the set t _ij In _syn, t is _ij Increase_syn to t _ij In the first of { } { }, t is formed _ij {t _ij _syn}{}；

Step A-2: sense_set=sense_set set t _ij _syn；

Step A-3: finding t in the parent dictionary _ij The term parent class to which they belong is stored in set t _ij In_fat, t is taken as _ij Fat is increased to t _ij {t _ij In the second one of _syn } { }, t is formed _ij {t _ij _syn}{t _ij _fat}；

Step A-4: sense_set=sense_set set t _ij _fat。

The implementation steps of the module B are as follows:

step B-1: for any term of sense_set, the support set of sf in Γ is calculated and denoted as supp_set (Γ, sf), i.e. supp_set (Γ, sf) = { s|s e Γ, and S contains at least one term of sf };

step B-2: for each TS in T gamma _i For TS _i Any one of t _ij {t _ij _syn}{t _ij Fat, for t _ij Any two elements t in_syn _ij _syn ₁ And t _ij _syn ₂ : if it is

And (2) and

then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_syn _ij _syn ₂ I.e. t _ij _syn ₂ At t _ij Redundant sense items of (a);

step B-3: for each TS in T gamma _i For TS _i Any one of t _ij {t _ij _syn}{t _ij Fat, for t _ij _fat

Any two elements t of (a) _ij _fat ₁ And t _ij _fat ₂ : if it is

And is also provided with

Then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_fat _ij _fat ₂ I.e. t _ij _fat ₂ At t _ij Is a redundant term of meaning.

The implementation steps of the module C are as follows:

step C-1: for each TS in T gamma _i For TS _i Any one of t _ij {t _ij _syn}{t _ij Fat, for t _ij Any two elements t in_syn _ij _syn ₁ And t _ij _syn ₂ : if it is

Then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_syn _ij _syn ₁ I.e. t _ij _syn ₁ At t _ij Redundant sense items of (a);

step C-2: for each TS in T gamma _i For TS _i Any one of t _ij {t _ij _syn}{t _ij Fat, for t _ij Any two elements t in_syn _ij _syn ₁ And t _ij _syn ₂ : if it is

Then the following is performed: if it is

Then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_syn _ij _syn ₂ (i.e. t _ij _syn ₂ At t _ij Redundancy of sense item).

The implementation method of the module D is as follows:

for each TS in T gamma _i For TS _i Any one of t _ij {t _ij _syn}{t _ij Fat, for t _ij Any element t in_syn _ij _syn ₁ And t _ij Any element t in_fat _ij _fat ₁ If (3)

And is also provided with

Then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_fat _ij _fat ₁ 。

The beneficial effects are that: the invention provides a system and a method for automatically eliminating Chinese redundancy meaning items. The method selects 5 Chinese application scenes such as package consultation, activity consultation, fault consultation, weather consultation, flight consultation and the like, and collects 10000 Chinese sentences to carry out a test of automatically eliminating Chinese redundancy meaning items. Then, 1000 test results are checked and the effect of the invention on automatic elimination of redundant sense term is examined. The result shows that up to 88.1% of redundant sense items are accurately eliminated, namely the automatic elimination precision of the redundant sense items reaches 88.1%.

Drawings

FIG. 1 is a workflow diagram of a system and method for automatic elimination of Chinese redundancy meaning items.

Detailed Description

In order to be able to more clearly state the invention, several important terms are introduced and explained below:

(1) The term proximity, the term parent, the term sense: generally, any term is synonymous or proximal. For example, for the term office, the near term is office, etc. In the present invention, a collection is used to store synonymous or near-sense terms of the terms. For example, for an office, an office proximity = { office,..} is used to represent a term synonymous with or close to the office, and an office proximity = { office, office.} is used to represent a term synonymous with or close to the office. As a concept, most terms have some lower terms. For example, a product is a term which also means a concept, and its lower terms are a mobile phone, a refrigerator, a washing machine, etc. For this reason, in the present invention, one set is used to store the lower terms of terms. For example, for a product, the product parent = { cell phone, refrigerator, washing machine,.} is used to represent the lower terminology of the product, the local bank parent = { beijing bank, beijing city, nanjing bank, nanjing, ningbo bank, ningbo, } is used to represent the lower term of the local bank. The term meaning is a generic term for the term neighborhood and the term parent, for example, the office neighborhood is a term meaning and the product parent is a term meaning.

(2) A near class dictionary, a parent class dictionary: a neighborhood dictionary is a collection of tuples (terms, term classes). For example, a near class dictionary= { (office, { office, office }), { office, application,.}) }. Similarly, a parent dictionary is a collection of tuples (terms, term parents). For example, a parent dictionary = { (product, { mobile phone, refrigerator, washing machine,.}), (local bank parent, { beijing bank, beijing city, nanjing bank, nanjing, ningbo bank, ningbo, }) }.

(3) Word segmentation dictionary, word segmentation: a word segmentation dictionary is a collection of terms made up of words that appear in a near class dictionary and a parent class dictionary. For example, regarding the near dictionary and the parent dictionary in the above (2), the word segmentation dictionary formed by them= { office, sponsor, product, mobile phone, refrigerator, washing machine, beijing bank, beijing city, nanjing bank, nanjing, ningbo bank, ningbo, }. The word segmentation is a process of segmenting a sentence S into words by using words of a word segmentation dictionary. For example, s= "how a cell phone card is handled", and the word is divided into ts= "cell phone card { } { }/how { } { }/handling { }/", where "{ }" indicates that the meaning Xiang Shang of the term is uncertain and therefore an empty set; "/" indicates a term after segmentation and a spacer between terms.

(4) Training corpus set, training corpus set after word segmentation: training corpus Γ= { S ₁ ,S ₂ ,...,S _n Is a set of Chinese sentences, where S _i (1 is more than or equal to i is more than or equal to n) is a Chinese sentence. Training corpus T gamma= { TS after word segmentation ₁ ,TS ₂ ,...,TS _n The term "is a collection obtained by word segmentation, in which TS _i (1.ltoreq.i.ltoreq.n) is S _i Word-segmented strings.

(5) Intersection operation, union operation, difference operation and base number of the set: given two sets S ₁ And S is ₂ 。S ₁ And S is equal to ₂ Is denoted as S ₁ ∩S ₂ Is a combination of simultaneous occurrence of S ₁ And S is ₂ A set of elements in the set. S is S ₁ And S is equal to ₂ Is marked as S ₁ ∪S ₂ Is a result of occurrence of S ₁ Or occur at S ₂ A set of elements in the set. S is S ₁ And S is equal to ₂ Is marked as S ₁ \S ₂ Is a result of occurrence of S ₁ But does not appear in S ₂ A set of elements in the set. For the set S, |S| is the radix function of S, and the function value of the radix function is the number of elements in S.

The invention is described in further detail below with reference to fig. 1 and the detailed description. The system for automatically eliminating Chinese redundancy meaning items is divided into five large modules, and each large module is realized through a plurality of specific method steps. The functions of the respective modules, core methods are explained in detail below.

Module a: meaning item labeling and meaning item correlation analysis for segmented training corpus T gamma

Without loss of generality, assume word segmentationPost training corpus T Γ= { TS ₁ ,TS ₂ ,...,TS _n }, wherein TS _i (1. Ltoreq.i.ltoreq.n) in the form TS _i ＝t _i1 {}{}/t _i2 {}{}/...t _ik {}{}/。

A sense set sense_set is introduced, which is a set, initially empty, for storing the term near class, term parent class to which T f relates.

For each TS in T gamma _i For TS _i Each t of (2) _ij { } { }, performing the following steps:

step A-1: finding t in a near class dictionary _ij The belonging term classes are stored in the set t _ij In _syn, t is _ij Increase_syn to t _ij In the first of { } { }, t is formed _ij {t _ij _syn}{}。

Step A-2: sense_set=sense_set set t _ij _syn。

Step A-3: finding t in the parent dictionary _ij The term parent class to which they belong is stored in set t _ij In_fat, t is taken as _ij Fat is increased to t _ij {t _ij In the second one of _syn } { }, t is formed _ij {t _ij _syn}{t _ij _fat}。

Step A-4: sense_set=sense_set set t _ij _fat。

For example, for ts= "cell phone card { }/how { { }/", after step a-1, ts= "cell phone card { card proximity } { }/how { how proximity, how { proximity } { }/how { proximity, office proximity } { }/", after step a-2, ts= "cell phone card { card proximity } { element parent }/how { proximity, how { query parent }/how { proximity, office proximity } { }/". Note that { } means that there is no corresponding term proximity or term parent, e.g., "do { transact proximity, office proximity } { }" means "do" belongs to two term proximity of transact proximity, office proximity, but not any term parent.

Module B: eliminating redundant sense items by automatically detecting business independent sense items

Through the dieThe training corpus after the processing of the block A is marked as T gamma= { TS ₁ ,TS ₂ ,...,TS _n }, wherein TS _i ＝t _i1 {t _i1 _syn}{t _i1 _fat}/t _i2 {t _i2 _syn}{t _i2 _fat}/...t _ij {t _ij _syn}{t _ij _fat}/...t _ik {t _ik _syn}{t _ik _fat}/。

Since the corpus Γ is derived from one or more related specific services, the specific service to which Γ belongs, such as a mobile customer service, an aviation customer service, a financial service, etc., can be determined from Γ. However, after the module a generates T, some service-independent meaning items enter T. For example, in the ts= "mobile phone card { card proximity } { element parent }/how { proximity } { how proximity } { query word parent }/office { proximity } { }/" given by the module a, element parent and office proximity are business independent meaning items, and need to be deleted from the TS.

In the sub-module A-1, the sense_set contains the term close class and the term parent class to which T Γ relates, note that each term close class or term parent class is a term set.

The specific implementation method of the module B is as follows:

step B-1: for any one term of sense_set, the term of parent sf, the supp_set (Γ, sf) = { s|s e Γ, and S contains at least one term of sf }.

(β is a parameter, and it is found through experiments that β=0.001 achieves the best effect, so that β=0.001 is adopted in the present invention

And (2) and

(α is a parameter, and it is found through experiments that α=0.3 achieves the best effect, so α=0.3 is adopted in the invention), then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_syn _ij _syn ₂ (i.e. t _ij _syn ₂ At t _ij Redundancy of sense item).

Step B-3: for each TS in T gamma _i For TS _i Any one of t _ij {t _ij _syn}{t _ij Fat, for t _ij Any two elements t in_fat _ij _fat ₁ And t _ij _fat ₂ : if it is

And is also provided with

Then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_fat _ij _fat ₂ (i.e. t _ij _fat ₂ At t _ij Redundancy of sense item).

Module C: eliminating redundant sense items by comparative analysis of multiple term proximity

Although module B may quickly delete some redundant sense items, some are implicit in both term classes. For this purpose, multiple term classes need to be analyzed by comparison to eliminate redundant sense items.

The specific implementation method of the module C is as follows:

(δ is a parameter, and it is found through experiments that the invention achieves the best effect when δ=0.3, so that δ=0.3 is adopted in the invention), then from t _ij {t _ij _syn}{t _ij _fatT in } _ij Deletion of t in_syn _ij _syn ₁ (i.e. t _ij _syn ₁ At t _ij Redundancy of sense item).

(γ is a parameter, and it is found through experiments that the present invention achieves the best effect when γ=0.7, so that γ=0.7 is adopted in the present invention), then the following is performed: if->

(α is a parameter, and it is found through experiments that α=0.3 achieves the best effect, so that α=0.3 is adopted in the invention) (, then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_syn _ij _syn ₂ (i.e. t _ij _syn ₂ At t _ij Redundancy of sense item).

Module D: eliminating redundant sense items by comparing term proximity to term parent

In practice, it may be encountered that a term neighborhood belongs to a term parent, or that most members (e.g., 90%) of a term neighborhood belong to a term parent. For example, beijing proximity = { beijing, beijing city }, local bank parent = { beijing bank, beijing city, nanjing bank, nanjing, ningbo bank, ningbo, &.}, where the city names of beijing, nanjing, ningbo, etc., are short for the corresponding beijing bank, nanjing bank, ningbo bank. At this time, beijing like

Local bank parent class.

For the above reasons, ts= "beijing phone card" is processed by the word segmentation and the module a to be changed into ts= "beijing { beijing near class } { local bank parent }/phone card { card near class } { element parent }/", but the local bank parent in "beijing { beijing near class } { local bank parent }" is not a term of meaning of beijing.

The specific implementation method of the module D is as follows: for each TS in T gamma _i For TS _i Any one of t _ij {t _ij _syn}{t _ij Fat, for t _ij Any element t in_syn _ij _syn ₁ And t _ij Any element t in_fat _ij _fat ₁ If (3)

(γ is a parameter, and experiments show that the invention achieves the best effect when γ=0.7, so that γ=0.7 is adopted in the invention), and +.>

(α is a parameter, and it is found through experiments that α=0.3 achieves the best effect, so α=0.3 is adopted in the invention), then from t _ij {t _ij _syn}{t _ij T in_fat } _ij Deletion of t in_fat _ij _fat ₁ 。

Experimental effect

The invention provides a system and a method for automatically eliminating Chinese redundancy meaning items. The method selects 5 Chinese application scenes such as package consultation, activity consultation, fault consultation, weather consultation, flight consultation and the like, collects 100000 Chinese sentences to carry out an automatic test for eliminating Chinese redundancy meaning items, carries out a grouping test of 0.1 to 1 for parameters alpha, gamma and delta in modules B, C and D, and carries out a grouping test of 0.1 to 0.01 for parameter beta, and 0.00005 for step length. Then, 5000 test results are inspected and analyzed manually. The results show that when alpha=0.3, beta=0.001, gamma=0.7 and delta=0.3, the invention obtains 88.1% of accurate elimination of the redundant sense item, namely the automatic elimination precision of the redundant sense item reaches 88.1%. Therefore, the invention not only has important theoretical value, but also plays an important role in the actual Chinese sentence processing application.

Claims

1. An automatic elimination system for Chinese redundancy meaning items is characterized by comprising the following modules:

module D: eliminating redundant sense items by comparing the term near class with the term parent class;

the implementation steps of the module A are as follows: training corpus T gamma= { TS after word segmentation ₁ ,TS ₂ ,...,TS _n }, wherein TS _i (1. Ltoreq.i.ltoreq.n) in the form TS _i ＝t _i1 {}{}/t _i2 {}{}/...t _ij {}{}/...t _ik { } { }/(1.ltoreq.j.ltoreq.n); introducing a sense set, which is a set, initially empty; for each TS in T gamma _i For TS _i Each t of (2) _ij { } { }, performing the following steps:

Step A-2: sense_set=sense_set set t _ij _syn；

Step A-4: sense_set=sense_set set t _ij _fat；

The implementation steps of the module B are as follows: