CN113033187A

CN113033187A - Method for establishing iterative corpus

Info

Publication number: CN113033187A
Application number: CN201911355650.7A
Authority: CN
Inventors: 陈滨; 梁晓斌
Original assignee: Xiamen Armored Network Co ltd
Current assignee: Xiamen Armored Network Co ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2021-06-25
Anticipated expiration: 2039-12-25
Also published as: CN113033187B

Abstract

The invention discloses a method for establishing an iterative corpus, which comprises the following steps: establishing a word stock for iterative use according to a Chinese dictionary; loading text content collection linguistic data of the existing popular novel, drama, frets and TV drama contents from a network; through carrying out word segmentation processing and part of speech judgment on the corpus, forming a plurality of corpora by context categories of words in a lexicon in an iterative combination mode, and realizing automatic establishment and supplement of the corpora; sentence reduction or expansion of the corpus, and sentence reduction or sentence expansion is carried out on the corpus to form an expanded corpus; and opening the corpus to public, carrying out public measurement, carrying out optimization and deletion according to the selected frequency or the selected records of response, and realizing the establishment of the iterative corpus. The invention adopts an iterative lexicon based on iterative establishment to analyze and differentiate the corpus, and carries out iterative combination, replacement, expansion and sentence expansion again, thereby rapidly expanding the collection speed of the corpus and saving a large amount of manpower processing.

Description

Method for establishing iterative corpus

Technical Field

The invention belongs to the technical field of corpus establishment, and particularly relates to an establishment method of an iterative corpus.

Background

The corpus is a basic resource for linguistic research of the corpus and is also a main resource of an empirical language research method. The method is applied to the aspects of lexicography, language teaching, traditional language research, statistical or example-based research in natural language processing and the like. Three basic insights about corpora: the language material base stores the language material which is actually appeared in the practical use of the language; the language database is a basic resource which takes an electronic computer as a carrier to bear language knowledge; the real corpus needs to be processed (analyzed and processed) to become a useful resource. There are many types of corpora, and the main basis for determining the type is its research purpose and use, which can often be reflected in the principle and manner of corpus collection. Corpora have been divided into four categories: the method comprises the following steps: the method has the advantages that a specific corpus collection principle is not adopted, and various corpora are widely collected and stored in an original mode; the method is characterized by comprising the following steps: collecting the corpora of the same type of content; the following of the system: collecting the linguistic data according to a predetermined principle and proportion, so that the linguistic data has balance and systematicness and can represent the language fact in a certain range; dedicated step four: only corpora for a particular purpose are collected.

When the existing corpus is established, a large amount of manpower is needed to input a corpus, and the situation of the corpus is set, so that a large amount of manpower is wasted, the consumed time is long, the corpus is easy to miss or is not rich, and the use value is low.

Disclosure of Invention

The invention aims to provide a method for establishing an iterative corpus so as to solve the problems that the corpus entry in the prior art consumes a large amount of manpower, consumes a long time, is easy to miss and is not rich in corpus in the background technology.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for establishing an iterative corpus comprises the following steps:

s1, establishing a word stock for iterative use according to a Chinese dictionary, wherein each word in the word stock separately establishes a near phrase, a synonym, an antisense phrase, a brief explanation and a context category;

s2, loading the text content of the existing popular novel, drama, story, merchantry, TV play and character chat content from the network, identifying the white content by paragraph, part of speech and punctuation mark identification, taking each sentence of the white content as a corpus, and taking each three-sentence content of the upper and lower emotion and scene sentences of the corpus as the scene response content of the corpus;

s3, performing word segmentation processing and part-of-speech judgment on the corpus by a word segmentation method, judging the part-of-speech of adjectives, verbs and nouns contained in the corpus by the part-of-speech, finally searching out the near-synonyms, antisense words and short interpretations of each word according to the context category of the words in the S1 lexicon, and forming a plurality of corpora by using an iterative combination mode to realize automatic establishment and supplement of the corpora;

s4, processing the situation response content corpus of the corpus in the S2 according to the step of S3 to obtain a corpus which is adaptive to the situation of the corpus in the S3;

s5, sentence reduction or expansion of the linguistic data, carrying out sentence reduction or sentence expansion on the linguistic data in S3 and S4 by using a part-of-speech recognition mode according to the sentence reduction and expansion principles of Chinese grammar, and deleting exclamatory words, adjectives, places and time nouns in the linguistic data by using an iteration method to obtain a short reduced sentence; expanding the corpus by selecting adjectives, places and time names by using an iterative method to form an expanded corpus;

s6, a public test application, which is to open to public after establishing huge corpus content through S3, S4 and S5, to carry out public test, and select and delete corpora similar to or responding to the corpora formed in S3, S4 and S5 according to the selected frequency or the selected record of the response in an iterative mode during the public test under the condition of the same context, and then complete the corpus step by step in an iterative mode in the using process, thereby realizing the establishment of an iterative corpus.

Preferably, in the process of creating the thesaurus at S1, the words in the thesaurus are classified into large categories according to their parts of speech or into a category list according to the number of words, and when creating a synonym or synonym, the words with the same meaning are numbered in a manner of combining the lists, and each word is individually set with a unique number.

Preferably, in the collecting step of S2, in the punctuation mark recognition step, the recognition of colon and quotation marks is included, the bibliographic contents expressed in the punctuation marks are recognized, in the long dialog process, the number of words of the paragraphs is recognized, and if three paragraphs within 30 words appear continuously, it is determined that the paragraphs are blank paragraphs, and corpus collection is performed.

Preferably, when the part-of-speech recognition is performed in S3, it is necessary to use the sigh, the adverb, and the sigh as the extension in S5 while focusing attention on recognition of the sigh, the adverb, and the adverb.

Preferably, in the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.

Preferably, in the thesaurus listing of S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.

Preferably, in the application process of S6 common measurement, the response proportion and the selection proportion are calculated by the corpus occurring iteratively, and the specific calculation formula is as follows:

p (response ratio) T (response times)/T (iteration times)

P (optional proportion) is T (optional proportion)/T (iteration number).

Preferably, a ranking list program is further set, according to the ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.

The invention has the technical effects and advantages that: compared with the prior art, the method for establishing the iterative corpus, provided by the invention, has the following advantages:

1. the invention adopts an iteration mode, firstly, a basic iterable word bank is established, and after simple corpus collection is carried out, the corpus is analyzed and differentiated, and repeated iteration combination, replacement, expansion and sentence expansion are carried out, so that the corpus collection speed can be rapidly increased, and a large amount of manpower processing is saved;

2. the invention carries out public test application by an iteration taking mode, carries out arrangement and judgment of the corpora by utilizing public test force, can carry out arrangement and optimization on the corpus quickly, deletes the corpora with wrong grammar, unclear meaning or inconsistent situation, reduces formal operation occupation and error rate of the corpus, can continuously utilize the iterable lexicon to carry out complete corpus in the operation process, and is very convenient.

Drawings

FIG. 1 is a schematic diagram of a partial structure of a word stock after classification and arrangement according to the present invention;

FIG. 2 is a flowchart of a method for building an iterative corpus according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. The specific embodiments described herein are merely illustrative of the invention and do not delimit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a method for establishing an iterative corpus as shown in figures 1-2, which comprises the following steps:

In the process of building the word stock of S1, words in the word stock are classified greatly according to their parts of speech or classified into lists according to the word number of the word, and when a synonym or synonym is built, the numbering process is performed in a list combination manner, and words of the same meaning are provided with an overall number, and each word is provided with a unique number. In the thesaurus listing at S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.

In the collecting process of S2, in the punctuation mark identification process, including identification of colon and quotation marks, the bibliographic content expressed in punctuation marks is identified, in the long dialog process, the number of characters of paragraphs is identified, if three paragraphs within 30 characters appear continuously, it is determined as a dialogue paragraph, and corpus collection is performed. In the part-of-speech recognition in S3, it is necessary to use the assisting word, the inflicted word, and the exclamatory word as the extended word in S5 while focusing attention on recognition of the exclamatory word, the inflicted word, and the assisting word. In the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.

In the application process of the S6 common measurement, the response proportion and the selection proportion are calculated by the linguistic data occurring in iteration, and the specific calculation formula is as follows:

p (response ratio) T (response times)/T (iteration times)

P (optional proportion) is T (optional proportion)/T (iteration number).

And during public testing, a ranking list program is further set, according to a ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved after the public testing is finished, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.

The invention adopts an iteration mode, firstly, a basic iterable word bank is established, and after simple corpus collection is carried out, the corpus is analyzed and differentiated, and repeated iteration combination, replacement, expansion and sentence expansion are carried out, so that the corpus collection speed can be rapidly increased, and a large amount of manpower processing is saved; the invention carries out public test application by an iteration taking mode, carries out arrangement and judgment of the corpora by utilizing public test force, can carry out arrangement and optimization on the corpus quickly, deletes the corpora with wrong grammar, unclear meaning or inconsistent situation, reduces formal operation occupation and error rate of the corpus, can continuously utilize the iterable lexicon to carry out complete corpus in the operation process, and is very convenient.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. A method for establishing an iterative corpus is characterized by comprising the following steps:

2. The method for building an iterative corpus according to claim 1, wherein: in the process of building the word stock of S1, words in the word stock are classified greatly according to their parts of speech or classified into lists according to the word number of the word, and when a synonym or synonym is built, the numbering process is performed in a list combination manner, and words of the same meaning are provided with an overall number, and each word is provided with a unique number.

3. The method for building an iterative corpus according to claim 1, wherein: in the collecting process of S2, in the punctuation mark identification process, including identification of colon and quotation marks, the bibliographic content expressed in punctuation marks is identified, in the long dialog process, the number of characters of paragraphs is identified, if three paragraphs within 30 characters appear continuously, it is determined as a dialogue paragraph, and corpus collection is performed.

4. The method for building an iterative corpus according to claim 1, wherein: in the part-of-speech recognition in S3, it is necessary to use the assisting word, the inflicted word, and the exclamatory word as the extended word in S5 while focusing attention on recognition of the exclamatory word, the inflicted word, and the assisting word.

5. The method for building an iterative corpus according to claim 1, wherein: in the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.

6. The method for building an iterative corpus according to claim 1, wherein: in the thesaurus listing at S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.

7. The method for building an iterative corpus according to claim 1, wherein: in the application process of the S6 common measurement, the response proportion and the selection proportion are calculated by the linguistic data occurring in iteration, and the specific calculation formula is as follows:

p (response ratio) T (response times)/T (iteration times)

P (optional proportion) is T (optional proportion)/T (iteration number).

8. The method for building an iterative corpus according to claim 7, wherein: and a ranking list program is also set, according to a ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.