CN113033187A - Method for establishing iterative corpus - Google Patents

Method for establishing iterative corpus Download PDF

Info

Publication number
CN113033187A
CN113033187A CN201911355650.7A CN201911355650A CN113033187A CN 113033187 A CN113033187 A CN 113033187A CN 201911355650 A CN201911355650 A CN 201911355650A CN 113033187 A CN113033187 A CN 113033187A
Authority
CN
China
Prior art keywords
corpus
iterative
word
sentence
expansion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911355650.7A
Other languages
Chinese (zh)
Other versions
CN113033187B (en
Inventor
陈滨
梁晓斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Armored Network Co ltd
Original Assignee
Xiamen Armored Network Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Armored Network Co ltd filed Critical Xiamen Armored Network Co ltd
Priority to CN201911355650.7A priority Critical patent/CN113033187B/en
Publication of CN113033187A publication Critical patent/CN113033187A/en
Application granted granted Critical
Publication of CN113033187B publication Critical patent/CN113033187B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for establishing an iterative corpus, which comprises the following steps: establishing a word stock for iterative use according to a Chinese dictionary; loading text content collection linguistic data of the existing popular novel, drama, frets and TV drama contents from a network; through carrying out word segmentation processing and part of speech judgment on the corpus, forming a plurality of corpora by context categories of words in a lexicon in an iterative combination mode, and realizing automatic establishment and supplement of the corpora; sentence reduction or expansion of the corpus, and sentence reduction or sentence expansion is carried out on the corpus to form an expanded corpus; and opening the corpus to public, carrying out public measurement, carrying out optimization and deletion according to the selected frequency or the selected records of response, and realizing the establishment of the iterative corpus. The invention adopts an iterative lexicon based on iterative establishment to analyze and differentiate the corpus, and carries out iterative combination, replacement, expansion and sentence expansion again, thereby rapidly expanding the collection speed of the corpus and saving a large amount of manpower processing.

Description

Method for establishing iterative corpus
Technical Field
The invention belongs to the technical field of corpus establishment, and particularly relates to an establishment method of an iterative corpus.
Background
The corpus is a basic resource for linguistic research of the corpus and is also a main resource of an empirical language research method. The method is applied to the aspects of lexicography, language teaching, traditional language research, statistical or example-based research in natural language processing and the like. Three basic insights about corpora: the language material base stores the language material which is actually appeared in the practical use of the language; the language database is a basic resource which takes an electronic computer as a carrier to bear language knowledge; the real corpus needs to be processed (analyzed and processed) to become a useful resource. There are many types of corpora, and the main basis for determining the type is its research purpose and use, which can often be reflected in the principle and manner of corpus collection. Corpora have been divided into four categories: the method comprises the following steps: the method has the advantages that a specific corpus collection principle is not adopted, and various corpora are widely collected and stored in an original mode; the method is characterized by comprising the following steps: collecting the corpora of the same type of content; the following of the system: collecting the linguistic data according to a predetermined principle and proportion, so that the linguistic data has balance and systematicness and can represent the language fact in a certain range; dedicated step four: only corpora for a particular purpose are collected.
When the existing corpus is established, a large amount of manpower is needed to input a corpus, and the situation of the corpus is set, so that a large amount of manpower is wasted, the consumed time is long, the corpus is easy to miss or is not rich, and the use value is low.
Disclosure of Invention
The invention aims to provide a method for establishing an iterative corpus so as to solve the problems that the corpus entry in the prior art consumes a large amount of manpower, consumes a long time, is easy to miss and is not rich in corpus in the background technology.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for establishing an iterative corpus comprises the following steps:
s1, establishing a word stock for iterative use according to a Chinese dictionary, wherein each word in the word stock separately establishes a near phrase, a synonym, an antisense phrase, a brief explanation and a context category;
s2, loading the text content of the existing popular novel, drama, story, merchantry, TV play and character chat content from the network, identifying the white content by paragraph, part of speech and punctuation mark identification, taking each sentence of the white content as a corpus, and taking each three-sentence content of the upper and lower emotion and scene sentences of the corpus as the scene response content of the corpus;
s3, performing word segmentation processing and part-of-speech judgment on the corpus by a word segmentation method, judging the part-of-speech of adjectives, verbs and nouns contained in the corpus by the part-of-speech, finally searching out the near-synonyms, antisense words and short interpretations of each word according to the context category of the words in the S1 lexicon, and forming a plurality of corpora by using an iterative combination mode to realize automatic establishment and supplement of the corpora;
s4, processing the situation response content corpus of the corpus in the S2 according to the step of S3 to obtain a corpus which is adaptive to the situation of the corpus in the S3;
s5, sentence reduction or expansion of the linguistic data, carrying out sentence reduction or sentence expansion on the linguistic data in S3 and S4 by using a part-of-speech recognition mode according to the sentence reduction and expansion principles of Chinese grammar, and deleting exclamatory words, adjectives, places and time nouns in the linguistic data by using an iteration method to obtain a short reduced sentence; expanding the corpus by selecting adjectives, places and time names by using an iterative method to form an expanded corpus;
s6, a public test application, which is to open to public after establishing huge corpus content through S3, S4 and S5, to carry out public test, and select and delete corpora similar to or responding to the corpora formed in S3, S4 and S5 according to the selected frequency or the selected record of the response in an iterative mode during the public test under the condition of the same context, and then complete the corpus step by step in an iterative mode in the using process, thereby realizing the establishment of an iterative corpus.
Preferably, in the process of creating the thesaurus at S1, the words in the thesaurus are classified into large categories according to their parts of speech or into a category list according to the number of words, and when creating a synonym or synonym, the words with the same meaning are numbered in a manner of combining the lists, and each word is individually set with a unique number.
Preferably, in the collecting step of S2, in the punctuation mark recognition step, the recognition of colon and quotation marks is included, the bibliographic contents expressed in the punctuation marks are recognized, in the long dialog process, the number of words of the paragraphs is recognized, and if three paragraphs within 30 words appear continuously, it is determined that the paragraphs are blank paragraphs, and corpus collection is performed.
Preferably, when the part-of-speech recognition is performed in S3, it is necessary to use the sigh, the adverb, and the sigh as the extension in S5 while focusing attention on recognition of the sigh, the adverb, and the adverb.
Preferably, in the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.
Preferably, in the thesaurus listing of S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.
Preferably, in the application process of S6 common measurement, the response proportion and the selection proportion are calculated by the corpus occurring iteratively, and the specific calculation formula is as follows:
p (response ratio) T (response times)/T (iteration times)
P (optional proportion) is T (optional proportion)/T (iteration number).
Preferably, a ranking list program is further set, according to the ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.
The invention has the technical effects and advantages that: compared with the prior art, the method for establishing the iterative corpus, provided by the invention, has the following advantages:
1. the invention adopts an iteration mode, firstly, a basic iterable word bank is established, and after simple corpus collection is carried out, the corpus is analyzed and differentiated, and repeated iteration combination, replacement, expansion and sentence expansion are carried out, so that the corpus collection speed can be rapidly increased, and a large amount of manpower processing is saved;
2. the invention carries out public test application by an iteration taking mode, carries out arrangement and judgment of the corpora by utilizing public test force, can carry out arrangement and optimization on the corpus quickly, deletes the corpora with wrong grammar, unclear meaning or inconsistent situation, reduces formal operation occupation and error rate of the corpus, can continuously utilize the iterable lexicon to carry out complete corpus in the operation process, and is very convenient.
Drawings
FIG. 1 is a schematic diagram of a partial structure of a word stock after classification and arrangement according to the present invention;
FIG. 2 is a flowchart of a method for building an iterative corpus according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. The specific embodiments described herein are merely illustrative of the invention and do not delimit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method for establishing an iterative corpus as shown in figures 1-2, which comprises the following steps:
s1, establishing a word stock for iterative use according to a Chinese dictionary, wherein each word in the word stock separately establishes a near phrase, a synonym, an antisense phrase, a brief explanation and a context category;
s2, loading the text content of the existing popular novel, drama, story, merchantry, TV play and character chat content from the network, identifying the white content by paragraph, part of speech and punctuation mark identification, taking each sentence of the white content as a corpus, and taking each three-sentence content of the upper and lower emotion and scene sentences of the corpus as the scene response content of the corpus;
s3, performing word segmentation processing and part-of-speech judgment on the corpus by a word segmentation method, judging the part-of-speech of adjectives, verbs and nouns contained in the corpus by the part-of-speech, finally searching out the near-synonyms, antisense words and short interpretations of each word according to the context category of the words in the S1 lexicon, and forming a plurality of corpora by using an iterative combination mode to realize automatic establishment and supplement of the corpora;
s4, processing the situation response content corpus of the corpus in the S2 according to the step of S3 to obtain a corpus which is adaptive to the situation of the corpus in the S3;
s5, sentence reduction or expansion of the linguistic data, carrying out sentence reduction or sentence expansion on the linguistic data in S3 and S4 by using a part-of-speech recognition mode according to the sentence reduction and expansion principles of Chinese grammar, and deleting exclamatory words, adjectives, places and time nouns in the linguistic data by using an iteration method to obtain a short reduced sentence; expanding the corpus by selecting adjectives, places and time names by using an iterative method to form an expanded corpus;
s6, a public test application, which is to open to public after establishing huge corpus content through S3, S4 and S5, to carry out public test, and select and delete corpora similar to or responding to the corpora formed in S3, S4 and S5 according to the selected frequency or the selected record of the response in an iterative mode during the public test under the condition of the same context, and then complete the corpus step by step in an iterative mode in the using process, thereby realizing the establishment of an iterative corpus.
In the process of building the word stock of S1, words in the word stock are classified greatly according to their parts of speech or classified into lists according to the word number of the word, and when a synonym or synonym is built, the numbering process is performed in a list combination manner, and words of the same meaning are provided with an overall number, and each word is provided with a unique number. In the thesaurus listing at S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.
In the collecting process of S2, in the punctuation mark identification process, including identification of colon and quotation marks, the bibliographic content expressed in punctuation marks is identified, in the long dialog process, the number of characters of paragraphs is identified, if three paragraphs within 30 characters appear continuously, it is determined as a dialogue paragraph, and corpus collection is performed. In the part-of-speech recognition in S3, it is necessary to use the assisting word, the inflicted word, and the exclamatory word as the extended word in S5 while focusing attention on recognition of the exclamatory word, the inflicted word, and the assisting word. In the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.
In the application process of the S6 common measurement, the response proportion and the selection proportion are calculated by the linguistic data occurring in iteration, and the specific calculation formula is as follows:
p (response ratio) T (response times)/T (iteration times)
P (optional proportion) is T (optional proportion)/T (iteration number).
And during public testing, a ranking list program is further set, according to a ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved after the public testing is finished, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.
The invention adopts an iteration mode, firstly, a basic iterable word bank is established, and after simple corpus collection is carried out, the corpus is analyzed and differentiated, and repeated iteration combination, replacement, expansion and sentence expansion are carried out, so that the corpus collection speed can be rapidly increased, and a large amount of manpower processing is saved; the invention carries out public test application by an iteration taking mode, carries out arrangement and judgment of the corpora by utilizing public test force, can carry out arrangement and optimization on the corpus quickly, deletes the corpora with wrong grammar, unclear meaning or inconsistent situation, reduces formal operation occupation and error rate of the corpus, can continuously utilize the iterable lexicon to carry out complete corpus in the operation process, and is very convenient.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims (8)

1. A method for establishing an iterative corpus is characterized by comprising the following steps:
s1, establishing a word stock for iterative use according to a Chinese dictionary, wherein each word in the word stock separately establishes a near phrase, a synonym, an antisense phrase, a brief explanation and a context category;
s2, loading the text content of the existing popular novel, drama, story, merchantry, TV play and character chat content from the network, identifying the white content by paragraph, part of speech and punctuation mark identification, taking each sentence of the white content as a corpus, and taking each three-sentence content of the upper and lower emotion and scene sentences of the corpus as the scene response content of the corpus;
s3, performing word segmentation processing and part-of-speech judgment on the corpus by a word segmentation method, judging the part-of-speech of adjectives, verbs and nouns contained in the corpus by the part-of-speech, finally searching out the near-synonyms, antisense words and short interpretations of each word according to the context category of the words in the S1 lexicon, and forming a plurality of corpora by using an iterative combination mode to realize automatic establishment and supplement of the corpora;
s4, processing the situation response content corpus of the corpus in the S2 according to the step of S3 to obtain a corpus which is adaptive to the situation of the corpus in the S3;
s5, sentence reduction or expansion of the linguistic data, carrying out sentence reduction or sentence expansion on the linguistic data in S3 and S4 by using a part-of-speech recognition mode according to the sentence reduction and expansion principles of Chinese grammar, and deleting exclamatory words, adjectives, places and time nouns in the linguistic data by using an iteration method to obtain a short reduced sentence; expanding the corpus by selecting adjectives, places and time names by using an iterative method to form an expanded corpus;
s6, a public test application, which is to open to public after establishing huge corpus content through S3, S4 and S5, to carry out public test, and select and delete corpora similar to or responding to the corpora formed in S3, S4 and S5 according to the selected frequency or the selected record of the response in an iterative mode during the public test under the condition of the same context, and then complete the corpus step by step in an iterative mode in the using process, thereby realizing the establishment of an iterative corpus.
2. The method for building an iterative corpus according to claim 1, wherein: in the process of building the word stock of S1, words in the word stock are classified greatly according to their parts of speech or classified into lists according to the word number of the word, and when a synonym or synonym is built, the numbering process is performed in a list combination manner, and words of the same meaning are provided with an overall number, and each word is provided with a unique number.
3. The method for building an iterative corpus according to claim 1, wherein: in the collecting process of S2, in the punctuation mark identification process, including identification of colon and quotation marks, the bibliographic content expressed in punctuation marks is identified, in the long dialog process, the number of characters of paragraphs is identified, if three paragraphs within 30 characters appear continuously, it is determined as a dialogue paragraph, and corpus collection is performed.
4. The method for building an iterative corpus according to claim 1, wherein: in the part-of-speech recognition in S3, it is necessary to use the assisting word, the inflicted word, and the exclamatory word as the extended word in S5 while focusing attention on recognition of the exclamatory word, the inflicted word, and the assisting word.
5. The method for building an iterative corpus according to claim 1, wherein: in the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.
6. The method for building an iterative corpus according to claim 1, wherein: in the thesaurus listing at S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.
7. The method for building an iterative corpus according to claim 1, wherein: in the application process of the S6 common measurement, the response proportion and the selection proportion are calculated by the linguistic data occurring in iteration, and the specific calculation formula is as follows:
p (response ratio) T (response times)/T (iteration times)
P (optional proportion) is T (optional proportion)/T (iteration number).
8. The method for building an iterative corpus according to claim 7, wherein: and a ranking list program is also set, according to a ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.
CN201911355650.7A 2019-12-25 2019-12-25 Method for establishing iterative corpus Active CN113033187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911355650.7A CN113033187B (en) 2019-12-25 2019-12-25 Method for establishing iterative corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911355650.7A CN113033187B (en) 2019-12-25 2019-12-25 Method for establishing iterative corpus

Publications (2)

Publication Number Publication Date
CN113033187A true CN113033187A (en) 2021-06-25
CN113033187B CN113033187B (en) 2022-08-05

Family

ID=76458075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911355650.7A Active CN113033187B (en) 2019-12-25 2019-12-25 Method for establishing iterative corpus

Country Status (1)

Country Link
CN (1) CN113033187B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332217A1 (en) * 2009-06-29 2010-12-30 Shalom Wintner Method for text improvement via linguistic abstractions
CN105320960A (en) * 2015-10-14 2016-02-10 北京航空航天大学 Voting based classification method for cross-language subjective and objective sentiments
CN109614499A (en) * 2018-11-22 2019-04-12 阿里巴巴集团控股有限公司 A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment
CN109829052A (en) * 2019-02-19 2019-05-31 田中瑶 A kind of open dialogue method and system based on human-computer interaction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100332217A1 (en) * 2009-06-29 2010-12-30 Shalom Wintner Method for text improvement via linguistic abstractions
CN105320960A (en) * 2015-10-14 2016-02-10 北京航空航天大学 Voting based classification method for cross-language subjective and objective sentiments
CN109614499A (en) * 2018-11-22 2019-04-12 阿里巴巴集团控股有限公司 A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment
CN109829052A (en) * 2019-02-19 2019-05-31 田中瑶 A kind of open dialogue method and system based on human-computer interaction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蒋志鹏: "面向中文电子病历的词法语料标注研究", 《高技术通讯》 *

Also Published As

Publication number Publication date
CN113033187B (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN109408642B (en) Domain entity attribute relation extraction method based on distance supervision
WO2016008452A1 (en) Highly effective input prediction method and device
EP0971294A2 (en) Method and apparatus for automated search and retrieval processing
WO2008098507A1 (en) An input method of combining words intelligently, input method system and renewing method
CN110717045A (en) Letter element automatic extraction method based on letter overview
JP2572314B2 (en) Keyword extraction device
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN107784123B (en) Topic-based search optimization method
CN106250367B (en) Method based on the improved Nivre algorithm building interdependent treebank of Vietnamese
CN109002540B (en) Method for automatically generating Chinese announcement document question answer pairs
CN113033187B (en) Method for establishing iterative corpus
CN106776590A (en) A kind of method and system for obtaining entry translation
CN110162615A (en) A kind of intelligent answer method, apparatus, electronic equipment and storage medium
US8862459B2 (en) Generating Chinese language banners
CN113553859A (en) English writing evaluation system based on deep semantic analysis
CN108763229B (en) Machine translation method and device based on characteristic sentence stem extraction
CN109325225B (en) Universal relevance-based part-of-speech tagging method
CN114444491A (en) New word recognition method and device
WO2008017188A1 (en) System and method for making teaching material of language class
Wang et al. A bootstrapping method for extracting sentiment words using degree adverb patterns
CN113722447B (en) Voice search method based on multi-strategy matching
CN113486155B (en) Chinese naming method fusing fixed phrase information
Yuan et al. Extraction of Emotional Words and Construction of Emotional Dictionary Based on Syntactic Dependency and Point Mutual Information
Flanagan et al. Automatic extraction and prediction of word order errors from language learning SNS
Khalfallah et al. Had, a platform to create a historical dictionary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Method for Establishing an Iterative Corpus

Granted publication date: 20220805

Pledgee: Industrial Bank Limited by Share Ltd. Xiamen branch

Pledgor: Xiamen Armored Network Co.,Ltd.

Registration number: Y2024980001529