CN113033187A - Method for establishing iterative corpus - Google Patents
Method for establishing iterative corpus Download PDFInfo
- Publication number
- CN113033187A CN113033187A CN201911355650.7A CN201911355650A CN113033187A CN 113033187 A CN113033187 A CN 113033187A CN 201911355650 A CN201911355650 A CN 201911355650A CN 113033187 A CN113033187 A CN 113033187A
- Authority
- CN
- China
- Prior art keywords
- corpus
- iterative
- word
- sentence
- expansion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000004044 response Effects 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 13
- 230000009467 reduction Effects 0.000 claims abstract description 11
- 230000011218 segmentation Effects 0.000 claims abstract description 7
- 238000005259 measurement Methods 0.000 claims abstract description 4
- 239000013589 supplement Substances 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 20
- 238000012360 testing method Methods 0.000 claims description 15
- 230000000692 anti-sense effect Effects 0.000 claims description 6
- 230000008602 contraction Effects 0.000 claims description 6
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 210000001072 colon Anatomy 0.000 claims description 3
- 230000008451 emotion Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 abstract description 3
- 238000012217 deletion Methods 0.000 abstract 1
- 230000037430 deletion Effects 0.000 abstract 1
- 238000011160 research Methods 0.000 description 5
- 239000000463 material Substances 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for establishing an iterative corpus, which comprises the following steps: establishing a word stock for iterative use according to a Chinese dictionary; loading text content collection linguistic data of the existing popular novel, drama, frets and TV drama contents from a network; through carrying out word segmentation processing and part of speech judgment on the corpus, forming a plurality of corpora by context categories of words in a lexicon in an iterative combination mode, and realizing automatic establishment and supplement of the corpora; sentence reduction or expansion of the corpus, and sentence reduction or sentence expansion is carried out on the corpus to form an expanded corpus; and opening the corpus to public, carrying out public measurement, carrying out optimization and deletion according to the selected frequency or the selected records of response, and realizing the establishment of the iterative corpus. The invention adopts an iterative lexicon based on iterative establishment to analyze and differentiate the corpus, and carries out iterative combination, replacement, expansion and sentence expansion again, thereby rapidly expanding the collection speed of the corpus and saving a large amount of manpower processing.
Description
Technical Field
The invention belongs to the technical field of corpus establishment, and particularly relates to an establishment method of an iterative corpus.
Background
The corpus is a basic resource for linguistic research of the corpus and is also a main resource of an empirical language research method. The method is applied to the aspects of lexicography, language teaching, traditional language research, statistical or example-based research in natural language processing and the like. Three basic insights about corpora: the language material base stores the language material which is actually appeared in the practical use of the language; the language database is a basic resource which takes an electronic computer as a carrier to bear language knowledge; the real corpus needs to be processed (analyzed and processed) to become a useful resource. There are many types of corpora, and the main basis for determining the type is its research purpose and use, which can often be reflected in the principle and manner of corpus collection. Corpora have been divided into four categories: the method comprises the following steps: the method has the advantages that a specific corpus collection principle is not adopted, and various corpora are widely collected and stored in an original mode; the method is characterized by comprising the following steps: collecting the corpora of the same type of content; the following of the system: collecting the linguistic data according to a predetermined principle and proportion, so that the linguistic data has balance and systematicness and can represent the language fact in a certain range; dedicated step four: only corpora for a particular purpose are collected.
When the existing corpus is established, a large amount of manpower is needed to input a corpus, and the situation of the corpus is set, so that a large amount of manpower is wasted, the consumed time is long, the corpus is easy to miss or is not rich, and the use value is low.
Disclosure of Invention
The invention aims to provide a method for establishing an iterative corpus so as to solve the problems that the corpus entry in the prior art consumes a large amount of manpower, consumes a long time, is easy to miss and is not rich in corpus in the background technology.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for establishing an iterative corpus comprises the following steps:
s1, establishing a word stock for iterative use according to a Chinese dictionary, wherein each word in the word stock separately establishes a near phrase, a synonym, an antisense phrase, a brief explanation and a context category;
s2, loading the text content of the existing popular novel, drama, story, merchantry, TV play and character chat content from the network, identifying the white content by paragraph, part of speech and punctuation mark identification, taking each sentence of the white content as a corpus, and taking each three-sentence content of the upper and lower emotion and scene sentences of the corpus as the scene response content of the corpus;
s3, performing word segmentation processing and part-of-speech judgment on the corpus by a word segmentation method, judging the part-of-speech of adjectives, verbs and nouns contained in the corpus by the part-of-speech, finally searching out the near-synonyms, antisense words and short interpretations of each word according to the context category of the words in the S1 lexicon, and forming a plurality of corpora by using an iterative combination mode to realize automatic establishment and supplement of the corpora;
s4, processing the situation response content corpus of the corpus in the S2 according to the step of S3 to obtain a corpus which is adaptive to the situation of the corpus in the S3;
s5, sentence reduction or expansion of the linguistic data, carrying out sentence reduction or sentence expansion on the linguistic data in S3 and S4 by using a part-of-speech recognition mode according to the sentence reduction and expansion principles of Chinese grammar, and deleting exclamatory words, adjectives, places and time nouns in the linguistic data by using an iteration method to obtain a short reduced sentence; expanding the corpus by selecting adjectives, places and time names by using an iterative method to form an expanded corpus;
s6, a public test application, which is to open to public after establishing huge corpus content through S3, S4 and S5, to carry out public test, and select and delete corpora similar to or responding to the corpora formed in S3, S4 and S5 according to the selected frequency or the selected record of the response in an iterative mode during the public test under the condition of the same context, and then complete the corpus step by step in an iterative mode in the using process, thereby realizing the establishment of an iterative corpus.
Preferably, in the process of creating the thesaurus at S1, the words in the thesaurus are classified into large categories according to their parts of speech or into a category list according to the number of words, and when creating a synonym or synonym, the words with the same meaning are numbered in a manner of combining the lists, and each word is individually set with a unique number.
Preferably, in the collecting step of S2, in the punctuation mark recognition step, the recognition of colon and quotation marks is included, the bibliographic contents expressed in the punctuation marks are recognized, in the long dialog process, the number of words of the paragraphs is recognized, and if three paragraphs within 30 words appear continuously, it is determined that the paragraphs are blank paragraphs, and corpus collection is performed.
Preferably, when the part-of-speech recognition is performed in S3, it is necessary to use the sigh, the adverb, and the sigh as the extension in S5 while focusing attention on recognition of the sigh, the adverb, and the adverb.
Preferably, in the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.
Preferably, in the thesaurus listing of S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.
Preferably, in the application process of S6 common measurement, the response proportion and the selection proportion are calculated by the corpus occurring iteratively, and the specific calculation formula is as follows:
p (response ratio) T (response times)/T (iteration times)
P (optional proportion) is T (optional proportion)/T (iteration number).
Preferably, a ranking list program is further set, according to the ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.
The invention has the technical effects and advantages that: compared with the prior art, the method for establishing the iterative corpus, provided by the invention, has the following advantages:
1. the invention adopts an iteration mode, firstly, a basic iterable word bank is established, and after simple corpus collection is carried out, the corpus is analyzed and differentiated, and repeated iteration combination, replacement, expansion and sentence expansion are carried out, so that the corpus collection speed can be rapidly increased, and a large amount of manpower processing is saved;
2. the invention carries out public test application by an iteration taking mode, carries out arrangement and judgment of the corpora by utilizing public test force, can carry out arrangement and optimization on the corpus quickly, deletes the corpora with wrong grammar, unclear meaning or inconsistent situation, reduces formal operation occupation and error rate of the corpus, can continuously utilize the iterable lexicon to carry out complete corpus in the operation process, and is very convenient.
Drawings
FIG. 1 is a schematic diagram of a partial structure of a word stock after classification and arrangement according to the present invention;
FIG. 2 is a flowchart of a method for building an iterative corpus according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. The specific embodiments described herein are merely illustrative of the invention and do not delimit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method for establishing an iterative corpus as shown in figures 1-2, which comprises the following steps:
s1, establishing a word stock for iterative use according to a Chinese dictionary, wherein each word in the word stock separately establishes a near phrase, a synonym, an antisense phrase, a brief explanation and a context category;
s2, loading the text content of the existing popular novel, drama, story, merchantry, TV play and character chat content from the network, identifying the white content by paragraph, part of speech and punctuation mark identification, taking each sentence of the white content as a corpus, and taking each three-sentence content of the upper and lower emotion and scene sentences of the corpus as the scene response content of the corpus;
s3, performing word segmentation processing and part-of-speech judgment on the corpus by a word segmentation method, judging the part-of-speech of adjectives, verbs and nouns contained in the corpus by the part-of-speech, finally searching out the near-synonyms, antisense words and short interpretations of each word according to the context category of the words in the S1 lexicon, and forming a plurality of corpora by using an iterative combination mode to realize automatic establishment and supplement of the corpora;
s4, processing the situation response content corpus of the corpus in the S2 according to the step of S3 to obtain a corpus which is adaptive to the situation of the corpus in the S3;
s5, sentence reduction or expansion of the linguistic data, carrying out sentence reduction or sentence expansion on the linguistic data in S3 and S4 by using a part-of-speech recognition mode according to the sentence reduction and expansion principles of Chinese grammar, and deleting exclamatory words, adjectives, places and time nouns in the linguistic data by using an iteration method to obtain a short reduced sentence; expanding the corpus by selecting adjectives, places and time names by using an iterative method to form an expanded corpus;
s6, a public test application, which is to open to public after establishing huge corpus content through S3, S4 and S5, to carry out public test, and select and delete corpora similar to or responding to the corpora formed in S3, S4 and S5 according to the selected frequency or the selected record of the response in an iterative mode during the public test under the condition of the same context, and then complete the corpus step by step in an iterative mode in the using process, thereby realizing the establishment of an iterative corpus.
In the process of building the word stock of S1, words in the word stock are classified greatly according to their parts of speech or classified into lists according to the word number of the word, and when a synonym or synonym is built, the numbering process is performed in a list combination manner, and words of the same meaning are provided with an overall number, and each word is provided with a unique number. In the thesaurus listing at S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.
In the collecting process of S2, in the punctuation mark identification process, including identification of colon and quotation marks, the bibliographic content expressed in punctuation marks is identified, in the long dialog process, the number of characters of paragraphs is identified, if three paragraphs within 30 characters appear continuously, it is determined as a dialogue paragraph, and corpus collection is performed. In the part-of-speech recognition in S3, it is necessary to use the assisting word, the inflicted word, and the exclamatory word as the extended word in S5 while focusing attention on recognition of the exclamatory word, the inflicted word, and the assisting word. In the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.
In the application process of the S6 common measurement, the response proportion and the selection proportion are calculated by the linguistic data occurring in iteration, and the specific calculation formula is as follows:
p (response ratio) T (response times)/T (iteration times)
P (optional proportion) is T (optional proportion)/T (iteration number).
And during public testing, a ranking list program is further set, according to a ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved after the public testing is finished, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.
The invention adopts an iteration mode, firstly, a basic iterable word bank is established, and after simple corpus collection is carried out, the corpus is analyzed and differentiated, and repeated iteration combination, replacement, expansion and sentence expansion are carried out, so that the corpus collection speed can be rapidly increased, and a large amount of manpower processing is saved; the invention carries out public test application by an iteration taking mode, carries out arrangement and judgment of the corpora by utilizing public test force, can carry out arrangement and optimization on the corpus quickly, deletes the corpora with wrong grammar, unclear meaning or inconsistent situation, reduces formal operation occupation and error rate of the corpus, can continuously utilize the iterable lexicon to carry out complete corpus in the operation process, and is very convenient.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.
Claims (8)
1. A method for establishing an iterative corpus is characterized by comprising the following steps:
s1, establishing a word stock for iterative use according to a Chinese dictionary, wherein each word in the word stock separately establishes a near phrase, a synonym, an antisense phrase, a brief explanation and a context category;
s2, loading the text content of the existing popular novel, drama, story, merchantry, TV play and character chat content from the network, identifying the white content by paragraph, part of speech and punctuation mark identification, taking each sentence of the white content as a corpus, and taking each three-sentence content of the upper and lower emotion and scene sentences of the corpus as the scene response content of the corpus;
s3, performing word segmentation processing and part-of-speech judgment on the corpus by a word segmentation method, judging the part-of-speech of adjectives, verbs and nouns contained in the corpus by the part-of-speech, finally searching out the near-synonyms, antisense words and short interpretations of each word according to the context category of the words in the S1 lexicon, and forming a plurality of corpora by using an iterative combination mode to realize automatic establishment and supplement of the corpora;
s4, processing the situation response content corpus of the corpus in the S2 according to the step of S3 to obtain a corpus which is adaptive to the situation of the corpus in the S3;
s5, sentence reduction or expansion of the linguistic data, carrying out sentence reduction or sentence expansion on the linguistic data in S3 and S4 by using a part-of-speech recognition mode according to the sentence reduction and expansion principles of Chinese grammar, and deleting exclamatory words, adjectives, places and time nouns in the linguistic data by using an iteration method to obtain a short reduced sentence; expanding the corpus by selecting adjectives, places and time names by using an iterative method to form an expanded corpus;
s6, a public test application, which is to open to public after establishing huge corpus content through S3, S4 and S5, to carry out public test, and select and delete corpora similar to or responding to the corpora formed in S3, S4 and S5 according to the selected frequency or the selected record of the response in an iterative mode during the public test under the condition of the same context, and then complete the corpus step by step in an iterative mode in the using process, thereby realizing the establishment of an iterative corpus.
2. The method for building an iterative corpus according to claim 1, wherein: in the process of building the word stock of S1, words in the word stock are classified greatly according to their parts of speech or classified into lists according to the word number of the word, and when a synonym or synonym is built, the numbering process is performed in a list combination manner, and words of the same meaning are provided with an overall number, and each word is provided with a unique number.
3. The method for building an iterative corpus according to claim 1, wherein: in the collecting process of S2, in the punctuation mark identification process, including identification of colon and quotation marks, the bibliographic content expressed in punctuation marks is identified, in the long dialog process, the number of characters of paragraphs is identified, if three paragraphs within 30 characters appear continuously, it is determined as a dialogue paragraph, and corpus collection is performed.
4. The method for building an iterative corpus according to claim 1, wherein: in the part-of-speech recognition in S3, it is necessary to use the assisting word, the inflicted word, and the exclamatory word as the extended word in S5 while focusing attention on recognition of the exclamatory word, the inflicted word, and the assisting word.
5. The method for building an iterative corpus according to claim 1, wherein: in the corpus processing procedure of S5, the corpus before the sentence contraction is collected as an expanded corpus of the corpus after the sentence contraction, and similarly, the corpus before the sentence expansion is collected as an expanded corpus of the corpus after the sentence expansion.
6. The method for building an iterative corpus according to claim 1, wherein: in the thesaurus listing at S1, it is also necessary to list new words, phrases, english, and english abbreviations that are currently popular in various fields.
7. The method for building an iterative corpus according to claim 1, wherein: in the application process of the S6 common measurement, the response proportion and the selection proportion are calculated by the linguistic data occurring in iteration, and the specific calculation formula is as follows:
p (response ratio) T (response times)/T (iteration times)
P (optional proportion) is T (optional proportion)/T (iteration number).
8. The method for building an iterative corpus according to claim 7, wherein: and a ranking list program is also set, according to a ranking mode, the linguistic data with the response occupation ratio higher than a certain value or the selected occupation ratio higher than the certain value are reserved, and the linguistic data lower than the certain value are automatically deleted, so that the occupation is reduced.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911355650.7A CN113033187B (en) | 2019-12-25 | 2019-12-25 | Method for establishing iterative corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911355650.7A CN113033187B (en) | 2019-12-25 | 2019-12-25 | Method for establishing iterative corpus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113033187A true CN113033187A (en) | 2021-06-25 |
CN113033187B CN113033187B (en) | 2022-08-05 |
Family
ID=76458075
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911355650.7A Active CN113033187B (en) | 2019-12-25 | 2019-12-25 | Method for establishing iterative corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113033187B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100332217A1 (en) * | 2009-06-29 | 2010-12-30 | Shalom Wintner | Method for text improvement via linguistic abstractions |
CN105320960A (en) * | 2015-10-14 | 2016-02-10 | 北京航空航天大学 | Voting based classification method for cross-language subjective and objective sentiments |
CN109614499A (en) * | 2018-11-22 | 2019-04-12 | 阿里巴巴集团控股有限公司 | A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment |
CN109829052A (en) * | 2019-02-19 | 2019-05-31 | 田中瑶 | A kind of open dialogue method and system based on human-computer interaction |
-
2019
- 2019-12-25 CN CN201911355650.7A patent/CN113033187B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100332217A1 (en) * | 2009-06-29 | 2010-12-30 | Shalom Wintner | Method for text improvement via linguistic abstractions |
CN105320960A (en) * | 2015-10-14 | 2016-02-10 | 北京航空航天大学 | Voting based classification method for cross-language subjective and objective sentiments |
CN109614499A (en) * | 2018-11-22 | 2019-04-12 | 阿里巴巴集团控股有限公司 | A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment |
CN109829052A (en) * | 2019-02-19 | 2019-05-31 | 田中瑶 | A kind of open dialogue method and system based on human-computer interaction |
Non-Patent Citations (1)
Title |
---|
蒋志鹏: "面向中文电子病历的词法语料标注研究", 《高技术通讯》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113033187B (en) | 2022-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109408642B (en) | Domain entity attribute relation extraction method based on distance supervision | |
WO2016008452A1 (en) | Highly effective input prediction method and device | |
EP0971294A2 (en) | Method and apparatus for automated search and retrieval processing | |
WO2008098507A1 (en) | An input method of combining words intelligently, input method system and renewing method | |
CN110717045A (en) | Letter element automatic extraction method based on letter overview | |
JP2572314B2 (en) | Keyword extraction device | |
CN114064901B (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
CN107784123B (en) | Topic-based search optimization method | |
CN106250367B (en) | Method based on the improved Nivre algorithm building interdependent treebank of Vietnamese | |
CN109002540B (en) | Method for automatically generating Chinese announcement document question answer pairs | |
CN113033187B (en) | Method for establishing iterative corpus | |
CN106776590A (en) | A kind of method and system for obtaining entry translation | |
CN110162615A (en) | A kind of intelligent answer method, apparatus, electronic equipment and storage medium | |
US8862459B2 (en) | Generating Chinese language banners | |
CN113553859A (en) | English writing evaluation system based on deep semantic analysis | |
CN108763229B (en) | Machine translation method and device based on characteristic sentence stem extraction | |
CN109325225B (en) | Universal relevance-based part-of-speech tagging method | |
CN114444491A (en) | New word recognition method and device | |
WO2008017188A1 (en) | System and method for making teaching material of language class | |
Wang et al. | A bootstrapping method for extracting sentiment words using degree adverb patterns | |
CN113722447B (en) | Voice search method based on multi-strategy matching | |
CN113486155B (en) | Chinese naming method fusing fixed phrase information | |
Yuan et al. | Extraction of Emotional Words and Construction of Emotional Dictionary Based on Syntactic Dependency and Point Mutual Information | |
Flanagan et al. | Automatic extraction and prediction of word order errors from language learning SNS | |
Khalfallah et al. | Had, a platform to create a historical dictionary |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A Method for Establishing an Iterative Corpus Granted publication date: 20220805 Pledgee: Industrial Bank Limited by Share Ltd. Xiamen branch Pledgor: Xiamen Armored Network Co.,Ltd. Registration number: Y2024980001529 |