CN114817517A - Corpus acquisition method and device, electronic equipment and storage medium - Google Patents

Corpus acquisition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114817517A
CN114817517A CN202210598358.3A CN202210598358A CN114817517A CN 114817517 A CN114817517 A CN 114817517A CN 202210598358 A CN202210598358 A CN 202210598358A CN 114817517 A CN114817517 A CN 114817517A
Authority
CN
China
Prior art keywords
corpus
pool
screening condition
target
linguistic data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210598358.3A
Other languages
Chinese (zh)
Other versions
CN114817517B (en
Inventor
刘克峻
郝玉峰
黄宇凯
李科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Speechocean Technology Co ltd
Original Assignee
Beijing Speechocean Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Speechocean Technology Co ltd filed Critical Beijing Speechocean Technology Co ltd
Priority to CN202210598358.3A priority Critical patent/CN114817517B/en
Publication of CN114817517A publication Critical patent/CN114817517A/en
Application granted granted Critical
Publication of CN114817517B publication Critical patent/CN114817517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a corpus acquisition method and a corpus acquisition device, wherein the method comprises the following steps: obtaining the linguistic data to be selected and target screening conditions in a linguistic data pool to be selected; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions; and adding the selected corpus with the minimum error value with the target screening condition in the selected corpus pool into the selected corpus pool in sequence, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold value. The technical scheme disclosed by the embodiment of the invention realizes corpus screening under multiple sentence dimensions, improves corpus acquisition efficiency, avoids corpus distortion caused by manual correction, further avoids the phenomenon of semantic loss, and ensures semantic accuracy of the corpus.

Description

Corpus acquisition method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to an artificial intelligence technology, in particular to a corpus acquisition method, a corpus acquisition device, electronic equipment and a storage medium.
Background
With the continuous progress of science and technology, the technical fields of speech synthesis, speech recognition, natural language processing and the like are rapidly developed, and the technical fields need to select a large amount of corpora meeting specific requirements from a corpus as corpus samples, so that the corpus screening and acquisition become more important.
In the prior art, in a multidimensional screening condition, a target dimension is usually used as a primary screening standard to obtain a corpus under the target dimension, and for the screening conditions of other dimensions, the obtained corpus is modified manually, so that the corpus also meets the requirements of the screening conditions in other dimensions.
However, in such a screening manner, the corpus acquisition efficiency is low, higher labor cost needs to be consumed, and meanwhile, the corpus is often difficult to be close to a real service scene due to manual modification, the matching degree with actual service requirements is low, even a semantic loss phenomenon occurs, and the corpus quality is poor.
Disclosure of Invention
The embodiment of the invention provides a corpus obtaining method, a corpus obtaining device, electronic equipment and a storage medium, which are used for obtaining target corpuses from a corpus to be selected through an error value between the corpus to be selected and a target screening condition.
In a first aspect, an embodiment of the present invention provides a corpus acquisition method, including:
obtaining the linguistic data to be selected and target screening conditions in a linguistic data pool to be selected; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions;
and adding the selected corpus in the selected corpus pool with the smallest error value with the target screening condition into the selected corpus pool in sequence, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold value.
In a second aspect, an embodiment of the present invention provides a corpus acquiring device, including:
the screening condition acquisition module is used for acquiring the linguistic data to be selected in the linguistic data pool to be selected and target screening conditions; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions;
and the corpus adding and executing module is used for sequentially adding the corpus to be selected in the corpus pool to be selected, which has the minimum error value with the target screening condition, into the selected corpus pool until the corpus number in the selected corpus pool reaches a first number threshold value, and taking the selected corpus in the selected corpus pool as the target corpus.
In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors implement the corpus acquiring method according to any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions, when executed by a computer processor, implement the corpus acquiring method according to any embodiment of the present invention.
According to the technical scheme disclosed by the embodiment of the invention, after the corpus to be selected in the corpus pool to be selected and the target screening condition are obtained, the corpus to be selected with the smallest error value with the target screening condition in the corpus pool to be selected is sequentially added into the selected corpus pool, and the selected corpus in the selected corpus pool is taken as the target corpus until the corpus number in the selected corpus pool reaches the first number threshold, so that corpus screening under multiple sentence dimensions is realized, corpus obtaining efficiency is improved, simultaneously, corpus distortion caused by manual correction is avoided, further, the phenomenon of semantic loss is caused, semantic accuracy of the corpus is ensured, and corpus quality is improved.
Drawings
Fig. 1 is a flowchart of a corpus acquiring method according to an embodiment of the present invention;
fig. 2 is a flowchart of a corpus acquiring method according to a second embodiment of the present invention;
fig. 3 is a flowchart of a corpus acquiring method according to a third embodiment of the present invention;
fig. 4A is a flowchart of a corpus acquiring method according to a fourth embodiment of the present invention;
fig. 4B is a schematic flow chart of a corpus acquiring method according to a fourth embodiment of the present invention;
fig. 4C is a schematic flow chart of a corpus acquiring method according to a fourth embodiment of the present invention;
fig. 5 is a block diagram of a corpus acquiring device according to a fifth embodiment of the present invention;
fig. 6 is a block diagram of an electronic device according to a sixth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a corpus acquiring method according to an embodiment of the present invention, where this embodiment may be applicable to selecting a target corpus from a corpus pool to be selected according to an error value between the corpus to be selected and a target screening condition, and the method may be executed by a corpus acquiring device in an embodiment of the present invention, where the device may be implemented by software and/or hardware and integrated in an electronic device, and may typically be integrated in a terminal device or a server, and the method specifically includes the following steps:
s110, obtaining the linguistic data to be selected in the linguistic data pool to be selected and target screening conditions; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition includes a plurality of statement dimensions.
The corpus pool to be selected may be a paragraph corpus pool, that is, the corpus to be selected in the corpus pool to be selected is a paragraph corpus, each paragraph corpus is composed of one or more sentences, each sentence in a paragraph corpus has a semantic relevance, for example, each sentence in a paragraph corpus is used to describe related information in the same application scenario (e.g., news, music, and geography), and thus, each paragraph corpus has a paragraph dimension (e.g., application scenario dimension) attribute; each sentence in the paragraph corpus has a plurality of sentence dimension attributes, i.e., is described from a plurality of sentence dimensions; where sentence dimensions may include sentences (e.g., question sentences, exclamation sentences, statement sentences, etc.), emotions (e.g., happiness, sadness, anger, fear, surprise, etc.), whether to contain acronyms, and whether to contain numbers; for example, the sentence dimension attributes of a sentence are "question sentence", "happy", "not containing acronym", and "containing number". The language material pool to be selected can also be a sentence language material pool, that is, the language materials to be selected in the language material pool to be selected are sentence language materials, each sentence language material is an independent sentence, and each sentence also has a plurality of sentence dimension attributes.
If the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; paragraph screening conditions, namely screening the paragraph corpuses in the corpus pool to be selected from the paragraph dimension, wherein the paragraph corpuses comprise the selected number of the corpuses and the proportion of each type of paragraph corpuses in the current paragraph dimension; taking paragraph dimensions as an application scenario as an example, one paragraph screening condition is to obtain 1000 paragraph corpora, wherein news is 30% in news category, music is 30% in music category, and geographical category is 40% in geographical category; statement screening conditions, namely screening the corpora from different statement dimensions, including the selection proportion of each statement type under each statement dimension, for example, one statement screening condition is that the question sentence accounts for 30%, the statement sentence accounts for 50% and the exclamation sentence accounts for 20% under the statement dimension; under the emotional dimension, the words of 'happy' account for more than 80%, the words of 'surprise' account for more than 10%, and the words of 'anger' account for more than 10%; whether an acronym-containing dimension contains 30% and no acronym-containing dimension 70%; whether the digital dimension contains 40% of digital sentences and 60% of digital sentences is not contained. If the corpus to be selected is a sentence corpus, the target screening condition is a sentence screening condition, and the sentence screening condition comprises the selection proportion of each sentence type under each sentence dimension and also comprises the corpus selection quantity; in the embodiment of the present invention, optionally, the paragraph dimension, the paragraph type in the paragraph dimension, the sentence dimension, and the sentence type in the sentence dimension are not specifically limited.
And S120, adding the corpus to be selected in the corpus pool to be selected, which has the smallest error value with the target screening condition, into the selected corpus pool in sequence, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold value.
A first quantity threshold value, which is the corpus acquisition quantity defined in the target screening condition; if the language material pool to be selected is a paragraph language material pool, and each paragraph language material comprises a plurality of sentences, each language material to be selected in the language material pool to be selected can be directly compared with the target screening condition, each language material to be selected is sequenced according to the error value between the two language materials, and each language material to be selected with the minimum error value is sequentially added into the selected language material pool; taking the above technical solution as an example, the paragraph screening condition is to obtain 1000 paragraph corpora, wherein the news category accounts for 30%, the music category accounts for 30%, and the geographical category accounts for 40%; sentence screening conditions are that under the sentence pattern dimension, the question sentences account for 30%, the statement sentences account for 50% and the exclamation sentences account for 20%; under the emotional dimension, the words of 'happy' account for more than 80%, the words of 'surprise' account for more than 10%, and the words of 'anger' account for more than 10%; whether an acronym-containing dimension contains 30% and no acronym-containing dimension 70%; whether the digital dimension contains 40% of digital sentences and 60% of digital sentences is not contained.
According to the target screening conditions, firstly, the paragraph screening conditions are used as a primary screening basis, namely, according to the proportion of each paragraph type under the paragraph dimension, the requirement for obtaining 300 news paragraph corpora, 300 music paragraph corpora and 400 geographical paragraph corpora is obtained after calculation; accordingly, the paragraph corpora can be sequentially obtained under each paragraph type according to the proportion of each paragraph type, for example, the proportion relationship between the news class, the music class, and the geographic class in the paragraph screening condition is 3: 3: 4, sequentially selecting 3 news paragraph corpora, 3 music paragraph corpora and 4 geographical paragraph corpora, and then circularly executing the selection operation until 1000 paragraph corpora are obtained; or sequentially acquiring all paragraph corpuses under each paragraph type according to the actual acquisition quantity of each paragraph type, namely selecting 300 news paragraph corpuses, then selecting 300 music paragraph corpuses and finally selecting 400 geographical paragraph corpuses; and acquiring a paragraph corpus under each paragraph type in sequence according to the actual acquisition quantity of each paragraph type, namely, selecting a paragraph corpus from news paragraph corpus, music paragraph corpus and geographical paragraph corpus each time, and if the paragraph corpus selected under a certain paragraph type reaches the paragraph acquisition value under the type, not acquiring the paragraph corpus from the paragraph type any more and continuing to acquire the paragraph corpuses under other paragraph types.
Then when selecting paragraph corpus in each paragraph type, comparing the sentence type ratio under each sentence dimension in each paragraph corpus with the sentence type ratio under the same sentence dimension in the sentence screening condition, calculating the ratio difference under each sentence dimension, summing the ratio differences under each sentence dimension to obtain the final error value, and taking the paragraph corpus with the minimum error value as the selected paragraph corpus; for example, the corpus of the current paragraph is that the question sentences account for 40%, the statement sentences account for 50% and the exclamation sentences account for 10% in the sentence pattern dimension; "happy" statements account for more than 50%, "surprised" statements account for more than 30%, and "anger" statements account for more than 20% in the emotional dimension; whether the dimension of the abbreviation is contained, the sentence containing the abbreviation accounts for 60 percent, and the sentence not containing the abbreviation accounts for 40 percent; whether the digital sentences are contained in the digital dimension in a ratio of 20% or not and the digital sentences are not contained in the digital dimension in a ratio of 80% or not is judged, so that the proportion of question sentences, statement sentences and exclamation sentences in the sentence pattern dimension is 4: 5: 1; the proportions of the "happy", "surprised" and "angry" sentences are 5: 3: 2; the proportion of containing and not containing the acronym sentences under the acronym dimension is 6: 4; if the dimension of the digital sentence is contained, the proportion of the digital sentence contained and the digital sentence not contained is 1: 4.
particularly, for convenience of calculation, the statement type ratio in each statement dimension may be normalized, for example, the above proportional values are all converted into values greater than or equal to 0 and less than or equal to 1, that is, the maximum value in the statement type ratio in each statement dimension is determined to be 1, and the rest of the values are reduced in equal proportion; therefore, in the current paragraph corpus, under the sentence pattern dimension, the proportion of the question sentence, the statement sentence and the exclamation sentence is 0.8: 1: 0.2; the proportions of the "happy", "surprised" and "angry" sentences are 1: 0.6: 0.4; the proportion of the sentences containing the acronyms and the sentences not containing the acronyms is 1: 0.67; the ratio of digital sentences contained to digital sentences not contained in the digital dimension is 0.25: 1.
in the target screening condition, under the sentence pattern dimension, the proportion of the question sentence, the statement sentence and the exclamation sentence is 3: 5: 2; the scale of the "happy", "surprised" and "angry" sentences is 8: 1: 1; the proportion of containing and not containing the acronym sentences under the acronym dimension is 3: 7; if the dimension of the digital sentence is contained, the proportion of the digital sentence contained and the digital sentence not contained is 4: 6; after normalization processing, the proportion of question sentences, statement sentences and exclamation sentences is 0.6: 1: 0.4; the proportions of the "happy", "surprised" and "angry" sentences are 1: 0.125: 0.125; the ratio of the words containing the abbreviation and the words not containing the abbreviation under the dimension of whether the abbreviation is contained or not is 0.43: 1; the ratio of digital sentences contained and digital sentences not contained in the digital dimension is 0.67: 1.
obtaining statement type ratios of statement dimensions in a corpus of a current paragraph and error values of the statement type ratios of the statement dimensions in corresponding statement screening conditions; wherein the error may comprise a true error, an absolute error, or a squared error; taking the above technical scheme as an example, the difference values under the sentence pattern dimensions are 0.2, 0, -0.2 respectively, that is, the absolute error is 0.4; differences in emotion type dimension of 0, 0.375, 0.175, i.e. absolute error of 0.55; whether the difference under the dimension of the abbreviation is 0.57 and-0.33 respectively is included, namely the absolute error is 0.9; whether the difference under the digital dimension is-0.42 and 0 respectively is included, namely the absolute error is 0.42; thus, the sum of the absolute error of the corpus of the current paragraph and the target screening condition is 2.2.27, and one or more paragraphs with the smallest error value under each paragraph type are added as the selected corpus into the selected corpus pool, and if the paragraphs with each paragraph type respectively reach the respective corpus acquiring value, the corpus amount in the selected corpus pool reaches the first amount threshold, and then the selected corpus in the selected corpus pool is used as the target corpus.
Optionally, in the embodiment of the present invention, the sequentially adding the corpus to be selected, which has the smallest error value with the target screening condition, in the corpus to be selected to the selected corpus, includes: classifying the linguistic data to be selected according to the error values of the linguistic data to be selected and the target screening condition so as to obtain a positive error linguistic data set, a negative error linguistic data set and a zero error linguistic data set; adding the linguistic data to be selected in the zero-error corpus set into the selected corpus pool, and respectively sequencing the linguistic data to be selected in the positive-error corpus set and the negative-error corpus set; and extracting the corpus to be selected with the minimum error absolute value from the positive error corpus set and the negative error corpus set in sequence, and adding the corpus to be selected into a selected corpus pool.
Specifically, each corpus to be selected can be classified according to the real error between each corpus to be selected and the target screening condition, and a positive error corpus set, a negative error corpus set and a zero error corpus set are respectively formed; taking the above technical scheme as an example, the difference values under the sentence pattern dimensions are 0.2, 0, -0.2 respectively, that is, the true error is 0; differences in emotion type dimension of 0, 0.375, 0.175, i.e. true error of 0.55; whether the difference values under the dimension of the abbreviation are respectively 0.57 and-0.33, namely the real error is 0.24; whether the difference values under the digital dimension are-0.42 and 0 respectively is included, namely the real error is-0.42; therefore, the sum of the actual errors of the corpus of the current paragraph and the target screening condition is 0.37, and the corpus of the current paragraph is distributed into a positive error corpus set; after all the corpora to be selected in the corpus pool to be selected are distributed, paragraph corpora in a zero-error corpus set are added into the selected corpus pool, if the corpus quantity in the current selected corpus pool does not reach a first quantity threshold value, the corpus to be selected with the smallest absolute value of error is extracted from a positive-error corpus set and a negative-error corpus set in sequence and added into the selected corpus pool until the corpus quantity in the selected corpus pool reaches the first quantity threshold value; compared with the prior art, the method has the advantages that all the linguistic data to be selected are directly sequenced, and the linguistic data to be selected are sequentially extracted from the angles of positive errors and negative errors through the positive error corpus set, the negative error corpus set and the zero error corpus set, so that the matching degree of the target linguistic data and the target screening condition is further improved, and the acquisition accuracy of the target linguistic data is improved.
Optionally, in the embodiment of the present invention, the sequentially adding the corpus to be selected in the corpus pool to be selected, which has the smallest error value with the target screening condition, to the selected corpus pool until the corpus number in the selected corpus pool reaches the first number threshold, and taking the selected corpus in the selected corpus pool as the target corpus includes: obtaining a corpus subset to be selected according to the corpus pool to be selected; the corpus subset to be selected comprises a plurality of corpora to be selected; adding the specified to-be-selected corpus with the minimum error value with the target screening condition in the to-be-selected corpus subset into a selected corpus pool, and returning other to-be-selected corpora except the specified to-be-selected corpus in the to-be-selected corpus subset into the to-be-selected corpus pool; and continuously acquiring a corpus subset to be selected according to the corpus pool to be selected, and taking the selected corpus in the selected corpus pool as a target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold.
Specifically, when the corpus to be selected is obtained from the corpus pool to be selected, a certain number of corpora to be selected can be randomly obtained to form a corpus subset to be selected; then, in the subset of the language materials to be selected, obtaining a language material to be selected with the minimum error value with the target screening condition and adding the language material to be selected into the selected language material pool; then releasing the subset of the linguistic data to be selected, namely, putting other linguistic data to be selected which are not selected in the subset of the linguistic data to be selected back to the corpus pool to be selected, and continuously constructing the next subset of the linguistic data to be selected until the number of the linguistic data in the corpus pool to be selected reaches a first number threshold value, and taking the selected linguistic data in the corpus pool to be selected as target linguistic data; compared with the method that the corpora are extracted from all the corpora to be selected in the corpus pool to be selected each time, the corpus subset to be selected is constructed, the corpus screening range is greatly reduced, and the target corpus obtaining efficiency is improved.
According to the technical scheme disclosed by the embodiment of the invention, after the corpus to be selected in the corpus pool to be selected and the target screening condition are obtained, the corpus to be selected with the smallest error value with the target screening condition in the corpus pool to be selected is sequentially added into the selected corpus pool, and the selected corpus in the selected corpus pool is taken as the target corpus until the corpus number in the selected corpus pool reaches the first number threshold, so that corpus screening under multiple sentence dimensions is realized, corpus obtaining efficiency is improved, simultaneously, corpus distortion caused by manual correction is avoided, further, the phenomenon of semantic loss is caused, semantic accuracy of the corpus is ensured, and corpus quality is improved.
Example two
Fig. 2 is a flowchart of a corpus acquiring method according to a second embodiment of the present invention, which is embodied on the basis of the foregoing technical solution, in which a selected corpus in a selected corpus pool and a to-be-selected corpus in a to-be-selected corpus pool are respectively combined into combined corpuses, and corpuses are selected from the to-be-selected corpus pool according to an error value between each combined corpus and a target screening condition, and specifically, the method includes the following steps:
s210, obtaining the linguistic data to be selected in the linguistic data pool to be selected and target screening conditions; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition includes a plurality of statement dimensions.
S220, combining the selected corpus in the selected corpus pool and the to-be-selected corpus in the to-be-selected corpus pool to form combined corpuses respectively, and obtaining a first combined corpus with the minimum error value according to the error value of each combined corpus and the target screening condition so as to add the first to-be-selected corpus corresponding to the first combined corpus into the selected corpus pool.
Initially, no corpus exists in the selected corpus pool, and at the moment, the corpus to be selected of a first basic threshold value is randomly obtained in the selected corpus pool according to the first basic threshold value and is added into the selected corpus pool; wherein the first base threshold value is of a lesser magnitude, typically much less than the first quantity threshold value; then, combining each corpus to be selected in the corpus pool to be selected and all the selected corpora in the selected corpus pool respectively to form combined corpora; comparing the error value of each combined corpus with the error value of the target screening condition to obtain a first combined corpus with the minimum error value; and finally, adding the linguistic data to be selected (namely the first linguistic data to be selected) in the first combined linguistic data into the selected linguistic data pool.
And S230, continuing to combine the selected corpora in the selected corpus pool and the to-be-selected corpora in the to-be-selected corpus pool into combined corpora respectively until the number of the corpora in the selected corpus pool reaches a first number threshold, and taking the selected corpora in the selected corpus pool as target corpora.
After the selected corpus in the combined corpus with the minimum error value is added into the selected corpus pool, continuously forming the remaining selected corpus and all selected corpora into combined corpuses respectively, and continuously obtaining the selected corpus in the combined corpus with the minimum error value according to each combined corpus until the corpus number in the selected corpus pool reaches a first number threshold value, and taking the selected corpus in the selected corpus pool as a target corpus; particularly, matching weight values can be set for paragraph dimensions and sentence dimensions respectively, a difference value between the combined corpus and the target screening condition in each dimension is multiplied by the weight value of the dimension, the multiplication results are summed, and finally the summed result is used as a final error value of the corpus to be selected and the target screening condition, so that the target corpus can be obtained according to different importance degrees of the dimensions.
Optionally, in the embodiment of the present invention, after the selected corpus in the selected corpus pool and the to-be-selected corpus in the to-be-selected corpus pool are respectively combined into the combined corpus, the method further includes: sequentially obtaining the error value of each combined corpus and the target screening condition; and if the obtained error value between the second combined corpus and the target screening condition is smaller than a first error threshold value, adding a second to-be-selected corpus corresponding to the second combined corpus into the selected corpus pool. If the error value of an acquired certain combined corpus and the target screening condition is smaller, that is, smaller than the first error threshold, it indicates that the degree of similarity between the combined corpus and the target screening condition is higher, and at this time, it is not necessary to compare the other currently acquired combined corpuses with the target screening condition, but the corpus to be selected (i.e., the second corpus to be selected) in the combined corpus (i.e., the second combined corpus) is directly put into the selected corpus pool, so as to simplify the comparison process and further improve the acquisition efficiency of the target corpus.
The technical scheme disclosed by the embodiment of the invention comprises the steps of respectively forming combined corpuses by the selected corpus in the selected corpus pool and the to-be-selected corpus in the to-be-selected corpus pool, obtaining the first combined corpus with the minimum error value according to the error value of each combined corpus and the target screening condition, adding the first to-be-selected corpus corresponding to the first combined corpus into the selected corpus pool until the corpus number in the selected corpus pool reaches a first number threshold value, taking the selected corpus in the selected corpus pool as the target corpus, and greatly reducing the error between the selected corpus and the target screening condition compared with the error comparison between the single corpus and the target screening condition based on the target corpus obtaining mode of the combined corpus, thereby further improving the corpus obtaining accuracy.
EXAMPLE III
Fig. 3 is a flowchart of a corpus acquiring method according to a third embodiment of the present invention, which is embodied on the basis of the foregoing technical solution, in the embodiment of the present invention, in the corpus screening process, if the dimensions of each sentence in the target screening condition are not changed and the screening value of at least one sentence dimension is changed, corpus screening may be continued based on the currently selected corpus, and specifically, the method includes the following steps:
s310, obtaining the corpus to be selected in the corpus pool to be selected and target screening conditions; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions; s320 is performed.
S320, combining the selected corpus in the selected corpus pool and the to-be-selected corpus in the to-be-selected corpus pool into combined corpuses respectively, and obtaining a first combined corpus with the minimum error value according to the error value of each combined corpus and the target screening condition so as to add the first to-be-selected corpus corresponding to the first combined corpus into the selected corpus pool; s330 is performed.
S330, responding to the change information of the target screening condition, and acquiring the changed target screening condition; s340 is performed.
S340, if the dimensionality of each sentence in the changed target screening condition is not changed and the screening numerical value of at least one sentence dimensionality is changed, judging whether the corpus quantity in the selected corpus pool is less than or equal to a second quantity threshold value or not; wherein the second quantity threshold is less than the first quantity threshold; if not, executing S350; if yes, go to step S360.
When the corpus is screened according to the target screening condition, if the target screening condition is changed, the screening is usually directly finished, the selected corpus pool is emptied, and then the corpus is screened again according to the changed target screening condition; if the sentence dimensions in the changed target screening condition are not changed, the screening numerical values under one or more sentence dimensions are changed, which indicates that the screening dimensions of the selected corpus are the same as the changed target screening condition, and the selected corpus is obtained after screening based on the same screening dimensions, and at this time, the dimension basis for continuing to perform the screening based on the selected corpus before the change is provided; meanwhile, if the number of the selected corpus is small, namely less than or equal to the second number threshold, the selected corpus still has a large adjustment space when the corpus is subsequently screened, and the final screening result completely or approximately meets the changed target screening condition by adjusting the number of the subsequently acquired sentences of each sentence type; for example, the target screening conditions before and after the change are all 1000 corpora, and before the target screening conditions are changed, the number of the selected corpora is only 200, so that when the remaining 800 corpora are obtained by subsequent screening, the number of the corpora in each statement type can be adjusted, so that the final screening result completely meets or approximately meets the changed target screening conditions; if the number of the selected corpus is large, namely larger than the second number threshold, it is indicated that the adjustment space for subsequently performing corpus screening is small, that is, the number of sentences in each sentence dimension obtained subsequently cannot be adjusted, so that the final screening result cannot completely or approximately accord with the changed target screening condition, and a large screening error exists; for example, the target screening condition is to obtain 1000 corpora, and before the target screening condition is changed, the number of the selected corpora is 980, so that when the remaining 20 corpora are obtained by subsequent screening, even if the number of the corpora in each sentence type is adjusted, the final screening result cannot approximately meet the changed target screening condition, and a large screening error exists.
And S350, emptying the selected corpus pool.
And if the corpus quantity in the selected corpus pool is larger than the second quantity threshold value, finishing the screening, emptying the selected corpus pool, and then re-screening the corpus according to the changed target screening condition.
S360, combining the selected linguistic data in the selected linguistic data pool and the linguistic data to be selected in the selected linguistic data pool into combined linguistic data respectively, and obtaining a first combined linguistic data with the minimum error value according to the error values of the combined linguistic data and the changed target screening condition so as to add the first linguistic data to be selected corresponding to the first combined linguistic data into the selected linguistic data pool; s370 is performed.
If the corpus quantity in the selected corpus pool is less than or equal to the second quantity threshold value, the selected corpus can be used for continuously screening and obtaining the subsequent corpus according to the technical scheme.
And S370, continuing to combine the selected corpora in the selected corpus pool and the to-be-selected corpora in the to-be-selected corpus pool into combined corpora respectively, and taking the selected corpora in the selected corpus pool as target corpora until the number of the corpora in the selected corpus pool reaches a first number threshold.
According to the technical scheme disclosed by the embodiment of the invention, when the corpus is screened according to the target screening condition, if the target screening condition is changed, the dimensionality of each sentence in the changed target screening condition is not changed, and the screening value under one or more sentence dimensionalities is changed, the selected corpus before the target screening condition is changed is utilized, the selected corpus and the corpus to be selected are continuously combined into the combined corpus, the corpus to be selected is continuously obtained according to the error value of each combined corpus and the target screening condition until the corpus number in the selected corpus pool reaches a first number threshold value, the selected corpus in the selected corpus pool is used as the target corpus, the selected corpus before the target screening condition is changed is fully utilized, the waste of the selected corpus under the same sentence dimensionality is avoided, and the obtaining efficiency of the target corpus is further improved.
Example four
Fig. 4A is a flowchart of a corpus acquiring method according to a fourth embodiment of the present invention, which is embodied on the basis of the foregoing technical solution, and in the embodiment of the present invention, after acquiring a corpus to be selected in a corpus pool to be selected and a target screening condition, the method further includes determining whether a number of the corpus to be selected in the corpus pool to be selected is less than a third number threshold, specifically, the method includes the following steps:
s410, obtaining the linguistic data to be selected in the linguistic data pool to be selected and target screening conditions; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions; s420 is performed.
S420, judging whether the number of the linguistic data to be selected in the linguistic data pool to be selected is smaller than a third number threshold value; wherein the third number threshold is greater than or equal to the first number threshold; if not, executing S430; if yes, go to step S440.
And S430, adding the selected corpus in the selected corpus pool with the smallest error value with the target screening condition into the selected corpus pool in sequence, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold value.
S440, acquiring a fourth quantity threshold according to the quantity of the linguistic data to be selected in the linguistic data pool to be selected and the first proportional coefficient; s450 is performed.
S450, adding the corpus to be selected, which has the smallest error value with the target screening condition, in the corpus to be selected into the selected corpus in sequence, and waiting for the corpus to be selected to be subjected to capacity increasing processing until the corpus quantity in the selected corpus reaches a fourth quantity threshold value.
If the number of the linguistic data to be selected in the linguistic data pool to be selected is small and is not enough to support the current screening (for example, 1000 linguistic data are to be acquired and only 700 linguistic data are in the linguistic data pool to be selected), or although the current screening can be supported, the error of the screening result is large (for example, 1000 linguistic data are to be acquired and 1200 linguistic data are in the linguistic data pool to be selected, although the acquisition requirement of 1000 linguistic data can be met, the error of the screening result is large due to too small number of samples), a fourth number threshold is acquired according to the number of the linguistic data to be selected in the linguistic data pool to be selected and a first scale coefficient (for example, 50%); taking the above technical solution as an example, the fourth quantity threshold is 600, and then 600 corpora are obtained from 1200 corpora to be selected and added into the selected corpus pool, and then subsequent screening is not performed, and new corpora to be selected are added into the selected corpus pool, and the 600 corpora obtained are stored as historical data.
Optionally, in this embodiment of the present invention, a difference between the first number threshold and the fourth number threshold is used as a fifth number threshold, and a quotient between the fifth number threshold and the first scaling factor is used as a sixth number threshold; and when the remaining linguistic data to be selected in the linguistic data pool to be selected are determined to be larger than or equal to the sixth quantity threshold, sequentially adding the linguistic data to be selected, which have the minimum error value with the target screening condition, in the linguistic data pool to be selected into the selected linguistic data pool until the number of the linguistic data in the selected linguistic data pool reaches the first quantity threshold, and taking the selected linguistic data in the selected linguistic data pool as the target linguistic data. Taking the above technical solution as an example, to meet the target screening condition, on the basis of 600 corpus acquired, 400 corpus are further acquired, and according to the first scale coefficient, it is known that the screening requirement of the first scale coefficient cannot be met until the number of corpus to be selected in the corpus pool to be selected is greater than or equal to 800, and therefore, by monitoring the number of corpus to be selected in the corpus pool to be selected, when it is determined that the corpus pool to be selected is subjected to capacity expansion processing and the number of corpus to be selected in the corpus pool to be selected reaches 800, the subsequent screening is continued until the number of corpus to be selected meets the target screening condition.
As shown in fig. 4B, when the history data is not included, the server obtains the target screening condition by obtaining the configuration file and analyzing the configuration file; the method comprises the steps that a server obtains corpus Identification (ID) and dimension data of each corpus to be selected through a corpus pool to be selected; the server combines the selected corpus and the to-be-selected corpus into a combined corpus, selects a combined corpus which enables the selected data (namely the combined corpus) to be closer to the target screening condition in each iteration by adopting a 'tentative-elimination' mechanism, selects the matched to-be-selected corpus from the to-be-selected corpus pool according to the identification of the to-be-selected corpus in the combined corpus, and then stores the selected to-be-selected corpus in an output file.
As shown in fig. 4C, when the history data is included, the server obtains the target screening condition by obtaining the configuration file and analyzing the configuration file; the method comprises the steps that a server obtains corpus Identification (ID) and dimension data of each corpus to be selected through a corpus pool to be selected; the server takes the historical data as the selected corpus and adds the selected corpus into the selected corpus pool; the server combines the selected corpus and the to-be-selected corpus into a combined corpus, selects a combined corpus which enables the selected data (namely the combined corpus) to be closer to the target screening condition in each iteration by adopting a 'tentative-elimination' mechanism, selects the matched to-be-selected corpus from the to-be-selected corpus pool according to the identification of the to-be-selected corpus in the combined corpus, and then stores the selected to-be-selected corpus in an output file.
According to the technical scheme disclosed by the embodiment of the invention, when the number of the linguistic data to be selected in the linguistic data pool to be selected is judged to be smaller than the third number threshold, the fourth number threshold is obtained according to the number of the linguistic data to be selected in the linguistic data pool to be selected and the first proportion coefficient, then the linguistic data to be selected with the smallest error value with the target screening condition in the linguistic data pool to be selected is sequentially added into the selected linguistic data pool, and the capacity increasing treatment is waited for the linguistic data pool to be selected when the number of the linguistic data in the selected linguistic data pool reaches the fourth number threshold.
EXAMPLE five
Fig. 5 is a block diagram of a corpus acquiring device according to a fifth embodiment of the present invention, where the corpus acquiring device specifically includes: a screening condition obtaining module 510 and a corpus adding execution module 520;
a screening condition obtaining module 510, configured to obtain a corpus to be selected and a target screening condition in a corpus pool to be selected; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions;
and a corpus adding execution module 520, configured to add the corpus to be selected, which has the smallest error value with the target screening condition, to the selected corpus in the selected corpus pool in sequence, and take the selected corpus in the selected corpus pool as the target corpus until the corpus number in the selected corpus pool reaches a first number threshold.
According to the technical scheme disclosed by the embodiment of the invention, after the corpus to be selected in the corpus pool to be selected and the target screening condition are obtained, the corpus to be selected with the smallest error value with the target screening condition in the corpus pool to be selected is sequentially added into the selected corpus pool, and the selected corpus in the selected corpus pool is taken as the target corpus until the corpus number in the selected corpus pool reaches the first number threshold, so that corpus screening under multiple sentence dimensions is realized, corpus obtaining efficiency is improved, simultaneously, corpus distortion caused by manual correction is avoided, further, the phenomenon of semantic loss is caused, semantic accuracy of the corpus is ensured, and corpus quality is improved.
Optionally, on the basis of the foregoing technical solution, the corpus adding execution module 520 specifically includes:
the error classification execution unit is used for classifying the linguistic data to be selected according to the error values of the linguistic data to be selected and the target screening condition so as to obtain a positive error linguistic data set, a negative error linguistic data set and a zero error linguistic data set;
the set sorting execution unit is used for adding the linguistic data to be selected in the zero-error corpus set into the selected linguistic data pool, and respectively sorting the linguistic data to be selected in the positive-error corpus set and the negative-error corpus set;
and the set extraction execution unit is used for extracting the corpus to be selected with the minimum error absolute value from the positive error corpus set and the negative error corpus set in sequence and adding the corpus to be selected into the selected corpus pool.
Optionally, on the basis of the foregoing technical solution, the corpus adding execution module 520 further includes:
a to-be-selected corpus subset obtaining unit, configured to obtain a to-be-selected corpus subset according to the to-be-selected corpus pool; the corpus subset to be selected comprises a plurality of corpora to be selected;
a subset extraction execution unit, configured to add the specified to-be-selected corpus, in the to-be-selected corpus subset, having the smallest error value with the target screening condition into a selected corpus pool, and return the to-be-selected corpus, in the to-be-selected corpus subset, other to-be-selected corpora except the specified to-be-selected corpus, to the to-be-selected corpus pool; and continuously acquiring a corpus subset to be selected according to the corpus pool to be selected, and taking the selected corpus in the selected corpus pool as a target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold.
Optionally, on the basis of the foregoing technical solution, the corpus adding execution module 520 specifically further includes:
a combination extraction execution unit, configured to combine the selected corpus in the selected corpus pool and the to-be-selected corpus in the to-be-selected corpus pool into combined corpuses respectively, and obtain a first combined corpus with a smallest error value according to an error value between each combined corpus and the target screening condition, so as to add the first to-be-selected corpus corresponding to the first combined corpus into the selected corpus pool; and continuously combining the selected corpora in the selected corpus pool and the to-be-selected corpora in the to-be-selected corpus pool to form combined corpora respectively until the number of the corpora in the selected corpus pool reaches a first number threshold, and taking the selected corpora in the selected corpus pool as target corpora.
Optionally, on the basis of the foregoing technical solution, the corpus adding execution module 520 specifically further includes:
an error threshold comparison unit, configured to sequentially obtain an error value between each of the combined corpora and the target screening condition; and if the error value of the obtained second combined corpus and the target screening condition is smaller than a first error threshold value, adding a second to-be-selected corpus corresponding to the second combined corpus into the selected corpus pool.
Optionally, on the basis of the above technical solution, if the corpus to be selected is a paragraph corpus, the combination extraction execution unit is further configured to obtain the target screening condition after the change in response to obtaining the change information of the target screening condition; if the language sentence dimension in the changed target screening condition is not changed and the screening numerical value of at least one language sentence dimension is changed, judging whether the corpus quantity in the selected language material pool is less than or equal to a second quantity threshold value or not; wherein the second quantity threshold is less than the first quantity threshold; if the corpus quantity in the selected corpus pool is larger than a second quantity threshold value, emptying the selected corpus pool; if the corpus quantity in the selected corpus pool is less than or equal to a second quantity threshold, combining the selected corpus in the selected corpus pool and the to-be-selected corpus in the to-be-selected corpus pool respectively to obtain combined corpuses, and obtaining a first combined corpus with the minimum error value according to the error values of the combined corpuses and the changed target screening condition so as to add the first to-be-selected corpus corresponding to the first combined corpus into the selected corpus pool.
Optionally, on the basis of the foregoing technical solution, the corpus acquiring device further includes:
the corpus quantity judging module is used for judging whether the corpus quantity to be selected in the corpus pool to be selected is smaller than a third quantity threshold value or not; wherein the third number threshold is greater than or equal to the first number threshold;
and the fourth quantity threshold acquisition module is used for acquiring a fourth quantity threshold according to the number of the linguistic data to be selected in the linguistic data pool to be selected and the first proportion coefficient if the number of the linguistic data to be selected in the linguistic data pool to be selected is smaller than the third quantity threshold.
Optionally, on the basis of the foregoing technical solution, the corpus adding execution module 520 is further specifically configured to add, in the to-be-selected corpus pool, the corpus to be selected with the smallest error value with the target screening condition to the selected corpus pool in sequence, and wait for the corpus to be selected to perform capacity expansion processing until the corpus number in the selected corpus pool reaches a fourth number threshold.
The device can execute the corpus acquisition method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the corpus obtaining method provided in any embodiment of the present invention.
EXAMPLE six
Fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention. FIG. 6 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 6 is only an example and should not bring any limitation to the function and the scope of use of the embodiment of the present invention.
As shown in fig. 6, the electronic device 12 is in the form of a general purpose computer device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a memory 28, and a bus 18 that couples various system components including the memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, and commonly referred to as a "hard drive"). Although not shown in FIG. 6, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, such as implementing the corpus acquisition method provided by the embodiment of the present invention, by running a program stored in the memory 28. Namely: obtaining the linguistic data to be selected and target screening conditions in a linguistic data pool to be selected; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions; and adding the selected corpus in the selected corpus pool with the smallest error value with the target screening condition into the selected corpus pool in sequence, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold value.
EXAMPLE seven
The seventh embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the corpus acquisition method according to any embodiment of the present invention; the method comprises the following steps:
obtaining the linguistic data to be selected and target screening conditions in a linguistic data pool to be selected; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions;
and adding the selected corpus in the selected corpus pool with the smallest error value with the target screening condition into the selected corpus pool in sequence, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold value.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A corpus acquiring method is characterized by comprising the following steps:
obtaining the linguistic data to be selected and target screening conditions in a linguistic data pool to be selected; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions;
and adding the selected corpus in the selected corpus pool with the smallest error value with the target screening condition into the selected corpus pool in sequence, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold value.
2. The method according to claim 1, wherein the sequentially adding the corpus to be selected, which has the smallest error value with the target screening condition, in the corpus to be selected to the selected corpus comprises:
classifying the linguistic data to be selected according to the error values of the linguistic data to be selected and the target screening condition so as to obtain a positive error linguistic data set, a negative error linguistic data set and a zero error linguistic data set;
adding the linguistic data to be selected in the zero-error corpus set into the selected corpus pool, and respectively sequencing the linguistic data to be selected in the positive-error corpus set and the negative-error corpus set;
and extracting the corpus to be selected with the minimum error absolute value from the positive error corpus set and the negative error corpus set in sequence, and adding the corpus to be selected into a selected corpus pool.
3. The method according to claim 1, wherein the sequentially adding the corpus to be selected, which has the smallest error value with the target screening condition, in the corpus to be selected to the selected corpus until the corpus number in the selected corpus reaches a first number threshold, and taking the selected corpus in the selected corpus as the target corpus comprises:
obtaining a corpus subset to be selected according to the corpus pool to be selected; the corpus subset to be selected comprises a plurality of corpora to be selected;
adding the specified to-be-selected corpus with the minimum error value with the target screening condition in the to-be-selected corpus subset into a selected corpus pool, and returning other to-be-selected corpora except the specified to-be-selected corpus in the to-be-selected corpus subset into the to-be-selected corpus pool;
and continuously acquiring a corpus subset to be selected according to the corpus pool to be selected, and taking the selected corpus in the selected corpus pool as a target corpus when the corpus quantity in the selected corpus pool reaches a first quantity threshold.
4. The method according to claim 1 or 3, wherein the step of adding the corpus to be selected, which has the smallest error value with the target screening condition, in the corpus pool to be selected into the selected corpus pool in sequence, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus number in the selected corpus pool reaches a first number threshold value comprises:
combining the selected corpus in the selected corpus pool and the to-be-selected corpus in the to-be-selected corpus pool to form combined corpuses respectively, and acquiring a first combined corpus with the minimum error value according to the error value of each combined corpus and the target screening condition so as to add the first to-be-selected corpus corresponding to the first combined corpus into the selected corpus pool;
and continuously combining the selected corpora in the selected corpus pool and the to-be-selected corpora in the to-be-selected corpus pool to form combined corpora respectively until the number of the corpora in the selected corpus pool reaches a first number threshold, and taking the selected corpora in the selected corpus pool as target corpora.
5. The method according to claim 4, wherein after the selected corpus in the selected corpus pool and the candidate corpus in the candidate corpus pool are combined into a combined corpus, the method further comprises:
sequentially obtaining the error value of each combined corpus and the target screening condition;
and if the obtained error value between the second combined corpus and the target screening condition is smaller than a first error threshold value, adding a second to-be-selected corpus corresponding to the second combined corpus into the selected corpus pool.
6. The method according to claim 4, wherein if the corpus to be selected is a paragraph corpus, the step of combining the selected corpus in the selected corpus pool and the corpus to be selected in the selected corpus pool into combined corpuses respectively, and obtaining a first combined corpus with a minimum error value according to the error value of each combined corpus and the target screening condition, so as to add a first corpus to be selected corresponding to the first combined corpus into the selected corpus pool comprises:
acquiring the changed target screening condition in response to acquiring the change information of the target screening condition;
if the language sentence dimension in the changed target screening condition is not changed and the screening numerical value of at least one language sentence dimension is changed, judging whether the corpus quantity in the selected language material pool is less than or equal to a second quantity threshold value or not; wherein the second quantity threshold is less than the first quantity threshold;
if the corpus quantity in the selected corpus pool is larger than a second quantity threshold value, emptying the selected corpus pool;
if the corpus quantity in the selected corpus pool is less than or equal to a second quantity threshold, combining the selected corpus in the selected corpus pool and the to-be-selected corpus in the to-be-selected corpus pool respectively to obtain combined corpuses, and obtaining a first combined corpus with the minimum error value according to the error values of the combined corpuses and the changed target screening condition so as to add the first to-be-selected corpus corresponding to the first combined corpus into the selected corpus pool.
7. The method according to any one of claims 1 to 6, wherein after obtaining the corpus to be selected and the target screening condition in the corpus pool to be selected, the method further comprises:
judging whether the number of the linguistic data to be selected in the linguistic data pool to be selected is smaller than a third number threshold value; wherein the third number threshold is greater than or equal to the first number threshold;
if yes, acquiring a fourth quantity threshold according to the quantity of the linguistic data to be selected in the linguistic data pool to be selected and the first proportion coefficient;
the sequentially adding the selected corpus in the selected corpus pool, which has the smallest error value with the target screening condition, into the selected corpus pool, and taking the selected corpus in the selected corpus pool as the target corpus when the corpus number in the selected corpus pool reaches a first number threshold, includes:
and sequentially adding the linguistic data to be selected, which has the minimum error value with the target screening condition, into the selected linguistic data pool, and waiting for the capacity increasing treatment of the selected linguistic data pool until the number of the linguistic data in the selected linguistic data pool reaches a fourth number threshold.
8. A corpus acquiring apparatus, comprising:
the screening condition acquisition module is used for acquiring the linguistic data to be selected in the linguistic data pool to be selected and target screening conditions; wherein, the corpus to be selected comprises paragraph corpus or sentence corpus; if the corpus to be selected is a paragraph corpus, the target screening condition comprises a paragraph screening condition and a sentence screening condition; if the corpus to be selected is a sentence corpus, the target screening condition comprises a sentence screening condition; the statement screening condition comprises a plurality of statement dimensions;
and the corpus adding execution module is used for sequentially adding the corpus to be selected in the corpus pool to be selected, which has the minimum error value with the target screening condition, into the selected corpus pool until the corpus number in the selected corpus pool reaches a first number threshold value, and taking the selected corpus in the selected corpus pool as the target corpus.
9. An electronic device, characterized in that the electronic device comprises:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the corpus acquisition method of any one of claims 1-7.
10. A storage medium containing computer-executable instructions for performing the corpus retrieval method of any one of claims 1-7 when executed by a computer processor.
CN202210598358.3A 2022-05-30 2022-05-30 Corpus acquisition method and device, electronic equipment and storage medium Active CN114817517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210598358.3A CN114817517B (en) 2022-05-30 2022-05-30 Corpus acquisition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210598358.3A CN114817517B (en) 2022-05-30 2022-05-30 Corpus acquisition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114817517A true CN114817517A (en) 2022-07-29
CN114817517B CN114817517B (en) 2022-12-20

Family

ID=82519386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210598358.3A Active CN114817517B (en) 2022-05-30 2022-05-30 Corpus acquisition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114817517B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948142A (en) * 2019-01-25 2019-06-28 北京海天瑞声科技股份有限公司 Corpus chooses processing method, device, equipment and computer readable storage medium
CN110728133A (en) * 2019-12-19 2020-01-24 北京海天瑞声科技股份有限公司 Individual corpus acquisition method and individual corpus acquisition device
US20210004603A1 (en) * 2019-07-02 2021-01-07 Baidu Usa Llc Method and apparatus for determining (raw) video materials for news
CN112668339A (en) * 2020-12-23 2021-04-16 北京有竹居网络技术有限公司 Corpus sample determination method and device, electronic equipment and storage medium
CN113627155A (en) * 2021-08-06 2021-11-09 上海浦东发展银行股份有限公司 Data screening method, device, equipment and storage medium
CN113807074A (en) * 2021-03-12 2021-12-17 京东科技控股股份有限公司 Similar statement generation method and device based on pre-training language model
CN113988047A (en) * 2021-09-26 2022-01-28 北京捷通华声科技股份有限公司 Corpus screening method and apparatus
CN114330285A (en) * 2021-11-30 2022-04-12 腾讯科技(深圳)有限公司 Corpus processing method and device, electronic equipment and computer readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948142A (en) * 2019-01-25 2019-06-28 北京海天瑞声科技股份有限公司 Corpus chooses processing method, device, equipment and computer readable storage medium
US20210004603A1 (en) * 2019-07-02 2021-01-07 Baidu Usa Llc Method and apparatus for determining (raw) video materials for news
CN110728133A (en) * 2019-12-19 2020-01-24 北京海天瑞声科技股份有限公司 Individual corpus acquisition method and individual corpus acquisition device
CN112668339A (en) * 2020-12-23 2021-04-16 北京有竹居网络技术有限公司 Corpus sample determination method and device, electronic equipment and storage medium
CN113807074A (en) * 2021-03-12 2021-12-17 京东科技控股股份有限公司 Similar statement generation method and device based on pre-training language model
CN113627155A (en) * 2021-08-06 2021-11-09 上海浦东发展银行股份有限公司 Data screening method, device, equipment and storage medium
CN113988047A (en) * 2021-09-26 2022-01-28 北京捷通华声科技股份有限公司 Corpus screening method and apparatus
CN114330285A (en) * 2021-11-30 2022-04-12 腾讯科技(深圳)有限公司 Corpus processing method and device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN114817517B (en) 2022-12-20

Similar Documents

Publication Publication Date Title
CN109408526B (en) SQL sentence generation method, device, computer equipment and storage medium
CN106897428B (en) Text classification feature extraction method and text classification method and device
CN108073568B (en) Keyword extraction method and device
CN109657054B (en) Abstract generation method, device, server and storage medium
WO2021174717A1 (en) Text intent recognition method and apparatus, computer device and storage medium
CN110415679B (en) Voice error correction method, device, equipment and storage medium
JP2019212287A (en) Text rehearsal method, device, server and storage medium
EP3805978A1 (en) Method and apparatus for reordering results of a translation model
CN113495900A (en) Method and device for acquiring structured query language sentences based on natural language
CN113407677B (en) Method, apparatus, device and storage medium for evaluating consultation dialogue quality
CN113064964A (en) Text classification method, model training method, device, equipment and storage medium
CN114116973A (en) Multi-document text duplicate checking method, electronic equipment and storage medium
JP2022003544A (en) Method for increasing field text, related device, and computer program product
CN113743117B (en) Method and device for entity labeling
CN114817517B (en) Corpus acquisition method and device, electronic equipment and storage medium
CN112287657A (en) Information matching system based on text similarity
CN112989003B (en) Intention recognition method, device, processing equipment and medium
CN113362809B (en) Voice recognition method and device and electronic equipment
CN115392235A (en) Character matching method and device, electronic equipment and readable storage medium
CN111858899B (en) Statement processing method, device, system and medium
CN114764437A (en) User intention identification method and device and electronic equipment
CN112926334A (en) Method and device for determining word expression vector and electronic equipment
CN112487800B (en) Text processing method, device, server and storage medium
CN112749275B (en) Data processing method and device
EP2109052A1 (en) System and method for query term suggestion for unprecise queries

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant