CN117010412A - Corpus data enhancement method and device - Google Patents

Corpus data enhancement method and device Download PDF

Info

Publication number
CN117010412A
CN117010412A CN202310800573.1A CN202310800573A CN117010412A CN 117010412 A CN117010412 A CN 117010412A CN 202310800573 A CN202310800573 A CN 202310800573A CN 117010412 A CN117010412 A CN 117010412A
Authority
CN
China
Prior art keywords
corpus
templates
data
target
slots
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310800573.1A
Other languages
Chinese (zh)
Inventor
田羽慧
王月岭
孟卫明
刘敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Group Holding Co Ltd
Original Assignee
Hisense Group Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Group Holding Co Ltd filed Critical Hisense Group Holding Co Ltd
Priority to CN202310800573.1A priority Critical patent/CN117010412A/en
Publication of CN117010412A publication Critical patent/CN117010412A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a corpus data enhancement method and device, and in the embodiment of the application, electronic equipment acquires a corpus template corresponding to an intention to be processed, wherein the corpus template comprises necessary slots corresponding to the intention; inputting the corpus templates into a data expansion model, and acquiring a plurality of expanded target corpus templates output by the data expansion model; and assigning values to the slots contained in the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data. In the embodiment of the application, the electronic equipment expands the existing corpus templates through the data expansion model to obtain a plurality of target corpus templates, assigns values to the plurality of target corpora to obtain a plurality of corpus data, improves the accuracy of corpus data enhancement, further improves the interpretability, the reliability and the generalization of the semantic understanding model obtained by subsequent training based on the enhanced corpus data, and accords with the credibility characteristic.

Description

Corpus data enhancement method and device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for enhancing corpus data.
Background
In recent years, with the development of artificial intelligence (Artificial Intelligence, AI) technology, the technical progress of natural language processing (Natural Language Processing, NLP) has been promoted, and the accuracy of semantic understanding is becoming higher and higher. In training of a semantic understanding model, a large amount of corpus data is needed, and in particular, in some processing tasks, the characteristic of insufficient data amount of the corpus data often exists. Therefore, the corpus data enhancement processing is required to be performed on the existing word stock, and more corpus data is expanded.
The existing corpus data enhancement mode mainly comprises an unsupervised data expansion (Unsupervised Data Augmentation, UDA), a simple data expansion (Easy Data Augmentation for Text Classification Tasks, EDA) of a text classification task and other methods. The UDA method mainly translates the corpus data into other languages and then translates the languages back, and cannot be suitable for large-scale corpus data or complicated corpus data. The EDA method is synonym replacement, and the synonym is adopted to replace the phrase in the language data, but the accuracy is lower, and the modification cost is higher.
Disclosure of Invention
The application provides a method and equipment for enhancing corpus data, which are used for solving the problems that the corpus data enhancing technology in the prior art is not suitable for large-batch corpus quantity, and the corpus data enhancing accuracy is low and the cost is high.
In a first aspect, an embodiment of the present application provides a corpus data enhancement method, where the method includes:
acquiring a corpus template corresponding to an intention to be processed, wherein the corpus template comprises necessary slots corresponding to the intention;
inputting the corpus templates into a data expansion model, and acquiring a plurality of expanded target corpus templates output by the data expansion model;
and assigning values to the slots contained in the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
In a second aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a processor, and the processor is configured to implement the steps of any one of the corpus data enhancement methods described above when executing a computer program stored in a memory.
In the embodiment of the application, the electronic equipment acquires a corpus template corresponding to the intention to be processed, wherein the corpus template comprises necessary slots corresponding to the intention; inputting the corpus templates into a data expansion model, and acquiring a plurality of expanded target corpus templates output by the data expansion model; and assigning values to the slots contained in the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data. In the embodiment of the application, the electronic equipment expands the existing corpus templates through the data expansion model to obtain a plurality of target corpus templates, and assigns values to the plurality of target corpora to obtain a plurality of corpus data, so that the accuracy of corpus data enhancement is improved, and the interpretability, the reliability and the generalization of a semantic understanding model obtained by subsequent training based on the enhanced corpus data are further improved.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a test result of corpus data enhanced by the UDA technology provided by the embodiment of the application;
FIG. 2 is a graph showing the test results of the EDA technology enhanced corpus data provided by the embodiment of the application;
FIG. 3 is a schematic diagram of a corpus data enhancement process according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an expansion model for expanding a corpus template including [ time ] and [ place ] slots according to an embodiment of the present application;
FIG. 5 is a schematic view of 8 combinations obtained in the above embodiment according to an embodiment of the present application;
FIG. 6 is a schematic diagram of corpus data enhancement provided by an embodiment of the present application;
FIG. 7 is a schematic flow chart of generating corpus data according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a corpus data enhancement device according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In training of a semantic understanding model, a large amount of corpus data is needed, and in particular, in some processing tasks, the characteristic of insufficient data amount of the corpus data often exists. Therefore, the corpus data enhancement processing is required to be performed on the existing word stock, and more corpus data is expanded. The existing corpus data enhancement method in NLP mainly comprises an unsupervised method EDA and a semi-supervised method UDA technology.
The UDA technology mainly translates back, translates corpus data back again in other languages, invokes the existing translation interface, and is not applicable to large-scale or complicated corpus data. The EDA technology is mainly synonym replacement, n words which do not belong to the stop word set are randomly selected from corpus data, and synonyms are randomly selected to replace the words; or randomly inserting, randomly finding out a word which does not belong to the stop word set in the corpus data, solving out a random synonym of the word, and inserting the synonym into a random position of the corpus data; or random exchange, randomly selecting two words in the corpus data and exchanging their positions; or randomly deleting, and randomly removing each word in the corpus data with the probability of p. In practical application, the EDA technology can achieve the purpose of enhancing the corpus data, but the obtained enhanced corpus data has lower accuracy and higher modification cost, and in practical application, the corpus data enhancement effect is not good.
Fig. 1 is a test result of corpus data enhanced by the UDA technology provided by the embodiment of the present application, as shown in fig. 1, the test result has a better test effect on corpus data with simple sentence patterns and clear content, as shown in fig. 1, "which industry administrative penalty is the most in Shandong province" is the most in Shandong industry after enhancement ". However, for complex corpus data, the accuracy of the enhanced corpus data is lower, and as shown in fig. 1, "which line is most heavily penalized by the administration of Shandong province" is most heavily penalized by the administration of Shandong' which line.
Fig. 2 is a test result of corpus data after enhancement of EDA technology provided by the embodiment of the present application, as shown in fig. 2, the EDA technology implements corpus data enhancement by means of random exchange, insertion, deletion, synonym replacement, etc., and the actual test effect is general, and the contents of insertion, deletion and modification are random, so as to affect the sentence effect. The "how many urban events exist in today's mountain area" as shown in fig. 2 is enhanced by "how many today's events exist in today's mountain area", "how many urban events exist in today's ink market", "how many urban events exist in today's mountain area", "how many urban events exist in today's sun, how much time in today's mountain area" or "how many urban events exist before today's mountain area".
In order to improve the corpus data enhancement accuracy and reduce the cost, the embodiment of the application provides a corpus data enhancement method and device.
In the embodiment of the application, the electronic equipment acquires a corpus template corresponding to the intention to be processed, wherein the corpus template comprises necessary slots corresponding to the intention; inputting the corpus templates into a data expansion model, and acquiring a plurality of expanded target corpus templates output by the data expansion model; and assigning values to the slots contained in the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
Fig. 3 is a schematic diagram of a corpus data enhancement process according to an embodiment of the present application, where the process includes:
s301: and obtaining a corpus template corresponding to the intention to be processed, wherein the corpus template comprises necessary slots corresponding to the intention.
The corpus data enhancement method provided by the embodiment of the application is applied to electronic equipment, and the electronic equipment can be a PC or a server.
The semantic understanding model is trained using samples corresponding to various intentions when training. Based on this, in the embodiment of the present application, a corpus template corresponding to each intention is saved in the electronic device, and each corpus template includes a necessary slot of the intention corresponding to the corpus template. The example sentence corresponding to the "index inquiry" intention is "how many … GDP is in Qingdao city", and the necessary slot corresponding to the "index inquiry" intention is "index".
The necessary slots are slots that are necessary in the corpus data corresponding to the intention, for example, the necessary slots corresponding to the intention of "index inquiry" are "indexes", the necessary slots corresponding to the intention of "time inquiry" are "time", and the like.
S302: inputting the corpus templates into a data expansion model, and obtaining a plurality of expanded target corpus templates output by the data expansion model.
In order to save manpower and ensure the richness and diversity of the slot position contents, in the embodiment of the application, the electronic equipment can expand and generalize the complicated slot positions in the language template in a fine granularity manner through a data expansion model, and generalize large slot positions such as time, place and the like into small slot positions.
Specifically, after determining the corpus template corresponding to the intention to be processed, the electronic device may input the corpus template into the data expansion model, where the data expansion model expands the slots included in the corpus template, and the data expansion model outputs a plurality of target corpus templates with the slots included in the corpus template being expanded.
The data expansion model is not a neural network model with traditional meaning, but a preconfigured algorithm for generalizing complex slots in a corpus template. Specifically, at least one preconfigured complex slot position which can be generalized and a plurality of generalization results corresponding to the complex slot position are stored in the data extension model. After the data expansion model receives an input corpus template, the data expansion model identifies a target complex slot position contained in the corpus template, and acquires a plurality of generalization results corresponding to the stored target complex slot position, and the data expansion model adopts each generalization result to replace the target complex slot position in the corpus template respectively, so as to acquire and output a plurality of target corpus templates.
For example, if the corpus template is "[ place ] [ index ]," the complex slots stored in the data expansion model are [ time ], [ place ]. The data expansion model determines that the corpus template contains a target complex slot position (place), then the data expansion model acquires the stored generalization results (city) corresponding to the place (city), the province (city) and the province (city), and the data expansion model adopts each generalization result to replace the place (place) in the corpus template respectively, and determines and outputs a plurality of target corpus templates: how much of the index is "how many" and "how many" the index "are" in the city "how much [ area ] [ index ] is [ city ] [ province ].
In the embodiment of the application, since the slot positions of the data expansion model comprise a plurality of contents, such as a certain time point, a certain time period and other combinations, the complex model which can be generalized and is stored in the data expansion model can be the data expansion model of the data expansion model, and the like, so as to improve the efficiency and the accuracy of the corpus data enhancement. The data expansion model divides slot positions into fine grains, namely, slot positions are divided into a period of time, a place of the slot positions are divided into a city of the city, and the like, and the data expansion model stores a plurality of generalization results corresponding to each complex slot position. For the slot positions combined in the province, the slot positions can be synchronously generalized to the slot positions with the same grade, for example, the slot positions can be generalized to the slot positions in the province, the city, the district, the country, the street and the like.
Fig. 4 is a schematic diagram of an expansion model provided in an embodiment of the present application for expanding a corpus template including [ time ] and [ place ] slots, where the process includes:
s401: an input corpus template containing [ time ] and [ place ] slots is received.
S402: traversing the slot positions according to the slot position level to obtain a single-stage target corpus template, a double-stage target corpus template and a three-stage target corpus template.
Such as: the slot positions of the two-stage target corpus templates are expanded into a single-stage target corpus template of the year, a double-stage target corpus template of the month, a three-stage target corpus template of the month, and the like.
S403: traversing the slot positions according to the slot position level to obtain a single-stage target corpus template, a double-stage target corpus template and a three-stage target corpus template.
Such as: a single-level target corpus template, a double-level target corpus template, a three-level target corpus template, and the like.
S404: and (4) re-generalizing the target corpus template obtained in the step (S402).
Such as: expanding the single-stage target corpus template into a single-stage target corpus template of [ year ] [ month ] [ week ]; expanding the two-stage target corpus template of [ year ] and [ month ] into the two-stage target corpus template of [ year quarter ] and [ month day ]; the three-level target corpus template is expanded into the three-level target corpus template and the like.
S405: and (3) generalizing the target corpus template obtained in the step (S403) again.
Such as: expanding the single-level target corpus template into a single-level target corpus template of the province; expanding the two-stage target corpus template of the city into the two-stage target corpus template of the city county street of the region street of the city; the three-level target corpus template of the city is expanded into the three-level target corpus template of the city and county and the district street community, and the like.
S406: and outputting the target corpus templates obtained in S402, S403, S404 and S405.
S303: and assigning values to the slots contained in the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
In this embodiment, the electronic device inputs the corpus template into a data expansion model, and the data expansion model expands slots included in the corpus template and outputs a plurality of target corpus templates obtained after the expansion.
Based on this, in the embodiment of the present application, after the electronic device obtains multiple target corpus templates output by the data expansion model, the electronic device may assign values to the slots included in the multiple target corpus templates according to the word bank stored in advance, so as to obtain multiple corpus data.
For example, if the target corpus template is "[ city ] [ index ]," how much ", the electronic device may assign a value to the target corpus template according to the word stock, so as to obtain corpus data as" how much GDP in Qingdao city ".
Specifically, in the embodiment of the application, after the electronic device determines a plurality of target corpus templates, the electronic device marks the slots in each target corpus template by adopting the identifiers corresponding to each pre-stored slot. The electronic equipment also stores a slot position content table, and the electronic equipment can determine a word stock corresponding to each identifier according to the slot position content table. Aiming at each target corpus template, the electronic equipment determines each identifier in the target corpus template, determines a word stock corresponding to each identifier, and carries out random value taking from the word stock to obtain corpus data.
In addition, in the embodiment of the application, the electronic equipment can carry out multiple assignment on one target corpus template to obtain multiple corpus data. That is, the electronic device may also set the amount of corpus data generated by each target corpus template.
In the embodiment of the application, the electronic equipment expands the existing corpus templates through the data expansion model to obtain a plurality of target corpus templates, and assigns values to the plurality of target corpora to obtain a plurality of corpus data, so that the accuracy of corpus data enhancement is improved, and the interpretability, the reliability and the generalization of a semantic understanding model obtained by subsequent training based on the enhanced corpus data are further improved. In addition, the data expansion model not only can save manpower and improve efficiency, but also can improve the accuracy of corpus data enhancement, and the corpus generalized by adopting the mode can be basically and directly used for training a semantic understanding model, so that the project cost is greatly saved.
In order to improve the corpus data enhancement effect, based on the above embodiment, in the embodiment of the present application, the determining process of the corpus template includes:
determining a necessary slot position corresponding to the intention according to a pre-stored corresponding relation between the intention and the necessary slot position;
and receiving input information carrying other slots, and carrying out disordered arrangement on at least one of the necessary slots and the other slots to obtain the corpus template.
In the embodiment of the application, when the electronic equipment expands the corpus template based on the data expansion model, the data expansion model only expands unnecessary slots in the corpus template in a generalization way. Based on this, in the embodiment of the present application, in generating the corpus template, the electronic device adds other slots in the corpus template, so that the data expansion model can expand the other slots.
Specifically, in the embodiment of the application, aiming at the intention of enhancing the corpus data, the electronic equipment determines the necessary slot corresponding to the intention according to the corresponding relation between the prestored intention and the necessary slot, receives the input information carrying other slots, and performs disordered arrangement on at least one of the necessary slot and the other slots to obtain the corpus template.
Specifically, the electronic device may perform disordered arrangement on the necessary slots and other received slots to obtain a plurality of combinations, and then delete the combinations that do not include the necessary slots, and determine the remaining combinations as corpus templates.
For example, if the necessary slot corresponding to the "index query" is "index", and if the other slots received by the electronic device are "time" and "place" and "prefix word", the electronic device performs unordered ordering on the four slots to obtain 16 combinations, and the electronic device deletes the combinations of the necessary slots not included in the 16 combinations [ index ], to obtain 8 combinations, and determines the corpus template based on the 8 combinations.
Fig. 5 is a schematic diagram of 8 combinations obtained in the foregoing embodiment according to the embodiment of the present application, where, as shown in fig. 5, the 8 combinations include: the method comprises the following steps of (1) performing (i) index, (ii) time, (iii) index, (iii) place, (iv) index, (iv) prefix word, (iv) time, (iv) place, (iv) prefix word, (iv) place, (iv) and (iv) performing (iv) operation.
In the implementation of the present application, the electronic device may determine, according to the combinations obtained in fig. 5, a corpus template corresponding to each combination. Specifically, the electronic device determines that corpus templates corresponding to the above combinations are respectively: what is the index, what is the location, what is the index, what is the prefix word, what is the time, what is the location, what is the index, what is the time, what is the prefix word, what is the location, what is the index, what is the prefix word, and what is the time, what is the location.
In addition, in the embodiment of the application, a plurality of linguistic data corresponding to the intention can be stored in the electronic equipment, and the electronic equipment respectively extracts the slots of the linguistic data to obtain the linguistic data template. If the sentence corresponding to the "index query" intention stored in the electronic device is "what is the GDP", the electronic device performs slot extraction and then determines how much the corpus template is "[ index ].
FIG. 6 is a schematic diagram of corpus data enhancement provided by the embodiment of the present application, where, as shown in FIG. 6, an electronic device performs question-answering design to determine an intention and a necessary slot to be processed; the electronic equipment receives the input complementary slot positions and generates a slot position sequence based on the complementary slot positions and the necessary slot positions; the electronic equipment generates a small amount of corpus templates according to the slot sequences; the electronic equipment marks the slot positions in the corpus template, and inputs the marked corpus template into the data expansion model; the data expansion model is used for carrying out slot positions in the corpus templates and outputting a plurality of generalized target corpus templates; the electronic equipment carries out assignment on the slots contained in the plurality of target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data; the electronic device outputs the plurality of corpus data.
In order to improve the corpus data enhancement effect, in the foregoing embodiments of the present application, before assigning the slots included in the multiple target corpus templates according to the pre-stored word stock to obtain multiple corpus data, the method further includes:
counting the total number of the corpus templates corresponding to the intentions and the target corpus templates;
assigning values to the slots contained in the target corpus templates according to a pre-stored word stock, and obtaining a plurality of corpus data comprises the following steps:
and assigning the slots contained in the target corpus templates according to a pre-stored word stock and the total number to obtain a plurality of corpus data.
When the corpus data corresponding to a certain intention is too much, namely the total number of the corpus templates corresponding to the intention and the target corpus templates is too much, the data volume of the formed corpus data is large, and when the data volume of the corpus data corresponding to a part of intention is too small, the data volume is unbalanced, and the accuracy of semantic understanding model identification is affected.
In addition, in the embodiment of the application, when the electronic equipment generates the corpus data for training the semantic understanding model, the electronic equipment can assign the target corpus data obtained by generalizing the data expansion model to obtain the corpus data. In addition, the electronic device can assign a value to the corpus data before being generalized to obtain the corpus data.
Based on the above, in the embodiment of the present application, before the electronic device generates the corpus data, the electronic device counts the total number of the corpus templates corresponding to the intention and the target corpus templates, and the electronic device assigns the slots included in the multiple target corpus templates according to the pre-stored word stock and the total number to obtain multiple corpus data.
In order to improve the corpus data enhancement effect, in the embodiment of the present application, assigning the slots included in the multiple target corpus templates according to the pre-stored word stock and the total number to obtain multiple corpus data includes:
if the total number is not smaller than the preset minimum threshold value and not larger than the preset maximum threshold value, assigning values to the total number of target corpus templates and the slots contained in the corpus templates according to the word stock to obtain a plurality of corpus data.
In the embodiment of the application, the maximum threshold and the minimum threshold are stored in the electronic equipment. The reason why the maximum threshold is set is that when the intention of the semantic understanding model training is too much, if the number of corpus templates corresponding to a certain intention and the number of target corpus templates are too much, the data size of the formed corpus data corresponding to the intention is large, and the data size of the corpus data of partial intention may be too small, so that the data size is unbalanced, and the accuracy of the semantic understanding model is affected. The reason for setting the minimum threshold is to avoid that the data volume of the corpus data of a certain intention is too small, so that the semantic understanding model has poor effect.
In the embodiment of the application, if the total number of the corpus templates corresponding to the intention and the target corpus templates is not smaller than a preset minimum threshold value and not larger than a preset maximum threshold value, the electronic equipment assigns the total number of the target corpus templates and the slots contained in the corpus templates according to the word stock to obtain a plurality of corpus data.
Specifically, a maximum threshold value corresponding to the intention is set as m1, a minimum threshold value is set as m2, and the total number of the corresponding corpus templates and the target corpus templates is set as z. If m2< z < m1, the electronic device assigns values to the total number of target corpus templates and the slots contained in the corpus templates according to the word stock, so as to obtain a plurality of corpus data.
In order to improve the corpus data enhancement effect, in the embodiment of the present application, assigning the slots included in the multiple target corpus templates according to the pre-stored word stock and the total number to obtain multiple corpus data includes:
if the total number is larger than a preset maximum threshold, randomly extracting the maximum threshold corpus templates and target corpus templates from the corpus templates and target corpus templates corresponding to the intention;
And assigning the slots contained in the maximum threshold corpus templates and the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
In the embodiment of the application, if the total number of the corpus templates corresponding to the intention and the target corpus templates is larger than the preset maximum threshold, the electronic equipment randomly extracts the maximum threshold corpus templates and the target corpus templates from the corpus templates corresponding to the intention and the target corpus templates. And the electronic equipment carries out assignment on the slots contained in the maximum threshold corpus templates and the target corpus templates according to the word stock to obtain a plurality of corpus data.
Specifically, a maximum threshold value corresponding to the intention is set as m1, a minimum threshold value is set as m2, and the total number of the corresponding corpus templates and the target corpus templates is set as z. If z is larger than m1, the electronic equipment randomly extracts m1 corpus templates and target corpus templates from the corpus templates and target corpus templates corresponding to the intention. The electronic equipment carries out assignment on the m1 target corpus templates and the slots contained in the corpus templates according to the word stock, so as to obtain a plurality of corpus data.
In order to improve the corpus data enhancement effect, in the embodiment of the present application, assigning the slots included in the multiple target corpus templates according to the pre-stored word stock and the total number to obtain multiple corpus data includes:
Acquiring the quantity of corpus data generated by a pre-stored target corpus template;
and assigning the slots contained in the target corpus templates according to the word stock, the quantity of corpus data generated by the target corpus templates and the total quantity to obtain a plurality of corpus data.
In the embodiment of the application, the electronic equipment can be further configured with the amount of corpus data generated by a target corpus template. The electronic device may assign values to the slots included in the plurality of target corpus templates according to the word stock, the number of corpus data generated by the one target corpus template, and the number of corpus data generated by the target corpus template, to obtain a plurality of corpus data.
In order to improve the corpus data enhancement effect, in the embodiment of the present application, assigning the slots included in the multiple target corpus templates according to the pre-stored word stock and the total number to obtain multiple corpus data includes:
if the total number is smaller than a preset minimum threshold, repeatedly sampling corpus templates corresponding to the intentions to obtain a preset number of corpus templates;
And assigning the slots contained in the preset number of corpus templates and the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
In the embodiment of the application, if the total number of the corpus templates corresponding to the intention and the target corpus templates is smaller than the preset minimum threshold, the electronic equipment repeatedly samples the corpus templates corresponding to the intention and the target corpus templates to obtain a preset number of corpus templates and target corpus templates. The electronic equipment carries out assignment on the slots contained in the preset number of corpus templates and the target corpus templates according to the word stock, and a plurality of corpus data are obtained.
Specifically, a maximum threshold value corresponding to the intention is set as m1, a minimum threshold value is set as m2, and the total number of the corresponding corpus templates and the target corpus templates is set as z. If z is less than m2, the electronic equipment repeatedly samples the corpus templates corresponding to the intention and the target corpus templates to obtain a preset number of corpus templates and target corpus templates. The electronic equipment carries out assignment on the slots contained in the preset number of corpus templates and the target corpus templates according to the word stock, and a plurality of corpus data are obtained. Wherein the preset number can be any number ranging from m2 to m 1.
In order to improve the corpus data enhancement effect, in the embodiment of the present application, assigning the slots included in the multiple target corpus templates according to the pre-stored word stock and the total number to obtain multiple corpus data includes:
if the total number is smaller than a preset minimum threshold value, acquiring a pre-stored weight and the number of corpus data generated by a target corpus template;
updating the quantity of the corpus data generated by the target corpus template according to the weight;
and assigning the slots contained in the target corpus templates according to the word stock, the quantity of corpus data generated by the updated target corpus templates and the total quantity to obtain a plurality of corpus data.
In the embodiment of the application, the electronic equipment can be further configured with a target corpus template to generate the number n of corpus data. Based on this, in the embodiment of the present application, if the electronic device determines that the total number of the corpus templates corresponding to the intent and the target corpora is smaller than the minimum threshold, the electronic device may not sample each target corpus template or corpus template corresponding to the intent for multiple times, but update the number of corpus data generated by the one target corpus template to tn according to the pre-saved weight t, where t >1.
And the electronic equipment carries out assignment on the slots contained in the multiple target corpus templates according to the word stock, the quantity and the total quantity of corpus data generated by the updated one target corpus template, so as to obtain multiple corpus data.
Fig. 7 is a schematic flow chart of generating corpus data according to an embodiment of the present application, where, as shown in fig. 7, the process includes:
s701: and determining the total number z of the corpus templates corresponding to the intentions to be processed and the target corpus templates.
S702: whether the total number z is larger than the maximum threshold value m1 is determined, if yes, S703 is executed, and if no, S704 is executed.
S703: and extracting m1 templates from the corpus templates corresponding to the intention and the target corpus templates, and executing S706.
S704: and extracting all templates from the corpus templates corresponding to the intention and the target corpus templates.
S705: whether the total number z is smaller than the minimum threshold value m2 is determined, if not, S706 is executed, and if yes, S707 is executed.
S706: the number of corpus data generated by one entry of the tagline template is set to n, and S708 is performed.
S707: the number of corpus data generated by one entry of the tagline template is set to tn, and S708 is performed.
S708: generating corpus data according to the quantity of the corpus data generated by a mark-up corpus template.
In order to improve the corpus data enhancement effect, based on the above embodiments, in the embodiment of the present application, the method for determining the minimum threshold includes:
determining a median of the number according to the number of included intents;
and determining the median as the minimum threshold.
In the embodiment of the present application, the setting of the minimum threshold may refer to the median value. Specifically, the electronic device determines the total number of the corpus templates corresponding to each intention and the target corpus templates, determines the median of the total number of each intention, and determines the median as a minimum threshold.
For example, the electronic device includes N intents, and the electronic device sorts the total number of each intention in order from large to small or from small to large, so as to obtain X (1) … … X (N).
If N is odd, determining the minimum thresholdIf N is even, then determining a minimum threshold
Fig. 8 is a schematic structural diagram of a corpus data enhancement device according to an embodiment of the present application, where the device includes:
an obtaining module 801, configured to obtain a corpus template corresponding to an intention to be processed, where the corpus template includes a necessary slot corresponding to the intention;
The expansion module 802 is configured to input the corpus template into a data expansion model, and obtain expanded multiple target corpus templates output by the data expansion model;
and the processing module 803 is configured to assign values to the slots included in the multiple target corpus templates according to a pre-stored word stock, so as to obtain multiple corpus data.
In a possible implementation manner, the processing module 803 is further configured to determine a necessary slot corresponding to the intention according to a pre-saved correspondence between the intention and the necessary slot; and receiving input information carrying other slots, and carrying out disordered arrangement on at least one of the necessary slots and the other slots to obtain the corpus template.
In a possible implementation manner, the processing module 803 is further configured to count the total number of corpus templates corresponding to the intent and the target corpus templates; and assigning the slots contained in the target corpus templates according to a pre-stored word stock and the total number to obtain a plurality of corpus data.
In a possible implementation manner, the processing module 803 is specifically configured to assign, according to the word stock, a value to the target corpus template of the total number and a slot position included in the corpus template if the total number is not less than a preset minimum threshold value and not greater than a preset maximum threshold value, so as to obtain a plurality of corpus data.
In a possible implementation manner, the processing module 803 is specifically configured to randomly extract the maximum threshold number of corpus templates and the target corpus template from the corpus templates and the target corpus templates corresponding to the intent if the total number is greater than a preset maximum threshold; and assigning the slots contained in the maximum threshold corpus templates and the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
In a possible implementation manner, the processing module 803 is specifically configured to obtain the amount of corpus data generated by one pre-saved target corpus template; and assigning the slots contained in the target corpus templates according to the word stock, the quantity of corpus data generated by the target corpus templates and the total quantity to obtain a plurality of corpus data.
In a possible implementation manner, the processing module 803 is specifically configured to, if the total number is smaller than a preset minimum threshold, repeatedly sample the corpus templates corresponding to the intent and the target corpus templates, and obtain a preset number of corpus templates and target corpus templates; and assigning the slots contained in the preset number of corpus templates and the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
In a possible implementation manner, the processing module 803 is specifically configured to obtain the pre-saved weight and the amount of corpus data generated by one target corpus template if the total amount is smaller than a pre-configured minimum threshold; updating the quantity of the corpus data generated by the target corpus template according to the weight; and assigning the slots contained in the target corpus templates according to the word stock, the quantity of corpus data generated by the updated target corpus templates and the total quantity to obtain a plurality of corpus data.
In a possible implementation, the processing module 803 is further configured to determine a median of the number according to the number of included intents; and determining the median as the minimum threshold.
On the basis of the foregoing embodiment, the embodiment of the present application further provides an electronic device, and fig. 9 is a schematic structural diagram of the electronic device provided by the embodiment of the present application, as shown in fig. 9, including: the processor 91, the communication interface 92, the memory 93 and the communication bus 94, wherein the processor 91, the communication interface 92 and the memory 93 complete communication with each other through the communication bus 94;
The memory 93 has stored therein a computer program which, when executed by the processor 91, causes the processor 91 to perform the steps of any of the corpus data enhancement methods described above.
Because the principle of solving the problem of the electronic device is similar to that of the corpus data enhancement method, the implementation of the electronic device can refer to the embodiment of the method, and the repetition is omitted.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus. The communication interface 92 is used for communication between the above-described electronic device and other devices. The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
On the basis of the above embodiments, the embodiments of the present application further provide a computer readable storage medium, in which a computer program executable by a processor is stored, which when executed on the processor causes the processor to perform the steps of any of the corpus data enhancement methods described above.
Since the principle of solving the problem by the above-mentioned computer readable storage medium is similar to that of the corpus data enhancement method, the implementation of the above-mentioned computer readable storage medium can refer to the embodiment of the method, and the repetition is not repeated.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. A method for enhancing corpus data, the method comprising:
acquiring a corpus template corresponding to an intention to be processed, wherein the corpus template comprises necessary slots corresponding to the intention;
inputting the corpus templates into a data expansion model, and acquiring a plurality of expanded target corpus templates output by the data expansion model;
and assigning values to the slots contained in the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
2. The method of claim 1, wherein the process of determining the corpus template comprises:
determining a necessary slot position corresponding to the intention according to a pre-stored corresponding relation between the intention and the necessary slot position;
and receiving input information carrying other slots, and carrying out disordered arrangement on at least one of the necessary slots and the other slots to obtain the corpus template.
3. The method of claim 1, wherein before assigning the slots included in the plurality of target corpus templates according to the pre-stored word stock to obtain a plurality of corpus data, the method further comprises:
counting the total number of the corpus templates corresponding to the intentions and the target corpus templates;
assigning values to the slots contained in the target corpus templates according to a pre-stored word stock, and obtaining a plurality of corpus data comprises the following steps:
and assigning the slots contained in the target corpus templates according to a pre-stored word stock and the total number to obtain a plurality of corpus data.
4. The method of claim 3, wherein assigning the slots included in the plurality of target corpus templates according to the pre-stored word stock and the total number to obtain a plurality of corpus data comprises:
If the total number is not smaller than the preset minimum threshold value and not larger than the preset maximum threshold value, assigning values to the total number of target corpus templates and the slots contained in the corpus templates according to the word stock to obtain a plurality of corpus data.
5. The method of claim 3, wherein assigning the slots included in the plurality of target corpus templates according to the pre-stored word stock and the total number to obtain a plurality of corpus data comprises:
if the total number is larger than a preset maximum threshold, randomly extracting the maximum threshold corpus templates and target corpus templates from the corpus templates and target corpus templates corresponding to the intention;
and assigning the slots contained in the maximum threshold corpus templates and the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
6. The method according to claim 4 or 5, wherein assigning the slots included in the plurality of target corpus templates according to the pre-stored word stock and the total number to obtain a plurality of corpus data includes:
acquiring the quantity of corpus data generated by a pre-stored target corpus template;
And assigning the slots contained in the target corpus templates according to the word stock, the quantity of corpus data generated by the target corpus templates and the total quantity to obtain a plurality of corpus data.
7. The method of claim 3, wherein assigning the slots included in the plurality of target corpus templates according to the pre-stored word stock and the total number to obtain a plurality of corpus data comprises:
if the total number is smaller than a preset minimum threshold, repeatedly sampling the corpus templates corresponding to the intentions and the target corpus templates to obtain a preset number of corpus templates and target corpus templates;
and assigning the slots contained in the preset number of corpus templates and the target corpus templates according to a pre-stored word stock to obtain a plurality of corpus data.
8. The method of claim 3, wherein assigning the slots included in the plurality of target corpus templates according to the pre-stored word stock and the total number to obtain a plurality of corpus data comprises:
if the total number is smaller than a preset minimum threshold value, acquiring a pre-stored weight and the number of corpus data generated by a target corpus template;
Updating the quantity of the corpus data generated by the target corpus template according to the weight;
and assigning the slots contained in the target corpus templates according to the word stock, the quantity of corpus data generated by the updated target corpus templates and the total quantity to obtain a plurality of corpus data.
9. A method according to claim 3, wherein the method of determining the minimum threshold comprises:
determining a median of the number according to the number of included intents;
and determining the median as the minimum threshold.
10. An electronic device, characterized in that it comprises a processor for implementing the steps of the corpus data enhancement method according to any of claims 1-9 when executing a computer program stored in a memory.
CN202310800573.1A 2023-06-30 2023-06-30 Corpus data enhancement method and device Pending CN117010412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310800573.1A CN117010412A (en) 2023-06-30 2023-06-30 Corpus data enhancement method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310800573.1A CN117010412A (en) 2023-06-30 2023-06-30 Corpus data enhancement method and device

Publications (1)

Publication Number Publication Date
CN117010412A true CN117010412A (en) 2023-11-07

Family

ID=88564539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310800573.1A Pending CN117010412A (en) 2023-06-30 2023-06-30 Corpus data enhancement method and device

Country Status (1)

Country Link
CN (1) CN117010412A (en)

Similar Documents

Publication Publication Date Title
CN110874531B (en) Topic analysis method and device and storage medium
CN108509425B (en) Chinese new word discovery method based on novelty
CN106528532A (en) Text error correction method and device and terminal
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN104142915A (en) Punctuation adding method and system
WO2021068683A1 (en) Method and apparatus for generating regular expression, server, and computer-readable storage medium
CN109063184B (en) Multi-language news text clustering method, storage medium and terminal device
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN103577587A (en) News theme classification method
CN105893362A (en) A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN104391837A (en) Intelligent grammatical analysis method based on case semantics
CN112883165A (en) Intelligent full-text retrieval method and system based on semantic understanding
CN112818110A (en) Text filtering method, text filtering equipment and computer storage medium
CN111507090A (en) Abstract extraction method, device, equipment and computer readable storage medium
CN103927176A (en) Method for generating program feature tree on basis of hierarchical topic model
CN116933697B (en) Method and device for converting natural language into hardware description language
CN101271448A (en) Chinese language fundamental noun phrase recognition, its regulation generating method and apparatus
CN106126496A (en) A kind of information segmenting method and device
CN117010412A (en) Corpus data enhancement method and device
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN108595434B (en) Syntax dependence method based on conditional random field and rule adjustment
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN113158693A (en) Uygur language keyword generation method and device based on Chinese keywords, electronic equipment and storage medium
CN108573025B (en) Method and device for extracting sentence classification characteristics based on mixed template

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination