CN115495541B - Corpus database, corpus database maintenance method, apparatus, device and medium - Google Patents

Corpus database, corpus database maintenance method, apparatus, device and medium Download PDF

Info

Publication number
CN115495541B
CN115495541B CN202211443162.3A CN202211443162A CN115495541B CN 115495541 B CN115495541 B CN 115495541B CN 202211443162 A CN202211443162 A CN 202211443162A CN 115495541 B CN115495541 B CN 115495541B
Authority
CN
China
Prior art keywords
data
data set
task
unit
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211443162.3A
Other languages
Chinese (zh)
Other versions
CN115495541A (en
Inventor
林余楚
古树桦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyi Information Technology Zhuhai Co ltd
Original Assignee
Shenyi Information Technology Zhuhai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyi Information Technology Zhuhai Co ltd filed Critical Shenyi Information Technology Zhuhai Co ltd
Priority to CN202211443162.3A priority Critical patent/CN115495541B/en
Publication of CN115495541A publication Critical patent/CN115495541A/en
Application granted granted Critical
Publication of CN115495541B publication Critical patent/CN115495541B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a corpus database, a corpus database maintenance method, a corpus database maintenance device and a corpus database maintenance medium, wherein the corpus database maintenance device comprises: performing fine-grained analysis on the basic data set written into the corpus database from different dimensions, and determining the application type of the basic data set, wherein the dimensions are preset, and the application type comprises a universality type and a specific task type; based on the application type, carrying out aggregation operation on the basic data set to obtain a training task corresponding to the basic data set; analyzing and training the basic data set by adopting a pre-training language model according to a training task to obtain a target data set; when a data interaction instruction is received, each target data set is adopted for data interaction, analysis, aggregation and interaction of the written basic data set are achieved, the written data set has strong adaptability to various tasks, and the quality of the data set is improved.

Description

Corpus database, corpus database maintenance method, apparatus, device and medium
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a corpus database, a corpus database maintenance method, a corpus database maintenance device and a corpus database maintenance medium.
Background
As machine learning relies on data, most of which is collected and processed by technicians has gained value in their artificial intelligence industry. Most existing data processing tools tend to be on top of existing data rather than how the data is interpreted and manipulated. This method of over-focusing on existing data for data processing is a passive approach, which is very costly.
The explanation data refers to that the tool is adapted to the information of grammar, characteristics and the like of the original data, and tools (such as rule analysis, data labeling and classifiers) made on the data can only be adapted to the data. Most of these tools are produced by the data creator in his own technology, without using corresponding natural language processing techniques. On the premise of losing the prior knowledge of natural language processing, the analyzed data information can not be used by machine learning. The manipulation data refers to the fact that the tool made of the original data cannot actively adapt to the training mode of machine learning because the tool loses the interpretation and analysis method of natural language processing knowledge, and influences are caused on model training.
The method not only reduces the overall development efficiency, but also enables project development to focus on data processing operation rather than substantial improvement, creation of machine learning and artificial intelligence algorithm development; the method is also not favorable for resource reusability, for example, in order to save the development cost of data processing operation, a data creator often adopts a language-independent feature, confuses languages to perform data preprocessing, and makes subsequent actual projects spend a large amount of resources to perform data standardization operation.
Disclosure of Invention
The embodiment of the invention provides a corpus database, a corpus database maintenance method, a corpus database maintenance device, computer equipment and a storage medium, which are used for improving the quality of a natural language data set.
In order to solve the above technical problem, an embodiment of the present application provides a corpus database, where the corpus database includes a data analysis module and a data interaction module;
the data analysis module comprises a basic expression unit, a data table, an embedded expression unit, a deviation analysis unit, a cluster prediction unit and a prompt learning unit, wherein,
the basic expression unit is used for analyzing basic information of the data; the embedded expression unit is used for embedding data in a layered mode through a model and projecting the data to multiple dimensions so as to visualize view browsing data set characteristics; the deviation analysis unit is used for carrying out data error check according to the reference data set; the clustering prediction unit is used for predicting labels of a data set, wherein the labels of the data set comprise a classification task, a text generation task, a voice model probability task and a structured prediction task; the prompt learning unit is used for predicting the performance of the data set and the output score of the index so as to prompt a machine learning method of a subsequent task;
the data interaction module comprises a data standardization unit, a data editor, a preprocessing task unit, a data enhancement unit and a result feedback unit, wherein,
the preprocessing task unit is used for performing natural language processing tasks through a preprocessing model so as to provide a data set for task execution; and the data enhancement unit is used for performing data completion and augmentation on the data set.
Optionally, for a given data set, any one of the basic expression unit, the data table, the embedded expression unit, the deviation analysis unit, the cluster prediction unit, the prompt learning unit, the data normalization unit, the data editor, the preprocessing task unit, the data enhancement unit and the result feedback is adopted to perform individual processing, or two or more module units are combined to perform overall processing.
In order to solve the foregoing technical problem, an embodiment of the present application provides a method for maintaining a corpus database, where the method for maintaining the corpus database includes:
performing fine-grained analysis on a basic data set written into a corpus database from different dimensions, and determining an application type of the basic data set, wherein the dimensions are preset, and the application type comprises a universality type and a specific task type;
based on the application type, carrying out aggregation operation on the basic data set to obtain a training task corresponding to the basic data set;
analyzing and training the basic data set by adopting a pre-training language model according to the training task to obtain a target data set;
and when a data interaction instruction is received, performing data interaction by adopting each target data set.
Optionally, after the performing fine-grained analysis on the basic data set written into the corpus database from different dimensions to determine an application type of the basic data set, and before performing an aggregation operation on the basic data based on the application type to obtain a training task corresponding to the basic data set, the method further includes:
performing bias analysis on the base dataset, the bias analysis including random error bias calculation and system bias calculation; wherein the content of the first and second substances,
the random error bias calculation is estimated based on a statistical method to ensure the complete implementation of the randomization principle in the sampling algorithm;
the system bias calculation determines the direction of data field adaptability, estimates the magnitude of bias by adopting a preset label, and performs matching and screening by adopting the information of a contrast group data set based on the magnitude of bias, wherein the contrast group data set is generated in advance according to a system bias calculation method.
Optionally, the data interaction instruction includes data preprocessing, data enhancement, and data searching.
Optionally, the data interaction instruction is data enhancement, and when the data interaction instruction is received, performing data interaction by using each target data set includes:
preprocessing the target data set to obtain a standard data set;
adopting different data deviation disturbances of a reference data set to carry out robustness test on the standard data set;
if the standard data set passes the robustness test, performing data enhancement processing by adopting a preset data enhancement mode, wherein the preset data enhancement mode comprises the following steps: named body recognition replacement, masking operations, and unsupervised consistency replacement.
Optionally, the data interaction instruction is data search, and performing data interaction by using each target data set when the data interaction instruction is received includes:
receiving a query statement;
extracting specific terms from the query sentence in a natural language task processing mode, or supplementing the query sentence in a correction expansion mode by using Boolean matching to obtain the user intention;
according to the specific terms or the user intention, performing matching query matching on each target data set, taking the successfully matched documents as target documents, acquiring the documents of the same category as the target documents in a clustering mode as reference documents, and taking the reference documents and the target documents as search results;
alternatively, the first and second electrodes may be,
vectorizing the specific term/the user intention and the target data set, constructing a matching model according to the cross characteristics of the specific term/the user intention and the target data set, distributing simulation parameters, and scoring the matching degree by a machine learning method to obtain retrieval and sequencing results.
In order to solve the foregoing technical problem, an embodiment of the present application provides a maintaining device for a corpus database, where the maintaining device for the corpus database includes:
the data set analysis module is used for performing fine-grained analysis on the basic data set written into the corpus database from different dimensions, and determining the application type of the basic data set, wherein the dimensions are preset, and the application type comprises a universality type and a specific task type;
the task determining module is used for carrying out aggregation operation on the basic data set based on the application type to obtain a training task corresponding to the basic data set;
the data training module is used for analyzing and training the basic data set by adopting a pre-training language model according to the training task to obtain a target data set;
and the data set interaction module is used for carrying out data interaction by adopting each target data set when a data interaction instruction is received.
Optionally, the apparatus for maintaining the corpus database further includes:
a bias analysis module for performing bias analysis on the base data set, the bias analysis including random error bias calculation and system bias calculation; wherein the content of the first and second substances,
the random error bias calculation is estimated based on a statistical method to ensure the complete implementation of a randomization principle in a sampling algorithm;
the system bias calculation determines the direction of data field adaptability, estimates the magnitude of bias by adopting a preset label, and performs matching and screening by adopting the information of a contrast group data set based on the magnitude of bias, wherein the contrast group data set is generated in advance according to a system bias calculation method.
Optionally, the data interaction instruction is data enhancement, and the data set interaction module includes:
the data preprocessing unit is used for preprocessing the target data set to obtain a standard data set;
the robustness testing unit is used for carrying out robustness testing on the standard data set by adopting different data deviation disturbances of the reference data set;
a data augmentation unit, configured to perform data enhancement processing in a preset data augmentation mode if the standard data set passes a robustness test, where the preset data augmentation mode includes: named body recognition replacement, masking operations, and unsupervised consistency replacement.
Optionally, the data interaction instruction is data search, and the data set interaction module includes:
a receiving unit for receiving a query statement;
the data extraction unit is used for extracting specific terms from the query statement in a natural language task processing mode, or supplementing the query statement in a correction expansion mode by using Boolean matching to obtain the user intention;
the first searching unit is used for performing matching query matching on each target data set according to the specific term or the user intention, taking a successfully matched document as a target document, acquiring a document of the same category as the target document in a clustering mode to serve as a reference document, and taking the reference document and the target document as a searching result; alternatively, the first and second electrodes may be,
and the second searching unit is used for vectorizing the specific term/the user intention and the target data set, constructing a matching model according to the cross characteristics of the specific term/the user intention and the target data set, distributing simulation parameters, and scoring the matching degree by a machine learning method to obtain retrieval and sequencing results.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the corpus database when executing the computer program.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the corpus database.
The corpus database, the corpus database maintenance method and device, the computer equipment and the storage medium provided by the embodiment of the invention determine the application type of the basic data set by performing fine-grained analysis on the basic data set written into the corpus database from different dimensions, wherein the dimensions are preset, and the application type comprises a universality type and a specific task type; based on the application type, carrying out aggregation operation on the basic data set to obtain a training task corresponding to the basic data set; analyzing and training the basic data set by adopting a pre-training language model according to a training task to obtain a target data set; when a data interaction instruction is received, each target data set is adopted for data interaction, analysis, aggregation and interaction of the written basic data set are achieved, the written data set has strong adaptability to various tasks, and the data quality of the written data set is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a schematic diagram of one embodiment of a corpus database of the present application;
FIG. 3 is a flow chart of one embodiment of a corpus database maintenance method of the present application;
FIG. 4 is a schematic diagram illustrating an embodiment of a corpus database maintenance apparatus according to the present application;
FIG. 5 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop portable computer, a desktop computer, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the corpus database maintenance method provided in the embodiment of the present application is executed by a server, and accordingly, the corpus database and the corpus database maintenance device are disposed in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the terminal devices 101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.
Referring to fig. 2, fig. 2 shows a corpus database applied to the server in fig. 1 according to an embodiment of the present invention, including a data analysis module and a data interaction module;
the data analysis module comprises a basic expression unit, a data table, an embedded expression unit, a deviation analysis unit, a cluster prediction unit and a prompt learning unit, wherein,
the basic expression unit is used for analyzing basic information of the data; the embedded expression unit is used for embedding the data in a layered mode through the model and projecting the data to multiple dimensions so as to visualize view browsing data set characteristics; a deviation analysis unit for performing a data error check based on the reference data set; the clustering prediction unit is used for predicting labels of a data set, wherein the labels of the data set comprise a classification task, a text generation task, a voice model probability task and a structured prediction task; the prompt learning unit is used for predicting the performance of the data set and the output scores of the indexes so as to prompt a machine learning method of a subsequent task;
the data interaction module comprises a data standardization unit, a data editor, a preprocessing task unit, a data enhancement unit and a result feedback unit, wherein,
the preprocessing task unit is used for performing natural language processing tasks through a preprocessing model so as to provide a data set for task execution; and the data enhancement unit is used for performing data completion and augmentation on the data set.
Optionally, for a given data set, any one of the basic expression unit, the data table, the embedded expression unit, the deviation analysis unit, the cluster prediction unit, the prompt learning unit, the data standardization unit, the data editor, the preprocessing task unit, the data enhancement unit and the result feedback is adopted to perform processing independently, or the whole amount of two or more module units is combined to perform processing.
Optionally, the data interaction module further comprises a data normalization unit for performing a basic clean-up impurity, noisy data flow on the data set, and a structured data representation, and a data editor unit for editable detailed information on the selected data.
In this embodiment, an analysis library integrated with natural language processing knowledge is constructed based on the data analysis module, an interaction library integrated with natural language processing knowledge is constructed by the data interaction module, and the design of the analysis library and the interaction library is focused on high-level summarization of information outside data, rather than only focusing on a single data set, and careful and repeatable analysis and operation are performed. Starting from global data set information, the data set can be improved in quality and reusability of research codes and development efficiency without paying more attention to the data set. And the framework integrates natural language processing knowledge, provides a certain amount of knowledge for deep learning in the data analysis and interaction processes, and improves the value of a data set.
It should be noted that there are a large number of existing toolkits that support data processing for various natural language processing tasks, and the goal is to make it easier to construct a combinable natural language processing data processing workflow. Taking a Natural Language processing Toolkit (NLTK) as an example, the NLTK is a set of library and program written by a programming Language for statistical Natural Language processing, and provides applications of part-of-speech research, similar word recognition and retrieval, classification tasks, semantic interpretation, and index evaluation for data processing operations. The tool can be used as an extension module in an analysis and interaction library to extend and adapt to the processing requirements of future data sets, so as to increase the expansibility and allow a corpus of current and future analysis and processing languages, a scientific calculation library, a data visualization 2D mapping library and a network structure function library to be merged into the library.
Referring to fig. 3, fig. 3 shows a corpus database maintenance method according to an embodiment of the present invention, which is described by taking the corpus database in fig. 2 as an example, and is detailed as follows:
s201: and performing fine-grained analysis on the basic data set written into the corpus database from different dimensions, and determining the application type of the basic data set, wherein the dimensions are preset, and the application type comprises a general type and a specific task type.
The fine-grained analysis refers to performing fine-grained analysis in different dimensions, that is, analysis can be performed on a data set in use at each sample level or data set level. The sample is analyzed for general sample-level text length or corpus-level average length, or for data set average for a specific task (e.g., article abstract compression). The analysis method then not only designs rich sample-level and dataset-level characteristics, but will also compute and store this information in a database for easy viewing.
S202: and based on the application type, carrying out aggregation operation on the basic data set to obtain a training task corresponding to the basic data set.
The aggregation operation is a statistical method for calculating a corpus level, and is performed by aggregating the natural language processing main tasks and analyzing whether the data set is the main tasks of natural language generation or natural language reasoning and the like according to label distribution.
Optionally, after the step S201 and before the step S202, that is, after performing fine-grained analysis on the basic data set written into the corpus database from different dimensions to determine an application type of the basic data set, and before performing an aggregation operation on the basic data based on the application type to obtain a training task corresponding to the basic data set, the method further includes:
performing bias analysis on the basic data set, wherein the bias analysis comprises random error bias calculation and system bias calculation; wherein the content of the first and second substances,
the random error bias calculation is estimated based on a statistical method to ensure the complete implementation of a randomization principle in a sampling algorithm;
the system bias calculation determines the direction of the adaptability of the data field, the magnitude of the bias is evaluated by adopting a preset label, and based on the magnitude of the bias, the information of a contrast group dataset is adopted for matching and screening, and the contrast group dataset is generated in advance according to the system bias calculation method.
Data bias refers to the difference between the average of a large number of measured values (values in a reference data set) and the true value of the written data, and a unified analysis method for data bias analysis is established to more effectively identify or prevent the problem of data bias. The data difference affects the system error generated by each link of natural language processing, and the error cannot be avoided in practical application. In this embodiment, a task processed by natural language is regarded as bias analysis, and a bias problem in a data set may cause inconsistency between data information and the data adaptation field, which affects the accuracy of subsequent processing. By the corpus database in the embodiment, the characteristics of each sample are pre-calculated by using internal tools in an analysis and interaction library so as to identify potential information of the data. This calculation can be divided into random error bias calculation and systematic bias calculation. The random bias calculation strictly adheres to the statistical method for estimation so as to ensure the complete implementation of the randomization principle in the sampling algorithm; the system bias calculation will determine the direction of data domain fitness, match and screen the size of the specific label estimation bias with the control group dataset information generated by the analysis method. Through bias analysis, the patented solution screens a more accurate dataset for model training in order to design a more robust system.
S203: and analyzing and training the basic data set by adopting a pre-training language model according to the training task to obtain a target data set.
In this embodiment, a prompt-based learning is adopted, that is, a pre-trained language model is used to define a prompt mode in the data analysis process. The hint patterns cover many aspects such as features, metadata, attributes, and even predictive dataset performance. And designing a method for helping a data creator to add information to the data by setting different pre-training modes.
Among them, the pre-trained language model is one of the cores in NLP, and plays a significant role in the development of NLP at the pretrain-finetune stage. The unsupervised training attribute of the pre-training language model enables the pre-training language model to easily acquire a large number of training samples, and the trained language model contains a lot of semantic grammar knowledge, so that the effect of downstream tasks can be obviously improved.
The pre-trained language model is to be spoken from a word vector. The word vector constructs the co-occurrence relation among words by using text data, generally takes the co-occurring words in a sentence as positive samples, constructs negative samples by random negative sampling, and trains in a CBOW or Skip-Gram mode, so as to enable the frequently co-occurring words to have similar vectorization representation. Its essence is a priori in NLP: two words that frequently co-occur in text tend to be semantically similar. However, the problem of word vectors is also obvious, meaning of the same word is often different in different contexts, and a word vector can only generate a fixed vector for a word and cannot be adjusted by combining context information. The pre-training language model directly finetune the pre-trained model on the downstream task, and different input or output layers are adopted for different tasks to be modified, so that the downstream task is closer to the upstream pre-training model. It is worth mentioning that in subsequent optimizations such as prompt, the approach of the downstream task to the upstream task is further advanced, that is, the input and output logic of the downstream task is also changed to adapt to the upstream task.
Common pre-trained language models include, but are not limited to: ELMo, GPT, bert, GPT series, etc.
S204: and when a data interaction instruction is received, performing data interaction by adopting each target data set.
Optionally, the data interaction instruction comprises data preprocessing, data enhancement and data searching.
Data preprocessing is an indispensable step in deep learning and machine learning model training, and the quality of a data set directly influences model learning.
Optionally, the data interaction instruction is data enhancement, and when the data interaction instruction is received, performing data interaction by using each target data set includes:
preprocessing a target data set to obtain a standard data set;
adopting different data deviation disturbances of the reference data set to carry out robustness test on the standard data set;
if the standard data set passes the robustness test, a preset data augmentation mode is adopted for data enhancement, wherein the preset data augmentation mode comprises the following steps: named body recognition replacement, masking operations, and unsupervised consistency replacement.
The named entity recognition replacement is a subtask for classifying and positioning named entities from an unstructured text, and the process is to generate a named entity expression of proper noun marking information from the unstructured text expression and replace the named entities with the proper information by a machine algorithm so as to achieve the effect of data enhancement.
The masking operation is to use a pre-training model to randomly mask specific words in the pre-processing operation to expand the text, so that the model can train and learn words masked according to the context prediction, thereby avoiding the appearance of over-learning of the model and ensuring better robustness of the model.
Unsupervised consistency replacement is to obtain words with less characteristic information in the data set by the weighting technology of information retrieval and text mining, because the words cannot provide information for model training, and therefore the words can be replaced without influencing the basic truth value of the data set, so as to reduce useless information.
In this embodiment, a set of data operation flows conforming to engineering and natural language processing knowledge should be designed before data interaction, and the data operation flows include the following flows:
data preprocessing: the preprocessing process includes general function functions such as basic operations of size limitation, regular expression, data filtering, and the like. To promote the way in which preprocessing can be performed globally without loss of information, tools that compute sample-level features of a given text will be used in data preprocessing operations to upgrade and improve the functionality of generic functions.
Data enhancement: firstly, performing antagonism evaluation on a data set interpreted by the data analysis method, namely a data creator appoints a reference data set, and testing different data deviation disturbances of a preprocessed data set sample and the reference data set sample to test the robustness of the preprocessed data set; and then the enhanced data set is constructed, and editing operation on the original data set can be carried out, such as tasks of named body identification replacement, masking operation and unsupervised consistency replacement. The data analysis and interaction library of the patent provides a uniform data interaction interface, and a data creator can easily select a task suitable for a data set to increase the value of the data set.
Data searching: the data search is described by adopting the research idea of natural language, and a retrieval system is used for searching specific information in a training set through semantic matching, namely specific terms are extracted by relying on natural language tasks such as word segmentation, named body recognition and the like; or using Boolean matching to supplement the query sentence of the data creator by correction, expansion and other modes so as to understand the user intention, and returning the same type of document to the result by using a clustering method when the target document is queried, thereby improving the recall rate; or using data similarity calculation to carry out vectorization on the cross feature representation of the data set and the target query data as features, constructing a proper model, distributing simulation parameters, and scoring the matching degree of the two data through a machine learning method to potential functions so as to obtain retrieval and sequencing results.
Optionally, the data interaction instruction is data search, and when the data interaction instruction is received, performing data interaction by using each target data set includes:
receiving a query statement;
extracting specific terms from the query sentences in a natural language task processing mode, or supplementing the query sentences in a correction and expansion mode by using Boolean matching to obtain the user intention;
according to specific terms or user intentions, matching, inquiring and matching each target data set, taking the successfully matched documents as target documents, acquiring documents of the same category as the target documents in a clustering mode, taking the documents as reference documents, and taking the reference documents and the target documents as search results;
alternatively, the first and second liquid crystal display panels may be,
vectorizing the specific terms/user intentions and the target data set, constructing a matching model according to the cross characteristics of the specific terms/user intentions and the target data set, distributing simulation parameters, and scoring the matching degree by a machine learning method to obtain retrieval and sequencing results.
In the embodiment, the basic data set written into the corpus database is subjected to fine-grained analysis from different dimensions, and the application type of the basic data set is determined, wherein the dimensions are preset, and the application type comprises a general type and a specific task type; based on the application type, carrying out aggregation operation on the basic data set to obtain a training task corresponding to the basic data set; according to the training task, analyzing and training the basic data set by adopting a pre-training language model to obtain a target data set; when a data interaction instruction is received, each target data set is adopted for data interaction, analysis, aggregation and interaction of the written basic data set are achieved, the written data set has strong adaptability to various tasks, and the data quality of the written data set is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 4 is a schematic block diagram of a corpus database maintenance device corresponding to the corpus database in one-to-one correspondence with the above-described embodiment. As shown in fig. 4, the apparatus for maintaining the corpus database includes a data set analyzing module 31, a task determining module 32, a data training module 33, and a data set interacting module 34. The functional modules are explained in detail as follows:
the data set analysis module 31 is configured to perform fine-grained analysis on the basic data set written into the corpus database from different dimensions, and determine an application type of the basic data set, where the dimensions are preset, and the application type includes a general type and a specific task type;
the task determination module 32 is configured to perform aggregation operation on the basic data set based on the application type to obtain a training task corresponding to the basic data set;
the data training module 33 is configured to perform analysis training on the basic data set by using a pre-training language model according to a training task to obtain a target data set;
and the data set interaction module 34 is configured to perform data interaction with each target data set when receiving the data interaction instruction.
Optionally, the apparatus for maintaining a corpus database further includes:
the bias analysis module is used for carrying out bias analysis on the basic data set, and the bias analysis comprises random error bias calculation and system bias calculation; wherein the content of the first and second substances,
the random error bias calculation is estimated based on a statistical method to ensure the complete implementation of a randomization principle in a sampling algorithm;
the system bias calculation determines the direction of the adaptability of the data field, the magnitude of the bias is evaluated by adopting a preset label, and based on the magnitude of the bias, the information of a contrast group dataset is adopted for matching and screening, and the contrast group dataset is generated in advance according to the system bias calculation method.
Optionally, the data interaction instruction is data enhancement, and the data set interaction module 34 includes:
the data preprocessing unit is used for preprocessing the target data set to obtain a standard data set;
the robustness testing unit is used for carrying out robustness testing on the standard data set by adopting different data deviation disturbances of the reference data set;
and the data augmentation unit is used for performing data enhancement processing by adopting a preset data augmentation mode if the standard data set passes the robustness test, wherein the preset data augmentation mode comprises the following steps: named body recognition replacement, masking operations, and unsupervised consistency replacement.
Optionally, the data interaction instruction is a data search, and the data set interaction module 34 includes:
a receiving unit for receiving a query statement;
the data extraction unit is used for extracting specific terms from the query sentences in a natural language task processing mode, or supplementing the query sentences in a correction and expansion mode by using Boolean matching to obtain the user intention;
the first searching unit is used for matching, inquiring and matching each target data set according to a specific term or user intention, taking the successfully matched document as a target document, acquiring the document of the same category as the target document in a clustering mode to serve as a reference document, and taking the reference document and the target document as a searching result; alternatively, the first and second electrodes may be,
and the second searching unit is used for vectorizing the specific terms/user intentions and the target data set, constructing a matching model according to the cross characteristics of the specific terms/user intentions and the target data set, distributing simulation parameters, and scoring the matching degree by a machine learning method to obtain retrieval and sequencing results.
For the specific definition of the corpus database maintenance device, reference may be made to the definition of the corpus database above, and details are not described herein again. All or part of each module in the corpus database maintenance device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 5, fig. 5 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to run a program code stored in the memory 41 or process data, for example, a program code of a maintenance method for a corpus database.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing a communication connection between the computer device 4 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium, wherein the computer-readable storage medium stores an interface display program, and the interface display program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the method for maintaining a corpus database as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (8)

1. A maintenance method of a corpus database is characterized in that the maintenance method is applied to the corpus database, and the corpus database comprises a data analysis module and a data interaction module; the data analysis module comprises a basic expression unit, a data table, an embedded expression unit, a deviation analysis unit, a clustering prediction unit and a prompt learning unit, wherein the basic expression unit is used for analyzing basic information of data; the embedded expression unit is used for embedding data in a layered mode through a model and projecting the data to multiple dimensions so as to visualize view browsing data set characteristics; the deviation analysis unit is used for carrying out data error check according to the reference data set; the clustering prediction unit is used for predicting labels of a data set, wherein the labels of the data set comprise a classification task, a text generation task, a voice model probability task and a structured prediction task; the prompt learning unit is used for predicting the performance of the data set and the output score of the index so as to prompt a machine learning method of a subsequent task; the data interaction module comprises a data standardization unit, a data editor, a preprocessing task unit, a data enhancement unit and a result feedback unit, wherein the preprocessing task unit is used for performing natural language processing tasks through a preprocessing model so as to provide a data set for task execution; the data enhancement unit is used for complementing and augmenting data of the data set;
the maintenance method of the corpus database comprises the following steps:
performing fine-grained analysis on a basic data set written into a corpus database from different dimensions, and determining an application type of the basic data set, wherein the dimensions are preset, and the application type comprises a universality type and a specific task type;
based on the application type, carrying out aggregation operation on the basic data set to obtain a training task corresponding to the basic data set;
according to the training task, analyzing and training the basic data set by adopting a pre-training language model to obtain a target data set;
and when a data interaction instruction is received, performing data interaction by adopting each target data set.
2. The method for maintaining the corpus database according to claim 1, wherein after the fine-grained analysis is performed on the basic data set written into the corpus database from different dimensions to determine an application type of the basic data set, and before the aggregation operation is performed on the basic data based on the application type to obtain a training task corresponding to the basic data set, the method further comprises:
performing bias analysis on the base dataset, the bias analysis including random error bias calculation and system bias calculation; wherein the content of the first and second substances,
the random error bias calculation is estimated based on a statistical method to ensure the complete implementation of a randomization principle in a sampling algorithm;
the system bias calculation determines the direction of adaptability of the data field, evaluates the magnitude of bias by adopting a preset label, and performs matching and screening by adopting information of a comparison group data set based on the magnitude of bias, wherein the comparison group data set is generated in advance according to a system bias calculation method.
3. The method according to any of the claims 1 or 2, wherein the data interaction command comprises data preprocessing, data enhancement and data searching.
4. The method for maintaining the corpus database according to claim 3, wherein the data interaction command is data enhancement, and the performing data interaction using each of the target data sets when receiving the data interaction command comprises:
preprocessing the target data set to obtain a standard data set;
carrying out robustness test on the standard data set by adopting different data deviation disturbances of the reference data set;
if the standard data set passes the robustness test, performing data enhancement processing by adopting a preset data enhancement mode, wherein the preset data enhancement mode comprises the following steps: named body recognition replacement, masking operations, and unsupervised consistency replacement.
5. The method for maintaining the corpus database according to claim 3, wherein the data interaction command is a data search, and the performing data interaction using each of the target data sets when receiving the data interaction command comprises:
receiving a query statement;
extracting specific terms from the query sentence in a natural language task processing mode, or supplementing the query sentence in a correction expansion mode by using Boolean matching to obtain the user intention;
according to the specific terms or the user intention, matching query matching is carried out on each target data set, documents which are successfully matched are used as target documents, documents of the same category as the target documents are obtained in a clustering mode and are used as reference documents, and the reference documents and the target documents are used as search results;
alternatively, the first and second electrodes may be,
vectorizing both the specific terms and the target data set, or vectorizing both the user intention and the target data set, constructing a matching model according to the cross characteristics of the specific terms and the target data set, distributing simulation parameters, and scoring the matching degree by a machine learning method to obtain retrieval and sequencing results.
6. A maintenance device of a corpus database is characterized in that the maintenance device is applied to the corpus database, and the corpus database comprises a data analysis module and a data interaction module; the data analysis module comprises a basic expression unit, a data table, an embedded expression unit, a deviation analysis unit, a clustering prediction unit and a prompt learning unit, wherein the basic expression unit is used for analyzing basic information of data; the embedded expression unit is used for embedding data in a layered mode through a model and projecting the data to multiple dimensions so as to visualize view browsing data set characteristics; the deviation analysis unit is used for carrying out data error check according to the reference data set; the clustering prediction unit is used for predicting labels of a data set, wherein the labels of the data set comprise a classification task, a text generation task, a voice model probability task and a structured prediction task; the prompt learning unit is used for predicting the performance of the data set and the output score of the index so as to prompt a machine learning method of a subsequent task; the data interaction module comprises a data standardization unit, a data editor, a preprocessing task unit, a data enhancement unit and a result feedback unit, wherein the preprocessing task unit is used for performing natural language processing tasks through a preprocessing model so as to provide a data set for task execution; the data enhancement unit is used for complementing and augmenting data of the data set;
the maintenance device of the corpus database comprises:
the data set analysis module is used for performing fine-grained analysis on the basic data set written into the corpus database from different dimensions, and determining the application type of the basic data set, wherein the dimensions are preset, and the application type comprises a universality type and a specific task type;
the task determination module is used for carrying out aggregation operation on the basic data set based on the application type to obtain a training task corresponding to the basic data set;
the data training module is used for analyzing and training the basic data set by adopting a pre-training language model according to the training task to obtain a target data set;
and the data set interaction module is used for carrying out data interaction by adopting each target data set when a data interaction instruction is received.
7. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method for maintaining a corpus database according to any one of claims 1 to 5 when executing the computer program.
8. A computer-readable storage medium, in which a computer program is stored, and the computer program, when being executed by a processor, implements the method for maintaining the corpus database according to any one of claims 1 to 5.
CN202211443162.3A 2022-11-18 2022-11-18 Corpus database, corpus database maintenance method, apparatus, device and medium Active CN115495541B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211443162.3A CN115495541B (en) 2022-11-18 2022-11-18 Corpus database, corpus database maintenance method, apparatus, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211443162.3A CN115495541B (en) 2022-11-18 2022-11-18 Corpus database, corpus database maintenance method, apparatus, device and medium

Publications (2)

Publication Number Publication Date
CN115495541A CN115495541A (en) 2022-12-20
CN115495541B true CN115495541B (en) 2023-04-07

Family

ID=85116186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211443162.3A Active CN115495541B (en) 2022-11-18 2022-11-18 Corpus database, corpus database maintenance method, apparatus, device and medium

Country Status (1)

Country Link
CN (1) CN115495541B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114781611A (en) * 2022-04-21 2022-07-22 润联软件系统(深圳)有限公司 Natural language processing method, language model training method and related equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766540B (en) * 2018-12-10 2022-05-03 平安科技(深圳)有限公司 General text information extraction method and device, computer equipment and storage medium
KR102039138B1 (en) * 2019-04-02 2019-10-31 주식회사 루닛 Method for domain adaptation based on adversarial learning and apparatus thereof
CN111695674B (en) * 2020-05-14 2024-04-09 平安科技(深圳)有限公司 Federal learning method, federal learning device, federal learning computer device, and federal learning computer readable storage medium
CN113435582B (en) * 2021-06-30 2023-05-30 平安科技(深圳)有限公司 Text processing method and related equipment based on sentence vector pre-training model
CN115249043A (en) * 2022-07-26 2022-10-28 江苏保旺达软件技术有限公司 Data analysis method and device, electronic equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114781611A (en) * 2022-04-21 2022-07-22 润联软件系统(深圳)有限公司 Natural language processing method, language model training method and related equipment

Also Published As

Publication number Publication date
CN115495541A (en) 2022-12-20

Similar Documents

Publication Publication Date Title
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN108304468B (en) Text classification method and text classification device
CN110232114A (en) Sentence intension recognizing method, device and computer readable storage medium
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN111814465A (en) Information extraction method and device based on machine learning, computer equipment and medium
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN112084752A (en) Statement marking method, device, equipment and storage medium based on natural language
CN112446209A (en) Method, equipment and device for setting intention label and storage medium
CN114881035A (en) Method, device, equipment and storage medium for augmenting training data
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN117807482B (en) Method, device, equipment and storage medium for classifying customs clearance notes
CN110309252B (en) Natural language processing method and device
CN113901836A (en) Word sense disambiguation method and device based on context semantics and related equipment
CN112686053A (en) Data enhancement method and device, computer equipment and storage medium
CN115169370B (en) Corpus data enhancement method and device, computer equipment and medium
CN114742058B (en) Named entity extraction method, named entity extraction device, computer equipment and storage medium
CN114385819B (en) Environment judicial domain ontology construction method and device and related equipment
CN115495541B (en) Corpus database, corpus database maintenance method, apparatus, device and medium
CN112364649B (en) Named entity identification method and device, computer equipment and storage medium
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN114091451A (en) Text classification method, device, equipment and storage medium
CN112287215A (en) Intelligent employment recommendation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant