CN116340548A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116340548A
CN116340548A CN202310610114.7A CN202310610114A CN116340548A CN 116340548 A CN116340548 A CN 116340548A CN 202310610114 A CN202310610114 A CN 202310610114A CN 116340548 A CN116340548 A CN 116340548A
Authority
CN
China
Prior art keywords
data
data set
word segmentation
entity
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310610114.7A
Other languages
Chinese (zh)
Inventor
孙基栩
司红星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siwei Chuangzhi Beijing Technology Development Co ltd
Original Assignee
Siwei Chuangzhi Beijing Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siwei Chuangzhi Beijing Technology Development Co ltd filed Critical Siwei Chuangzhi Beijing Technology Development Co ltd
Priority to CN202310610114.7A priority Critical patent/CN116340548A/en
Publication of CN116340548A publication Critical patent/CN116340548A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention discloses a data processing method, a data processing device, electronic equipment and a storage medium, and relates to the field of network security. The method comprises the following steps: constructing a domain knowledge graph based on full domain knowledge of the target domain; acquiring a training data set or a fine tuning data set required for training or fine tuning a large language model of the target field; determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph; and screening each data in the training data set or the fine tuning data set according to the data quality. According to the technical scheme, the field knowledge graph is utilized to screen the data in the training data set or the fine tuning data, so that the data quality of the data in the training data set or the fine tuning data can be improved, and the toxic data in the data set is prevented from polluting a large language model.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of network security, and in particular, to a data processing method, apparatus, electronic device, and storage medium.
Background
The large language model refers to a language model comprising hundreds of billions (or more) of parameters, aiming at training and fine tuning of the large language model, if the quality of a data set is higher, the training or fine tuning effect is better, especially in the field of network security, the quality requirement on the data set is higher, but the current data set mainly depends on manual construction and webpage text collection, the data quality is uneven, and a toxic data pollution model is easy to appear.
Disclosure of Invention
The invention provides a data processing method, a data processing device, electronic equipment and a storage medium.
According to an aspect of the present invention, there is provided a data processing method including:
constructing a domain knowledge graph based on full domain knowledge of the target domain;
acquiring a training data set or a fine tuning data set required for training or fine tuning a large language model of the target field;
determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph;
and screening each data in the training data set or the fine tuning data set according to the data quality.
In an alternative implementation, determining the data quality of each data in the training data set or fine tuning data set based on the domain knowledge-graph includes:
performing word segmentation processing on each data in the training data set or the fine tuning data set to obtain a word segmentation list corresponding to each piece of data; wherein the word segmentation list comprises at least one word segment;
determining the entity matching degree of any word segmentation list based on the domain knowledge graph; the entity matching degree is used for measuring the ratio of target word segmentation belonging to the target field in the word segmentation list;
And determining the data quality of each data in the training data set or the fine tuning data set according to the entity matching degree.
In an optional implementation manner, for any word segmentation list, determining the entity matching degree of the word segmentation list based on the domain knowledge graph includes:
for each word in the word segmentation list, determining the similarity between the word and each entity name in the domain knowledge graph;
selecting target word segmentation with similarity larger than a preset similarity threshold value;
and taking the ratio of the number of the target word segments in the total word segment number included in the word segment list as the entity matching degree of the word segment list.
In an alternative implementation, determining the data quality of each data in the training data set or the fine-tuning data set according to the entity matching degree includes:
for any data in the training data set or the fine tuning data set, if the entity matching degree of the word segmentation list of the data is smaller than a first threshold value, determining the data to be low-quality data;
if the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold value, determining that the data is high-quality data;
And if the entity matching degree of the word segmentation list of the data is larger than or equal to the first threshold value and smaller than the second threshold value, marking the data as data to be segmented.
In an alternative implementation, the method further includes:
aiming at the entities in the domain knowledge graph, carrying out association analysis by adopting a community division technology based on a modularity algorithm to generate at least one entity community; wherein each entity community comprises at least one entity;
determining target word segmentation belonging to the target field in a word segmentation list corresponding to any data to be segmented, and determining a target entity community in which the target word segmentation is located;
determining the ratio of the number of the target entity communities to the total number of entity communities included in the domain knowledge graph;
if the duty ratio is smaller than a third threshold value, determining that the data to be divided is high-quality data;
and if the duty ratio is greater than or equal to the third threshold value, determining the data to be divided as discrete data of the target field.
In an alternative implementation, the filtering each data in the training data set or the fine tuning data set according to the data quality includes:
High quality data in the training data set or the fine tuning data set is preserved and low quality data or discrete data in the training data set or the fine tuning data set is deleted.
In an alternative implementation, the method further includes:
and establishing a hidden association relationship between at least two entities which do not have association relationship in the entity community.
According to another aspect of the present invention, there is provided a data processing apparatus comprising:
the map construction module is used for constructing a domain knowledge map based on full domain knowledge of the target domain;
the data set acquisition module is used for acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field;
the quality analysis module is used for determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph;
and the data screening module is used for screening each data in the training data set or the fine adjustment data set according to the data quality.
In an alternative implementation, the mass analysis module includes:
the word segmentation processing unit is used for carrying out word segmentation processing on each piece of data in the training data set or the fine tuning data set to obtain a word segmentation list corresponding to each piece of data; wherein the word segmentation list comprises at least one word segment;
The entity matching degree determining unit is used for determining the entity matching degree of any word segmentation list based on the domain knowledge graph; the entity matching degree is used for measuring the ratio of target word segmentation belonging to the target field in the word segmentation list;
and the quality analysis unit is used for determining the data quality of each data in the training data set or the fine tuning data set according to the entity matching degree.
In an alternative implementation, the entity matching degree determining unit is further configured to:
for each word in the word segmentation list, determining the similarity between the word and each entity name in the domain knowledge graph;
selecting target word segmentation with similarity larger than a preset similarity threshold value;
and taking the ratio of the number of the target word segments in the total word segment number included in the word segment list as the entity matching degree of the word segment list.
In an alternative implementation, the mass analysis unit is further configured to:
for any data in the training data set or the fine tuning data set, if the entity matching degree of the word segmentation list of the data is smaller than a first threshold value, determining the data to be low-quality data;
If the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold value, determining that the data is high-quality data;
and if the entity matching degree of the word segmentation list of the data is larger than or equal to the first threshold value and smaller than the second threshold value, marking the data as data to be segmented.
In an alternative implementation, the apparatus further includes:
the community dividing module is used for carrying out association analysis by adopting a community dividing technology based on a modularity algorithm aiming at the entities in the domain knowledge graph to generate at least one entity community; wherein each entity community comprises at least one entity;
the word segmentation attribution determining module is used for determining target word segmentation belonging to the target field in a word segmentation list corresponding to any data to be divided, and determining a target entity community where the target word segmentation is located;
the duty ratio determining module is used for determining the duty ratio of the number of the target entity communities in the total number of entity communities included in the domain knowledge graph;
the first judging module is used for determining that the data to be divided is high-quality data if the duty ratio is smaller than a third threshold value;
And the second judging module is used for determining that the data to be divided is the discrete data of the target field if the duty ratio is larger than or equal to the third threshold value.
In an alternative implementation, the data screening module is further configured to:
high quality data in the training data set or the fine tuning data set is preserved and low quality data or discrete data in the training data set or the fine tuning data set is deleted.
In an alternative implementation, the apparatus further includes:
the implicit association relation construction module is used for establishing the implicit association relation between at least two entities which do not have association relation in the entity community.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the data processing method according to the embodiment of the present invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a data processing method according to an embodiment of the present invention.
According to the technical scheme provided by the embodiment of the invention, the data in the training data set or the fine tuning data is screened by utilizing the domain knowledge graph, so that the data quality of the data in the training data set or the fine tuning data can be improved, and the toxic data pollution to a large language model in the data set is avoided.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a data processing method according to a first embodiment of the present invention;
FIG. 2 is a flow chart of a data processing method according to a second embodiment of the present invention;
FIG. 3 is a flow chart of a data processing method according to a third embodiment of the present invention;
FIG. 4 is a schematic diagram of a data processing apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device implementing a data processing method according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
The terms involved in the present invention are explained first.
Knowledge graph: a database of knowledge stores triples (entities, concepts and attributes), each representing a fact. The knowledge graph can also be regarded as a graph, and the triples can be nodes in the knowledge graph.
Concept: refers to a collection of entities with the same characteristics, such as books, computers, etc.
Entity: refers to something that is distinguishable and independently present. Such as a person, a city, a plant, a commodity, etc. The entities are the most basic elements in the knowledge graph, and different relationships exist among different entities.
Attributes: features for distinguishing concepts, different concepts having different properties. Different attribute value types correspond to edges of different types of attributes. If the attribute value corresponds to a concept or an entity, the attribute describes a relationship between the two entities, referred to as an object attribute; if the attribute value is a specific value, it is referred to as a data attribute.
Example 1
Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention, where the method may be implemented by a data processing apparatus, and the data processing apparatus may be implemented in hardware and/or software, and the data processing apparatus may be configured in an electronic device, for example, integrated in a server device, and the method may be applicable to a scenario in which data in a data set is optimized using a knowledge graph.
As shown in fig. 1, the data processing method includes:
S101, constructing a domain knowledge graph based on full domain knowledge of the target domain.
In this embodiment, the target domain may be a network security domain, and the full domain knowledge of the target domain may be knowledge related to network security. The domain knowledge graph refers to a knowledge graph constructed based on knowledge data of a specific domain, and in this embodiment, the domain knowledge graph may be a knowledge graph of a network security domain.
The process of constructing the domain knowledge graph based on the full-scale domain knowledge of the target domain is as follows: aiming at the target field, firstly, acquiring related vocabulary of the target field, and optionally, acquiring the related vocabulary of the target field by responding to the input vocabulary of the user; or can be obtained directly from the existing database; further, initial source data of related words is determined, and optionally, the initial source data may be encyclopedia page information or original webpage data. And based on the initial source data, obtaining the domain knowledge graph of the target domain finally through knowledge modeling, knowledge extraction, knowledge fusion, knowledge storage and other processes.
It should be noted that knowledge modeling, that is, defining a knowledge model, mainly starts from the actual application scenario of the target domain and the specific problem to be solved, and defines the hierarchical structure of the concept and the relationship type between the concepts in the target domain. Knowledge extraction optionally includes entity extraction, relationship extraction, attribute extraction, etc. from the acquired data. Knowledge fusion mainly comprises concept fusion, entity fusion and relationship fusion, wherein concept fusion mainly refers to fusion of concept layer data, entity fusion mainly refers to fusion of entity layer data, and relationship fusion refers to fusion of concepts and relationships between concepts, relationships between concepts and entities and relationships between entities. The knowledge storage mainly comprises storage according to a preset storage mode (such as a graph database mode).
S102, acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field.
In this embodiment, the training data set or the fine tuning data set required for training or fine tuning the large language model in the target area is optionally determined manually or mechanically in advance. Training or fine tuning a large language model with this dataset may contaminate the large language model because toxic data may be present in the training dataset or fine tuning dataset. Therefore, before training or fine tuning of the model, a pre-constructed training dataset or fine tuning dataset needs to be acquired, and then according to the steps of S103-S104, data in the training dataset or fine tuning dataset is optimally screened to remove data in the dataset that is not related to the target field.
And S103, determining the data quality of each data in the training data set or the fine adjustment data set based on the domain knowledge graph.
In this embodiment, the data quality is used to measure the matching degree of the data in the data set and the domain knowledge graph of the target domain, if the matching degree of a certain data and the domain knowledge graph is higher, the probability that the data belongs to the target domain is higher, and the data is high-quality data; conversely, the lower the matching, the less probability that the data belongs to the target domain, the data is low quality data, i.e. the data may be toxic data.
And S104, screening each data in the training data set or the fine adjustment data set according to the data quality.
Optionally, high quality data in the training data set or the fine tuning data set is retained and low quality data therein is deleted. Therefore, the optimization of the data in the training data set or the fine tuning data set is realized, the overall data quality of the data in the training data set or the fine tuning data can be improved, and the problem that a large language model is polluted due to toxic data in the data set can be avoided.
Example two
Fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention. Referring to fig. 2, the method flow includes the steps of:
s201, constructing a domain knowledge graph based on full domain knowledge of the target domain.
In this embodiment, the target domain may be a network security domain, and the full domain knowledge of the target domain may be knowledge related to network security. The domain knowledge graph refers to a knowledge graph constructed based on knowledge data of a specific domain, and in this embodiment, the domain knowledge graph may be a knowledge graph of a network security domain.
S202, acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field.
In this embodiment, the specific process of steps S201 to S202 may be referred to the description of the above embodiment, and will not be repeated here.
On the basis of the domain knowledge graph of the target domain obtained through S201-S202 and the training data set or the fine tuning data set required for training or fine tuning the large language model of the target domain to be optimized, the process of determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph includes the steps of S203-S205.
S203, performing word segmentation processing on each piece of data in the training data set or the fine adjustment data set to obtain a word segmentation list corresponding to each piece of data.
In this embodiment, for any data (optionally, composed of a chinese character sequence) in the training data set or the fine tuning data set, a chinese word segmentation method is used to segment the data, where chinese word segmentation refers to segmenting a chinese character sequence into individual words. In an alternative embodiment, the word segmentation process may be performed by using a barker word segmentation method, where the barker word segmentation is mainly based on a statistical dictionary, and a prefix dictionary is constructed; then, utilizing a prefix dictionary to segment the input sentence to obtain all segmentation possibilities, and constructing a directed acyclic graph according to the segmentation positions; the maximum probability path is calculated through a dynamic programming algorithm, and a final segmentation form is obtained. After the word segmentation, each piece of data in the training data set or the fine tuning data set corresponds to one word segmentation list respectively, and each word segmentation list comprises at least one word segmentation.
S204, determining the entity matching degree of the word segmentation list based on the domain knowledge graph aiming at any word segmentation list.
The entity matching degree is used for measuring the duty ratio of target word segmentation belonging to the target field in the word segmentation list. In an optional implementation manner, for any word segmentation list, determining the entity matching degree of the word segmentation list based on the domain knowledge graph comprises the following steps: firstly, determining the similarity between each word in the word list and each entity name in the domain knowledge graph. Optionally, calculating the Euclidean distance between the word vector of each word and the word vector of the entity name in the domain knowledge graph, and determining the similarity according to the Euclidean distance; or directly calculating cosine similarity between the word vector of the segmented word and the word vector of the entity name. Further, selecting target word segments with similarity greater than a preset similarity threshold; wherein the preset similarity threshold is optionally 0.9; it should be noted that, as long as the similarity between a certain word and a certain entity name in the map is greater than 0.9, the word is considered to be related to the entity name, or the word is considered to belong to the target field. And finally, taking the ratio of the number of the target word segments in the total word segment number included in the word segment list as the entity matching degree of the word segment list. For example, if a word segmentation list includes 5 words, where the similarity between 4 words and 4 entity names in the domain knowledge graph is greater than a preset similarity threshold, the entity matching degree of the word segmentation list is 0.8.
S205, determining the data quality of each data in the training data set or the fine adjustment data set according to the entity matching degree.
Optionally, the entity matching degree of the word segmentation list corresponding to each data in the training data set or the fine tuning data set is directly used as the actual value of the respective data quality, so that whether the data is high-quality data or low-quality data can be determined according to the entity matching degree of the word segmentation list corresponding to each data.
In an alternative embodiment, for any data in the training data set or the fine-tuning data set, if the entity matching degree of the word segmentation list of the data is less than a first threshold (for example, 0.3), determining that the data is low-quality data; if the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold (for example, 0.6), the data is determined to be high-quality data. It should be noted that, if the entity matching degree of the word segmentation list of the data is greater than or equal to the first threshold and less than the second threshold, the data is marked as data to be segmented, that is, the data needs to be further determined to be high quality data or discrete data related to the target field, and a specific determination process can be referred to a subsequent embodiment.
S206, screening each data in the training data set or the fine adjustment data set according to the data quality.
Optionally, high quality data in the training data set or the trimming data set is preserved and low quality data in the training data set or the trimming data set is deleted.
In this embodiment, by calculating the similarity between each word segment and the entity name in the word segment list, the entity matching degree between the word segment list and the knowledge graph can be accurately determined, and further the accuracy of determining the data quality by using the entity matching degree can be ensured. And finally, deleting the low-quality data based on the data quality, so that the pollution of a large language model caused by the existence of low-quality toxic data in the data set can be avoided.
Example III
Fig. 3 is a flow chart of a data processing method according to a third embodiment of the present invention. Referring to fig. 3, the method logic includes the following:
s301, constructing a domain knowledge graph based on full domain knowledge of the target domain.
In this embodiment, the target domain may be a network security domain, and the full domain knowledge of the target domain may be knowledge related to network security. The domain knowledge graph refers to a knowledge graph constructed based on knowledge data of a specific domain, and in this embodiment, the domain knowledge graph may be a knowledge graph of a network security domain.
S302, aiming at the entities in the domain knowledge graph, performing association analysis by adopting a community division technology based on a modularity algorithm, and generating at least one entity community.
Optionally, when the entity community is divided, each entity can be used as an entity community first, and then the entity communities are primarily combined based on the community division technology of the modularity algorithm until the modularity is not increased any more. In this embodiment, each entity community obtained finally includes at least one entity. It should be noted that each entity can only belong to one entity community, that is, there is no entity belonging to multiple entity communities at the same time.
Further, for at least two entities in the entity community, which have no association relationship, a hidden association relationship between the at least two entities is established, so that related data analysis can be performed according to the hidden association relationship between the entities.
S303, acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field.
S304, performing word segmentation processing on each piece of data in the training data set or the fine adjustment data set to obtain a word segmentation list corresponding to each piece of data.
S305, determining the entity matching degree of the word segmentation list based on the domain knowledge graph aiming at any word segmentation list.
Optionally, for each word in the word segmentation list, determining the similarity between the word and each entity name in the domain knowledge graph; selecting target word segmentation with similarity larger than a preset similarity threshold value; and taking the ratio of the number of target word fragments in the total word fragments included in the word fragment list as the entity matching degree of the word fragment list.
S306, determining the data quality of each data in the training data set or the fine adjustment data set according to the entity matching degree.
In an alternative embodiment, for any data in the training data set or the fine-tuning data set, if the entity matching degree of the word segmentation list of the data is less than a first threshold (for example, 0.3), determining that the data is low-quality data; if the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold (for example, 0.6), the data is determined to be high-quality data.
It should be noted that if the entity matching degree of the word segmentation list of the data is greater than or equal to the first threshold value and less than the second threshold value, the data is marked as data to be segmented, that is, the data needs to be further determined to be high-quality data or discrete data related to the target field. Specifically, see steps S307-S309.
S307, determining target word segmentation belonging to the target field in a word segmentation list corresponding to any data to be segmented, and determining a target entity community where the target word segmentation is located.
S308, determining the ratio of the number of the target entity communities in the total number of the entity communities included in the domain knowledge graph.
S309, if the duty ratio is smaller than a third threshold value, determining that the data to be divided is high-quality data; and if the duty ratio is greater than or equal to the third threshold value, determining the data to be divided as discrete data of the target field.
In this embodiment, the larger the duty ratio (for example, the more target words belong to different entity communities), the more dispersed the target words in the word segmentation list are, the smaller the relevance among the target words is, and the larger the probability that the data to be divided corresponding to the word segmentation list belongs to discrete data in the target field is; the smaller the duty cycle (e.g., there are multiple target tokens belonging to the same entity community), the more concentrated the target tokens in the token list; and the more concentrated the target word in the word segmentation list, the greater the relevance of each target word in the word segmentation list, the greater the probability that the data to be divided corresponding to the word segmentation list belongs to high-quality data.
In an alternative embodiment, a third threshold (e.g. 0.1) may be preset, and if the duty ratio is smaller than the third threshold, the data to be divided is determined to be high quality data; and if the duty ratio is greater than or equal to the third threshold value, determining the data to be divided as discrete data of the target field.
And S310, screening each data in the training data set or the fine adjustment data set according to the data quality.
Optionally, high quality data in the training data set or the trimming data set is preserved and low quality data or discrete data in the training data set or the trimming data set is deleted.
In this embodiment, for the number of target entity communities where the target word is located in the word segmentation list corresponding to the data to be divided, the data quality of the data to be divided is further confirmed by the ratio of the number of the entity communities included in the domain knowledge graph, and the determined discrete data and the low-quality data are deleted together, so that the overall quality of the data in the training data set or the fine-tuning data set can be further improved.
Example IV
Fig. 4 is a schematic structural diagram of a data processing apparatus according to a fourth embodiment of the present invention, where the present embodiment is applicable to a scenario in which data in a data set is optimized by using a knowledge graph. As shown in fig. 4, the apparatus includes:
The map construction module 401 is configured to construct a domain knowledge map based on full-scale domain knowledge of the target domain;
a data set acquisition module 402 for acquiring a training data set or a fine tuning data set required for training or fine tuning a large language model of a target area;
a quality analysis module 403, configured to determine a data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph;
the data filtering module 404 is configured to filter each data in the training data set or the fine tuning data set according to the data quality.
In an alternative implementation, the mass analysis module includes:
the word segmentation processing unit is used for carrying out word segmentation processing on each piece of data in the training data set or the fine tuning data set to obtain a word segmentation list corresponding to each piece of data; wherein the word segmentation list comprises at least one word segment;
the entity matching degree determining unit is used for determining the entity matching degree of the word segmentation list based on the domain knowledge graph aiming at any word segmentation list; the entity matching degree is used for measuring the duty ratio of target word segmentation belonging to the target field in the word segmentation list;
and the quality analysis unit is used for determining the data quality of each data in the training data set or the fine tuning data set according to the entity matching degree.
In an alternative implementation, the entity matching degree determining unit is further configured to:
aiming at each word in the word segmentation list, determining the similarity between the word and each entity name in the domain knowledge graph;
selecting target word segmentation with similarity larger than a preset similarity threshold value;
and taking the ratio of the number of target word fragments in the total word fragments included in the word fragment list as the entity matching degree of the word fragment list.
In an alternative implementation, the mass analysis unit is further configured to:
for any data in the training data set or the fine tuning data set, if the entity matching degree of the word segmentation list of the data is smaller than a first threshold value, determining the data to be low-quality data;
if the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold value, determining that the data is high-quality data;
and if the entity matching degree of the word segmentation list of the data is larger than or equal to a first threshold value and smaller than a second threshold value, marking the data as data to be segmented.
In an alternative implementation, the apparatus further includes:
the community dividing module is used for carrying out association analysis by adopting a community dividing technology based on a modularity algorithm aiming at the entities in the domain knowledge graph to generate at least one entity community; wherein each entity community comprises at least one entity;
The word segmentation attribution determining module is used for determining target word segmentation belonging to the target field in a word segmentation list corresponding to any data to be divided, and determining a target entity community where the target word segmentation is located;
the duty ratio determining module is used for determining the duty ratio of the number of the target entity communities in the total number of the entity communities included in the domain knowledge graph;
the first judging module is used for determining that the data to be divided is high-quality data if the duty ratio is smaller than a third threshold value;
and the second judging module is used for determining the data to be divided into discrete data of the target field if the duty ratio is larger than or equal to a third threshold value.
In an alternative implementation, the data screening module is further configured to:
high quality data in the training data set or the fine tuning data set is preserved and low quality data or discrete data in the training data set or the fine tuning data set is deleted.
In an alternative implementation, the apparatus further includes:
the implicit association relation construction module is used for establishing the implicit association relation between at least two entities aiming at least two entities which have no association relation in the entity community.
The data processing device provided by the embodiment of the invention can execute the data processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example five
Fig. 5 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, for example, performs a data processing method.
In some embodiments, the data processing method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. One or more of the steps of the data processing method described above may be performed when the computer program is loaded into RAM13 and executed by processor 11. Alternatively, in other embodiments, the processor 11 may be configured to perform the data processing method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of data processing, comprising:
constructing a domain knowledge graph based on full domain knowledge of the target domain;
acquiring a training data set or a fine tuning data set required for training or fine tuning a large language model of the target field;
determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph;
and screening each data in the training data set or the fine tuning data set according to the data quality.
2. The method of claim 1, wherein determining the data quality of each data in the training dataset or fine tuning dataset based on the domain knowledge-graph comprises:
performing word segmentation processing on each data in the training data set or the fine tuning data set to obtain a word segmentation list corresponding to each piece of data; wherein the word segmentation list comprises at least one word segment;
determining the entity matching degree of any word segmentation list based on the domain knowledge graph; the entity matching degree is used for measuring the ratio of target word segmentation belonging to the target field in the word segmentation list;
and determining the data quality of each data in the training data set or the fine tuning data set according to the entity matching degree.
3. The method of claim 2, wherein for any word segmentation list, determining the entity matching degree of the word segmentation list based on the domain knowledge graph comprises:
for each word in the word segmentation list, determining the similarity between the word and each entity name in the domain knowledge graph;
selecting target word segmentation with similarity larger than a preset similarity threshold value;
And taking the ratio of the number of the target word segments in the total word segment number included in the word segment list as the entity matching degree of the word segment list.
4. The method of claim 2, wherein determining the data quality of each data in the training dataset or fine-tuning dataset based on the entity matching degree comprises:
for any data in the training data set or the fine tuning data set, if the entity matching degree of the word segmentation list of the data is smaller than a first threshold value, determining the data to be low-quality data;
if the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold value, determining that the data is high-quality data;
and if the entity matching degree of the word segmentation list of the data is larger than or equal to the first threshold value and smaller than the second threshold value, marking the data as data to be segmented.
5. The method according to claim 4, wherein the method further comprises:
aiming at the entities in the domain knowledge graph, carrying out association analysis by adopting a community division technology based on a modularity algorithm to generate at least one entity community; wherein each entity community comprises at least one entity;
Determining target word segmentation belonging to the target field in a word segmentation list corresponding to any data to be segmented, and determining a target entity community in which the target word segmentation is located;
determining the ratio of the number of the target entity communities to the total number of entity communities included in the domain knowledge graph;
if the duty ratio is smaller than a third threshold value, determining that the data to be divided is high-quality data;
and if the duty ratio is greater than or equal to the third threshold value, determining the data to be divided as discrete data of the target field.
6. The method of claim 5, wherein filtering each data in the training dataset or fine tuning dataset according to the data quality comprises:
high quality data in the training data set or the fine tuning data set is preserved and low quality data or discrete data in the training data set or the fine tuning data set is deleted.
7. The method of claim 5, wherein the method further comprises:
and establishing a hidden association relationship between at least two entities which do not have association relationship in the entity community.
8. A data processing apparatus, comprising:
the map construction module is used for constructing a domain knowledge map based on full domain knowledge of the target domain;
the data set acquisition module is used for acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field;
the quality analysis module is used for determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph;
and the data screening module is used for screening each data in the training data set or the fine adjustment data set according to the data quality.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of any one of claims 1-7.
CN202310610114.7A 2023-05-29 2023-05-29 Data processing method and device, electronic equipment and storage medium Pending CN116340548A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310610114.7A CN116340548A (en) 2023-05-29 2023-05-29 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310610114.7A CN116340548A (en) 2023-05-29 2023-05-29 Data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116340548A true CN116340548A (en) 2023-06-27

Family

ID=86884454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310610114.7A Pending CN116340548A (en) 2023-05-29 2023-05-29 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116340548A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861928A (en) * 2023-07-07 2023-10-10 北京中关村科金技术有限公司 Method, device, equipment and medium for generating instruction fine tuning data
CN116915459A (en) * 2023-07-13 2023-10-20 上海戎磐网络科技有限公司 Network threat analysis method based on large language model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682172A (en) * 2016-12-28 2017-05-17 江苏大学 Keyword-based document research hotspot recommending method
CN110334272A (en) * 2019-05-29 2019-10-15 平安科技(深圳)有限公司 The intelligent answer method, apparatus and computer storage medium of knowledge based map
CN111488462A (en) * 2020-04-02 2020-08-04 中国移动通信集团江苏有限公司 Recommendation method, device, equipment and medium based on knowledge graph
US20220083874A1 (en) * 2020-11-24 2022-03-17 Beijing Baidu Netcom Science Technology Co., Ltd. Method and device for training search model, method for searching for target object, and storage medium
CN115422179A (en) * 2022-09-14 2022-12-02 冯秦海 AI training processing method based on big data cleaning and artificial intelligence training system
CN115862848A (en) * 2023-02-15 2023-03-28 之江实验室 Disease prediction system and device based on clinical data screening and medical knowledge map

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682172A (en) * 2016-12-28 2017-05-17 江苏大学 Keyword-based document research hotspot recommending method
CN110334272A (en) * 2019-05-29 2019-10-15 平安科技(深圳)有限公司 The intelligent answer method, apparatus and computer storage medium of knowledge based map
CN111488462A (en) * 2020-04-02 2020-08-04 中国移动通信集团江苏有限公司 Recommendation method, device, equipment and medium based on knowledge graph
US20220083874A1 (en) * 2020-11-24 2022-03-17 Beijing Baidu Netcom Science Technology Co., Ltd. Method and device for training search model, method for searching for target object, and storage medium
CN115422179A (en) * 2022-09-14 2022-12-02 冯秦海 AI training processing method based on big data cleaning and artificial intelligence training system
CN115862848A (en) * 2023-02-15 2023-03-28 之江实验室 Disease prediction system and device based on clinical data screening and medical knowledge map

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗森林 等: "大数据分析理论与技术", 北京理工大学出版社, pages: 182 - 184 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861928A (en) * 2023-07-07 2023-10-10 北京中关村科金技术有限公司 Method, device, equipment and medium for generating instruction fine tuning data
CN116861928B (en) * 2023-07-07 2023-11-17 北京中关村科金技术有限公司 Method, device, equipment and medium for generating instruction fine tuning data
CN116915459A (en) * 2023-07-13 2023-10-20 上海戎磐网络科技有限公司 Network threat analysis method based on large language model
CN116915459B (en) * 2023-07-13 2024-03-08 上海戎磐网络科技有限公司 Network threat analysis method based on large language model

Similar Documents

Publication Publication Date Title
CN116340548A (en) Data processing method and device, electronic equipment and storage medium
CN112559885B (en) Training model determining method and device for map interest points and electronic equipment
US20230018489A1 (en) Method for acquiring structured question-answering model, question-answering method and corresponding apparatus
CN112559631B (en) Data processing method and device of distributed graph database and electronic equipment
CN113836925A (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN114549058A (en) Address selection method and device, electronic equipment and readable storage medium
CN114244795A (en) Information pushing method, device, equipment and medium
CN113904943A (en) Account detection method and device, electronic equipment and storage medium
CN115186738B (en) Model training method, device and storage medium
CN116484215A (en) Diffusion model-based text generation model training and text generation method and device
CN114417974B (en) Model training method, information processing device, electronic equipment and medium
CN113032251B (en) Method, device and storage medium for determining service quality of application program
CN115248890B (en) User interest portrait generation method and device, electronic equipment and storage medium
US20220188292A1 (en) Data processing method, apparatus, electronic device and readable storage medium
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN114611625A (en) Language model training method, language model training device, language model data processing method, language model data processing device, language model data processing equipment, language model data processing medium and language model data processing product
CN113807390A (en) Model training method and device, electronic equipment and storage medium
CN116628167B (en) Response determination method and device, electronic equipment and storage medium
CN113822057B (en) Location information determination method, location information determination device, electronic device, and storage medium
CN113836244B (en) Sample acquisition method, model training method, relation prediction method and device
CN117033801B (en) Service recommendation method, device, equipment and storage medium
CN117272970B (en) Document generation method, device, equipment and storage medium
CN116167978A (en) Model updating method and device, electronic equipment and storage medium
CN114863207A (en) Pre-training method and device of target detection model and electronic equipment
CN117610508A (en) Text processing method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination