CN116340548A

CN116340548A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN116340548A
Application number: CN202310610114.7A
Authority: CN
Inventors: 孙基栩; 司红星
Original assignee: Siwei Chuangzhi Beijing Technology Development Co ltd
Current assignee: Siwei Chuangzhi Beijing Technology Development Co ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-06-27

Abstract

The invention discloses a data processing method, a data processing device, electronic equipment and a storage medium, and relates to the field of network security. The method comprises the following steps: constructing a domain knowledge graph based on full domain knowledge of the target domain; acquiring a training data set or a fine tuning data set required for training or fine tuning a large language model of the target field; determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph; and screening each data in the training data set or the fine tuning data set according to the data quality. According to the technical scheme, the field knowledge graph is utilized to screen the data in the training data set or the fine tuning data, so that the data quality of the data in the training data set or the fine tuning data can be improved, and the toxic data in the data set is prevented from polluting a large language model.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of network security, and in particular, to a data processing method, apparatus, electronic device, and storage medium.

Background

The large language model refers to a language model comprising hundreds of billions (or more) of parameters, aiming at training and fine tuning of the large language model, if the quality of a data set is higher, the training or fine tuning effect is better, especially in the field of network security, the quality requirement on the data set is higher, but the current data set mainly depends on manual construction and webpage text collection, the data quality is uneven, and a toxic data pollution model is easy to appear.

Disclosure of Invention

The invention provides a data processing method, a data processing device, electronic equipment and a storage medium.

According to an aspect of the present invention, there is provided a data processing method including:

constructing a domain knowledge graph based on full domain knowledge of the target domain;

acquiring a training data set or a fine tuning data set required for training or fine tuning a large language model of the target field;

determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph;

and screening each data in the training data set or the fine tuning data set according to the data quality.

In an alternative implementation, determining the data quality of each data in the training data set or fine tuning data set based on the domain knowledge-graph includes:

performing word segmentation processing on each data in the training data set or the fine tuning data set to obtain a word segmentation list corresponding to each piece of data; wherein the word segmentation list comprises at least one word segment;

determining the entity matching degree of any word segmentation list based on the domain knowledge graph; the entity matching degree is used for measuring the ratio of target word segmentation belonging to the target field in the word segmentation list;

And determining the data quality of each data in the training data set or the fine tuning data set according to the entity matching degree.

In an optional implementation manner, for any word segmentation list, determining the entity matching degree of the word segmentation list based on the domain knowledge graph includes:

for each word in the word segmentation list, determining the similarity between the word and each entity name in the domain knowledge graph;

selecting target word segmentation with similarity larger than a preset similarity threshold value;

and taking the ratio of the number of the target word segments in the total word segment number included in the word segment list as the entity matching degree of the word segment list.

In an alternative implementation, determining the data quality of each data in the training data set or the fine-tuning data set according to the entity matching degree includes:

for any data in the training data set or the fine tuning data set, if the entity matching degree of the word segmentation list of the data is smaller than a first threshold value, determining the data to be low-quality data;

if the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold value, determining that the data is high-quality data;

And if the entity matching degree of the word segmentation list of the data is larger than or equal to the first threshold value and smaller than the second threshold value, marking the data as data to be segmented.

In an alternative implementation, the method further includes:

aiming at the entities in the domain knowledge graph, carrying out association analysis by adopting a community division technology based on a modularity algorithm to generate at least one entity community; wherein each entity community comprises at least one entity;

determining target word segmentation belonging to the target field in a word segmentation list corresponding to any data to be segmented, and determining a target entity community in which the target word segmentation is located;

determining the ratio of the number of the target entity communities to the total number of entity communities included in the domain knowledge graph;

if the duty ratio is smaller than a third threshold value, determining that the data to be divided is high-quality data;

and if the duty ratio is greater than or equal to the third threshold value, determining the data to be divided as discrete data of the target field.

In an alternative implementation, the filtering each data in the training data set or the fine tuning data set according to the data quality includes:

High quality data in the training data set or the fine tuning data set is preserved and low quality data or discrete data in the training data set or the fine tuning data set is deleted.

In an alternative implementation, the method further includes:

and establishing a hidden association relationship between at least two entities which do not have association relationship in the entity community.

According to another aspect of the present invention, there is provided a data processing apparatus comprising:

the map construction module is used for constructing a domain knowledge map based on full domain knowledge of the target domain;

the data set acquisition module is used for acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field;

the quality analysis module is used for determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph;

and the data screening module is used for screening each data in the training data set or the fine adjustment data set according to the data quality.

In an alternative implementation, the mass analysis module includes:

the word segmentation processing unit is used for carrying out word segmentation processing on each piece of data in the training data set or the fine tuning data set to obtain a word segmentation list corresponding to each piece of data; wherein the word segmentation list comprises at least one word segment;

The entity matching degree determining unit is used for determining the entity matching degree of any word segmentation list based on the domain knowledge graph; the entity matching degree is used for measuring the ratio of target word segmentation belonging to the target field in the word segmentation list;

and the quality analysis unit is used for determining the data quality of each data in the training data set or the fine tuning data set according to the entity matching degree.

In an alternative implementation, the entity matching degree determining unit is further configured to:

In an alternative implementation, the mass analysis unit is further configured to:

In an alternative implementation, the apparatus further includes:

the community dividing module is used for carrying out association analysis by adopting a community dividing technology based on a modularity algorithm aiming at the entities in the domain knowledge graph to generate at least one entity community; wherein each entity community comprises at least one entity;

the word segmentation attribution determining module is used for determining target word segmentation belonging to the target field in a word segmentation list corresponding to any data to be divided, and determining a target entity community where the target word segmentation is located;

the duty ratio determining module is used for determining the duty ratio of the number of the target entity communities in the total number of entity communities included in the domain knowledge graph;

the first judging module is used for determining that the data to be divided is high-quality data if the duty ratio is smaller than a third threshold value;

And the second judging module is used for determining that the data to be divided is the discrete data of the target field if the duty ratio is larger than or equal to the third threshold value.

In an alternative implementation, the data screening module is further configured to:

In an alternative implementation, the apparatus further includes:

the implicit association relation construction module is used for establishing the implicit association relation between at least two entities which do not have association relation in the entity community.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the data processing method according to the embodiment of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a data processing method according to an embodiment of the present invention.

According to the technical scheme provided by the embodiment of the invention, the data in the training data set or the fine tuning data is screened by utilizing the domain knowledge graph, so that the data quality of the data in the training data set or the fine tuning data can be improved, and the toxic data pollution to a large language model in the data set is avoided.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a data processing method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a data processing method according to a second embodiment of the present invention;

FIG. 3 is a flow chart of a data processing method according to a third embodiment of the present invention;

FIG. 4 is a schematic diagram of a data processing apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device implementing a data processing method according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The terms involved in the present invention are explained first.

Knowledge graph: a database of knowledge stores triples (entities, concepts and attributes), each representing a fact. The knowledge graph can also be regarded as a graph, and the triples can be nodes in the knowledge graph.

Concept: refers to a collection of entities with the same characteristics, such as books, computers, etc.

Entity: refers to something that is distinguishable and independently present. Such as a person, a city, a plant, a commodity, etc. The entities are the most basic elements in the knowledge graph, and different relationships exist among different entities.

Attributes: features for distinguishing concepts, different concepts having different properties. Different attribute value types correspond to edges of different types of attributes. If the attribute value corresponds to a concept or an entity, the attribute describes a relationship between the two entities, referred to as an object attribute; if the attribute value is a specific value, it is referred to as a data attribute.

Example 1

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention, where the method may be implemented by a data processing apparatus, and the data processing apparatus may be implemented in hardware and/or software, and the data processing apparatus may be configured in an electronic device, for example, integrated in a server device, and the method may be applicable to a scenario in which data in a data set is optimized using a knowledge graph.

As shown in fig. 1, the data processing method includes:

S101, constructing a domain knowledge graph based on full domain knowledge of the target domain.

In this embodiment, the target domain may be a network security domain, and the full domain knowledge of the target domain may be knowledge related to network security. The domain knowledge graph refers to a knowledge graph constructed based on knowledge data of a specific domain, and in this embodiment, the domain knowledge graph may be a knowledge graph of a network security domain.

The process of constructing the domain knowledge graph based on the full-scale domain knowledge of the target domain is as follows: aiming at the target field, firstly, acquiring related vocabulary of the target field, and optionally, acquiring the related vocabulary of the target field by responding to the input vocabulary of the user; or can be obtained directly from the existing database; further, initial source data of related words is determined, and optionally, the initial source data may be encyclopedia page information or original webpage data. And based on the initial source data, obtaining the domain knowledge graph of the target domain finally through knowledge modeling, knowledge extraction, knowledge fusion, knowledge storage and other processes.

It should be noted that knowledge modeling, that is, defining a knowledge model, mainly starts from the actual application scenario of the target domain and the specific problem to be solved, and defines the hierarchical structure of the concept and the relationship type between the concepts in the target domain. Knowledge extraction optionally includes entity extraction, relationship extraction, attribute extraction, etc. from the acquired data. Knowledge fusion mainly comprises concept fusion, entity fusion and relationship fusion, wherein concept fusion mainly refers to fusion of concept layer data, entity fusion mainly refers to fusion of entity layer data, and relationship fusion refers to fusion of concepts and relationships between concepts, relationships between concepts and entities and relationships between entities. The knowledge storage mainly comprises storage according to a preset storage mode (such as a graph database mode).

S102, acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field.

In this embodiment, the training data set or the fine tuning data set required for training or fine tuning the large language model in the target area is optionally determined manually or mechanically in advance. Training or fine tuning a large language model with this dataset may contaminate the large language model because toxic data may be present in the training dataset or fine tuning dataset. Therefore, before training or fine tuning of the model, a pre-constructed training dataset or fine tuning dataset needs to be acquired, and then according to the steps of S103-S104, data in the training dataset or fine tuning dataset is optimally screened to remove data in the dataset that is not related to the target field.

And S103, determining the data quality of each data in the training data set or the fine adjustment data set based on the domain knowledge graph.

In this embodiment, the data quality is used to measure the matching degree of the data in the data set and the domain knowledge graph of the target domain, if the matching degree of a certain data and the domain knowledge graph is higher, the probability that the data belongs to the target domain is higher, and the data is high-quality data; conversely, the lower the matching, the less probability that the data belongs to the target domain, the data is low quality data, i.e. the data may be toxic data.

And S104, screening each data in the training data set or the fine adjustment data set according to the data quality.

Optionally, high quality data in the training data set or the fine tuning data set is retained and low quality data therein is deleted. Therefore, the optimization of the data in the training data set or the fine tuning data set is realized, the overall data quality of the data in the training data set or the fine tuning data can be improved, and the problem that a large language model is polluted due to toxic data in the data set can be avoided.

Example two

Fig. 2 is a flowchart of a data processing method according to a second embodiment of the present invention. Referring to fig. 2, the method flow includes the steps of:

s201, constructing a domain knowledge graph based on full domain knowledge of the target domain.

S202, acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field.

In this embodiment, the specific process of steps S201 to S202 may be referred to the description of the above embodiment, and will not be repeated here.

On the basis of the domain knowledge graph of the target domain obtained through S201-S202 and the training data set or the fine tuning data set required for training or fine tuning the large language model of the target domain to be optimized, the process of determining the data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph includes the steps of S203-S205.

S203, performing word segmentation processing on each piece of data in the training data set or the fine adjustment data set to obtain a word segmentation list corresponding to each piece of data.

In this embodiment, for any data (optionally, composed of a chinese character sequence) in the training data set or the fine tuning data set, a chinese word segmentation method is used to segment the data, where chinese word segmentation refers to segmenting a chinese character sequence into individual words. In an alternative embodiment, the word segmentation process may be performed by using a barker word segmentation method, where the barker word segmentation is mainly based on a statistical dictionary, and a prefix dictionary is constructed; then, utilizing a prefix dictionary to segment the input sentence to obtain all segmentation possibilities, and constructing a directed acyclic graph according to the segmentation positions; the maximum probability path is calculated through a dynamic programming algorithm, and a final segmentation form is obtained. After the word segmentation, each piece of data in the training data set or the fine tuning data set corresponds to one word segmentation list respectively, and each word segmentation list comprises at least one word segmentation.

S204, determining the entity matching degree of the word segmentation list based on the domain knowledge graph aiming at any word segmentation list.

The entity matching degree is used for measuring the duty ratio of target word segmentation belonging to the target field in the word segmentation list. In an optional implementation manner, for any word segmentation list, determining the entity matching degree of the word segmentation list based on the domain knowledge graph comprises the following steps: firstly, determining the similarity between each word in the word list and each entity name in the domain knowledge graph. Optionally, calculating the Euclidean distance between the word vector of each word and the word vector of the entity name in the domain knowledge graph, and determining the similarity according to the Euclidean distance; or directly calculating cosine similarity between the word vector of the segmented word and the word vector of the entity name. Further, selecting target word segments with similarity greater than a preset similarity threshold; wherein the preset similarity threshold is optionally 0.9; it should be noted that, as long as the similarity between a certain word and a certain entity name in the map is greater than 0.9, the word is considered to be related to the entity name, or the word is considered to belong to the target field. And finally, taking the ratio of the number of the target word segments in the total word segment number included in the word segment list as the entity matching degree of the word segment list. For example, if a word segmentation list includes 5 words, where the similarity between 4 words and 4 entity names in the domain knowledge graph is greater than a preset similarity threshold, the entity matching degree of the word segmentation list is 0.8.

S205, determining the data quality of each data in the training data set or the fine adjustment data set according to the entity matching degree.

Optionally, the entity matching degree of the word segmentation list corresponding to each data in the training data set or the fine tuning data set is directly used as the actual value of the respective data quality, so that whether the data is high-quality data or low-quality data can be determined according to the entity matching degree of the word segmentation list corresponding to each data.

In an alternative embodiment, for any data in the training data set or the fine-tuning data set, if the entity matching degree of the word segmentation list of the data is less than a first threshold (for example, 0.3), determining that the data is low-quality data; if the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold (for example, 0.6), the data is determined to be high-quality data. It should be noted that, if the entity matching degree of the word segmentation list of the data is greater than or equal to the first threshold and less than the second threshold, the data is marked as data to be segmented, that is, the data needs to be further determined to be high quality data or discrete data related to the target field, and a specific determination process can be referred to a subsequent embodiment.

S206, screening each data in the training data set or the fine adjustment data set according to the data quality.

Optionally, high quality data in the training data set or the trimming data set is preserved and low quality data in the training data set or the trimming data set is deleted.

In this embodiment, by calculating the similarity between each word segment and the entity name in the word segment list, the entity matching degree between the word segment list and the knowledge graph can be accurately determined, and further the accuracy of determining the data quality by using the entity matching degree can be ensured. And finally, deleting the low-quality data based on the data quality, so that the pollution of a large language model caused by the existence of low-quality toxic data in the data set can be avoided.

Example III

Fig. 3 is a flow chart of a data processing method according to a third embodiment of the present invention. Referring to fig. 3, the method logic includes the following:

s301, constructing a domain knowledge graph based on full domain knowledge of the target domain.

S302, aiming at the entities in the domain knowledge graph, performing association analysis by adopting a community division technology based on a modularity algorithm, and generating at least one entity community.

Optionally, when the entity community is divided, each entity can be used as an entity community first, and then the entity communities are primarily combined based on the community division technology of the modularity algorithm until the modularity is not increased any more. In this embodiment, each entity community obtained finally includes at least one entity. It should be noted that each entity can only belong to one entity community, that is, there is no entity belonging to multiple entity communities at the same time.

Further, for at least two entities in the entity community, which have no association relationship, a hidden association relationship between the at least two entities is established, so that related data analysis can be performed according to the hidden association relationship between the entities.

S303, acquiring a training data set or a fine tuning data set required by a large language model for training or fine tuning the target field.

S304, performing word segmentation processing on each piece of data in the training data set or the fine adjustment data set to obtain a word segmentation list corresponding to each piece of data.

S305, determining the entity matching degree of the word segmentation list based on the domain knowledge graph aiming at any word segmentation list.

Optionally, for each word in the word segmentation list, determining the similarity between the word and each entity name in the domain knowledge graph; selecting target word segmentation with similarity larger than a preset similarity threshold value; and taking the ratio of the number of target word fragments in the total word fragments included in the word fragment list as the entity matching degree of the word fragment list.

S306, determining the data quality of each data in the training data set or the fine adjustment data set according to the entity matching degree.

In an alternative embodiment, for any data in the training data set or the fine-tuning data set, if the entity matching degree of the word segmentation list of the data is less than a first threshold (for example, 0.3), determining that the data is low-quality data; if the entity matching degree of the word segmentation list of the data is greater than or equal to a second threshold (for example, 0.6), the data is determined to be high-quality data.

It should be noted that if the entity matching degree of the word segmentation list of the data is greater than or equal to the first threshold value and less than the second threshold value, the data is marked as data to be segmented, that is, the data needs to be further determined to be high-quality data or discrete data related to the target field. Specifically, see steps S307-S309.

S307, determining target word segmentation belonging to the target field in a word segmentation list corresponding to any data to be segmented, and determining a target entity community where the target word segmentation is located.

S308, determining the ratio of the number of the target entity communities in the total number of the entity communities included in the domain knowledge graph.

S309, if the duty ratio is smaller than a third threshold value, determining that the data to be divided is high-quality data; and if the duty ratio is greater than or equal to the third threshold value, determining the data to be divided as discrete data of the target field.

In this embodiment, the larger the duty ratio (for example, the more target words belong to different entity communities), the more dispersed the target words in the word segmentation list are, the smaller the relevance among the target words is, and the larger the probability that the data to be divided corresponding to the word segmentation list belongs to discrete data in the target field is; the smaller the duty cycle (e.g., there are multiple target tokens belonging to the same entity community), the more concentrated the target tokens in the token list; and the more concentrated the target word in the word segmentation list, the greater the relevance of each target word in the word segmentation list, the greater the probability that the data to be divided corresponding to the word segmentation list belongs to high-quality data.

In an alternative embodiment, a third threshold (e.g. 0.1) may be preset, and if the duty ratio is smaller than the third threshold, the data to be divided is determined to be high quality data; and if the duty ratio is greater than or equal to the third threshold value, determining the data to be divided as discrete data of the target field.

And S310, screening each data in the training data set or the fine adjustment data set according to the data quality.

Optionally, high quality data in the training data set or the trimming data set is preserved and low quality data or discrete data in the training data set or the trimming data set is deleted.

In this embodiment, for the number of target entity communities where the target word is located in the word segmentation list corresponding to the data to be divided, the data quality of the data to be divided is further confirmed by the ratio of the number of the entity communities included in the domain knowledge graph, and the determined discrete data and the low-quality data are deleted together, so that the overall quality of the data in the training data set or the fine-tuning data set can be further improved.

Example IV

Fig. 4 is a schematic structural diagram of a data processing apparatus according to a fourth embodiment of the present invention, where the present embodiment is applicable to a scenario in which data in a data set is optimized by using a knowledge graph. As shown in fig. 4, the apparatus includes:

The map construction module 401 is configured to construct a domain knowledge map based on full-scale domain knowledge of the target domain;

a data set acquisition module 402 for acquiring a training data set or a fine tuning data set required for training or fine tuning a large language model of a target area;

a quality analysis module 403, configured to determine a data quality of each data in the training data set or the fine tuning data set based on the domain knowledge graph;

the data filtering module 404 is configured to filter each data in the training data set or the fine tuning data set according to the data quality.

In an alternative implementation, the mass analysis module includes:

the entity matching degree determining unit is used for determining the entity matching degree of the word segmentation list based on the domain knowledge graph aiming at any word segmentation list; the entity matching degree is used for measuring the duty ratio of target word segmentation belonging to the target field in the word segmentation list;

aiming at each word in the word segmentation list, determining the similarity between the word and each entity name in the domain knowledge graph;

and taking the ratio of the number of target word fragments in the total word fragments included in the word fragment list as the entity matching degree of the word fragment list.

and if the entity matching degree of the word segmentation list of the data is larger than or equal to a first threshold value and smaller than a second threshold value, marking the data as data to be segmented.

In an alternative implementation, the apparatus further includes:

the duty ratio determining module is used for determining the duty ratio of the number of the target entity communities in the total number of the entity communities included in the domain knowledge graph;

and the second judging module is used for determining the data to be divided into discrete data of the target field if the duty ratio is larger than or equal to a third threshold value.

In an alternative implementation, the apparatus further includes:

the implicit association relation construction module is used for establishing the implicit association relation between at least two entities aiming at least two entities which have no association relation in the entity community.

The data processing device provided by the embodiment of the invention can execute the data processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 5 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, for example, performs a data processing method.

In some embodiments, the data processing method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. One or more of the steps of the data processing method described above may be performed when the computer program is loaded into RAM13 and executed by processor 11. Alternatively, in other embodiments, the processor 11 may be configured to perform the data processing method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of data processing, comprising:

2. The method of claim 1, wherein determining the data quality of each data in the training dataset or fine tuning dataset based on the domain knowledge-graph comprises:

3. The method of claim 2, wherein for any word segmentation list, determining the entity matching degree of the word segmentation list based on the domain knowledge graph comprises:

4. The method of claim 2, wherein determining the data quality of each data in the training dataset or fine-tuning dataset based on the entity matching degree comprises:

5. The method according to claim 4, wherein the method further comprises:

6. The method of claim 5, wherein filtering each data in the training dataset or fine tuning dataset according to the data quality comprises:

7. The method of claim 5, wherein the method further comprises:

8. A data processing apparatus, comprising:

9. An electronic device, comprising:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of any one of claims 1-7.