CN113538075A

CN113538075A - Data processing method, model training method, device and equipment

Info

Publication number: CN113538075A
Application number: CN202010290718.4A
Authority: CN
Inventors: 袁博; 黄龙涛
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2021-10-22

Abstract

The embodiment of the invention provides a data processing method, a model training device and equipment. The method comprises the following steps: acquiring data to be processed; determining at least one first reference data and at least one second reference data corresponding to the data to be processed, wherein the similarity between the at least one first reference data and the data to be processed is greater than or equal to a preset threshold value, and the similarity between the at least one second reference data and the data to be processed is smaller than the preset threshold value; and determining a target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data. The technical scheme provided by the embodiment effectively realizes that the data object does not need to be recalled in advance, and can directly avoid the phenomena of mistaken recall, over-recall or no recall and the like of the data object; determining the target data object based on the first reference data and the second reference data not only lengthens the distance between the positive and negative example data objects, but also improves the accuracy and reliability of data processing.

Description

Data processing method, model training method, device and equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method, a model training device, and a data processing apparatus.

Background

With the rapid development of science and technology, the application of data platforms is more and more extensive. Taking e-commerce platform as an example, in the application scenario of e-commerce, some merchants may issue some prohibited goods/services or non-qualified goods/services based on interest relations, and in order to prevent being penalized, various means may be adopted to avoid penalizing and off-shelf, but at the same time, it is desirable that the goods/services can be exposed. For example: there may be some implicit contents in the information content issued by the merchant, that is, intuitively, there is no violation or violation in issuing the information content, and in practice, the information may point to some prohibited goods/services or non-qualified goods/services.

For the information to be published including some hidden contents, the contents intuitively meet the requirements when the information to be published is examined. Therefore, the information corresponding to the forbidden goods/services or non-qualified goods/services is easily checked and issued, and further the justice and the principles of justice, justice and good-custom of market economy are influenced.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a model training method, a device and equipment, which are used for solving the problem that some forbidden goods/services or unqualified goods/services are issued because the information to be issued containing some hidden contents cannot be checked in the prior art.

In a first aspect, an embodiment of the present invention provides a data processing method, including:

acquiring data to be processed;

determining at least one first reference data and at least one second reference data corresponding to the data to be processed, wherein the similarity between the at least one first reference data and the data to be processed is greater than or equal to a preset threshold value, and the similarity between the at least one second reference data and the data to be processed is less than the preset threshold value;

and determining a target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data.

In a second aspect, an embodiment of the present invention provides an apparatus for processing data, including:

the first acquisition module is used for acquiring data to be processed;

the first determining module is used for determining at least one first reference data and at least one second reference data corresponding to the data to be processed, wherein the similarity between the at least one first reference data and the data to be processed is greater than or equal to a preset threshold value, and the similarity between the at least one second reference data and the data to be processed is smaller than the preset threshold value;

and the first processing module is used for determining a target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to implement at least the method of processing data as described in the first aspect.

An embodiment of the present invention provides a non-transitory machine-readable storage medium, on which executable code is stored, and when the executable code is executed by a processor of an electronic device, the processor is enabled to at least implement the data processing method according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a method for training a model, including:

acquiring first data;

determining positive sample data, negative sample data and irrelevant sample data corresponding to the first data, wherein the similarity between the positive sample data and the first data is greater than or equal to a preset threshold, and the similarity between the negative sample data and the irrelevant sample data and the first data is less than the preset threshold;

and performing learning training on the first data, the positive sample data, the negative sample data and the irrelevant sample data to obtain a data processing model, wherein the data processing model is used for determining a data object corresponding to the data.

In a fifth aspect, an embodiment of the present invention provides a training apparatus for a model, including:

the second acquisition module is used for acquiring the first data;

a second determining module, configured to determine positive sample data, negative sample data, and irrelevant sample data corresponding to the first data, where a similarity between the positive sample data and the first data is greater than or equal to a preset threshold, and a similarity between the negative sample data and the irrelevant sample data and the first data is less than the preset threshold;

and the second processing module is used for performing learning training on the first data, the positive sample data, the negative sample data and the irrelevant sample data to obtain a data processing model, and the data processing model is used for determining a data object corresponding to the data.

In a sixth aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to implement at least the method of training a model according to the fourth aspect.

An embodiment of the present invention provides a non-transitory machine-readable storage medium, having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to implement at least the training method of the model according to the fourth aspect.

In the embodiment of the invention, by acquiring data to be processed, determining at least one first reference data and at least one second reference data corresponding to the data to be processed, and then determining a target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data, the data object does not need to be recalled in advance, so that the phenomena of mistaken recall, excessive recall or no recall and the like of the data object can be directly avoided; in addition, the target data object is determined based on the first reference data and the second reference data comprehensively, and the target data object is determined by effectively considering the limit relationship between the first reference data and the second reference data, so that the distance between the positive example data object and the negative example data object is effectively increased, the precision and the reliability of the target data object corresponding to the data to be processed can be effectively ensured, the practicability of the method is further improved, and the method is favorable for popularization and application in the market.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;

fig. 2 is a schematic view of a scenario of a data processing method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of another data processing method according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of determining a target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data according to the embodiment of the present invention;

fig. 5 is a schematic flowchart of a process of replacing at least part of the data to be processed with the at least one first reference data to obtain at least one first data corresponding to the data to be processed according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of a process of replacing at least part of the data to be processed with the at least one second reference data to obtain at least one second data corresponding to the data to be processed according to an embodiment of the present invention;

fig. 7 is a schematic flowchart of determining a target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data according to the embodiment of the present invention;

fig. 8 is a schematic flowchart of determining a target data object corresponding to the to-be-processed data in the at least one first data according to the first similarity, the second similarity, and the third similarity according to the embodiment of the present invention;

FIG. 9 is a flowchart illustrating a method for training a model according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating a data processing method according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an electronic device corresponding to the data processing apparatus provided in the embodiment shown in fig. 11;

FIG. 13 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of an electronic device corresponding to the training apparatus for model provided in the embodiment shown in fig. 13.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good/service or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good/service or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of additional like elements in a good/service or system that comprises the element.

In order to facilitate understanding of the technical solutions in the embodiments of the present application, the following briefly describes related technologies:

in the application scenario of the e-commerce platform, a merchant can issue some commodity/service related information (commodity link information/service link information); based on the interest relationship, some merchants may issue some related information about forbidden goods/services or non-qualified goods/services. In order to prevent from being penalized, various means are adopted to avoid the penalization and the off-shelf, but the service corresponding to the commodity/service link information corresponding to the commodity link information is expected to be exposed, so that various data issuing modes such as entity link, implicit entity link and the like are generated. The entity link refers to link reference in a text in a graph database to determine link information of an entity; implicit entity links are links where a reference to the entity does not appear explicitly in the text, but where the text also points to some explicit entity, and where the text content can be linked to the entity in the graph database. In specific application, implicit entity linking includes the following two cases:

in the first case: the commodity words/the service words are hidden in the text content.

In order to avoid data auditing and management and control of an e-commerce platform, some merchants hide commodity words/hidden service words in data links and list peripheral attributes of various commodities/services, so that the data links can be exposed, and the combined peripheral attributes of the commodities/services can point to unique commodities/services. For example: for a Virtual Private Network (VPN) product, the product title may be: the title of the commodity does not include commodity words, but includes a plurality of phrases related to peripheral attributes of the commodity, and the phrases can still point to a certain vpn commodity after being combined together.

In the second case: the text content contains hidden commodity words/hidden service words and attribute words.

In order to avoid data audit and control of an e-commerce platform, some merchants imply commodity words/implied service words and attribute words in data links, so that a wind control engine is avoided and the search engine cannot retrieve the commodity words/implied service words and the attribute words. But, the merchant can indirectly point to the deterministic goods/services by adding some goods description information/service description information, the exposure of the goods/services is completed by a recommendation engine, and the goods/services are recommended by the cooperation of the description of the goods/services and the direct exposure, so that the goods/services can be searched and seen by the user. For example, a certain VPN commodity is titled as: "free 24-hour service for free ladder of free trip in the world". The corresponding product is not visible from the literal view of the title. However, these descriptions may appear in the exposure descriptions of other VPN class commodities and thus are recommended at the time of collaborative recommendation.

The data linked by the two implicit entities are two evasive control commodity/service scenes which are difficult to prevent and control at present. In addition, the entity identification auditing operation for the e-commerce platform generally comprises the following steps:

(1) a candidate entity recall.

The method comprises the steps of obtaining keywords or identification information related or similar to entity words, and indirectly pointing to the entities through a plurality of keywords or identification information related or similar to the entity words so as to obtain a candidate entity set, wherein the candidate entity set comprises a plurality of candidate entities.

(2) And sorting the candidate entities.

And inputting all candidate entities in the candidate entity set into a preset sorting algorithm according to the corresponding attribute and relationship information and the self information of the commodities/services, and acquiring the sorting information of all the candidate entities in a mode of manual feature extraction or embedded feature combination. And then, in the sorted candidate entities, selecting the first or the first N ranked candidate entities as the entities referred by the commodity information/service information.

However, the above process has the following drawbacks:

(a) in the process of recalling the candidate entity, for the implicit entity link, no cue words or description of entity words of the entity appear. Thus, the accuracy and reliability of candidate entity recalls is greatly reduced.

(b) In the process of sorting candidate entities, even if a part of entities are recalled through surrounding attribute words and the like, since the information of the goods/services is less and is implicitly described, when the characteristics are extracted for entity sorting, the problem of long tail exists, and the general characteristics are not provided, so that the sorting accuracy is reduced. In addition, because the semantic difference between the goods/service description information and the structured information in the graph database is large, it is difficult to ensure that the ranking of a small part of accurate candidate entities which are recalled is advanced in the ranking stage.

(c) In the process of candidate entity sorting, since the constraint relationship between the positive case entity (the entity similar to the goods/service information) and the negative case entity (the entity dissimilar to the goods/service information) is not considered, the positive case entity and the negative case entity are difficult to be effectively distinguished in sorting due to extremely small measurement distance, and the reference and the accuracy of candidate entity sorting are further reduced.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention; fig. 2 is a schematic view of a scenario of a data processing method according to an embodiment of the present invention; referring to fig. 1-2, in order to solve the technical problem that data that does not explicitly refer to goods/services information but can point to a certain violation or non-qualification goods/services through various means cannot be effectively controlled, the present embodiment provides a data processing method, an execution subject of the data processing method may be a processing device, and it can be understood that the processing device may be implemented as software or a combination of software and hardware. Specifically, the method may include:

step S101: and acquiring data to be processed.

Step S102: determining at least one first reference data and at least one second reference data corresponding to the data to be processed, wherein the similarity between the at least one first reference data and the data to be processed is greater than or equal to a preset threshold value, and the similarity between the at least one second reference data and the data to be processed is smaller than the preset threshold value.

Step S103: and determining a target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data.

The above steps are described in detail below:

step S101: and acquiring data to be processed.

The data to be processed refers to text information that needs to be subjected to data recognition operation, and the data to be processed may include at least one of the following: description information of the data object, attribute information of the data object, or related information unrelated to the data object. It can be understood that the text information may identify different data objects in different application scenarios, for example, in an application scenario of an e-commerce platform, the text information may be to-be-released commodity information/service information, and the to-be-released commodity information may correspond to a commodity entity (data object); or, in an application scenario of the service platform, the text information may be service information to be published, and the service information to be published may correspond to a service item (data object), for example: a housekeeping service item, a cleaning service item, a moving service item, etc.

In addition, the specific acquiring method of the data to be processed is not limited in this embodiment, and a person skilled in the art may set the acquiring method according to specific application requirements and design requirements, for example: when the processing device is arranged on the preset data processing platform and the user transmits the data to be processed to the data processing platform, the processing device can directly acquire the data to be processed transmitted by the user. Or, when the processing device is in communication connection with the data processing platform, after the data processing platform acquires the to-be-processed data transmitted by the user, the data processing platform may send the to-be-processed data to the processing device. For example, a user may upload or edit data to be processed through the client, and after the client acquires the data to be processed, the client may transmit the data to be processed to the processing device, so that the processing device may stably acquire the data to be processed.

Of course, the obtaining method of the to-be-processed data is not limited to the above-mentioned exemplary method, and those skilled in the art may also obtain the to-be-processed data in other manners as long as the to-be-processed data can be accurately and effectively obtained, which is not described herein again.

After the data to be processed is acquired, the data to be processed may be analyzed to determine at least one first reference data and at least one second reference data corresponding to the data line to be processed. The similarity between the at least one first reference data and the data to be processed is greater than or equal to a preset threshold, so that the first reference data can be determined as positive sample data corresponding to the data to be processed, and the positive sample data comprises at least one of the following: the data processing method comprises a regular data object similar to data to be processed, description information of the regular data object and attribute information of the regular data object. Similarly, since the similarity between the at least one second reference data and the data to be processed is smaller than the preset threshold, the second reference data can be determined as non-positive sample data corresponding to the data to be processed. In a particular implementation, the at least one second reference datum comprises at least one of: the negative reference data and the irrelevant reference data, that is, the non-positive sample data may include the negative reference data corresponding to the data to be processed and/or the irrelevant reference data corresponding to the data to be processed, and the negative sample data includes at least one of the following: the negative example data object which is not similar to the data to be processed, the description information of the negative example data object and the attribute information of the negative example data object, and the irrelevant reference data comprises at least one of the following data: the data processing method comprises the following steps of a sample data object irrelevant to data to be processed, description information of the sample data object and attribute information of the sample data object.

In addition, the embodiment does not limit the specific implementation manner of determining the at least one first reference data and the at least one second reference data corresponding to the data to be processed, and a person skilled in the art may set the determination manner according to specific application requirements and design requirements. Specifically, as shown in fig. 2, a graph database is configured in advance, and the graph database includes a plurality of entity information and description information and attribute information corresponding to the entity information. After the data to be processed is obtained, data retrieval may be performed in the map database based on a preset data matching algorithm, so that first reference data similar to the data to be processed and second reference data dissimilar to the data to be processed may be determined, and it may be understood that the number of the first reference data and the number of the second reference data may be one or more.

Of course, the determination method of the first reference data and the second reference data is not limited to the above-mentioned exemplary method, and a person skilled in the art may also determine the first reference data and the second reference data in other manners as long as the first reference data and the second reference data can be accurately and effectively acquired, which is not described herein again.

After the at least one first reference data and the at least one second reference data are acquired, the at least one first reference data and the at least one second reference data may be analyzed to determine a target data object corresponding to the data to be processed. Specifically, the embodiment does not limit the specific implementation manner of determining the target data object, and a person skilled in the art may set the target data object according to a specific application scenario and a design requirement, for example: a pre-trained machine learning model may be configured to determine data objects corresponding to the data. In a specific operation, the obtained to-be-processed data, the at least one first reference data, and the at least one second reference data may be input into the machine learning model, so that a target data object corresponding to the to-be-processed data may be obtained, and it may be understood that, in different application scenarios, the target data object may include different entity information or service information. Or, a first similarity between any one of the at least one first reference data and the data to be processed, and a second similarity between any one of the at least one first reference data and any one of the at least one second reference data may be obtained, and then the target data object corresponding to the data to be processed is determined based on the obtained first similarity and the obtained second similarity, specifically, the data object corresponding to the positive sample data with the highest first similarity and the lowest second similarity may be determined as the target data object corresponding to the data to be processed.

Of course, the determining method of the target data object corresponding to the data to be processed is not limited to the above-mentioned exemplary method, and a person skilled in the art may also determine the target data object corresponding to the data to be processed in other manners as long as the target data object corresponding to the data to be processed can be accurately and effectively acquired, which is not described herein again.

In the data processing method provided by this embodiment, to-be-processed data is acquired, at least one first reference data and at least one second reference data corresponding to the to-be-processed data are determined, and then a target data object corresponding to the to-be-processed data is determined according to the at least one first reference data and the at least one second reference data, so that it is effectively achieved that a data object does not need to be recalled in advance, and thus, the phenomena of mistaken recall, over-recall, no recall and the like of the data object can be directly avoided; in addition, the target data object is determined based on the first reference data and the second reference data comprehensively, and the target data object is determined by effectively considering the limit relationship between the first reference data and the second reference data, so that the distance between the positive example data object and the negative example data object is effectively increased, the precision and the reliability of the target data object corresponding to the data to be processed can be effectively ensured, the practicability of the method is further improved, and the method is favorable for popularization and application in the market.

Fig. 3 is a schematic flow chart of another data processing method according to an embodiment of the present invention; on the basis of the foregoing embodiment, with continuing reference to fig. 3, after obtaining the data to be processed, the method in this embodiment may further include:

step S201: the data to be processed is divided into a plurality of data fragments.

Step S202: and marking each data fragment to obtain fragment identification information corresponding to the data fragment.

After the data to be processed is acquired, in order to implement accuracy of analyzing and identifying the data to be processed, the data to be processed may be divided into a plurality of data segments, specifically, the data to be processed may be divided by using a preset machine learning algorithm, so that the data to be processed is divided into a plurality of data segments, where the machine learning algorithm may include at least one of the following: a bert word granularity model, an e lmo model, a gpt model, and so on.

After a plurality of data segments corresponding to the data to be processed are acquired, each data segment may be marked, so that segment identification information corresponding to the data segment may be obtained, where the segment identification information may include at least one of: attribute identification information, identification information of the part of speech and the syntactic structure, and phrase identification information, wherein when each data segment is marked, the priority of the attribute identification information is higher than that of the identification information of the part of speech and the syntactic structure; the identification information of the part of speech and the syntactic structure is higher in priority than the phrase identification information. Specifically, marking each data segment, and obtaining segment identification information corresponding to the data segment may include:

step S2021: and when the data segment is marked as the attribute identification information, determining the segment identification information corresponding to the data segment as the attribute identification information. Alternatively, the first and second electrodes may be,

step S2022: and when the data segment mark is not the attribute identification information, splitting the data segment into data phrases and marking the data phrases.

After a plurality of data segments corresponding to data to be processed are acquired, each of the plurality of data segments may be marked, specifically, whether attribute identification information corresponding to the data segment exists may be searched for in a graph database, when the attribute identification information corresponding to the data segment is retrieved from the graph database, the data segment may be marked as the attribute identification information, and then the segment identification information corresponding to the data segment may be determined as the attribute identification information. When the attribute identification information corresponding to the data segment is not retrieved from the graph database, the data segment may be split again, that is, the data segment is split into data phrases, which are at least a part of the data segment, that is, one data segment may correspond to one or more data phrases. After the data word group is acquired, in order to acquire the segment identification information corresponding to the data segment, a marking operation may be performed on the data word group. Specifically, marking the data word group may include:

step S20221: and when the data phrase is marked as the identification information of the part of speech and the syntactic structure, determining the segment identification information corresponding to the data phrase as the identification information of the part of speech and the syntactic structure. Alternatively, the first and second electrodes may be,

step S20222: and when the data phrase mark is not the identification information of the part of speech and the syntactic structure, determining the segment identification information corresponding to the data phrase as phrase identification information.

After one or more data phrases corresponding to the data fragments are acquired, the data phrases may be labeled, specifically, whether identification information of parts of speech and a syntactic structure corresponding to the data phrases exists or not may be searched in the graph database, and when the identification information of the parts of speech and the syntactic structure corresponding to the data phrases is retrieved in the graph database, the data phrases may be labeled as the identification information of the parts of speech and the syntactic structure, and then it may be determined that the fragment identification information corresponding to the data phrases is the identification information of the parts of speech and the syntactic structure. When identification information of part of speech and syntactic structure corresponding to the data phrase is not retrieved from the graph database, the fragment identification information corresponding to the data phrase can be determined as the phrase identification information, so that the marking operation of each data fragment is realized, and the accuracy and reliability of acquiring the fragment identification information corresponding to the data fragment are effectively ensured.

In the embodiment, the data to be processed is divided into the plurality of data segments, and then each data segment is marked to obtain the segment identification information corresponding to the data segment, so that the segment identification information corresponding to the data segment is effectively determined, the data to be processed is conveniently analyzed based on the obtained segment identification information, and the accuracy and reliability of the analysis of the data to be processed are further improved.

Fig. 4 is a schematic flowchart of determining a target data object corresponding to data to be processed according to at least one first reference data and at least one second reference data according to an embodiment of the present invention; based on the foregoing embodiment, with reference to fig. 4, in this embodiment, a specific implementation manner of determining a target data object corresponding to data to be processed is not limited, and a person skilled in the art may set the target data object according to specific application requirements and design requirements, and preferably, in this embodiment, determining the target data object corresponding to the data to be processed according to at least one first reference data and at least one second reference data may include:

step S301: at least part of data of the data to be processed is replaced by utilizing the at least one first reference data, and at least one first data corresponding to the data to be processed is obtained.

After the at least one first reference data is acquired, at least part of the data to be processed may be replaced with the at least one first reference data, so that at least one first data corresponding to the data to be processed may be acquired. Specifically, as shown in fig. 5, replacing at least a part of the data to be processed with at least one first reference data, and obtaining at least one first data corresponding to the data to be processed may include:

step S3011: first replacement strategy information corresponding to data to be processed is acquired.

Step S3012: and replacing at least part of the data to be processed by utilizing at least one first reference data based on the first replacement strategy information to obtain at least one first data.

Wherein, a first replacement policy information corresponding to the data to be processed is configured in advance, and the first replacement policy information may include at least one of the following: the replacement priority corresponding to the attribute identification information is higher than the replacement priority corresponding to the identification information of the part of speech and the syntactic structure, and the replacement priority corresponding to the identification information of the part of speech and the syntactic structure is higher than the replacement priority corresponding to the phrase identification information and the proportion information of the replacement data in the data to be processed. After the first replacement policy information is acquired, at least part of the data to be processed may be replaced with at least one first reference data, and then at least one first data corresponding to the at least one first reference data may be acquired.

For the convenience of understanding, the example is given by taking the proportion information of the replacement data in the data to be processed as 30%, and for the convenience of expression, the identifier a is taken as attribute identification information, the identifier B is taken as identification information of a part of speech and a syntactic structure, and the identifier C is taken as phrase identification information.

Assuming that the data to be processed includes a data segment 1, a data segment 2, a data segment 3, a data segment 4, a data segment 5, and a data segment 6, segment identification information corresponding to the data segments is as follows: A. a, B, B, B and C, based on the above ratio information 30%, the data segment 1 and the data segment 2 corresponding to the segment identification information "a" in the above data to be processed can be replaced by the first reference data, so that the first data corresponding to the first reference data can be generated, and the first data at this time can include: first reference data, data segment 3, data segment 4, data segment 5, and data segment 6.

For another example, the segment identification information corresponding to the data segments is as follows: A. b, B, B, C and C, based on the above ratio information 30%, the first reference data may be used to replace any one of the data segment 1 corresponding to the segment identification information "a" and the data segment corresponding to the segment identification information "B" (taking data segment 3 as an example), so as to generate first data corresponding to the first reference data, where the first data may include: first reference data, data segment 2, first reference data, data segment 4, data segment 5, and data segment 6.

Step S302: and replacing at least part of the data to be processed by utilizing at least one second reference data to obtain at least one second data corresponding to the data to be processed.

After the at least one second reference data is acquired, at least part of the data to be processed may be replaced with the at least one second reference data, so that at least one second data corresponding to the data to be processed may be acquired. Specifically, as shown in fig. 6, replacing at least a part of the data to be processed with at least one second reference data, and obtaining at least one second data corresponding to the data to be processed may include:

step S3021: and acquiring second replacement strategy information corresponding to the data to be processed.

Step S3022: and replacing at least part of the data to be processed by utilizing at least one second reference data based on the second replacement strategy information to obtain at least one second data.

Wherein, second replacement policy information corresponding to the data to be processed is configured in advance, and the second replacement policy information may include at least one of the following: the replacement priority corresponding to the attribute identification information is higher than the replacement priority corresponding to the identification information of the part of speech and the syntactic structure, and the replacement priority corresponding to the identification information of the part of speech and the syntactic structure is higher than the replacement priority corresponding to the phrase identification information and the proportion information of the replacement data in the data to be processed. It is to be understood that the second replacement policy information in the present embodiment may be the same as or different from the first replacement policy information described above.

After the second replacement policy information is acquired, at least part of the data to be processed may be replaced with at least one second reference data, and then at least one second data corresponding to the at least one second reference data may be acquired. For example, the example is given by taking the proportion information of the replacement data in the data to be processed as 25%, and for convenience of description, the identifier a is taken as attribute identification information, the identifier B is taken as identification information of a part of speech and a syntactic structure, and the identifier C is taken as phrase identification information.

Suppose that the data to be processed includes a data segment 1, a data segment 2, a data segment 3, a data segment 4, a data segment 5, a data segment 6, a data segment 7, and a data segment 8, and the segment identification information corresponding to the data segments is as follows: A. a, A, B, B, B, C and C, based on the above proportion information of 25%, any two data fragments corresponding to the fragment identification information "a" in the above data to be processed can be replaced by the second reference data, for example: replacement data segments 1 and 2, replacement data segments 2 and 3, replacement data segments 1 and 3, and so on; second data corresponding to the second reference data may thus be generated, and the second data at this time may include: second reference data, data segment 3, data segment 4, data segment 5, data segment 6, data segment 7, and data segment 8.

Step S303: and determining a target data object corresponding to the data to be processed according to the at least one first data and the at least one second data.

After the at least one first data and the at least one second data are acquired, the first data and the second data may be analyzed to determine a target data object corresponding to the data to be processed. Specifically, the embodiment does not limit the specific implementation manner of determining the target data object, and a person skilled in the art may set the target data object according to a specific application scenario and a design requirement, for example: a first similarity between any one of the at least one first data and the data to be processed, and a second similarity between any one of the at least one first data and any one of the at least one second data may be obtained, and then a target data object corresponding to the data to be processed is determined based on the obtained first similarity and second similarity, specifically, a data object corresponding to a first reference data having a highest first similarity and a lowest second similarity may be determined as a target data object corresponding to the data to be processed.

In this embodiment, at least a part of data to be processed is replaced by at least one first reference data to obtain at least one first data corresponding to the data to be processed, at least a part of data to be processed is replaced by at least one second reference data to obtain at least one second data corresponding to the data to be processed, and then a target data object corresponding to the data to be processed is determined according to the at least one first data and the at least one second data, so as to effectively implement data reconstruction on the data to be processed and effectively fuse entity information corresponding to the first reference data and the second reference data into the data to be processed, thereby directly obtaining a similarity relation between the entity information and the data object corresponding to the data to be processed, and thus eliminating the need of manual feature extraction engineering in a sorting stage, and under the condition that the matching degree of the data to be processed and the entity information is not enough, a reasonable similarity measurement index can be obtained, so that the accuracy and reliability of target data object identification are improved, and the accuracy and reliability of auditing the data to be processed are also improved.

Fig. 7 is a schematic flowchart of determining a target data object corresponding to data to be processed according to at least one first reference data and at least one second reference data according to an embodiment of the present invention; on the basis of the foregoing embodiment, referring to fig. 7, in this embodiment, a specific implementation manner of determining a target data object corresponding to data to be processed according to at least one first reference data and at least one second reference data is not limited, and a person skilled in the art may set the target data object according to specific application requirements and design requirements, and preferably, in this embodiment, determining the target data object corresponding to the data to be processed according to at least one first reference data and at least one second reference data may include:

step S601: acquiring a first similarity between the data to be processed and at least one first data, a second similarity between the data to be processed and at least one second data, and a third similarity between the data to be processed and the at least one first data and the at least one second data.

Step S602: and determining a target data object corresponding to the data to be processed in the at least one first data according to the first similarity, the second similarity and the third similarity.

After the data to be processed, the at least one first data and the at least one second data are acquired, a first similarity between the data to be processed and any one of the at least one first data, a second similarity between the data to be processed and any one of the at least one second data, and a third similarity between the at least one first data and the at least one second data may be acquired by using a preset algorithm. After the first similarity, the second similarity, and the third similarity are obtained, a target data object corresponding to the data to be processed may be determined in the at least one first data based on the first similarity, the second similarity, and the third similarity. Specifically, referring to fig. 8, determining the target data object corresponding to the data to be processed in the at least one first data according to the first similarity, the second similarity, and the third similarity may include:

step S6021: and determining at least one alternative data, in the at least one first data, of which the first similarity is greater than or equal to a first preset threshold and the second similarity and the third similarity are both less than a second preset threshold, wherein the second preset threshold is less than the first preset threshold.

Step S6022: in the at least one alternative data, a target data object corresponding to the data to be processed is determined.

For example, a first preset threshold and a second preset threshold are preconfigured, wherein the first preset threshold is much larger than the second preset threshold, for example: the first preset threshold may be 90%, 95% or 98%, etc., and the second preset threshold may be 10%, 5% or 8%, etc. Assuming that the at least one first data comprises: data A1, data A2, data A3 and data A4, wherein the first similarities corresponding to the data A1, the data A2, the data A3 and the data A4 are S1, S2, S3 and S4 respectively, S3> S2> S4> S1, S1, S2 and S3 are all larger than or equal to a first preset threshold, and S4 is smaller than the first preset threshold; the data a1, the data a2, the data A3 and the data a4 correspond to second similarities of Sa1, Sa2, Sa3 and Sa4, wherein Sa1 and Sa2 are both smaller than a second preset threshold, and Sa3 and Sa4 are both larger than the second preset threshold; the third similarities corresponding to the data a1, the data a2, the data A3 and the data a4 are Sb1, Sb2, Sb3 and Sb4, respectively, wherein Sb1, Sb2 and Sb4 are all smaller than a second preset threshold, and Sb3 is larger than the second preset threshold.

As can be seen from the analysis processing of the first similarity, the second similarity, and the third similarity, since the first similarity between the data a1 and the data a2 is greater than or equal to a first preset threshold, and the second similarity and the third similarity are both less than a second preset threshold, it may be determined that the at least one candidate data includes the data a1 and the data a 2.

After determining that the at least one alternative data includes data A1 and data A2, then a target data object corresponding to the data to be processed may be determined in the at least one alternative data. Specifically, in the at least one candidate data, determining the target data object corresponding to the data to be processed may include:

step S60221: and determining the candidate data with the maximum first similarity as the target data object corresponding to the data to be processed in the at least one candidate data.

Specifically, after at least one candidate data is acquired, the candidate data with the largest first similarity may be determined as the target data object corresponding to the data to be processed. For example, when the at least one candidate data includes data a1 and data a2, since the first similarity S2 corresponding to data a2 is greater than the first similarity S1 corresponding to data a1, data a2 may be determined as a target data object corresponding to the data to be processed.

In the embodiment, by obtaining the first similarity between the data to be processed and the at least one first data, the second similarity between the data to be processed and the at least one second data, and the third similarity between the at least one first data and the at least one second data, then determining the at least one candidate data based on the first similarity, the second similarity, and the third similarity, and then determining the candidate data with the largest first similarity among the at least one candidate data as the target data object corresponding to the data to be processed, it is effectively achieved that the target data object corresponding to the data to be processed is determined based on the constraint relationship between the first reference data and the second reference data, so that not only is the accuracy and reliability of determining the target data object guaranteed, but also a larger metric distance exists between the first reference data and the second reference data, the quality and efficiency of data processing are further improved.

On the basis of any one of the above embodiments, after determining the target data object corresponding to the data to be processed, the method in this embodiment may further include:

step S104: and auditing the target data object to identify whether the data to be processed meet the preset requirement.

The preset requirement is a preset data auditing standard corresponding to different application scenes, after the target data object is obtained, the target data object can be audited according to a preset auditing strategy to identify whether the data to be processed meet the preset requirement, the data to be processed can be issued when the data to be processed meets the preset requirement, and the data to be processed can be prohibited from being issued when the data to be processed does not meet the preset requirement, so that the data which can meet the preset requirement and the data which cannot meet the preset requirement can be effectively audited, the quality and the efficiency of data issuing are ensured, and the justice and the fair and popular principle of market economy are further ensured.

FIG. 9 is a flowchart illustrating a method for training a model according to an embodiment of the present invention; referring to fig. 9, the embodiment provides a training method of a model, and an execution subject of the training method may be a training device, and it is understood that the training device may be implemented as software or a combination of software and hardware.

Specifically, the method may include:

step S801: first data is acquired.

Step S802: and determining positive sample data, negative sample data and irrelevant sample data corresponding to the first data, wherein the similarity between the positive sample data and the first data is greater than or equal to a preset threshold, and the similarity between the negative sample data and the irrelevant sample data and the first data is less than the preset threshold.

Step S803: and performing learning training on the first data, the positive sample data, the negative sample data and the irrelevant sample data to obtain a data processing model, wherein the data processing model is used for determining a data object corresponding to the data.

The above steps are described in detail below:

step S801: first data is acquired.

The first data may be description data, attribute data, or other types of data for a certain data object, and the first data corresponds to a standard data object. In addition, in order to enable accurate learning training of the data processing model, the number of the first data may be plural. In addition, the specific obtaining manner of the first data is not limited in this embodiment, and a person skilled in the art may set the obtaining manner according to specific application requirements and design requirements, for example: the first data can be stored in a preset area, and the first data can be acquired by accessing the preset area; or, the first data may be sent by other devices (clients) to the training apparatus, so that the training apparatus may acquire the first data to implement the training operation of the model based on the first data.

Of course, the manner of obtaining the first data is not limited to the above-mentioned exemplary manner, and those skilled in the art may also obtain the first data in other manners as long as the first data can be accurately and effectively obtained, which is not described herein again.

Mapping relations between the first data and the positive sample data, the negative sample data and the irrelevant sample data are configured in advance, so that after the first data is acquired, the positive sample data, the negative sample data and the irrelevant sample data corresponding to the first data can be determined. It can be understood that the obtained positive sample data, negative sample data and irrelevant sample data are all pre-configured standard sample data, and therefore, the learning and training operations of the model can be performed by using the positive sample data, the negative sample data and the irrelevant sample data.

After the first data, the positive sample data, the negative sample data and the irrelevant sample data are obtained, the first data can be processed based on the positive sample data, the negative sample data and the irrelevant sample data, so that learning training can be performed based on the processed data to obtain a data processing model. Specifically, the learning training of the first data, the positive sample data, the negative sample data, and the irrelevant sample data to obtain the data processing model may include:

step S8031: replacing at least part of data of the first data by using the positive sample data to obtain first training data corresponding to the first data;

specifically, after the positive sample data is acquired, at least part of the first data may be replaced by the positive sample data, so that first training data corresponding to the first data may be acquired, where when at least part of the first data is replaced, segment identification information corresponding to each data segment in the first data may be acquired first, where the segment identification information includes at least one of: attribute identification information, identification information of part of speech and syntactic structure, and phrase identification information; then, at least part of the first data is replaced based on the identification information of each segment, and the replacement principle may include: the replacement priority corresponding to the attribute identification information is higher than the replacement priority corresponding to the identification information of the part of speech and the syntactic structure; and the replacement priority corresponding to the identification information of the part of speech and the syntactic structure is higher than the replacement priority corresponding to the phrase identification information. Based on the above-mentioned alternative principle, first training data corresponding to the first data may then be obtained.

Step S8032: replacing at least part of data of the first data by using negative sample data to obtain second training data corresponding to the first data;

the implementation process, implementation manner, and implementation effect of this step in this embodiment are similar to the implementation process, implementation manner, and implementation effect of step S8031, and reference may be specifically made to the above statements, and details are not described here again.

Step S8033: replacing at least part of data of the first data by using irrelevant sample data to obtain third training data corresponding to the first data;

Step S8034: and performing learning training on the first training data, the second training data, the third training data and the first data to obtain a data processing model.

After the first training data, the second training data, and the third training data are acquired, the first training data, the second training data, the third training data, and the first data may be subjected to learning training, so that a data processing model may be obtained. When the learning training operation of the model is performed, the first training data, the second training data, the third training data and the incidence relation among the first data can be obtained, and then the learning training operation of the model is performed based on the incidence relation so as to ensure the learning training effect of the data processing model. In addition, the association relationship may include at least one of: a first similarity between the first data and the first training data is greater than or equal to a first preset threshold; the second similarity between the first data and the second training data, the third similarity between the first data and the third training data, the fourth similarity between the first training data and the second training data, the fifth similarity between the first training data and the third training data, and the sixth similarity between the first training data and the third training data are all smaller than a second preset threshold, wherein the second preset threshold is smaller than the first preset threshold.

According to the training method of the model provided by the embodiment, the first data is acquired, the positive sample data, the negative sample data and the irrelevant sample data corresponding to the first data are determined, and then the first data, the positive sample data, the negative sample data and the irrelevant sample data are subjected to learning training, so that the data processing model for determining the data object corresponding to the data is effectively obtained, then the data can be processed based on the data processing model to determine the data processing object of the data, and the quality and the efficiency of data processing are further improved.

In specific application, referring to fig. 10, the present application embodiment provides a data processing method, which takes to-be-issued commodity information as to-be-processed data for example, and the method can solve the problem that in the prior art, the to-be-issued commodity information for specifically referring to a commodity is checked, and specifically, the problem that the to-be-issued commodity information can still point to a certain illegal or non-qualified commodity in multiple ways and cannot be effectively controlled and controlled can be solved. The present application embodiment provides a data processing method, which may adopt an end-to-end direct link analysis method, and specifically, the method may include the following steps:

step 1: and acquiring original commodity information to be issued.

Step 2: and marking the original commodity information by word granularity.

Step 2.1: and segmenting the original commodity information to obtain a data fragment (such as an ngram language model fragment) corresponding to the original commodity information.

Specifically, the original commodity information can be marked and segmented by adopting a word granularity coding (embedding) algorithm, a word2vec algorithm or a bert algorithm.

Step 2.2: the data segment is marked so that segment identification information corresponding to the data segment can be obtained.

Wherein the segment identification information may include at least one of: attribute identification information, identification information of part of speech and syntactic structure (such as subject, animal subject combination, etc.), and phrase identification information. The marking priority of the attribute identification information is higher than that of the identification information of the part of speech and the syntactic structure; the tagging priority of the identification information of the part of speech and the syntactic structure is higher than that of the phrase identification information.

Step 2.21: when the data segment is marked as the attribute identification information, the segment identification information corresponding to the data segment can be determined to be the attribute identification information.

Step 2.22: when the data segment is not identified as the attribute identification information, the data segment can be segmented according to a mutual information (information entropy) mode to obtain a plurality of data phrases corresponding to the data segment;

step 2.23: when the data phrase is marked as the identification information of the part of speech and the syntactic structure, the segment identification information corresponding to the data phrase can be determined to be the identification information of the part of speech and the syntactic structure; otherwise, determining the segment identification information corresponding to the data phrase as phrase identification information.

And step 3: and performing information reconstruction based on the original commodity information to obtain first information, second information and third information corresponding to the original commodity information.

Searching the graph database for positive sample data (entity words of the positive sample), negative sample data (entity words of the negative sample) and irrelevant sample data (noise information carried in the original commodity information) corresponding to the original commodity information, wherein the number of the determined positive sample data can be one, the number of the negative sample data can be multiple, and the number of the irrelevant sample data can be multiple. After positive sample data, negative sample data and irrelevant sample data are obtained, replacing and reconstructing at least part of the original commodity information by using the positive sample data to obtain first information corresponding to the original commodity information; similarly, at least part of the original commodity information is replaced and reconstructed by using the negative sample data, and second information corresponding to the original commodity information is obtained; and replacing and reconstructing at least part of the original commodity information by using the irrelevant relevant data to obtain third information corresponding to the original commodity information.

When the replacement reconstruction operation is performed by using the positive sample data, the negative sample data and the irrelevant sample data, the replacement reconstruction operation can be performed according to a preset replacement principle: the replacement priority corresponding to the attribute identification information is higher than the replacement priority corresponding to the identification information of the part of speech and the syntactic structure; and the replacement priority corresponding to the identification information of the part of speech and the syntactic structure is higher than the replacement priority corresponding to the phrase identification information. For example, the coverage rate of the replacement occlusion is 10% to 30%, and after the original commodity information and the corresponding plurality of segment identification information are acquired, part of information to be replaced and reconstructed can be extracted from the original commodity information based on the replacement priority of the segment identification information. After the partial information needing to be replaced and reconstructed is obtained, the replacement and reconstruction operation can be performed by using the positive sample data, the negative sample data and the irrelevant sample data, so that the first information, the second information and the third information corresponding to the original commodity information can be obtained.

And 4, step 4: and grouping the original commodity information, the first information, the second information and the third information in pairs, and calculating the information similarity between any two pieces of information.

Wherein, the similarity between the original commodity information and the first information is identified as sim (o, e +), the similarity between the original commodity information and the second information is identified as sim (o, e-), the similarity between the original commodity information and the third information is identified as sim (o, unk), the similarity between the first information and the second information is identified as sim (e +, e-), the similarity between the first information and the third information is identified as sim (unk, e +), and the similarity between the second information and the third information is identified as sim (unk, e-). Specifically, the above information may be analyzed by using two preset encoders, so as to obtain similarity information, where a cosine distance between two pieces of information may be determined as a similarity between the two pieces of information; alternatively, other representations may be selected to determine the similarity between two pieces of information, such as: the Jaccard similarity between the two pieces of information, etc. can be obtained, and will not be described herein.

After the semantic similarity metric and the constraint metric based on the symmetry metric mechanism are obtained, the commodity entity corresponding to the original commodity information can be determined based on the semantic similarity metric and the constraint metric based on the symmetry metric mechanism.

And 5: and performing learning training based on the semantic similarity measurement and the constraint measurement of the symmetry measurement mechanism to obtain a machine learning model, wherein the machine learning model is used for determining a data object corresponding to the data.

In the process of performing the learning training of the model, the similarity sim (o, e +) between the original commodity information and the first information is made as large as possible, the similarity sim (o, e-) between the original commodity information and the second information, the similarity sim (o, unk) between the original commodity information and the third information, the similarity sim (e +, e-) between the first information and the second information, the similarity sim (unk, e +), between the first information and the third information, and the similarity sim (unk, e-) between the second information and the third information are made as small as possible, that is, the similarity between the original commodity information and the first information is ensured as large as possible and is not similar to other information, and meanwhile, the measurement distance between the positive entity and the negative entity is ensured to be far enough, that is, the measurement distance between the first information and the third information is ensured to be far enough. To summarize, similarity metrics based on semantic similarity need to ensure that sim (o, e +) is large enough, sim (o, e-) is small enough, sim (o, unk) is small enough; constraint metrics based on the symmetry metric mechanism need to ensure that sim (e +, e-) is small enough, sim (unk, e +) is small enough, sim (unk, e-) is small enough.

Since it is necessary to ensure that the similarity values sim (o, e +) are sufficiently large, in order to ensure the quality and efficiency of data processing, the similarity values may be normalized to achieve the same dimension. In specific implementation, a sigmod function may be used to perform normalization operation on the similarity constraint metric, that is, the similarity metric based on semantic similarity and the constraint metric based on a symmetric metric mechanism may be normalized to a [0,1] interval. The structure from which the loss function can be derived is:

loss＝

wherein α and β are preset parameters. After the loss function is obtained, the loss function can approach 0 as much as possible, then model parameters corresponding to the machine learning model can be determined, and finally the machine learning model can be generated through learning and training.

Step 6: after the machine learning model is acquired, the data to be recognized may be analyzed and processed by the machine learning model, and at least one positive entity data corresponding to the data to be recognized is determined.

And 7: and determining the positive entity data with the highest similarity with the data to be identified as target entity data in the at least one positive entity data.

And 8: and processing the target entity data, and detecting whether the data to be identified meets the preset requirements.

The data processing method provided by the application embodiment can achieve the following effects:

(1) by directly adopting an end-to-end data processing mode, the whole section of commodity information is taken as the reference information to carry out entity link without recalling in advance, so that the phenomena of mistaken recall, over-recall or no recall and the like can be directly avoided.

(2) Because the traditional entity linking method depends heavily on the matching degree of the commodity information and the entity information in the sequencing stage, in order to overcome the situation, the application embodiment can effectively fuse the positive example entity information and the negative example entity information into the commodity information through the reconstruction statement, and then can realize the reconstruction of the entity information and the commodity information, so that the similarity relation between the entity information and the original commodity information can be directly obtained, the manual feature extraction in the sequencing stage is not needed, and a reasonable similarity measurement index can be obtained under the condition that the matching degree of the original commodity information and the entity information is not enough.

(3) In the conventional method, only the relationship between the commodity information and the entity is considered, so that the positive example entity of the trained machine learning model is within the similar radius range of the same commodity, and the negative example commodity is outside the similar radius range of the same commodity. However, it may happen that the distance between the positive example entity and the merchandise is greater than the distance between the positive example entity and the negative example entity, which may result in the positive example entity and the negative example entity being very similar to each other with respect to the original merchandise information. For example, the original merchandise information includes "one before sleep, thunder is still". In the conventional entity linking scheme, the target data object corresponding to the original commodity information may include: "vitamin B6 health product", "hypnotic", and "spirits", and the health product and hypnotic may be ranked very close to each other and difficult to distinguish.

In view of the above problems, the present application embodiment provides a restriction mechanism based on symmetry measurement, and on the premise of considering the distance between the original commodity information and the positive example entity, the distance between the positive example entity and the negative example entity can be considered to be increased, so that the precision of recall in entity linking can be improved. For example, the original merchandise information includes "one before sleep, thunder is still". By processing the original commodity information through the method provided by the application embodiment, the target data object corresponding to the original commodity information can be ensured to comprise: the distance between the vitamin B6 health product, the hypnotic and the spirits is far enough, so that the prescription drugs such as the hypnotic can be accurately linked in the process of finally determining the commodity entity, and the illegal commodity is identified.

(4) The method provided by the application embodiment considers the relationship between the original commodity information and the positive case entity, the negative case entity and the irrelevant entity, and considers the restriction relationship between the entities, so that the method is favorable for measuring the link relationship between the original commodity information and different entities more accurately, and further realizes the identification operation of illegal commodities more reasonably and accurately.

Fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention; referring to fig. 11, the present embodiment provides a data processing apparatus, which may execute the data processing method shown in fig. 1, and the data processing apparatus may include: the device comprises a first acquisition module 11, a first determination module 12 and a first processing module 13. In particular, the method comprises the following steps of,

the first obtaining module 11 is configured to obtain data to be processed.

The first determining module 13 is configured to determine at least one first reference data and at least one second reference data corresponding to the data to be processed, where a similarity between the at least one first reference data and the data to be processed is greater than or equal to a preset threshold, and a similarity between the at least one second reference data and the data to be processed is less than the preset threshold.

A first processing module 14, configured to determine, according to the at least one first reference data and the at least one second reference data, a target data object corresponding to the data to be processed.

In some examples, after obtaining the data to be processed, the first processing module 14 in this embodiment is further configured to: dividing data to be processed into a plurality of data fragments; and marking each data fragment to obtain fragment identification information corresponding to the data fragment.

In some examples, the segment identification information includes at least one of: attribute identification information, identification information of part of speech and syntactic structure, and phrase identification information.

In some examples, the attribute identification information has a higher priority than the identification information of part-of-speech and syntactic structure; the identification information of the part of speech and the syntactic structure is higher in priority than the phrase identification information.

In some examples, when the first processing module 14 marks each data segment and obtains segment identification information corresponding to the data segment, the first processing module 14 may be configured to perform: when the data segment is marked as the attribute identification information, determining the segment identification information corresponding to the data segment as the attribute identification information; or when the data segment mark is not the attribute identification information, the data segment is divided into data phrases, and the data phrases are marked.

In some examples, when the first processing module 14 marks a data phrase, the first processing module 14 may be configured to perform: when the data phrase is marked as identification information of a part of speech and a syntactic structure, determining segment identification information corresponding to the data phrase as the identification information of the part of speech and the syntactic structure; or, when the data phrase mark is not the identification information of the part of speech and the syntactic structure, determining the segment identification information corresponding to the data phrase as the phrase identification information.

In some examples, the at least one second reference data comprises at least one of: negative reference data and irrelevant reference data.

In some examples, when the first processing module 14 determines the target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data, the first processing module 14 may be configured to perform: replacing at least part of data of the data to be processed by utilizing at least one first reference data to obtain at least one first data corresponding to the data to be processed; replacing at least part of data of the data to be processed by utilizing at least one second reference data to obtain at least one second data corresponding to the data to be processed; and determining a target data object corresponding to the data to be processed according to the at least one first data and the at least one second data.

In some examples, when the first processing module 14 replaces at least part of the data to be processed with the at least one first reference data to obtain at least one first data corresponding to the data to be processed, the first processing module 14 may be configured to perform: acquiring first replacement strategy information corresponding to data to be processed; and replacing at least part of the data to be processed by utilizing at least one first reference data based on the first replacement strategy information to obtain at least one first data.

In some examples, the first replacement policy information includes at least one of: the replacement priority corresponding to the attribute identification information is higher than the replacement priority corresponding to the identification information of the part of speech and the syntactic structure; the replacement priority corresponding to the identification information of the part of speech and the syntactic structure is higher than the replacement priority corresponding to the phrase identification information; and the proportion information of the replacement data in the data to be processed.

In some examples, when the first processing module 14 replaces at least part of the data to be processed with the at least one second reference data to obtain at least one second data corresponding to the data to be processed, the first processing module 14 may be configured to perform: acquiring second replacement strategy information corresponding to the data to be processed; and replacing at least part of the data to be processed by utilizing at least one second reference data based on the second replacement strategy information to obtain at least one second data.

In some examples, the second replacement policy information includes at least one of: the replacement priority corresponding to the attribute identification information is higher than the replacement priority corresponding to the identification information of the part of speech and the syntactic structure; the replacement priority corresponding to the identification information of the part of speech and the syntactic structure is higher than the replacement priority corresponding to the phrase identification information; and the proportion information of the replacement data in the data to be processed.

In some examples, when the first processing module 14 determines the target data object corresponding to the data to be processed according to the at least one first reference data and the at least one second reference data, the first processing module 14 may be configured to: acquiring a first similarity between the data to be processed and at least one first data, a second similarity between the data to be processed and at least one second data, and a third similarity between the data to be processed and the at least one first data and the at least one second data; and determining a target data object corresponding to the data to be processed in the at least one first data according to the first similarity, the second similarity and the third similarity.

In some examples, when the first processing module 14 determines the target data object corresponding to the data to be processed in the at least one first data according to the first similarity, the second similarity and the third similarity, the first processing module 14 may be configured to: determining at least one alternative data, in the at least one first data, of which the first similarity is greater than or equal to a first preset threshold and the second similarity and the third similarity are both less than a second preset threshold, wherein the second preset threshold is less than the first preset threshold; in the at least one alternative data, a target data object corresponding to the data to be processed is determined.

In some examples, when the first processing module 14 determines the target data object corresponding to the data to be processed in the at least one candidate data, the first processing module 14 may be configured to: and determining the candidate data with the maximum first similarity as the target data object corresponding to the data to be processed in the at least one candidate data.

In some examples, after determining the target data object corresponding to the data to be processed, the first processing module 14 in this embodiment may be configured to: and auditing the target data object to identify whether the data to be processed meet the preset requirement.

The apparatus shown in fig. 11 can perform the method of the embodiments shown in fig. 1-8 and 10, and the related description of the embodiments shown in fig. 1-8 and 10 can be referred to for the parts not described in detail in this embodiment. The implementation process and technical effect of the technical solution are described in the embodiments shown in fig. 1 to 8 and 10, and are not described again here.

In one possible design, the structure of the data processing apparatus shown in fig. 11 may be implemented as an electronic device, as shown in fig. 12, which may include: a first processor 21, a first memory 22. The first memory 22 stores executable codes thereon, and when the executable codes are executed by the first processor 21, the first processor 21 is enabled to at least implement the data processing method provided in the embodiments of fig. 1 to 8 and 10.

Optionally, the electronic device may further include a first communication interface 23 for communicating with other devices.

In addition, the present invention provides a non-transitory machine-readable storage medium, on which executable codes are stored, and when the executable codes are executed by a processor of an electronic device, the processor is enabled to at least implement the data processing method provided in the foregoing embodiments shown in fig. 1 to fig. 8 and fig. 10.

FIG. 13 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention; referring to fig. 13, the present embodiment provides a training apparatus for a model, which can perform the above-mentioned training method for the model shown in fig. 9, and the training apparatus for a model can include: a second obtaining module 31, a second determining module 32, and a second processing module 33. In particular, the method comprises the following steps of,

a second obtaining module 31, configured to obtain the first data;

a second determining module 32, configured to determine positive sample data, negative sample data, and irrelevant sample data corresponding to the first data, where a similarity between the positive sample data and the first data is greater than or equal to a preset threshold, and a similarity between the negative sample data and the irrelevant sample data and the first data is less than the preset threshold;

the second processing module 33 is configured to perform learning training on the first data, the positive sample data, the negative sample data, and the irrelevant sample data to obtain a data processing model, where the data processing model is used to determine a data object corresponding to the data.

In some examples, when the second processing module 33 performs learning training on the first data, the positive sample data, the negative sample data, and the irrelevant sample data to obtain the data processing model, the second processing module 33 may be configured to perform: replacing at least part of data of the first data by using the positive sample data to obtain first training data corresponding to the first data; replacing at least part of data of the first data by using negative sample data to obtain second training data corresponding to the first data; replacing at least part of data of the first data by using irrelevant sample data to obtain third training data corresponding to the first data; and performing learning training on the first training data, the second training data, the third training data and the first data to obtain a data processing model.

In some examples, a first similarity between the first data and the first training data is greater than or equal to a first preset threshold; the second similarity between the first data and the second training data, the third similarity between the first data and the third training data, the fourth similarity between the first training data and the second training data, the fifth similarity between the first training data and the third training data, and the sixth similarity between the first training data and the third training data are all smaller than a second preset threshold, wherein the second preset threshold is smaller than the first preset threshold.

The apparatus shown in fig. 13 can perform the method of the embodiment shown in fig. 9-10, and reference may be made to the related description of the embodiment shown in fig. 9-10 for parts of this embodiment that are not described in detail. The implementation process and technical effect of the technical solution are described in the embodiments shown in fig. 9 to 10, and are not described herein again.

In one possible design, the structure of the training apparatus of the model shown in fig. 13 may be implemented as an electronic device, as shown in fig. 14, which may include: a second processor 41, a second memory 42. Wherein the second memory 42 has stored thereon executable code, which when executed by the first processor 41, makes the second processor 41 at least implement the training method of the model as provided in the embodiments of fig. 9-10 described above.

Optionally, the electronic device may further include a second communication interface 43 for communicating with other devices.

In addition, the present invention provides a non-transitory machine-readable storage medium, on which executable code is stored, and when the executable code is executed by a processor of an electronic device, the processor is enabled to implement at least the training method of the model provided in the embodiments shown in fig. 9 to 10.

The above-described apparatus embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

The data processing method and the model training method provided in the embodiments of the present invention may be executed by a program/software, where the program/software may be provided by a network side, and the program/software may be, for example, an instant messaging application program mentioned in the foregoing embodiments, and the user end or the client end mentioned in the foregoing embodiments may download the program/software into a local nonvolatile storage medium, and read the program/software into a memory by a CPU when it needs to execute the management method of the server, and then execute the program/software by the CPU to implement the data processing method and the model training method provided in the foregoing embodiments, and an execution process may refer to the schematic diagrams in fig. 1 to fig. 14.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for processing data, comprising:

acquiring data to be processed;

2. The method of claim 1, wherein after obtaining the data to be processed, the method further comprises:

dividing the data to be processed into a plurality of data fragments;

marking each data fragment to obtain fragment identification information corresponding to the data fragment.

3. The method of claim 2, wherein the segment identification information comprises at least one of: attribute identification information, identification information of part of speech and syntactic structure, and phrase identification information.

4. The method of claim 3,

the priority of the attribute identification information is higher than that of the identification information of the part of speech and the syntactic structure;

the identification information of the part of speech and the syntactic structure is higher than the priority of the phrase identification information.

5. The method of claim 2, wherein marking each data segment and obtaining segment identification information corresponding to the data segment comprises:

when the data segment is marked as attribute identification information, determining the segment identification information corresponding to the data segment as the attribute identification information; alternatively, the first and second electrodes may be,

and when the data fragment mark is not the attribute identification information, splitting the data fragment into data phrases and marking the data phrases.

6. The method of claim 5, wherein marking the data phrase comprises:

when the data phrase is marked as identification information of a part of speech and a syntactic structure, determining that the segment identification information corresponding to the data phrase is identification information of the part of speech and the syntactic structure; alternatively, the first and second electrodes may be,

and when the data phrase mark is not the identification information of the part of speech and the syntactic structure, determining the segment identification information corresponding to the data phrase as phrase identification information.

7. The method of claim 1, wherein the at least one second reference datum comprises at least one of: negative reference data and irrelevant reference data.

8. The method of claim 3, wherein determining a target data object corresponding to the data to be processed from the at least one first reference data and the at least one second reference data comprises:

replacing at least part of the data to be processed by utilizing the at least one first reference data to obtain at least one first data corresponding to the data to be processed;

replacing at least part of the data to be processed by using the at least one second reference data to obtain at least one second data corresponding to the data to be processed;

and determining a target data object corresponding to the data to be processed according to the at least one first data and the at least one second data.

9. The method according to claim 8, wherein replacing at least part of the data to be processed with the at least one first reference data, obtaining at least one first data corresponding to the data to be processed, comprises:

acquiring first replacement strategy information corresponding to the data to be processed;

and replacing at least part of the data to be processed by utilizing the at least one first reference data based on the first replacement strategy information to obtain the at least one first data.

10. The method of claim 9, wherein the first replacement policy information comprises at least one of:

the replacement priority corresponding to the attribute identification information is higher than the replacement priority corresponding to the identification information of the part of speech and the syntactic structure;

the replacement priority corresponding to the identification information of the part of speech and the syntactic structure is higher than the replacement priority corresponding to the phrase identification information;

and the proportion information of the replacement data in the data to be processed.

11. The method according to claim 8, wherein replacing at least part of the data to be processed with the at least one second reference data, and obtaining at least one second data corresponding to the data to be processed, comprises:

acquiring second replacement strategy information corresponding to the data to be processed;

and replacing at least part of the data to be processed by using the at least one second reference data based on the second replacement strategy information to obtain the at least one second data.

12. The method of claim 11, wherein the second replacement policy information comprises at least one of:

13. The method of claim 3, wherein determining a target data object corresponding to the data to be processed from the at least one first reference data and the at least one second reference data comprises:

acquiring a first similarity between the data to be processed and the at least one first data, a second similarity between the data to be processed and the at least one second data, and a third similarity between the data to be processed and the at least one first data and the at least one second data;

and determining a target data object corresponding to the data to be processed in the at least one piece of first data according to the first similarity, the second similarity and the third similarity.

14. The method of claim 13, wherein determining a target data object corresponding to the data to be processed in the at least one first data according to the first similarity, the second similarity, and the third similarity comprises:

determining at least one alternative data, in the at least one first data, where the first similarity is greater than or equal to a first preset threshold, and the second similarity and the third similarity are both less than a second preset threshold, where the second preset threshold is less than the first preset threshold;

and determining a target data object corresponding to the data to be processed in the at least one alternative data.

15. The method according to claim 14, wherein determining, among the at least one candidate data, a target data object corresponding to the data to be processed comprises:

and determining the candidate data with the maximum first similarity as the target data object corresponding to the data to be processed in the at least one candidate data.

16. The method according to any of claims 1-15, wherein after determining a target data object corresponding to the data to be processed, the method further comprises:

and auditing the target data object to identify whether the data to be processed meets preset requirements.

17. A method of training a model, comprising:

acquiring first data;

18. The method of claim 17, wherein learning training the first data, positive sample data, negative sample data, and irrelevant sample data to obtain a data processing model comprises:

replacing at least part of data of the first data by the positive sample data to obtain first training data corresponding to the first data;

replacing at least part of data of the first data by using the negative sample data to obtain second training data corresponding to the first data;

replacing at least part of data of the first data by the irrelevant sample data to obtain third training data corresponding to the first data;

and performing learning training on the first training data, the second training data, the third training data and the first data to obtain the data processing model.

19. The method of claim 18,

a first similarity between the first data and the first training data is greater than or equal to a first preset threshold;

a second similarity between the first data and the second training data, a third similarity between the first data and the third training data, a fourth similarity between the first training data and the second training data, a fifth similarity between the first training data and the third training data, and a sixth similarity between the second training data and the third training data are all smaller than a second preset threshold, wherein the second preset threshold is smaller than the first preset threshold.

20. An apparatus for processing data, comprising:

the first acquisition module is used for acquiring data to be processed;

21. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform a method of processing data as claimed in any one of claims 1 to 16.

22. An apparatus for training a model, comprising:

the second acquisition module is used for acquiring the first data;

23. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform a method of training a model according to any one of claims 17 to 19.