CN111552703B

CN111552703B - Data processing method and device

Info

Publication number: CN111552703B
Application number: CN202010453465.8A
Authority: CN
Inventors: 李身宇
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2023-11-21
Anticipated expiration: 2040-05-25
Also published as: CN111552703A

Abstract

A data processing method and apparatus, the method comprising: acquiring supervision data to be processed; the to-be-processed supervision data comprises supervision data obtained by carrying out data fusion on the supervision data from a plurality of data sources; the supervision data carries the identification of the supervised object, the attribute of the supervised object and the updated time of the supervision data; retrieving a plurality of pieces of historical supervision data, wherein the attribute of a carried supervised object is matched with the supervision data to be processed, and the updated time is earlier than the time of the plurality of pieces of historical supervision data of the supervision data to be processed; determining historical supervision data with highest reliability score in the plurality of historical supervision data, and replacing the identification of the supervised object carried in the supervision data to be processed with the identification of the supervised object carried in the historical supervision data with highest reliability score; wherein the confidence score indicates the confidence of the administrative data.

Description

Data processing method and device

Technical Field

The present disclosure relates to the field of computer applications, and in particular, to a data processing method and apparatus.

Background

With the improvement of informatization level, people generally utilize computer systems to conduct network supervision on objects such as enterprises; specifically, the computer system can acquire various information of the monitored object from various data sources, and after the information acquired from the various data sources is fused, the information is stored in a monitoring database for inquiring and managing related personnel.

However, with the above scheme, various information from different data sources may carry different identifiers, and after fusion, new identifiers are often generated, which may further cause continuous changes in the identifiers corresponding to the records of the same monitored object in the supervision database, and finally may cause a decrease in query and management efficiency.

Disclosure of Invention

In view of this, the present specification discloses a data processing method and apparatus.

According to a first aspect of embodiments of the present specification, a data processing method is disclosed, the method comprising:

acquiring supervision data to be processed; the to-be-processed supervision data comprises supervision data obtained by carrying out data fusion on the supervision data from a plurality of data sources; the supervision data carries the identification of the supervised object, the attribute of the supervised object and the updated time of the supervision data;

Retrieving a plurality of pieces of historical supervision data, wherein the attribute of a carried supervised object is matched with the supervision data to be processed, and the updated time is earlier than the time of the plurality of pieces of historical supervision data of the supervision data to be processed;

determining historical supervision data with highest credibility score in the plurality of historical supervision data, and replacing the identification of the supervised object carried in the supervision data with the identification of the supervised object carried in the historical supervision data with highest credibility score; wherein the confidence score indicates the confidence of the administrative data.

According to a second aspect of embodiments of the present specification, there is disclosed a data processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring the supervision data to be processed; the to-be-processed supervision data comprises supervision data obtained by carrying out data fusion on the supervision data from a plurality of data sources; the supervision data carries the identification of the supervised object, the attribute of the supervised object and the updated time of the supervision data;

the first retrieval module retrieves a plurality of historical supervision data which are matched with the supervision data to be processed in the carried attribute of the supervised object from the prestored historical supervision data and are updated at a moment earlier than the supervision data to be processed;

The first replacement module is used for determining historical supervision data with highest credibility score in the plurality of historical supervision data, and replacing the identification of the supervised object carried in the supervision data with the identification of the supervised object carried in the historical supervision data with highest credibility score; wherein the confidence score indicates the confidence of the administrative data.

In the technical scheme, on one hand, the attributes of all the monitored objects from a plurality of pre-fusion monitoring data are reserved, so that the full utilization of the data sources is ensured, and the coverage of the data acquisition of the monitored objects is improved;

on the other hand, as the plurality of historical supervision data retrieved in the scheme are matched with the supervision data to be processed in the carried attribute, and the updated time is earlier than the supervision data to be processed, the identification of the supervision object carried in the supervision data to be processed is replaced by the identification of the supervision data carried by the supervision data, the condition that the identification of the supervision object changes after data fusion can be avoided, the credibility and the stability of the identification of the supervision object in the produced supervision data can be improved, and the efficiency of inquiring and managing the supervision object in the follow-up process is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles.

FIG. 1 is an exemplary diagram of an application scenario of a data processing method described in the present specification;

FIG. 2 is a flowchart illustrating a data processing method described in the present specification;

FIG. 3 is an exemplary diagram of data fusion and splitting as described herein;

FIG. 4 is a diagram showing an example of the structure of a data processing apparatus described in the present specification;

fig. 5 is a diagram showing an example of the structure of a computer device for data processing described in this specification.

Detailed Description

In order to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the accompanying drawings in one or more embodiments of the present specification. It will be apparent that the described embodiments are only some embodiments and not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.

When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of systems and methods that are consistent with some aspects of the present description as detailed in the accompanying claims.

The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

With the improvement of informatization level, people generally utilize computer systems to conduct network supervision on objects such as enterprises; specifically, the computer system can acquire various data of the monitored object from various data sources, and after the data acquired from the various data sources are fused, the data are stored in a monitoring database in the form of an identification-attribute pair so as to be inquired and managed by related personnel;

for example, the enterprise supervisory system may obtain an Identifier (ID) of a monitored enterprise and attributes such as an enterprise name, an operation status, a stock right structure, risk data, etc. from a data source in the form of a web crawler, etc., and store the data having the correspondence relationship in a supervisory database for querying and managing by related personnel.

However, with the above scheme, various data from different data sources may carry different identifications, and after fusion, new identifications are often generated, which may cause continuous changes in identifications corresponding to the same monitored object in the supervision database, and finally may cause reduction in query and management efficiency;

for example, an enterprise supervision system acquires supervision data A and supervision data B for an enterprise A of the same enterprise from two data sources A and B respectively, wherein the two data sources respectively carry an identifier A and an identifier B, and the two data sources are fused to generate supervision data C and identifier C; for the manager, the enterprise A has three types of identifiers, namely A, B and C, and the corresponding identifiers can be changed more probably through multiple fusion, so that a plurality of identifiers need to be considered in management and inquiry, and the problem of low efficiency exists.

If only data from a single data source is used for improving the efficiency of subsequent management and inquiry, data fusion is not needed, and the problem that the identification is changed due to data fusion is avoided to a large extent; however, such an approach provides poor coverage of data acquisition of the subject under supervision relative to schemes that acquire data from multiple data sources.

In view of this, the present specification discloses a technical solution for replacing the identifier of the monitored object carried in the monitored data after data fusion with the identifier of the monitored object carried by the person with the highest reliability in the corresponding monitored data before fusion.

When the method is realized, corresponding pre-fusion supervision data can be obtained in a mode of searching in the pre-stored historical supervision data according to the carried attribute of the supervised object and the time information; and selecting the highest credibility score from the above, and then, replacing the identifier of the monitored object with the fused monitoring data.

By applying the scheme, on one hand, the attributes of all the supervised objects from a plurality of pre-fusion supervision data are reserved, so that the full utilization of the data sources is ensured, and the coverage of data acquisition of the supervised objects is improved;

The following describes the solution in the present specification through specific embodiments and with reference to specific application scenarios.

Referring to fig. 1, fig. 1 is an exemplary diagram of an application scenario of a data processing method shown in the present specification;

as shown in fig. 1, in this scenario, the supervisory system may include an interconnected supervisory database and data processing server; the data processing server is connected with a plurality of external data sources; it should be understood that the data processing server, the supervisory database and the data source may be separate devices that provide different functions, or may be multiple programs that execute different functions on the same device, which is not limited in this specification.

The supervision database may be any type of database, for example, a relational database or a non-relational database, a dedicated database or a database shared with other functions, a single database server or a database server cluster or a distributed database, which are not particularly limited in the specification; the supervision database can store the identification of the supervised object and the corresponding attribute for the supervision management; in practical design, the corresponding functions can be completed only by the method, and the corresponding functions can be regarded as the supervision database.

The data processing server is a server structurally positioned between a data source and a supervision database, and can functionally process source data acquired from the data source so as to store the source data into the supervision database; the specific form may be a stand-alone server device, a running server program, a group of servers capable of providing the functions described above, etc., and the specific implementation form is not specifically limited in this specification, and one skilled in the art may determine the specific implementation form according to specific requirements and related technical documents.

The data source is a component which is structurally connected with a data processing server in the supervision system and can functionally provide related information of an object to be supervised by the supervision system, and the component can be an entire data acquisition system which has a cooperative relationship with the supervision system or the last stage of a path of information of the supervised object to the supervision system; in a specific implementation form, the method can be a program such as a web crawler, a forwarding program which acquires information from other programs such as the web crawler and forwards the information to the data processing server, a preprocessing program which has data mining and data cleaning functions and returns the preprocessed data to the data processing server, and the like; detailed description of the preferred embodimentsone skilled in the art may implement the design according to specific needs, and the present disclosure is not limited in any way.

It will be appreciated that the above-mentioned monitoring system may also include other components, such as components for monitoring analysis by calling data in the above-mentioned monitoring database, etc., which have been omitted from fig. 1 because the technical solutions disclosed below in this specification are not substantially related to the design of this part; the data sources, the data processing servers and the supervision databases can also provide other functions at the same time, for example, the data processing servers can also be responsible for processing tasks such as fusion, splitting and the like of supervision data from a plurality of data sources before executing the data processing method disclosed in the specification; therefore, the present disclosure is not limited to other components in the above-mentioned supervisory system, and other functions that the components may perform, and those skilled in the art may design other parts in the above-mentioned supervisory system according to specific needs and related technical documents;

It may also be appreciated that the above scenario including the supervision system is only a possible example, and in other scenarios where the same problem needs to be solved, a person skilled in the art may complete migration design by himself, for example, if data fusion is required to be completed, but supervision data not yet stored in the supervision database is processed, the method may be executed by the data processing server after completing data fusion; if the supervision data is already stored in the supervision database, the execution subject of the data processing method only needs to be connected with the supervision database, and is not necessarily the data processing server respectively connected with the data source and the supervision database; therefore, the present specification does not need to be limited in detail.

Referring to fig. 2, fig. 2 is a data processing method according to an embodiment of the present disclosure, the method includes the following steps:

s201, acquiring supervision data to be processed; the to-be-processed supervision data comprises supervision data obtained by carrying out data fusion on the supervision data from a plurality of data sources; the supervision data carries the identification of the supervised object, the attribute of the supervised object and the updated time of the supervision data;

S202, retrieving a plurality of historical supervision data which are matched with the supervision data to be processed and are updated at a moment earlier than the supervision data to be processed from the prestored historical supervision data;

s203, determining the historical supervision data with the highest credibility score in the plurality of historical supervision data, and replacing the identification of the supervised object carried in the supervision data with the identification of the supervised object carried in the historical supervision data with the highest credibility score; wherein the confidence score indicates the confidence of the administrative data.

The identification of the monitored object comprises information which can uniquely identify the monitored object; for example, when the supervised object is an enterprise, the identification may be an industry and commerce registration number, an enterprise standard number, or the like; when the supervised object is a personal asset, the identification may be an asset account identification code or the like; it will be appreciated that the various data sources may also be in a custom identification format, such as an asset account-id number combination format, etc., and the present description is not intended to be limiting; in addition, since the scheme can relate to pre-fusion supervision data from a plurality of different data sources, a plurality of different identification formats can be also related, and therefore, the same supervised object can appear, and corresponding identifications in data acquired from different data sources are not identical.

The attributes of the monitored objects include information for describing the characteristics or states of the corresponding monitored objects; for example, when the supervised object is an enterprise, the corresponding attributes may include enterprise name, enterprise type, enterprise business status, enterprise equity structure, enterprise risk analysis information, and so forth; when the supervised object is a personal asset, the corresponding attributes may include asset type, asset amount, asset value added expected analysis information, and so on; it will be appreciated that the various data sources may also take on custom attribute formats, such as "feature-value" combination formats, etc., and the present description is not intended to be limiting; furthermore, since the present solution may involve multiple different data sources, and thus multiple different attribute representations, the same attribute of the same managed object may appear, and there may be situations where different representations exist in data acquired from different data sources.

When the above-mentioned supervision data is updated, in the case that the above-mentioned to-be-processed supervision data is supervision data obtained by performing data fusion on supervision data from multiple data sources, the time of occurrence of the corresponding data fusion process may be indicated, specifically, a time stamp form may be adopted, or other counting forms capable of expressing time sequence may be adopted, for example, increasing magic words, etc., which do not need to be specifically limited, and a person skilled in the art may select an implementation manner according to specific requirements.

The data fusion can be a process of combining multiple pieces of supervision data into newly-built supervision data; for convenience of description, the supervisory data for the data fusion will be hereinafter referred to as pre-fusion supervisory data, and the product of the data fusion will be referred to as post-fusion supervisory data.

The credibility score comprises a score capable of indicating the credibility of the supervision data; specifically, the score may be derived synthetically based on multiple evaluation dimensions; for example, the weight of the data source from which the supervision data is derived may be preset, the higher credibility weight may be given to the announcement of the more credible government related departments, and the lower credibility weight may be given to the personal information publishing account number with weaker credibility; for another example, since the administrative data with more abundant attribute types of the supervised objects is generally higher in credibility, the administrative data can be given higher credibility score, otherwise, if one administrative data carries less attribute of the supervised objects, the data to be processed may not come from the data source with abundant information, and the administrative data can be given lower credibility score; it can be seen that the specific calculation mode of the above reliability score is not limited in this specification, and can be determined by one skilled in the art according to specific requirements.

In one embodiment shown, the calculation of the above-mentioned reliability score may refer to the number of attributes of the supervised object carried by the corresponding supervision data; the more the number of the attributes of the monitored objects is, the more abundant attribute information aiming at the monitored objects is contained in the data to be processed, so that the data to be processed can be considered to have higher credibility; that is, the confidence score is positively correlated with the number of attributes of the administrative object carried by the corresponding administrative data.

In the present specification, the acquisition channel of the supervision data to be processed is not necessarily limited; for example, in the case where the to-be-processed supervision data is supervision data obtained by data fusion of supervision data from a plurality of data sources, the to-be-processed supervision data may be acquired and subjected to a subsequent data processing operation in the data processing server before being stored in the supervision database, or may be additionally extracted from the supervision database after being stored in the supervision database, and subjected to a subsequent data processing operation.

In one embodiment, after the data processing step is performed, the processed supervision data may be stored in a supervision database; specifically, the interaction mode and strategy with the supervision database can be determined according to specific requirements and with reference to related technologies, and the specification is not required to be limited specifically; for example, the data may be stored in batches to achieve better operating efficiency, or may be stored in real time to achieve lower latency.

In this specification, the search may be performed from previously stored historical supervision data, thereby obtaining several supervision data before fusion. As can be seen from the foregoing description of the attribute of the monitored object, the attribute of the monitored object in the fused monitoring data is from a plurality of monitoring data before fusion, so the above-mentioned searching condition may include that the carried attribute of the monitored object matches with the fused monitoring data; and because the fused supervision data can be updated during fusion, the retrieved condition can also comprise that the updated moment is earlier than the supervision data to be processed.

Specifically, the matching may be strictly corresponding to the character string or may be the same at the semantic level; for example, the company name attribute of "Apple company" and the company name attribute of "Apple company" may be regarded as semantically identical; if two pieces of supervision data are matched with each other, the carried attributes of the supervised objects are reasonably considered that the two pieces of supervision data indicate that the two pieces of supervision data are actually the same supervised object; further, if the attributes of the monitored objects carried by the historical monitoring data are matched with the monitored data after fusion, the historical monitoring data can be identified as a plurality of monitoring data before fusion.

It can be understood that, in the above process, the specific criteria for determining the matching and the specific algorithm for the matching process can be determined by those skilled in the art according to specific requirements, and reference is made to the related art, which is not required to be specifically limited in this specification.

In the present specification, the supervision data with the highest credibility score may be determined from the plurality of retrieved supervision data; as mentioned above, the above-mentioned reliability score may indicate the reliability of the supervision data, so if the reliability score of a piece of supervision data is higher, then it is reasonable to consider that the identifier of the supervised object carried by the piece of supervision data also has higher reliability;

for example, a part of to-be-processed supervision data a from a small unknown website and a part of to-be-processed supervision data B from a public notice of a government related department are retrieved together, and the reliability score of the to-be-processed supervision data B is significantly higher than that of the to-be-processed supervision data a, so that the reliability of the identification can be ensured to a greater extent by adopting the identification of the to-be-supervised object carried in the to-be-processed supervision data B.

In the specification, the identification of the monitored object carried in the fused monitoring data can be replaced by the identification of the monitored object carried in the monitoring data with the highest determined reliability score; the specific implementation manner can be determined by one skilled in the art according to specific requirements and related technologies, and the specification is not required to be limited specifically.

Referring to the upper half of fig. 3, the upper half of fig. 3 is an exemplary diagram of data fusion as described herein; in this example, there are three pre-fusion regulatory data, respectively: supervision data identified as ID-01 and having an attribute of "Apple Inc.), supervision data identified as ID-02 and having an attribute of" Apple Inc.), and supervision data identified as ID-03 and having an attribute of "big Apple Inc"; the integrated supervision data also comprises three properties of Apple company, apple company and big Apple company;

after the fused supervision data is obtained, the three-point pre-fusion supervision data is retrieved from the historical supervision data in a character string matching mode, and the reliability score of the pre-fusion supervision data corresponding to the ID-02 is highest, so that the identification of the supervised object in the final fused data can be replaced by the ID-02 identification as shown by a dotted line in fig. 3.

Through the replacement, the identification of the monitored object carried in the monitored data before and after fusion can be ensured not to change frequently, and the stability of the identification of the monitored object is improved; meanwhile, the identifier of the monitored object for replacement is from the monitoring data with the highest reliability score, so that the reliability of the identifier can be improved; compared with the related art, the stability and the credibility of the identification of the supervised object in the produced supervision data are higher, so that the efficiency of inquiring and managing the supervision data can be improved.

In this specification, the to-be-processed supervision data may further include a plurality of supervision data obtained by splitting the fused supervision data into data; for example, in the upstream data source, it is found that a certain part of fused supervision data contains information actually from a plurality of supervised objects, so that splitting operation is performed on the information, and a plurality of split supervision data corresponding to different supervised objects are generated; in particular, how to determine whether the fused supervision data contains information actually from a plurality of supervised objects, and how to perform the splitting operation, those skilled in the art can refer to the related technical descriptions, which need not be specifically limited in this specification.

It can be understood that the updated time carried by the split supervision data may indicate the time when the corresponding splitting operation occurs.

In one embodiment, for the multiple split supervision data, the supervision data before splitting may be retrieved from the pre-stored historical supervision data, and the identity of the supervised object carried by the supervision data with the highest reliability score in the multiple split supervision data is further replaced with the identity of the supervised object carried by the retrieved supervision data before splitting;

Specifically, since the attribute of the monitored object carried by the split monitoring data comes from the monitoring data before the splitting, the retrieving condition may include that the attribute of the carried monitored object matches with the plurality of monitoring data after the splitting; because the update time of the plurality of split supervision data obtained by splitting is inevitably later than that of the supervision data before splitting, the searching condition can also comprise historical supervision data with the updated time earlier than that of the supervision data to be processed;

the determination of the supervision data with the highest credibility score from the plurality of split supervision data can be completed by directly reading the prestored identification of the supervision data with the highest credibility score, or by acquiring the credibility score of each supervision data and then immediately selecting the highest supervision data; the implementation manner of obtaining the reliability score of each supervision data can calculate the corresponding reliability score for the split plurality of supervision data in real time, or can read the corresponding reliability score from a preset data structure, so that a person skilled in the art can freely select a specific implementation manner, and the specification is not limited specifically.

Referring to the bottom half of fig. 3, the bottom half of fig. 3 is an exemplary diagram of a data splitting scenario described in the present specification; in this example, it is assumed that the "Apple company" and the "Apple company" are found to be actually the same company, but the "large Apple company" indicates a different company, that is, the above-described fused attribute of the supervised object indicates a plurality of companies, and thus, it is split;

after splitting, two pieces of supervision data, namely supervision data corresponding to 'big Apple company' and supervision data corresponding to 'Apple company' are obtained; assuming that the reliability score of the latter is higher, the identification of the monitored object in the monitored data corresponding to Apple company can be replaced by the identification of the monitored object carried by the monitored data before fusion, namely ID-02; the large apple company cannot inherit the ID-02 identification because the credibility score of the corresponding supervision data is not high enough; specifically, the previous corresponding identifier ID-03 can be found and utilized through further mining of the historical supervision data, or a new identifier can be reassigned to the historical supervision data by the system, which is not particularly limited in this specification.

In one embodiment, the method can further check whether the data indicating the same monitored object exists in the processed monitored data based on a preset check-repeat strategy; if so, performing deduplication operation on the data indicating the same supervised object; specifically, the data indicating the same supervised object with the highest reliability score can be further determined, reserved and the other data deleted.

In one embodiment shown, the above check may be performed in any one or more of the following ways: the semantic recognition can be performed on the processed supervision data to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to the result of the semantic recognition; for example, in the above example, "Apple company" is semantically similar to "Apple company" and thus it can be determined that both indicate the same supervised object;

the character string matching can be carried out on the processed supervision data, so that whether the processed supervision data contains a plurality of supervision data indicating the same supervised object or not is determined according to the character string matching result; for example, "Apple company" and "Apple (Apple) company" may have a high coincidence rate on the matching level of the character strings, so that they can be considered to indicate the same supervised object;

The method can also extract keywords from the processed supervision data and inquire in a third party database so as to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to the inquiry result; for example, both Apple company and Apple company can be used as keywords to query the same entry in the database of the third party, so that the two can be judged to indicate the same supervised object;

the above multiple implementations can be selected or combined by the person skilled in the art according to specific requirements to complete specific implementations.

In one embodiment, the repairing of the abnormal data may also be accomplished using the previously stored historical supervision data; specifically, firstly, whether the processed supervision data has an abnormality or not can be determined by calling a preset abnormality detection algorithm; in general, the attribute of the supervised object carrying the abnormality, or the identifier of the supervised object carrying the abnormality, or both, may be regarded as the presence of the abnormality; if the content abnormality data is actually contained, the history supervision data corresponding to the abnormality supervision data can be obtained from the prestored history supervision data, and the abnormality supervision data can be repaired according to the obtained history supervision data;

The algorithm of anomaly detection can be finished by a person skilled in the art according to specific requirements and by referring to related technologies, and detailed setting is not needed in the specification; the above process of repairing abnormal supervision data may be to directly replace the abnormal part with the obtained normal part of the historical supervision data, or may be to use other feasible repairing modes, and those skilled in the art may select a specific implementation mode according to the description of the related technology, which does not need to be specifically limited in this specification.

Referring to fig. 4, fig. 4 is a diagram showing an example of the structure of the data processing apparatus described above in the present specification; the data processing apparatus may comprise the following modules:

an acquisition module 401 for acquiring the supervision data to be processed; the to-be-processed supervision data comprises supervision data obtained by carrying out data fusion on the supervision data from a plurality of data sources; the supervision data carries the identification of the supervised object, the attribute of the supervised object and the updated time of the supervision data;

the first retrieval module 402 retrieves a plurality of historical supervision data of which the attribute of the carried supervised object is matched with the supervision data to be processed from the prestored historical supervision data and the updated time is earlier than the supervision data to be processed;

The first replacing module 403 determines historical supervision data with highest reliability score from the plurality of historical supervision data, and replaces the identifier of the supervised object carried in the supervision data with the identifier of the supervised object carried in the historical supervision data with highest reliability score; wherein the confidence score indicates the confidence of the administrative data.

In the present specification, the channel for acquiring the to-be-processed supervision data by the acquisition module 401 is not limited; for example, in the case where the to-be-processed supervision data is supervision data obtained by data fusion of supervision data from a plurality of data sources, the to-be-processed supervision data may be acquired and subjected to a subsequent data processing operation in the data processing server before being stored in the supervision database, or may be additionally extracted from the supervision database after being stored in the supervision database, and subjected to a subsequent data processing operation.

In one embodiment, the apparatus may further include a storage module configured to store the processed supervision data in a supervision database; specifically, the interaction mode and strategy with the supervision database can be determined according to specific requirements and with reference to related technologies, and the specification is not required to be limited specifically; for example, the data may be stored in batches to achieve better operating efficiency, or may be stored in real time to achieve lower latency.

In this specification, the first retrieval module 402 may retrieve from the pre-stored historical supervision data, thereby obtaining several supervision data before fusion. As can be seen from the foregoing description of the attribute of the monitored object, the attribute of the monitored object in the fused monitoring data is from a plurality of monitoring data before fusion, so the above-mentioned searching condition may include that the carried attribute of the monitored object matches with the fused monitoring data; and because the fused supervision data can be updated during fusion, the condition for searching can also comprise that the updated time is earlier than the supervision data to be processed.

In this specification, the first replacing module 403 may determine the supervision data with the highest reliability score from the plurality of supervision data retrieved above; as mentioned above, the above-mentioned reliability score may indicate the reliability of the supervision data, so if the reliability score of a piece of supervision data is higher, then it is reasonable to consider that the identifier of the supervised object carried by the piece of supervision data also has higher reliability;

In this specification, the first replacing module 403 may replace the identifier of the monitored object carried in the fused monitoring data with the identifier of the monitored object carried in the monitoring data with the highest determined reliability score; the specific implementation manner can be determined by one skilled in the art according to specific requirements and related technologies, and the specification is not required to be limited specifically.

In an embodiment of the present invention, the data processing apparatus may further include a second search module and a second replacement module, where, for the plurality of split supervision data, the second search module may search out the supervision data before splitting from the history supervision data stored in advance, and the second replacement module may further replace, with the identifier of the supervised object carried by the supervision data with the highest reliability score in the plurality of split supervision data, the identifier of the supervised object carried by the supervision data before splitting that is searched out;

In one embodiment, the data processing apparatus may further include a deduplication module, where the deduplication module may check whether the processed supervision data includes data indicating the same supervised object based on a preset deduplication policy; if so, performing deduplication operation on the data indicating the same supervised object; specifically, the data indicating the same supervised object with the highest reliability score can be further determined, reserved and the other data deleted.

In one embodiment, the deduplication module may specifically perform the deduplication by any one or more of the following manners: the semantic recognition can be performed on the processed supervision data to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to the result of the semantic recognition; the character string matching can be carried out on the processed supervision data, so that whether the processed supervision data contains a plurality of supervision data indicating the same supervised object or not is determined according to the character string matching result; the method can also extract keywords from the processed supervision data and inquire in a third party database so as to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to the inquiry result;

In one embodiment, the data processing apparatus may further include a repair module, where the repair module may complete repair of the abnormal data using the previously stored historical supervision data; specifically, the repair module may first determine whether an abnormality exists in the processed supervision data by invoking a preset abnormality detection algorithm; in general, the attribute of the supervised object carrying the abnormality, or the identifier of the supervised object carrying the abnormality, or both, may be regarded as the presence of the abnormality; if the content abnormality data is actually contained, the history supervision data corresponding to the abnormality supervision data can be obtained from the prestored history supervision data, and the abnormality supervision data can be repaired according to the obtained history supervision data;

the algorithm of anomaly detection can be finished by a person skilled in the art according to specific requirements and by referring to related technologies, and detailed setting is not needed in the specification; the above process of repairing abnormal supervision data may be to directly replace the abnormal part with the normal part of the corresponding early supervision data, or may adopt other feasible repairing modes, and those skilled in the art may select a specific implementation mode with reference to the description of the related technology, which does not need to be specifically limited in this specification.

The embodiments of the present disclosure also provide a computer device, which at least includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the foregoing data processing method when executing the program.

FIG. 5 illustrates a more specific hardware architecture diagram of a computing device provided by embodiments of the present description, which may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.

The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the foregoing data processing method.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

From the foregoing description of embodiments, it will be apparent to those skilled in the art that the present embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be embodied in essence or what contributes to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present specification.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present disclosure. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, and these improvements and modifications should also be considered as protective scope of the embodiments of this disclosure.

Claims

1. A data processing method, comprising:

determining historical supervision data with highest reliability score in the plurality of historical supervision data, and replacing the identification of the supervised object carried in the supervision data to be processed with the identification of the supervised object carried in the historical supervision data with highest reliability score; wherein the confidence score indicates the confidence of the administrative data.

2. The method of claim 1, wherein the to-be-processed supervision data further comprises a plurality of supervision data obtained after further splitting the supervision data obtained by data fusion of the supervision data from the plurality of data sources;

The method further comprises the steps of:

retrieving the historical supervision data of which the attribute of the carried supervised object is matched with the supervision data to be processed from the prestored historical supervision data and the updated time is earlier than the historical supervision data of the supervision data to be processed;

and determining the supervision data with the highest credibility score in the plurality of supervision data, and replacing the identification of the supervised object carried in the supervision data with the highest credibility score with the identification of the supervised object carried in the retrieved historical supervision data.

3. The method of claim 1, the method further comprising:

based on a preset duplicate checking strategy, checking whether the processed supervision data contains a plurality of supervision data indicating the same supervised object;

if so, further determining the supervision data with the highest credibility score in the plurality of supervision data, and deleting the supervision data except the supervision data with the highest credibility score.

4. A method according to claim 3, wherein the checking whether the processed supervision data contains several supervision data indicating the same supervised object based on a preset duplication checking policy includes any one or more of the following combinations:

Carrying out semantic recognition on the processed supervision data to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to the semantic recognition result;

performing character string matching on the processed supervision data to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to a character string matching result;

and extracting keywords from the processed supervision data, and inquiring in a third party database to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to the inquiring result.

5. The method of claim 1, the method further comprising:

invoking a preset abnormality detection algorithm, and determining whether abnormality exists in the processed supervision data; the abnormality comprises an attribute abnormality of the supervised object and/or an identification abnormality of the supervised object;

if so, acquiring historical supervision data corresponding to the supervision data with the abnormality from the prestored historical supervision data, and repairing the supervision data with the abnormality according to the acquired historical supervision data.

6. A method according to any one of claims 1 to 3, the confidence score being positively correlated with the number of attributes of the supervised object carried in the corresponding supervision data.

7. The method of claim 1, the method further comprising:

and storing the processed supervision data into a supervision database.

8. A data processing apparatus comprising:

9. The apparatus of claim 8, the to-be-processed regulatory data further comprising a plurality of regulatory data obtained after further splitting of regulatory data obtained by data fusion of regulatory data from a plurality of data sources;

the apparatus further comprises:

the second retrieval module retrieves the attribute of the carried monitored object from the pre-stored historical monitoring data, wherein the attribute is matched with the monitoring data to be processed, and the updated time is earlier than the historical monitoring data of the monitoring data to be processed;

and the second replacement module is used for determining the supervision data with the highest credibility score in the plurality of supervision data and replacing the identification of the supervised object carried in the supervision data with the highest credibility score with the identification of the supervised object carried in the retrieved historical supervision data.

10. The apparatus of claim 8, the apparatus further comprising:

the duplicate removal module is used for checking whether the processed supervision data contains a plurality of supervision data indicating the same supervised object or not based on a preset duplicate checking strategy; if so, further determining the supervision data with the highest credibility score in the plurality of supervision data, and deleting the supervision data except the supervision data with the highest credibility score.

11. The apparatus of claim 10, the deduplication module further to:

carrying out semantic recognition on the processed supervision data to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to the semantic recognition result; and/or

Performing character string matching on the processed supervision data to determine whether the processed supervision data contains a plurality of supervision data indicating the same supervised object according to a character string matching result; and/or

12. The apparatus of claim 8, the apparatus further comprising:

the restoration module is used for calling a preset abnormality detection algorithm and determining whether abnormality exists in the processed supervision data; the abnormality comprises an attribute abnormality of the supervised object and/or an identification abnormality of the supervised object; if so, acquiring historical supervision data corresponding to the supervision data with the abnormality from the prestored historical supervision data, and repairing the supervision data with the abnormality according to the acquired historical supervision data.

13. The apparatus of any of claims 8 to 10, the confidence score positively correlated with a number of attributes of a supervised object carried in the corresponding supervision data.

14. The apparatus of claim 8, the apparatus further comprising:

and the storage module is used for storing the processed supervision data into a supervision database.

15. A computer device comprising at least a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-7 when the program is executed by the processor.