CN118093515A

CN118093515A - Data processing method, apparatus, device, medium, and program product

Info

Publication number: CN118093515A
Application number: CN202410301435.3A
Authority: CN
Inventors: 余超; 许洋; 向海川; 王岩
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2024-05-28

Abstract

The present disclosure provides a data processing method, apparatus, device, medium and program product, which may be applied in the technical fields of computer application and financial science and technology. The method comprises the following steps: acquiring a first subject file and a second subject file, wherein the first subject file comprises M first subjects, the first subjects comprise first subject identification information and first subject level information, the second subject file comprises N second subjects, and the second subjects comprise second subject identification information and second subject level information; determining a second subject corresponding to a parent subject of the current subject to be mapped from a second subject file according to the first subject level information of the current subject to be mapped, and obtaining a target parent subject; determining the child subjects of the target father subject from N second subjects according to the second subject level information of the target father subject, and obtaining a candidate subject set; and determining a candidate subject corresponding to the current subject to be mapped from the candidate subject set according to the first subject identification information of the current subject to be mapped.

Description

Data processing method, apparatus, device, medium, and program product

Technical Field

The present disclosure relates to the field of computer applications and financial technology, and more particularly to a data processing method, apparatus, device, medium and program product.

Background

In the investment monitoring business among financial institutions, in order to ensure the accurate and perfect reconciliation function, the mapping relation of different names and orders among the financial institutions needs to be confirmed before the real reconciliation is carried out.

In the related art, matching of subject mapping relations is generally performed manually according to similarity between subject names.

However, since the subject names of different financial institutions are different and huge in system, the method for identifying the similarity of the subject names in the related technology is inaccurate in subject matching and cannot correspond to the subject hierarchy.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a data processing method, apparatus, device, medium, and program product.

According to a first aspect of the present disclosure, there is provided a data processing method comprising: acquiring a first subject file and a second subject file, wherein the first subject file comprises M first subjects, the first subjects comprise first subject identification information and first subject level information, the second subject file comprises N second subjects, the second subjects comprise second subject identification information and second subject level information, M is a positive integer more than or equal to 1, and N is a positive integer more than or equal to 1; under the condition that the ith first subject in the first subject file is determined to be the current subject to be mapped, determining a second subject corresponding to the father-level subject of the current subject to be mapped from the second subject file according to the first subject level information of the current subject to be mapped, and obtaining a target father subject, wherein i is more than or equal to 1 and less than or equal to M; determining the child subjects of the target father subject from N second subjects according to the second subject level information of the target father subject, and obtaining a candidate subject set; and determining a candidate subject corresponding to the current subject to be mapped from the candidate subject set according to the first subject identification information of the current subject to be mapped, wherein the candidate subject is a mapping subject of the current subject to be mapped.

According to an embodiment of the present disclosure, the above method further includes: obtaining a mapping file, wherein the mapping file comprises a first subject and a second subject with confirmed mapping relation; wherein determining that the ith first subject in the first subject file is the current subject to be mapped comprises: under the condition that the mapping file comprises the father-level subjects of the ith first subject, determining the ith first subject as the current subject to be mapped; and under the condition that the parent class subjects of the ith first subject are not included in the mapping file, determining the parent class subjects of the ith first subject as the current subjects to be mapped.

According to an embodiment of the present disclosure, the above method further includes: writing the i first subject and the mapping subject of the i first subject into the mapping file.

According to an embodiment of the present disclosure, the above method further includes: building a first tree structure corresponding to the first subject file according to the first file structure of the first subject file; determining first subject identification information and first subject level information of M first subjects according to the first tree structure; building a second tree structure corresponding to the second subject file according to the second file structure of the second subject file; and determining second subject identification information and second subject level information of N second subjects according to the second tree structure.

According to an embodiment of the present disclosure, the first subject identification information includes at least one of a first subject code and a first subject name; the second subject identification information includes at least one of a second subject code and a second subject name.

According to an embodiment of the present disclosure, the first subject identification information includes a first subject name; the second subject identification information includes a second subject name; wherein determining, from the candidate subject set, a candidate subject corresponding to the current subject to be mapped according to the first subject identification information of the current subject to be mapped includes: determining the similarity of the names of the first subjects of the current subjects to be mapped and the names of the second subjects of elective course subjects in the candidate subject set; and determining candidate subjects corresponding to the current subjects to be mapped according to the name similarity.

According to an embodiment of the present disclosure, determining candidate subjects corresponding to a current subject to be mapped according to name similarity includes: selecting candidate subjects with the name similarity larger than a preset value to obtain a first candidate subject subset; candidate subjects corresponding to the current mapped subject are determined from the first subset of candidate subjects.

According to an embodiment of the present disclosure, determining candidate subjects corresponding to a current subject to be mapped according to name similarity includes: ranking the candidate subjects in the candidate subject set according to the similarity of names from high to low; selecting candidate subjects with sorting positions positioned in front of the preset positions to obtain a second candidate subject subset; candidate subjects corresponding to the current mapped subject are determined from the second subset of candidate subjects.

According to an embodiment of the present disclosure, the above method further includes: and writing the current subjects to be mapped and the mapping subjects into a mapping result file according to a preset format.

A second aspect of the present disclosure provides a data processing apparatus comprising: the first acquisition module is used for acquiring a first subject file and a second subject file, wherein the first subject file comprises M first subjects, the first subjects comprise first subject identification information and first subject level information, the second subject file comprises N second subjects, the second subjects comprise second subject identification information and second subject level information, M is a positive integer more than or equal to 1, and N is a positive integer more than or equal to 1; the first determining module is used for determining a second subject corresponding to a parent subject of the current subject to be mapped from the second subject file according to the first subject level information of the current subject to be mapped under the condition that the ith first subject in the first subject file is determined to be the current subject to be mapped, so as to obtain a target parent subject, wherein i is more than or equal to 1 and less than or equal to M; the second determining module is used for determining the child subjects of the target father subjects from the N second subjects according to the second subject level information of the target father subjects to obtain a candidate subject set; and a third determining module, configured to determine, from the candidate set of subjects, a candidate subject corresponding to the current subject to be mapped according to the first subject identification information of the current subject to be mapped, where the candidate subject is a mapping subject of the current subject to be mapped.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more computer programs, wherein the one or more processors execute the one or more computer programs to implement the steps of the data processing method.

A fourth aspect of the present disclosure also provides a computer readable storage medium having stored thereon a computer program for execution by a processor to perform the steps of the data processing method described above.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the data processing method described above.

According to the embodiment of the disclosure, by acquiring a first subject file and a second subject file, the first subject file includes M first subjects, the first subjects include first subject identification information and first subject level information, the second subject file includes N second subjects, and the second subjects include second subject identification information and second subject level information; then determining a second subject corresponding to the father subject of the current subject to be mapped from the second subject file according to the first subject level information of the current subject to be mapped, and obtaining a target father subject; determining the child subjects of the target father subject from the N second subjects according to the second subject level information of the target father subject, and obtaining a candidate subject set; and then determining a candidate subject corresponding to the current subject to be mapped from the candidate subject set according to the first subject identification information of the current subject to be mapped. For each current subject to be mapped, a target father subject corresponding to the current subject to be mapped is found through the father subject, and then the mapping subject of the current subject to be mapped is determined in the son subjects of the target father subject, so that the corresponding position of the current subject to be mapped in the second subject file can be accurately positioned, and a subject mapping relation is established. The problem that subject matching is inaccurate and subject levels are difficult to correspond is solved. The automatic processing of the subject mapping confirmation work is realized, the accuracy of the mapping result is improved, and the efficiency of manual confirmation is improved.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a data processing method, apparatus, device, medium and program product according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a data processing method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a schematic view of a first tree structure in accordance with an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a data processing method according to another embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a block diagram of a data processing apparatus according to another embodiment of the present disclosure; and

Fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a data processing method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. in compliance with relevant laws and regulations and standards, necessary security measures are taken, no prejudice to the public order colloquia is provided, and corresponding operation entries are provided for the user to select authorization or rejection.

In the scenario of using personal information to make an automated decision, the method, the device and the system provided by the embodiment of the disclosure provide corresponding operation inlets for users, so that the users can choose to agree or reject the automated decision result; if the user selects refusal, the expert decision flow is entered. The expression "automated decision" here refers to an activity of automatically analyzing, assessing the behavioral habits, hobbies or economic, health, credit status of an individual, etc. by means of a computer program, and making a decision. The expression "expert decision" here refers to an activity of making a decision by a person who is specializing in a certain field of work, has specialized experience, knowledge and skills and reaches a certain level of expertise.

The embodiment of the disclosure provides a data processing method, which comprises the steps of obtaining a first subject file and a second subject file, wherein the first subject file comprises M first subjects, the first subjects comprise first subject identification information and first subject level information, the second subject file comprises N second subjects, the second subjects comprise second subject identification information and second subject level information, M is a positive integer more than or equal to 1, and N is a positive integer more than or equal to 1; under the condition that the ith first subject in the first subject file is determined to be the current subject to be mapped, determining a second subject corresponding to the father-level subject of the current subject to be mapped from the second subject file according to the first subject level information of the current subject to be mapped, and obtaining a target father subject, wherein i is more than or equal to 1 and less than or equal to M; determining the child subjects of the target father subject from N second subjects according to the second subject level information of the target father subject, and obtaining a candidate subject set; and determining a candidate subject corresponding to the current subject to be mapped from the candidate subject set according to the first subject identification information of the current subject to be mapped, wherein the candidate subject is a mapping subject of the current subject to be mapped.

Fig. 1 schematically illustrates an application scenario diagram of a data processing method, apparatus, device, medium and program product according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may be included in a terminal device 101, a terminal device 102, a network 103, and a server 104. The network 103 is a medium used to provide communication links between the terminal device 101, the terminal device 102, and the server 104. The network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with server 104 via network 103 using terminal device 101, terminal device 102, to receive or send messages, etc. Various communication client applications may be installed on the terminal device 101, 102, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, and the like (just examples).

Terminal devices 101, 102 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

Terminal device 101 may be used to provide a first subject file and terminal device 102 may be used to provide a second subject file. Furthermore, the terminal device 101 may also be used to provide the first subject file and the second subject file simultaneously; the terminal device 102 may also be used to provide the first subject file and the second subject file simultaneously.

The server 104 may be a server providing various services, such as a background management server (merely an example) providing support for websites browsed by the user using the terminal device 101, 102. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

The server 104 is configured to obtain a mapping subject matching the subject to be mapped in the first subject file and the second subject file according to the obtained first subject file and the second subject file, and obtain a mapping subject matching the subject to be mapped in the second subject file.

It should be noted that the data processing method provided in the embodiments of the present disclosure may be generally performed by the server 104. Accordingly, the data processing apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 104. The data processing method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 104 and is capable of communicating with the terminal device 101, the terminal device 102, and/or the server 104. Accordingly, the data processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 104 and capable of communicating with the terminal device 101, the terminal device 102, and/or the server 104.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The data processing method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 4 based on the scenario described in fig. 1.

Fig. 2 schematically illustrates a flow chart of a data processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the data processing method of this embodiment includes operations S210 to S240.

In operation S210, a first subject file and a second subject file are acquired, where the first subject file includes M first subjects, the first subjects include first subject identification information and first subject level information, the second subject file includes N second subjects, the second subjects include second subject identification information and second subject level information, where M is a positive integer greater than or equal to 1, and N is a positive integer greater than or equal to 1.

In operation S220, under the condition that the ith first subject in the first subject file is determined to be the current subject to be mapped, determining a second subject corresponding to the father subject of the current subject to be mapped from the second subject file according to the first subject level information of the current subject to be mapped, so as to obtain the target father subject, wherein i is more than or equal to 1 and less than or equal to M.

In operation S230, according to the second subject level information of the target parent subject, the child subjects of the target parent subject are determined from the N second subjects, and a candidate subject set is obtained.

In operation S240, according to the first subject identification information of the current subject to be mapped, a candidate subject corresponding to the current subject to be mapped is determined from the candidate subject set, wherein the candidate subject is a mapped subject of the current subject to be mapped.

In accordance with an embodiment of the present disclosure, the subject file may be a file of all investment subjects of a financial institution. Wherein the first subject file may include files of all investment subjects in the first financial institution and the second subject file may include files of all investment subjects in the second financial institution.

According to embodiments of the present disclosure, the first subject identification information may be information characterizing the identity of the first subject, such as a subject name, subject number, subject code, etc., identification. And the first subject identification information of each first subject is unique and is used for identifying the first subjects. The second subject identification information may be information characterizing the identity of the second subject, such as a subject name, subject number, subject code, etc., and each second subject identification information is unique for identifying the second subject.

According to an embodiment of the present disclosure, the first subject level information may be information of a hierarchical relationship between different first subjects, for characterizing a level at which the first subjects are located. The first subject level information may be a priority relationship or a dependency relationship, or may be a manually preset divided level relationship. Similarly, the second subject level information may be information of a hierarchical relationship between different second subjects, for characterizing a level at which the second subject is located.

According to an embodiment of the present disclosure, a parent subject may be a superior subject of a current subject to be mapped, and is determined according to the first subject level information. For example, where one of the first subjects is a deposit subject, in a predetermined hierarchical relationship of divisions, the deposit subject belongs to one of the asset subjects, then the parent subject to which the first subject corresponds may be the asset subject.

According to an embodiment of the present disclosure, the target parent subject may be a second subject that matches the parent subject of the first subject file in the second subject file. For example, if the parent subject of the first subject file determined above is an asset subject, then the target parent subject to which the second subject file matches may be the second subject corresponding to the asset subject in the second subject file.

According to the embodiment of the disclosure, the first subject is taken as the subject to be mapped currently; confirming a higher-level subject of the first subject according to the first subject level information, and taking the higher-level subject as a father-level subject; and determining the mapping subjects of the parent subjects from the second subject file according to the parent subjects as target parent subjects because the parent subjects of the first subject file have completed the mapping relationship confirmation.

According to an embodiment of the present disclosure, for example, a first subject file includes a first subject a, a first subject b, and a first subject c, wherein the first subject b is a parent subject of the first subject a; the second subject file includes a second subject a, a second subject b, and a second subject c, where the second subject b and the first subject b are in a mapping relationship, for example, the first subject a is a current subject to be mapped, and determining, from the second subject file, a second subject corresponding to a parent subject of the current subject to be mapped according to the first subject level information of the current subject to be mapped, to obtain the target parent subject may include: and determining a parent grade subject of the first subject a, namely a first subject b, according to the first subject level information of the first subject a, and then determining a mapping subject of the first subject b, namely a second subject b, namely the target parent subject, from a second subject file.

According to the embodiment of the disclosure, it is to be noted that the subject file has the following characteristics that firstly, the subject system presents a standard tree structure, and the related subject file is gradually perfect and accurate along with the increase of the level; secondly, a father-level subject mostly manages a plurality of sub-level subjects, so that the storage of a subject system in a memory adopts a multi-branch tree structure. In addition, according to the previous mapping result and service experience deduction, an important criterion of 'if two subject files are corresponding, the father subject is also necessarily corresponding' is established. Therefore, in the process of establishing the current object to be mapped and the mapping object, the target parent object corresponding to the mapping object is found through the parent object of the current object to be mapped.

According to an embodiment of the present disclosure, the candidate subjects may be child subjects of the target parent subject having a relatively high degree of matching with the current subject to be mapped, and may be characterized as the mapped subjects of the current subject to be mapped.

According to the embodiment of the disclosure, after the target parent subjects are determined, the next subjects of the target parent subjects are confirmed according to the second subject level information of the target parent subjects, and the next subjects are taken as candidate subjects.

According to an embodiment of the present disclosure, the above target parent subject is the second subject b, for example, the child subjects of the second subject b include the second subject b1, the second subject b2, the second subject b3; the candidate subject set may include a second subject b1, a second subject b2, a second subject b3.

According to the embodiment of the disclosure, similarity matching is performed according to the first subject identification information of the current subject to be mapped and the second subject identification information corresponding to the obtained candidate subjects, and according to the matching result, the mapping subject corresponding to the current subject to be mapped is determined from the candidate subjects.

According to the embodiment of the disclosure, it is to be noted that, by determining the candidate subject set according to the mapping relationship of the parent subjects, the parent subjects of the current subject to be mapped should have confirmed the mapping relationship, that is, the scheme adopts a manner of mapping from one level higher to one level lower.

According to an embodiment of the present disclosure, the mapping file may be a first subject and a second subject for which a mapping relationship is previously confirmed in the first subject file and the second subject file, wherein the first subject is a parent subject in the first subject file, and the second subject is a parent subject in the second subject file.

According to an embodiment of the present disclosure, for example, the first subject A1 and the first subject A2 are included in the map file. At this time, a first subject B is selected from the first subject files, and a parent subject of the first subject B is confirmed according to the first subject level information. If the mapping file includes the parent subject, e.g., the parent subject is the first subject A1, the first subject B may be the subject to be mapped currently. If the mapping file does not include the parent subject, for example, the parent subject is the first subject C, the first subject C is used as the current subject to be mapped, and the above steps are executed again to determine whether the first subject C can be used as the current subject to be mapped.

According to the embodiment of the present disclosure, as above, when determining the first subject C as the current subject to be mapped, the mapped subject corresponding thereto is confirmed as the second subject C' from the second subject file. And storing the first subject C and the second subject C' into the mapping file, and updating the mapping file. At this time, the updated map file includes the first subject A1, the first subject A2, and the first subject C. When the first subject B is selected from the first subject files again, the parent subject of the first subject B belongs to the mapping file, and the first subject B is used as a new current subject to be mapped.

According to the embodiment of the disclosure, the first subjects and the second subjects which are mapped are stored into the mapping file, so that subject mapping is conducted at one level according to the subject level, and the accuracy of the mapping is improved.

According to embodiments of the present disclosure, the first file structure may be a dependency or a dependency between a plurality of first subjects.

According to the embodiment of the disclosure, based on an actual business scenario, a dependency relationship or a subordinate relationship between each first subject is defined, wherein a first tree structure corresponding to a first subject file is constructed in a form of a connection line between each first subject.

According to the embodiment of the present disclosure, the second file structure is similar to the first file structure, and the construction method of the second tree structure is the same as the construction method of the first tree structure, which is not described herein again.

Fig. 3 schematically illustrates a schematic view of a first tree structure according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the first tree structure may be as shown in fig. 3, wherein the first subject level information of the first subject 1 is a first level, the first subject level information of the first subject 2 and the first subject 3 is a second level, and the first subject level information of the first subject 4 is a third level.

According to an embodiment of the present disclosure, the first subject identification information is used for identification of unused subjects. Wherein, if a first subject is related to the asset class business, the first subject name of the first subject may be the asset subject, or expressed as property in english; the code for implementing business transaction can be a first subject code, and at least one of the first subject name and the first subject code is selected as the first subject identification information.

According to the embodiment of the present disclosure, the second subject identification information is similar to the first subject identification information, and will not be described herein.

According to an embodiment of the present disclosure, determining a name similarity of a first subject name of a current subject to be mapped to a second subject name of elective course subjects in a candidate subject set may include: and determining the name similarity between the first subject name of the subject to be mapped and the second subject name of elective course subjects in the candidate subject set by using a character string similarity algorithm.

According to an embodiment of the present disclosure, a string similarity algorithm may include: LEVENSHTEIN DISTANCE (lev, edit distance), jaccard similarity coefficient, and cosine similarity (Cosine Similarity).

According to an embodiment of the present disclosure, LEVENSHTEIN DISTANCE (lev, edit distance) is used to calculate the minimum number of editing operations (e.g., insert, delete, or replace operations, etc.) between the first subject name and the second subject name to convert the first subject name to the second subject name. Wherein the smaller the edit distance, the more similar the first subject name and the second subject name are. The edit distance can be expressed as formula (1).

Where i is the first i characters of the first subject name a and j is the first j characters of the second subject name b. lev _a,b (i-1, j) denotes deleting a character from the first subject name a to reach the second subject name b; lev _a,b (i, j-1) indicates that a character is inserted in the second subject name b; Representing replacement of a character in the second subject name b. Wherein/> Is an indication function, and the value of the indication function is 1 when the ith character of the first subject name a is different from the jth character of the second subject name b, and is 0 otherwise.

Converting the first subject name to the second subject name may be, for example, the first subject name kitten and the second subject name sitting. The first editing, converting kitten 'k' into's' to obtain sitten; the second editing, converting sitten 'e' to 'i' to obtain sittin; the third edit inserts "g" at sittin, resulting in sitting, at which point the conversion of the first subject name to the second subject name is completed. The above operation underwent three edits, i.e., the edit distance was 3.

Wherein similarity=1-edit distance/maximum number of characters between the first subject name kitten and the second subject name sitting, i.e., similarity=1-3/7=0.571.

According to embodiments of the present disclosure, a Jaccard similarity coefficient (Jaccard similarity coefficient) may also be utilized to compare the similarity and variability of the first subject name and the second subject name. Jaccard similarity coefficients are mostly used for comparing the similarity of texts, performing duplicate checking and duplicate removal on the texts, calculating the distance between objects for data clustering, measuring the similarity degree between limited sample sets, and the like.

The calculation using Jaccard coefficients can be expressed as formula (2).

Wherein a represents a first subject name and B represents a second subject name. If the first subject name is kitten and the second subject name is sitting, i.e., a= { k, i, t, t, e, n }, b= { s, i, t, t, i, n, g }. Then |a n b|= { i, t, t, n } = 4, |a u b|= { k, s, i, t, t, e, i, n, g } = 9,

The value range of the jaccard coefficient [0,1], when a= =b, the jaccard coefficient is 1; when A and B do not intersect, the jaccard coefficient is 0.

The jaccard distance can be expressed as formula (3).

The jaccard distance is used to indicate the degree of dissimilarity between the first subject name and the second subject name, and the greater the jaccard distance, the lower the sample similarity. The jaccard distance is used to describe dissimilarity, which has the disadvantage of being applicable only to collections of binary data.

According to embodiments of the present disclosure, a cosine similarity (Cosine Similarity) may also be utilized to perform similarity calculations on the first and second subject names. And respectively converting the first subject name and the second subject name into vectors, and calculating the cosine value of the included angle of the first subject name and the second subject name. The value range of Cosine similarity is also 0 to 1, and the closer the result value is to 1, the higher the similarity is. The similarity calculation can be expressed as formula (4).

Where x ₁、x₂ represents the eigenvalue of the vectorized first subject name a and y ₁、y₂ represents the eigenvalue of the vectorized second subject name b.

According to an embodiment of the present disclosure, at least one from the above algorithms is selected for similarity calculation according to the field composition of the first subject name and the second subject name.

According to embodiments of the present disclosure, the preset value may be a specific value, for example, the preset value may be 0.8, 0.9, or the like. The preset numerical value is not limited in the disclosure, and can be selected according to actual needs.

According to an embodiment of the present disclosure, for example, if the preset value is 0.8, selecting candidate subjects with a name similarity greater than the preset value, and obtaining the first candidate subject subset includes: and selecting a second subject with the similarity of the names with the subjects to be mapped being higher than 0.8 from the plurality of second subjects, and taking the part of the second subjects as a first candidate subject subset.

According to an embodiment of the present disclosure, the preset position may be any position in the ranking after ranking the candidate subjects from high to low in the numerical value of the name similarity. For example, the preset position may be the position ranked third in the ranking. The selection of the preset position is not limited herein, and may be selected according to actual needs.

According to an embodiment of the present disclosure, the ranking of the candidate subjects in the candidate subject set according to the name similarity from high to low includes: the first candidate subject, the second candidate subject, the third candidate subject, and the fourth candidate subject, at which time selecting the candidate subject that is located before the third location in the ranking described above may include the first candidate subject and the second candidate subject. And forming a second candidate subject subset by the first candidate subject and the second candidate subject. Further, candidate subjects corresponding to the current mapped subject may be determined from the second subset of candidate subjects in a manually selected manner or other processing strategy.

According to embodiments of the present disclosure, subject mapping is essentially a definition and rule problem, and due to differences between subjects of different businesses, manual intervention in accordance with subject definition and naming rules may be included in addition to calculating name similarity to improve accuracy. That is, candidate subjects corresponding to the currently mapped subject are determined from the second subset of candidate subjects based on the subject variance and outcome analysis for each financial institution. Therefore, by means of further human intervention, the method has an important role in improving the accuracy of mapping confirmation.

According to the embodiment of the disclosure, the preset format may be a fixed character format, which is used for normalizing the formats of the current subject to be mapped and the mapping subject, so as to facilitate analysis and interpretation.

According to an embodiment of the present disclosure, the mapping result file may be used to store the current subject to be mapped and the mapping subject, and is used to analyze the mapping result.

Fig. 4 schematically shows a flow chart of a data processing method according to another embodiment of the present disclosure.

As shown in fig. 4, the data processing method of another embodiment of the present disclosure includes operations S410 to S450.

In operation S410, an analysis process is performed on the subject file. Acquiring a first subject file, and building a first tree structure corresponding to the first subject file according to a first file structure of the first subject file; acquiring a second subject file, and building a second tree structure corresponding to the second subject file according to a second file structure of the second subject file; and determining the current subject to be mapped and the corresponding parent subject thereof in the first subject file through the first tree structure.

In operation S420, extraction of the target parent subject is performed. And determining a second subject corresponding to the parent subject of the current subject to be mapped from the second subject file to obtain a target parent subject.

In operation S430, general policy processing is performed. Determining the child subjects of the target father subject from the second subjects according to the second subject level information of the target father subject, and obtaining a candidate subject set; determining the similarity of the names of the first subjects of the current subjects to be mapped and the names of the second subjects of elective course subjects in the candidate subject set to obtain a first candidate subject subset; further, the similarity of the names is ranked, and a second candidate subject subset is obtained.

In operation S440, other policy processing is performed. Candidate subjects corresponding to the currently mapped subject are determined from the second subset of candidate subjects in a manually selected manner or other processing strategy.

In operation S450, a mapping result file is generated according to the current mapping subject and the mapping subject.

According to the embodiment of the disclosure, for each current subject to be mapped, a target parent subject corresponding to the current subject to be mapped is found through the parent subject, and then the mapping subject of the current subject to be mapped is determined in the child subjects of the target parent subject, so that the corresponding position of the current subject to be mapped in the second subject file can be accurately positioned, and a subject mapping relation is established. The problem that subject matching is inaccurate and subject levels are difficult to correspond is solved. The automatic processing of the subject mapping confirmation work is realized, the accuracy of the mapping result is improved, and the efficiency of manual confirmation is improved.

Based on the data processing method, the disclosure also provides a data processing device. The device will be described in detail below in connection with fig. 5.

Fig. 5 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the data processing apparatus 500 of this embodiment includes a first acquisition module 510, a first determination module 520, a second determination module 530, and a third determination module 540.

The first obtaining module 510 is configured to obtain a first subject file and a second subject file, where the first subject file includes M first subjects, the first subjects include first subject identification information and first subject level information, the second subject file includes N second subjects, and the second subjects include second subject identification information and second subject level information, where M is a positive integer greater than or equal to 1, and N is a positive integer greater than or equal to 1. In an embodiment, the obtaining module 510 may be configured to perform the operation S210 described above, which is not described herein.

The first determining module 520 is configured to determine, according to the first subject level information of the current subject to be mapped, a second subject corresponding to a parent subject of the current subject to be mapped from the second subject file, to obtain a target parent subject, where i is greater than or equal to 1 and less than or equal to M, when it is determined that the i-th first subject in the first subject file is the current subject to be mapped. In an embodiment, the first determining module 520 may be configured to perform the operation S220 described above, which is not described herein.

And a second determining module 530, configured to determine, according to the second subject level information of the target parent subject, a child subject of the target parent subject from the N second subjects, to obtain a candidate subject set. In an embodiment, the second determining module 530 may be configured to perform the operation S230 described above, which is not described herein.

A third determining module 540, configured to determine, according to the first subject identification information of the current subject to be mapped, a candidate subject corresponding to the current subject to be mapped from the candidate subject set, where the candidate subject is a mapped subject of the current subject to be mapped. In an embodiment, the third determining module 540 may be used to perform the operation S240 described above, which is not described herein.

According to the embodiment of the disclosure, for each current subject to be mapped, the first determining module 520 and the second determining module 530 find the corresponding target parent subject through the parent subject, and then determine the mapping subject of the current subject to be mapped in the child subjects of the target parent subject through the third determining module 540, so that the corresponding position of the current subject to be mapped in the second subject file can be accurately located, and a subject mapping relationship can be established. The problem that subject matching is inaccurate and subject levels are difficult to correspond is solved. The automatic processing of the subject mapping confirmation work is realized, the accuracy of the mapping result is improved, and the efficiency of manual confirmation is improved.

According to an embodiment of the present disclosure, the data processing apparatus 500 further comprises a second acquisition module.

And the second acquisition module is used for acquiring a mapping file, wherein the mapping file comprises a first subject and a second subject of which the mapping relation is confirmed.

According to an embodiment of the present disclosure, the first determination module 520 includes a first determination sub-module and a second determination sub-module.

The first determining submodule is used for determining that the ith first subject is the current subject to be mapped under the condition that the parent subject of the ith first subject is included in the mapping file.

And the second determining submodule is used for determining that the parent subject of the ith first subject is the current subject to be mapped under the condition that the parent subject of the ith first subject is not included in the mapping file.

According to an embodiment of the present disclosure, the data processing apparatus 500 further comprises a first writing module.

And the first writing module is used for writing the ith first subject and the mapping subject of the ith first subject into the mapping file.

According to an embodiment of the present disclosure, the data processing apparatus 500 further comprises a first building module, a fourth determining module, a second building module and a fifth determining module.

The first construction module is used for constructing a first tree structure corresponding to the first subject file according to the first file structure of the first subject file.

And the fourth determining module is used for determining first subject identification information and first subject level information of M first subjects according to the first tree structure.

The second building module is used for building a second tree structure corresponding to the second subject file according to the second file structure of the second subject file.

And a fifth determining module, configured to determine second subject identification information and second subject level information of the N second subjects according to the second tree structure.

According to an embodiment of the present disclosure, the third determination module 540 includes a third determination sub-module and a fourth determination sub-module.

And the third determination submodule is used for determining the name similarity between the first subject name of the current subject to be mapped and the second subject name of the candidate elective course subjects in the candidate subject set.

And the fourth determination submodule is used for determining candidate subjects corresponding to the current subjects to be mapped according to the name similarity.

According to an embodiment of the present disclosure, the fourth determination submodule includes a selection unit and a determination unit.

And the selecting unit is used for selecting candidate subjects with the name similarity larger than a preset value to obtain a first candidate subject subset.

And the determining unit is used for determining the candidate subjects corresponding to the current mapping subjects from the first candidate subject subset.

According to an embodiment of the present disclosure, the determining unit comprises a sorting subunit, a selecting subunit and a determining subunit.

And the sorting subunit is used for sorting the candidate subjects in the candidate subject set from high to low according to the similarity of the names.

And the selecting subunit is used for selecting the candidate subjects with the sorting positions positioned in front of the preset positions to obtain a second candidate subject subset.

And the determining subunit is used for determining the candidate subjects corresponding to the current mapping subjects from the second candidate subject subset.

According to an embodiment of the present disclosure, the data processing apparatus 500 further comprises a second writing module.

And the second writing module is used for writing the current subjects to be mapped and the mapping subjects into the mapping result file according to the preset format.

According to an embodiment of the present disclosure, any of the first acquisition module 510, the first determination module 520, the second determination module 530, and the third determination module 540 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules. Or at least some of the functionality of one or more of the modules may be combined with, and implemented in, at least some of the functionality of other modules. According to embodiments of the present disclosure, at least one of the first acquisition module 510, the first determination module 520, the second determination module 530, and the third determination module 540 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Or at least one of the first acquisition module 510, the first determination module 520, the second determination module 530 and the third determination module 540 may be at least partially implemented as computer program modules which, when executed, may perform the respective functions.

Fig. 6 schematically illustrates a block diagram of a data processing apparatus 600 according to another embodiment of the present disclosure.

As shown in fig. 6, a data processing apparatus 600 of another embodiment of the present disclosure includes a subject file analysis module 610, a general policy processing module 620, an other policy processing module 630, and a mapping result generation module 640.

The subject file analysis module 610 is configured to obtain a first subject file, and build a first tree structure corresponding to the first subject file according to a first file structure of the first subject file; acquiring a second subject file, and building a second tree structure corresponding to the second subject file according to a second file structure of the second subject file; and determining the current subject to be mapped and the corresponding parent subject thereof in the first subject file through the first tree structure.

The general policy processing module 620 determines a second subject corresponding to the parent subject of the current subject to be mapped from the second subject file, and obtains the target parent subject. Determining the child subjects of the target father subject from the second subjects according to the second subject level information of the target father subject, and obtaining a candidate subject set; determining the similarity of the names of the first subjects of the current subjects to be mapped and the names of the second subjects of elective course subjects in the candidate subject set to obtain a first candidate subject subset; further, the similarity of the names is ranked, and a second candidate subject subset is obtained.

The other policy processing module 630 determines candidate subjects corresponding to the currently mapped subject from the second subset of candidate subjects in a manually selected manner or other manner of processing policies.

The mapping result generation module 640 generates a mapping result file from the current mapping subject and the mapping subject.

As shown in fig. 7, an electronic device 700 according to an embodiment of the present disclosure includes a processor 701 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.

In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. The processor 701 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 702 and/or the RAM 703. Note that the program may be stored in one or more memories other than the ROM 702 and the RAM 703. The processor 701 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 700 may further include an input/output (I/O) interface 705, the input/output (I/O) interface 705 also being connected to the bus 704. The electronic device 700 may also include one or more of the following components connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, and the like; an output portion 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 708 including a hard disk or the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 702 and/or RAM 703 and/or one or more memories other than ROM 702 and RAM 703 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code means for causing a computer system to carry out the methods as provided by the embodiments of the present disclosure when the computer program product is run on the computer system.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 701. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed over a network medium in the form of signals, downloaded and installed via the communication section 709, and/or installed from the removable medium 711. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 709, and/or installed from the removable medium 711. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 701. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of data processing, the method comprising:

Acquiring a first subject file and a second subject file, wherein the first subject file comprises M first subjects, the first subjects comprise first subject identification information and first subject level information, the second subject file comprises N second subjects, the second subjects comprise second subject identification information and second subject level information, M is a positive integer more than or equal to 1, and N is a positive integer more than or equal to 1;

Under the condition that the ith first subject in the first subject file is determined to be the current subject to be mapped, determining a second subject corresponding to a father subject of the current subject to be mapped from the second subject file according to the first subject level information of the current subject to be mapped, and obtaining a target father subject, wherein i is more than or equal to 1 and less than or equal to M;

Determining the child subjects of the target father subject from the N second subjects according to the second subject level information of the target father subject, and obtaining a candidate subject set; and

And determining a candidate subject corresponding to the current subject to be mapped from the candidate subject set according to the first subject identification information of the current subject to be mapped, wherein the candidate subject is the mapping subject of the current subject to be mapped.

2. The method according to claim 1, wherein the method further comprises:

obtaining a mapping file, wherein the mapping file comprises a first subject and a second subject with confirmed mapping relation;

wherein determining that the ith first subject in the first subject file is the current subject to be mapped comprises:

under the condition that the mapping file comprises the father-level subjects of the ith first subject, determining the ith first subject as the current subject to be mapped;

And under the condition that the parent class subjects of the ith first department are not included in the mapping file, determining that the parent class subjects of the ith first department are current subjects to be mapped.

3. The method according to claim 2, wherein the method further comprises:

and writing the ith first subject and the mapping subject of the ith first subject into the mapping file.

4. The method according to claim 1, wherein the method further comprises:

building a first tree structure corresponding to the first subject file according to the first file structure of the first subject file;

Determining first subject identification information and first subject level information of the M first subjects according to the first tree structure;

Building a second tree structure corresponding to the second subject file according to the second file structure of the second subject file;

And determining second subject identification information and second subject level information of the N second subjects according to the second tree structure.

5. The method of claim 1, wherein the first subject identification information includes at least one of a first subject code and a first subject name; the second subject identification information includes at least one of a second subject code and a second subject name.

6. The method of claim 1, wherein the first subject identification information comprises a first subject name; the second subject identification information includes a second subject name;

Wherein the determining, according to the first subject identification information of the current subject to be mapped, a candidate subject corresponding to the current subject to be mapped from the candidate subject set includes:

Determining the similarity of the names of the first subjects of the current subjects to be mapped and the names of the second subjects of the candidate subjects in the candidate subject set;

And determining candidate subjects corresponding to the current subjects to be mapped according to the name similarity.

7. The method of claim 6, wherein the determining candidate subjects corresponding to the current subject to be mapped based on the name similarity comprises:

Selecting candidate subjects with the name similarity larger than a preset value to obtain a first candidate subject subset;

candidate subjects corresponding to the current mapped subject are determined from the first subset of candidate subjects.

8. The method of claim 6, wherein the determining candidate subjects corresponding to the current subject to be mapped based on the name similarity comprises:

ranking the candidate subjects in the candidate subject set according to the name similarity from high to low;

Selecting candidate subjects with sorting positions positioned in front of the preset positions to obtain a second candidate subject subset;

and determining a candidate subject corresponding to the current mapping subject from the second candidate subject subset.

9. The method according to claim 1, wherein the method further comprises:

And writing the current subjects to be mapped and the mapping subjects into a mapping result file according to a preset format.

10. A data processing apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a first subject file and a second subject file, wherein the first subject file comprises M first subjects, the first subjects comprise first subject identification information and first subject level information, the second subject file comprises N second subjects, the second subjects comprise second subject identification information and second subject level information, M is a positive integer more than or equal to1, and N is a positive integer more than or equal to 1;

The first determining module is used for determining a second subject corresponding to a parent subject of the current subject to be mapped from the second subject file according to the first subject level information of the current subject to be mapped under the condition that the ith first subject in the first subject file is determined to be the current subject to be mapped, so as to obtain a target parent subject, wherein i is more than or equal to 1 and less than or equal to M;

the second determining module is used for determining the child subjects of the target father subject from the N second subjects according to the second subject level information of the target father subject to obtain a candidate subject set; and

And a third determining module, configured to determine, from the candidate subject set, a candidate subject corresponding to the current subject to be mapped according to the first subject identification information of the current subject to be mapped, where the candidate subject is a mapping subject of the current subject to be mapped.

11. An electronic device, comprising:

one or more processors;

A memory for storing one or more computer programs,

Characterized in that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 9.

12. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps of the method according to any one of claims 1-9.

13. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1-9.