CN116541449A - Integrated analysis method and system for multi-source heterogeneous data of tobacco - Google Patents

Integrated analysis method and system for multi-source heterogeneous data of tobacco Download PDF

Info

Publication number
CN116541449A
CN116541449A CN202310533120.7A CN202310533120A CN116541449A CN 116541449 A CN116541449 A CN 116541449A CN 202310533120 A CN202310533120 A CN 202310533120A CN 116541449 A CN116541449 A CN 116541449A
Authority
CN
China
Prior art keywords
data
name
target
sales
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310533120.7A
Other languages
Chinese (zh)
Other versions
CN116541449B (en
Inventor
桂洪洋
桑万里
雷建岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Mingshi Security Prevention Engineering Co ltd
Original Assignee
Henan Mingshi Security Prevention Engineering Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Mingshi Security Prevention Engineering Co ltd filed Critical Henan Mingshi Security Prevention Engineering Co ltd
Priority to CN202310533120.7A priority Critical patent/CN116541449B/en
Publication of CN116541449A publication Critical patent/CN116541449A/en
Application granted granted Critical
Publication of CN116541449B publication Critical patent/CN116541449B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/256Integrating or interfacing systems involving database management systems in federated or virtual databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an integrated analysis method and system for multi-source heterogeneous data of tobacco, which belong to the technical field of data processing and comprise the following steps of S1: setting a target data structure and a matching dictionary library; step S2: extracting source data in each basic database, dividing the source data into first structure data and second structure data, extracting target information from the first structure data and the second structure data respectively, mapping the target information into a target data structure based on a matching dictionary library, and merging the target data structures to obtain standard integrated data; step S3: acquiring a sales time period included in the standard integrated data; step S4: and acquiring predicted sales data of the sales time period based on the sales prediction model, and comparing the predicted sales data with corresponding data in the standard integrated data to acquire abnormal data in the standard integrated data. The invention not only realizes the integration of the original data with the same type but different data formats, but also can map the original data into a preset standard data structure.

Description

Integrated analysis method and system for multi-source heterogeneous data of tobacco
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to an integrated analysis method and system for multi-source heterogeneous data of tobacco.
Background
Due to the particularity of the tobacco sales property, enterprises with tobacco sales qualification develop corresponding sales management systems to manage tobacco sales data of each sales store under jurisdiction, so as to analyze whether the sales conditions of each store are normal; in order to ensure the smoothness of the sales management system, the system needs to be updated at intervals or new sales systems are introduced, however, sales systems of different stores are not necessarily updated or replaced at the same time, and during the handover period of the system, different sales data are distributed in different systems and stored in different data formats, which may happen to cause difficulty in data analysis due to the difference of the data formats.
In order to solve the above problems, a scheme is proposed in the prior art, for example, chinese patent application CN113342880a discloses a method and a device for obtaining metadata in tobacco data, the method firstly extracts original data in a database, obtains a service type corresponding to the original data, and classifies the original data according to the service type to obtain service data of each type; the data attribute of the service data is acquired, the data attributes of the service data of all types are compared, and the standard requirement of the data attribute is determined according to the comparison result; finally, constructing a corresponding data standardization model according to standard requirements, and standardizing service data according to the data standardization model; for example, chinese patent application CN114429305a discloses a method for normalizing tobacco data, which comprises obtaining tobacco data of different tobacco data types from other systems in tobacco industry, processing the tobacco data of each type, determining the format of each tobacco sub-data in the corresponding tobacco data, integrating the tobacco sub-data of the same format, and generating cleaning templates corresponding to each type; and finally, according to the type of the tobacco data, calling a cleaning template corresponding to the type, so as to realize cleaning treatment of tobacco sub-data in the tobacco data and generate corresponding tobacco standard data.
However, when the same type of original data is stored in different data formats, it cannot be processed using the first method described above, while the template generated by the second method is determined by the system according to the ratio occupied by each data format, and the converted data is not standardized data set by human.
Disclosure of Invention
In order to solve the problems, the invention provides an integrated analysis method and an integrated analysis system for tobacco multi-source heterogeneous data, which can integrate original data with the same content but different data formats and map the original data into standardized data.
In order to achieve the above object, the present invention provides an integrated analysis method for multi-source heterogeneous data of tobacco, comprising:
step S1: setting a target data structure, wherein the target data structure is structured data, the target data structure comprises a standard name, the standard name comprises a domain name, an entity name and an attribute name, a matching dictionary base is generated based on the standard name, and the matching dictionary base comprises a plurality of expansion synonyms similar to the standard name in word sense;
step S2: determining a basic database to be extracted, extracting source data in each basic database, wherein the source data are semi-structured form data, acquiring text information contained in each cell of the source data, dividing the source data into first structure data if each text information only appears once in the source data, otherwise dividing the source data into second structure data, extracting target information from the first structure data and the second structure data respectively, mapping the target information into a target data structure based on the matching dictionary database, and merging the target data structures to obtain standard integrated data;
step S3: extracting a target entity name from the standard integrated data, and acquiring an attribute name associated with the target entity name, wherein the attribute name associated with the target entity name comprises a sales number, a sales date and a sales price, and acquiring a sales time period included in the standard integrated data from the sales date;
step S4: historical sales data are extracted, a sales volume prediction model is established, the predicted sales number, the predicted sales date and the predicted sales price in the sales time period are obtained based on the sales volume prediction model, and the predicted sales number, the predicted sales date and the predicted sales price are compared with corresponding data in the standard integrated data, so that abnormal data in the standard integrated data are obtained.
Further, in the step S2, extracting the target information from the first structure data and the second structure data includes the steps of:
step S21: if the source data is the first structural data, acquiring an outer frame line and an inner frame line of the source data, wherein the outer frame line is a frame line forming a source data table outline, the inner frame line is positioned in the source data table outline, two ends of the inner frame line are respectively connected with the outer frame line, and the inner frame line extending transversely is numbered as a from top to bottom in sequence 1 ,a 2 ,…,a n The inner frame lines extending vertically are numbered as b from left to right 1 ,b 2 ,…,b n Positioning the inner frame line a 1 Between the outer frame line and the inner frame line b 1 A target cell between the frame line and the frame line, and defining text information filled in the target cell as the target information;
step S22: if the source data is the second structure data, text information which appears for many times in the second structure data is acquired and defined as the target information
Further, in the step S3, mapping the target information to the target data structure includes the following steps:
step S31: comparing each piece of target information with each standard name and each expansion synonym to obtain a corresponding similarity value, setting a first threshold, mapping the target information to the standard name if the similarity value of the target information and the standard name is larger than the first threshold, and mapping the target information to the standard name if the similarity value of the target information and the expansion synonym is larger than the first threshold;
step S32: and defining text information corresponding to the attribute names in the source data as actual attributes, associating the domain names, the entity names, the attribute names and the actual attributes in the source data based on the corresponding relation of the tables, associating the actual attributes with the entity names respectively if a plurality of entity names corresponding to the same actual attribute exist, and mapping the associated domain names, entity names and attribute names into the target data structure.
Further, in the step S31, after the source data mapping is completed, a new domain name is generated based on the following steps:
acquiring entity names included in the domain names and attribute names included in each entity name, acquiring words and corresponding attribute names commonly included in each entity name if the entity names correspond to the same attribute names, generating a new domain name based on the words and the attribute names, and generating a blank placeholder as the new domain name if the words which are the same are not included in each entity name.
Further, after the target data structure is generated, the domain names are combined based on the following steps:
extracting a first domain and a second domain respectively, wherein the first domain and the second domain are respectively different domain names, and acquiring a first entity set and a second entity set, wherein the first entity set and the second entity set respectively comprise entity name sets for the first domain and the second domain, and the first entity set and the second entity set respectively comprise a first number and a second number of entity names;
comparing the first entity set with the second entity set, obtaining the same entity names and the corresponding third quantity, and calculating a fourth quantity based on a first formula, wherein the first formula is as follows: alpha 4 =MAX[α 12 ]-α 3 Wherein the alpha is 1234 Respectively the firstA number, said second number, said third number and said fourth number, MAX [ alpha ] 12 ]To return alpha 1 And alpha 2 A value with a larger median value;
setting a second threshold value, and calculating rejection degree beta of the first domain and the second domain based on a second formula, wherein the second formula is as follows:wherein ω is a preset adjustment coefficient, MINAα 12 ]To return alpha 1 And alpha 2 And merging the first domain with the second domain when the degree of rejection is less than the second threshold.
Further, dividing the source data into the first structure data and the second structure data includes the steps of:
and acquiring text information contained in each cell of the source data, dividing the source data into the first structural data if the text information repeatedly appears in the same source data for a plurality of times, and otherwise dividing the source data into the second structural data.
The invention also provides an integrated analysis method for the tobacco multi-source heterogeneous data, which is used for realizing the integrated analysis method for the tobacco multi-source heterogeneous data and mainly comprises the following steps:
the storage module is used for storing a target data structure and a matching dictionary library, the target data structure is structured data, the target data structure comprises a standard name, the standard name comprises a domain name, an entity name and an attribute name, and the matching dictionary library comprises a plurality of expansion synonyms similar to the standard name in word sense;
the mapping module is used for determining a basic database to be extracted, extracting source data in each basic database, wherein the source data are semi-structured form data, acquiring text information contained in each unit cell of the source data, dividing the source data into first structure data if each text information only appears once in the source data, otherwise dividing the source data into second structure data, extracting target information from the first structure data and the second structure data respectively, mapping the target information into a target data structure, and merging the target data structures to obtain standard integrated data;
the data screening module is used for extracting a target entity name from the standard integrated data, acquiring an attribute name associated with the target entity name, wherein the attribute name associated with the target entity name comprises a sales number, a sales date and a sales price, acquiring a sales time period included in the standard integrated data from the sales date, further extracting historical sales data, establishing a sales prediction model, acquiring the predicted sales number, the predicted sales date and the predicted sales price of the sales time period based on the sales prediction model, comparing the predicted sales number, the predicted sales date and the predicted sales price with corresponding data in the standard integrated data, and acquiring abnormal data in the standard integrated data.
Compared with the prior art, the invention has the following beneficial effects:
firstly, setting a target data structure, namely a standard data structure, and then establishing a matching dictionary base based on a standard domain name, an entity name and an attribute name; when extracting source data, firstly distinguishing the format of the source data, and extracting target information from the source data in different modes, so that the accuracy of the extracted target information is improved; after extraction is completed, mapping the target information into a target data structure based on a matching dictionary database, thereby obtaining a standard data structure; after that, a sales volume prediction model is established, and data can be accurately extracted from the sales volume prediction model and predicted because the data are already organized into a standard data structure, and finally, whether abnormality exists in the actual sales volume data is determined by comparing the predicted sales volume data with the actual sales volume data, so that effective supervision of tobacco sales data is realized.
Drawings
FIG. 1 is a flow chart of steps of an integrated analysis method for multi-source heterogeneous data of tobacco in accordance with the present invention;
FIG. 2 is a diagram of first structural data according to the present invention;
FIG. 3 is a diagram of second structure data according to the present invention;
fig. 4 is a block diagram of an integrated analysis system for multi-source heterogeneous data of tobacco in accordance with the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various elements, but these elements are not limited by these terms unless otherwise specified. These terms are only used to distinguish one element from another element. For example, a first xx script may be referred to as a second xx script, and similarly, a second xx script may be referred to as a first xx script, without departing from the scope of the present application.
As shown in fig. 1, an integrated analysis method for multi-source heterogeneous data of tobacco includes:
step S1: setting a target data structure, wherein the target data structure is structured data, the target data structure comprises a standard name, the standard name comprises a field name, an entity name and an attribute name, a matching dictionary base is generated based on the standard name, and the matching dictionary base comprises a plurality of expansion synonyms similar to the word meaning of the standard name;
specifically, the target data structure includes a standard storage format of data, and for standard names, the domain name, the entity name and the attribute name are triple relationships commonly used in the art, that is, the domain name includes a plurality of entity names, and each entity name includes a plurality of attribute names; then, in order to increase the accuracy of matching, the same expansion is performed on each name, for example, in entity names, other names of the tobacco A are determined to be expansion synonyms of the tobacco A, for example, in attribute names, for the same attribute of the tobacco A, for example, time, the attribute of which is 2020, month 11, and day 1, 2020.11.01 is taken as the expansion synonym of the attribute, so that based on the description, the expansion synonym in the invention can be a word, a word description or a record format.
Step S2: determining basic databases to be extracted, extracting source data in each basic database, wherein the source data are semi-structured form data, acquiring text information contained in each cell of the source data, dividing the source data into first structure data if each text information only appears once in the source data, otherwise dividing the source data into second structure data, extracting target information from the first structure data and the second structure data respectively, mapping the target information into a target data structure based on a matching dictionary database, and merging the target data structures to obtain standard integrated data;
the source data is semi-structured form data, the source data can be an electronic form stored in a database, or an electronic document generated after the paper form is scanned, such as a PDF format document, and text information in the source data can be extracted by using an OCR text recognition algorithm; after extracting the source data, classifying the source data into first structural data in which text information within each cell appears only once as shown in fig. 2 and second structural data in which text information such as a tobacco name appears multiple times in the source data as shown in fig. 3; in particular, due to the particularity of the sales property of the tobacco, after the sales information of the tobacco is recorded into the electronic form, the electronic form is printed and stamped with the official seal, and then scanned and stored, so that the accuracy of the sales data is ensured, however, different recording forms can occur for the same data, such as in fig. 2, the same tobacco price corresponds to a plurality of tobacco names for simplifying the recording, and in fig. 3, each tobacco is recorded separately for detail recording and visually showing the information of each tobacco; thus, for this case, this step is used to sort it, thereby facilitating the identification of which cells therein fill the entity names, and which cells fill the attribute names.
After extracting the target information, confirming which are domain names and which are entity names and which are attribute names in the target information, and after the completion of the determination, correlating the domain names, the entity names and the attribute names according to a preset relation of a matching dictionary library so as to map the target information into a target data structure; and merging the target data structures, so as to simplify the target data structures and obtain standard integrated data.
Step S3: extracting a target entity name from the standard integrated data, and acquiring an attribute name associated with the target entity name, wherein the attribute name associated with the target entity name comprises the sales number, the sales date and the sales price, and acquiring a sales time period included in the standard integrated data from the sales date;
step S4: historical sales data are extracted, a sales volume prediction model is established, the predicted sales number, the predicted sales date and the predicted sales price in the sales time period are obtained based on the sales volume prediction model, and the predicted sales number, the predicted sales date and the predicted sales price are compared with corresponding data in the standard integrated data, so that abnormal data in the standard integrated data are obtained.
Specifically, after the data are arranged into standard integrated data, an automatic analysis model is established to analyze the tobacco sales data and obtain abnormal data therein, such as single sales volume abnormality, order frequency abnormality and the like; in the embodiment, a sales volume prediction model is established based on a neural network, and sales quantity, sales date and sales price of tobacco in a certain time period are obtained in standard integrated data, and then the sales data in the same time period of history is obtained by taking the sales quantity, the sales date and the sales price as input characteristics of the neural network and is input into the sales volume prediction model, so that sales volume prediction of a certain time period in the future is obtained; and then comparing the predicted sales volume with the actual sales volume, thereby finding out abnormal data in the sales volume and further realizing effective monitoring of tobacco sales.
It is especially noted that the present invention not only implements integration of the original data of the same type but different data formats, but also maps it to a preset standard data structure through the above steps.
Firstly, setting a target data structure, namely a standard data structure, and then establishing a matching dictionary base based on a standard domain name, an entity name and an attribute name; when extracting source data, firstly distinguishing the format of the source data, and extracting target information from the source data in different modes, so that the accuracy of the extracted target information is improved; after extraction is completed, mapping the target information into a target data structure based on a matching dictionary database, thereby obtaining a standard data structure; after that, a sales volume prediction model is established, and data can be accurately extracted from the sales volume prediction model and predicted because the data are already organized into a standard data structure, and finally, whether abnormality exists in the actual sales volume data is determined by comparing the predicted sales volume data with the actual sales volume data, so that effective supervision of tobacco sales data is realized.
In step S2, extracting the target information from the first structure data and the second structure data includes the steps of:
step S21: the lines are frame lines forming the source data form outline, the inner frame lines are positioned in the source data form outline, two ends of the inner frame lines are respectively connected with the outer frame lines, and the transversely extending inner frame lines are numbered as a from top to bottom in sequence 1 ,a 2 ,…,a n The vertically extending inner frame lines are numbered b from left to right 1 ,b 2 ,…,b n Positioning inner frame line a 1 Between the outer frame line and the inner frame line b 1 The target cell between the frame line and the frame line defines text information filled in the target cell as target information;
step S22: if the source data is the second structure data, text information which appears for many times in the second structure data is acquired, and the text information is defined as target information.
As shown in fig. 2, when the source data is the first structure data E 1 In the process, the outer frame line and the inner frame line of the source data are acquired, such as L in the figure A ,L B ,L C ,L D Outer frame line of source data in turn, a 1 And b 1 As can be seen from the figure, a is two inner frame lines in the source data 1 Two ends of (a) are respectively connected with the outer frame line L B And L D Connection, b 1 Two ends of (a) are respectively connected with the outer frame line L A And L C Connecting; after the inner frame wire and the frame wire are determined, an inner frame wire a is obtained 1 With the outer frame line L A Cells in between, and inner frame lines b 1 With the outer frame line L D The cell between each two cells, obtaining the text information filled in each cell, and defining the text information as target information; when the source data is the second structure data, as shown in FIG. 3, the second structure data E 2 Each name corresponds to an attribute, and the text information in some cells can appear in the same table for multiple times in the recording mode, for example, the 'tobacco name' in the table appears three times, and the text information is extracted as target information.
In the prior art, when the table data is grabbed, the cells are sequentially identified, text information is extracted from the cells, then the types of the text information are judged, and then when the table data are more, more time and calculation resources are wasted by adopting the method.
In step S3, mapping the target information into a target data structure includes the steps of:
step S31: comparing each piece of target information with each standard name and the expansion synonym, obtaining a corresponding similarity value, setting a first threshold value, mapping the target information into the standard name if the similarity value of the target information and the standard name is larger than the first threshold value, obtaining the standard name corresponding to the expansion synonym if the similarity value of the target information and the expansion synonym is larger than the first threshold value, and mapping the target information into the standard name;
step S32: and defining text information corresponding to the attribute names in the source data as actual attributes, associating the domain names, the entity names, the attribute names and the actual attributes in the source data based on the corresponding relation of the tables, associating the actual attributes with the entity names respectively if a plurality of entity names corresponding to the same actual attribute exist, and mapping the associated domain names, entity names and attribute names into a target data structure.
In the following explanation of the above steps, since the standard names include a domain name, an entity name, and an attribute name, such as a commodity name, a tobacco AA, a tobacco BB, and a sales time, for example, the extracted target information is a tobacco name, a tobacco a, and a sales time, then the commodity name is compared with the tobacco name, the sales time, and the sales price, a similarity value between the two texts is obtained, for example, a first threshold value is set to 40%, the similarity value of the commodity name and the tobacco name is 50%, then the tobacco name is mapped to the commodity name, the name is the domain name, and similarly, the tobacco a is mapped to the tobacco AA, the tobacco B is mapped to the tobacco BB, the name is the entity name, and the sales time is mapped to the sales time, and the name is the attribute name. In particular, the similarity value between tobacco B and tobacco BB is less than the first threshold, but the similarity value between tobacco B and tobacco AB is greater than the first threshold, tobacco AB being the same extension of tobacco BB, thus mapping tobacco B to tobacco BB.
Because the commodity name comprises the tobacco AA, and the tobacco AA comprises the sales time, after the target information is mapped, the relation among the data can be automatically associated; after the data association, the actual attributes of the source data are mapped to a standard format and associated with a standard entity, such as the time of presence 2022.01.05 in fig. 2, which is mapped to 2022, 1, 5 days, and then associated with the tobacco AA entity.
Further, if one actual attribute corresponds to a plurality of entity names, after mapping to the target data structure, it is split and corresponds to the entity names respectively, for example, in fig. 2, the selling number 10 corresponds to two entity names of tobacco a and tobacco B, and after mapping to the target data structure, it is split into two actual attributes 10, the first actual attribute 10 corresponds to tobacco AA, and the second actual attribute 10 corresponds to tobacco BB. The target information can be rapidly and accurately mapped into the target data structure through the steps.
In practice, when setting a matching dictionary library, the names of the fields are often designed based on experience of operators, however, the names of the artificial designs are not always reasonable to show the relationship between data, if the field is set unreasonably, the same field is higher, and the same field includes too many entities; for example, a first field is provided, including tobacco a, tobacco B, and tobacco C, and both tobacco a and tobacco B include two attributes, and one same attribute is included therein, whereas in practice, since tobacco a and tobacco B have the same attribute, both can be allocated in a more refined field, and thus, the present invention is based on the following steps for expanding field names on the basis of artificially constructed fields.
Acquiring entity names included in the domain names and attribute names included in each entity name, acquiring words and corresponding attribute names commonly included in each entity name if the entity names correspond to the same attribute names, generating a new domain name based on the words and the attribute names, and generating a blank placeholder as the new domain name if the entity names do not contain the same words.
Specifically, if entity names extracted from the same source data are located in the same domain after mapping, for example, tobacco a, tobacco B and tobacco C are mapped to a first domain, but the tobacco a and tobacco B include the same attribute names, words contained in the tobacco a and the tobacco B together are obtained, for example, words tobacco is extracted from the tobacco a and the tobacco B, the tobacco a and the tobacco B together include cigarette attributes, then the domain names of the cigarette tobacco are generated and added to a matching dictionary library, so that further subdivision of the domain is realized, and refinement of a target data structure is realized. In particular, if a new domain name is not generated according to the above steps, a space placeholder is generated as the domain name, the space placeholder has a specific number, and can be modified later.
Since the domain names in the above steps are automatically generated by the system, the similarity value between the two domain names may be too high, that is, too many identical entities are included in the two domains, so that the entities are repeatedly associated, and the stored data amount is increased, so that after the new domain name is generated, the domain names are combined based on the following steps.
Extracting a first domain and a second domain respectively, wherein the first domain and the second domain are respectively different domain names, and acquiring a first entity set and a second entity set, wherein the first entity set and the second entity set respectively comprise entity name sets for the first domain and the second domain, and the first entity set and the second entity set respectively comprise a first number and a second number of entity names;
comparing the first entity set with the second entity set, obtaining the same entity names and the corresponding third quantity, and calculating a fourth quantity based on a first formula, wherein the first formula is as follows: alpha 4 =MAX[α 12 ]-α 3 Wherein alpha is 1234 A first number, a second number, a third number and a fourth number, MAX [ alpha ] 12 ]To return alpha 1 And alpha 2 A value with a larger median value;
setting a second threshold value, and calculating rejection degree beta of the first domain and the second domain based on a second formula, wherein the second formula is as follows:wherein ω is a preset adjustment coefficient, MINAα 12 ]To return alpha 1 And alpha 2 And merging the first domain with the second domain when the degree of rejection is smaller than the second threshold.
In the following explanation of the above steps, when merging, first, two domain names, namely a first domain and a second domain, are extracted, entity names included in the first domain and the second domain are obtained, for example, the first domain includes 10 entity names, the second domain includes 8 entity names, and 7 entity names are the same, then the obtained first number is 10, the second number is 8, the third number is 7, and the obtained value is substituted into the first formula to obtain a fourth number of 3; after the fourth number is obtained, calculating the degree of rejection of the first domain and the second domain based on the second formula; in particular, an adjustment coefficient is introduced into the second formula, so as to correct the rejection value; for example, setting the adjustment coefficient to 1, setting the second threshold to 0.5, and calculating the rejection degree to be 0.425 using the above values, then combining the first domain and the second domain; if the first number is 1, the second number is 2, the third number is 1, and the fourth number is 1, the calculated result is 1.5, which is larger than the second threshold value; the purpose of introducing correction factors is therefore that when the first number and the second number are of the order of magnitude smaller and close, for example in the case of the first number being 1 and the second number being 2, this may occur because the entities in the first domain are divided separately by containing a particular attribute, and thus such domains can be prevented from being merged by setting the correction factors.
Dividing the source data into first and second structure data comprises the steps of:
and acquiring text information contained in each cell of the source data, if the text information repeatedly appears in the same source data for a plurality of times, dividing the source data into first structure data, and otherwise, dividing the source data into second structure data.
As shown in fig. 4, the present invention further provides an integrated analysis method for multi-source heterogeneous data of tobacco, where the system is used to implement the integrated analysis method for multi-source heterogeneous data of tobacco, and the system mainly includes:
the storage module is used for storing a target data structure and a matching dictionary library, the target data structure is structured data, the target data structure comprises a standard name, the standard name comprises a field name, an entity name and an attribute name, and the matching dictionary library comprises a plurality of expansion synonyms similar to the word meaning of the standard name;
the mapping module is used for determining basic databases to be extracted, extracting source data in each basic database D1, D2 and D3, wherein the source data are semi-structured form data, acquiring text information contained in each unit cell of the source data, dividing the source data into first structure data if each text information only appears once in the source data, otherwise dividing the source data into second structure data, extracting target information from the first structure data and the second structure data respectively, mapping the target information into a target data structure, and merging the target data structures to obtain standard integrated data;
the data screening module is used for extracting target entity names from the standard integrated data, acquiring attribute names related to the target entity names, wherein the attribute names related to the target entity names comprise sales quantity, sales date and sales price, acquiring a sales time period included in the standard integrated data from the sales date, further extracting historical sales data, establishing a sales prediction model, acquiring predicted sales quantity, predicted sales date and predicted sales price of the sales time period based on the sales prediction model, and comparing the predicted sales quantity, the predicted sales date and the predicted sales price with corresponding data in the standard integrated data to acquire abnormal data in the standard integrated data.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a non-transitory computer readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, they should be considered as the scope of the disclosure as long as there is no contradiction between the combinations of the technical features.
The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (7)

1. An integrated analysis method for multi-source heterogeneous data of tobacco, which is characterized by comprising the following steps:
step S1: setting a target data structure, wherein the target data structure is structured data, the target data structure comprises a standard name, the standard name comprises a domain name, an entity name and an attribute name, a matching dictionary base is generated based on the standard name, and the matching dictionary base comprises a plurality of expansion synonyms similar to the standard name in word sense;
step S2: determining a basic database to be extracted, extracting source data in each basic database, wherein the source data are semi-structured form data, acquiring text information contained in each cell of the source data, dividing the source data into first structure data if each text information only appears once in the source data, otherwise dividing the source data into second structure data, extracting target information from the first structure data and the second structure data respectively, mapping the target information into a target data structure based on the matching dictionary database, and merging the target data structures to obtain standard integrated data;
step S3: extracting a target entity name from the standard integrated data, and acquiring an attribute name associated with the target entity name, wherein the attribute name associated with the target entity name comprises a sales number, a sales date and a sales price, and acquiring a sales time period included in the standard integrated data from the sales date;
step S4: historical sales data are extracted, a sales volume prediction model is established, the predicted sales number, the predicted sales date and the predicted sales price in the sales time period are obtained based on the sales volume prediction model, and the predicted sales number, the predicted sales date and the predicted sales price are compared with corresponding data in the standard integrated data, so that abnormal data in the standard integrated data are obtained.
2. The method according to claim 1, wherein in the step S2, extracting the target information from the first structural data and the second structural data comprises the steps of:
step S21: if the source data is the first structure data, obtaining an outer frame line and an inner frame line of the source data, wherein the outer frame line is a frame line forming the outline of the source data table,the inner frame line is positioned in the table outline of the source data, two ends of the inner frame line are respectively connected with the outer frame line, and the inner frame line which transversely extends is numbered as a from top to bottom in sequence 1 ,a 2 ,…,a n The inner frame lines extending vertically are numbered as b from left to right 1 ,b 2 ,…,b n Positioning the inner frame line a 1 Between the outer frame line and the inner frame line b 1 A target cell between the frame line and the frame line, and defining text information filled in the target cell as the target information;
step S22: and if the source data is the second structure data, acquiring text information which appears for many times in the second structure data, and defining the text information as the target information.
3. The method according to claim 1, wherein in the step S3, mapping the target information into the target data structure comprises the steps of:
step S31: comparing each piece of target information with each standard name and each expansion synonym to obtain a corresponding similarity value, setting a first threshold, mapping the target information to the standard name if the similarity value of the target information and the standard name is larger than the first threshold, and mapping the target information to the standard name if the similarity value of the target information and the expansion synonym is larger than the first threshold;
step S32: and defining text information corresponding to the attribute names in the source data as actual attributes, associating the domain names, the entity names, the attribute names and the actual attributes in the source data based on the corresponding relation of the tables, associating the actual attributes with the entity names respectively if a plurality of entity names corresponding to the same actual attribute exist, and mapping the associated domain names, entity names and attribute names into the target data structure.
4. The integrated analysis method of multi-source heterogeneous data of tobacco according to claim 3, wherein in the step S31, after the mapping of the source data is completed, a new domain name is generated based on the following steps:
acquiring entity names included in the domain names and attribute names included in each entity name, acquiring words and corresponding attribute names commonly included in each entity name if the entity names correspond to the same attribute names, generating a new domain name based on the words and the attribute names, and generating a blank placeholder as the new domain name if the words which are the same are not included in each entity name.
5. The integrated analysis method of multi-source heterogeneous data of tobacco according to claim 4, wherein after the target data structure is generated, the domain names are combined based on the following steps:
extracting a first domain and a second domain respectively, wherein the first domain and the second domain are respectively different domain names, and acquiring a first entity set and a second entity set, wherein the first entity set and the second entity set respectively comprise entity name sets for the first domain and the second domain, and the first entity set and the second entity set respectively comprise a first number and a second number of entity names;
comparing the first entity set with the second entity set, obtaining the same entity names and the corresponding third quantity, and calculating a fourth quantity based on a first formula, wherein the first formula is as follows: alpha 4 =MAX[α 12 ]-α 3 Wherein the alpha is 1234 MAX [ alpha ] of said first number, said second number, said third number and said fourth number, respectively 12 ]To return alpha 1 And alpha 2 A value with a larger median value;
setting a second threshold value, calculating the rejection degree beta of the first domain and the second domain based on a second formula, the second formulaThe two formulas are:wherein ω is a preset adjustment coefficient, MINAα 12 ]To return alpha 1 And alpha 2 And merging the first domain with the second domain when the degree of rejection is less than the second threshold.
6. The method of claim 4, wherein dividing the source data into the first and second structural data comprises:
and acquiring text information contained in each cell of the source data, dividing the source data into the first structural data if the text information repeatedly appears in the same source data for a plurality of times, and otherwise dividing the source data into the second structural data.
7. An integrated analysis system for multi-source heterogeneous data of tobacco, for implementing an integrated analysis method for multi-source heterogeneous data of tobacco as claimed in any one of claims 1 to 6, comprising:
the storage module is used for storing a target data structure and a matching dictionary library, the target data structure is structured data, the target data structure comprises a standard name, the standard name comprises a domain name, an entity name and an attribute name, and the matching dictionary library comprises a plurality of expansion synonyms similar to the standard name in word sense;
the mapping module is used for determining a basic database to be extracted, extracting source data in each basic database, wherein the source data are semi-structured form data, acquiring text information contained in each unit cell of the source data, dividing the source data into first structure data if each text information only appears once in the source data, otherwise dividing the source data into second structure data, extracting target information from the first structure data and the second structure data respectively, mapping the target information into a target data structure, and merging the target data structures to obtain standard integrated data;
the data screening module is used for extracting a target entity name from the standard integrated data, acquiring an attribute name associated with the target entity name, wherein the attribute name associated with the target entity name comprises a sales number, a sales date and a sales price, acquiring a sales time period included in the standard integrated data from the sales date, further extracting historical sales data, establishing a sales prediction model, acquiring the predicted sales number, the predicted sales date and the predicted sales price of the sales time period based on the sales prediction model, comparing the predicted sales number, the predicted sales date and the predicted sales price with corresponding data in the standard integrated data, and acquiring abnormal data in the standard integrated data.
CN202310533120.7A 2023-05-12 2023-05-12 Integrated analysis method and system for multi-source heterogeneous data of tobacco Active CN116541449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310533120.7A CN116541449B (en) 2023-05-12 2023-05-12 Integrated analysis method and system for multi-source heterogeneous data of tobacco

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310533120.7A CN116541449B (en) 2023-05-12 2023-05-12 Integrated analysis method and system for multi-source heterogeneous data of tobacco

Publications (2)

Publication Number Publication Date
CN116541449A true CN116541449A (en) 2023-08-04
CN116541449B CN116541449B (en) 2023-10-13

Family

ID=87455663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310533120.7A Active CN116541449B (en) 2023-05-12 2023-05-12 Integrated analysis method and system for multi-source heterogeneous data of tobacco

Country Status (1)

Country Link
CN (1) CN116541449B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091289A1 (en) * 2015-09-30 2017-03-30 Hitachi, Ltd. Apparatus and method for executing an automated analysis of data, in particular social media data, for product failure detection
CN107256263A (en) * 2017-06-13 2017-10-17 成都布林特信息技术有限公司 Internet hot spots information automatic monitoring method
CN111274301A (en) * 2020-01-20 2020-06-12 启迪数华科技有限公司 Intelligent management method and system based on data assets
CN112507035A (en) * 2020-11-25 2021-03-16 国网电力科学研究院武汉南瑞有限责任公司 Power transmission line multi-source heterogeneous data unified standardized processing system and method
CN113723985A (en) * 2021-03-04 2021-11-30 京东城市(北京)数字科技有限公司 Training method and device for sales prediction model, electronic equipment and storage medium
CN113779981A (en) * 2021-09-13 2021-12-10 广州汇通国信科技有限公司 Recommendation method and device based on pointer network and knowledge graph
CN113822698A (en) * 2021-06-30 2021-12-21 腾讯科技(深圳)有限公司 Content pushing method and device, computer equipment and storage medium
CN115713362A (en) * 2022-11-24 2023-02-24 广西中烟工业有限责任公司 Cigarette commodity consumption conversion behavior analysis method and device and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091289A1 (en) * 2015-09-30 2017-03-30 Hitachi, Ltd. Apparatus and method for executing an automated analysis of data, in particular social media data, for product failure detection
CN107256263A (en) * 2017-06-13 2017-10-17 成都布林特信息技术有限公司 Internet hot spots information automatic monitoring method
CN111274301A (en) * 2020-01-20 2020-06-12 启迪数华科技有限公司 Intelligent management method and system based on data assets
CN112507035A (en) * 2020-11-25 2021-03-16 国网电力科学研究院武汉南瑞有限责任公司 Power transmission line multi-source heterogeneous data unified standardized processing system and method
CN113723985A (en) * 2021-03-04 2021-11-30 京东城市(北京)数字科技有限公司 Training method and device for sales prediction model, electronic equipment and storage medium
CN113822698A (en) * 2021-06-30 2021-12-21 腾讯科技(深圳)有限公司 Content pushing method and device, computer equipment and storage medium
CN113779981A (en) * 2021-09-13 2021-12-10 广州汇通国信科技有限公司 Recommendation method and device based on pointer network and knowledge graph
CN115713362A (en) * 2022-11-24 2023-02-24 广西中烟工业有限责任公司 Cigarette commodity consumption conversion behavior analysis method and device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MARIA-ESTHER VIDAL等: "Semantic Data Integration Techniques for Transforming Big Biomedical Data into Actionable Knowledge", 《2019 IEEE 32ND INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS (CBMS) 》, pages 563 - 566 *
任娟: "数据驱动的图书销量预测理论框架研究", 《出版与印刷》, pages 10 - 21 *

Also Published As

Publication number Publication date
CN116541449B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
CN109767322B (en) Suspicious transaction analysis method and device based on big data and computer equipment
CN109543925B (en) Risk prediction method and device based on machine learning, computer equipment and storage medium
CN110990390B (en) Data cooperative processing method, device, computer equipment and storage medium
US11216896B2 (en) Identification of legal concepts in legal documents
CN112612908A (en) Natural resource knowledge graph construction method and device, server and readable memory
CN112036144B (en) Data analysis method, device, computer equipment and readable storage medium
Clinchant et al. Comparing machine learning approaches for table recognition in historical register books
CN115098671B (en) Government affair data processing method based on artificial intelligence, electronic equipment and storage medium
CN112883692B (en) Automatic generation method of PPT data report
CN111291135A (en) Knowledge graph construction method and device, server and computer readable storage medium
CN113220672A (en) Military and civil fusion policy information database system
CN109325868A (en) Questionnaire data processing method, device, computer equipment and storage medium
CN116401379A (en) Financial product data pushing method, device, equipment and storage medium
CN116541449B (en) Integrated analysis method and system for multi-source heterogeneous data of tobacco
CN112948504B (en) Data acquisition method and device, computer equipment and storage medium
CN112036151B (en) Gene disease relation knowledge base construction method, device and computer equipment
CN113626571A (en) Answer sentence generating method and device, computer equipment and storage medium
WO2021047327A1 (en) Method and apparatus for constructing target concept map, computer device, and storage medium
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN115587115A (en) Database query optimization method and system
CN115545756A (en) Product matching type marketing method and system
CN115204129A (en) Automatic matching and identifying method for key parameters of drilling operation report
CN112559739A (en) Method for processing insulation state data of power equipment
CN116955300B (en) File generation method and system based on label technology
CN116502614B (en) Data checking method, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant