CN113486630A - Supply chain data vectorization and visualization processing method and device - Google Patents
Supply chain data vectorization and visualization processing method and device Download PDFInfo
- Publication number
- CN113486630A CN113486630A CN202111045671.6A CN202111045671A CN113486630A CN 113486630 A CN113486630 A CN 113486630A CN 202111045671 A CN202111045671 A CN 202111045671A CN 113486630 A CN113486630 A CN 113486630A
- Authority
- CN
- China
- Prior art keywords
- data
- enterprise
- owner data
- business
- enterprise information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 20
- 238000012800 visualization Methods 0.000 title abstract description 8
- 239000013598 vector Substances 0.000 claims abstract description 175
- 238000000034 method Methods 0.000 claims abstract description 100
- 238000006243 chemical reaction Methods 0.000 claims abstract description 32
- 230000009467 reduction Effects 0.000 claims abstract description 31
- 238000000605 extraction Methods 0.000 claims abstract description 13
- 238000012986 modification Methods 0.000 claims abstract description 10
- 230000004048 modification Effects 0.000 claims abstract description 10
- 230000002452 interceptive effect Effects 0.000 claims description 41
- 230000011218 segmentation Effects 0.000 claims description 32
- 230000004927 fusion Effects 0.000 claims description 31
- 238000012545 processing Methods 0.000 claims description 25
- 238000013079 data visualisation Methods 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 abstract description 19
- 230000000007 visual effect Effects 0.000 abstract 3
- 230000003993 interaction Effects 0.000 abstract 1
- 230000008569 process Effects 0.000 description 12
- 238000003860 storage Methods 0.000 description 10
- 230000002159 abnormal effect Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000012935 Averaging Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000000513 principal component analysis Methods 0.000 description 6
- 238000009826 distribution Methods 0.000 description 5
- 238000007726 management method Methods 0.000 description 5
- 238000004140 cleaning Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000007667 floating Methods 0.000 description 3
- 230000001788 irregular Effects 0.000 description 3
- 238000005065 mining Methods 0.000 description 3
- 238000012847 principal component analysis method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013523 data management Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 235000018185 Betula X alpestris Nutrition 0.000 description 1
- 235000018212 Betula X uliginosa Nutrition 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012517 data analytics Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002595 magnetic resonance imaging Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000033772 system development Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/358—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the specification provides a supply chain data vectorization and visualization processing method and device. In the vectorization method, business main data is obtained, and internal sub-enterprise information and external business cooperation enterprise information are extracted from the business main data. And respectively extracting the participles from the two kinds of information by using a characteristic word extraction model. And respectively inputting the two kinds of information and the corresponding participles into the text conversion model to obtain corresponding first and second feature vectors. And fusing the obtained feature vectors to obtain a target feature vector corresponding to the business owner data. And then displaying the target characteristic vectors of a plurality of enterprise owner data on an interface by a proper dimensionality reduction algorithm and a clustering algorithm in a visual interaction mode, and determining proper algorithm parameters by observation. And finally, possible problem data can be positioned and cleaned directly based on the clustering result or in a visual searching mode. While the visual search view also supports versioning and revising of previous search and modification records.
Description
Technical Field
One or more embodiments of the present disclosure relate to the field of data processing, and in particular, to a method and an apparatus for vectorizing and visualizing supply chain data.
Background
The data is the core and key of the success of enterprise digital transformation, and the data quality directly influences the authenticity and reliability of data analysis. With the continuous development of modern internet technology, the value of data enabling for large-scale supply chain integrated service group companies is larger and larger, and accordingly, the business operation and innovation development of enterprises are driven, the management level of the enterprises is improved, the transformation and upgrading of the enterprises are led, and new economic value is continuously created. From practice, a large supply chain integrated service group company has the characteristics of massive data volume, complex data environment, potential data defects and the like, the cost investment in the early stage of analyzing, mining and applying supply chain data such as main data, business data, analysis data and the like is high, and the actually generated value in the data application stage is often greatly deviated from the expected purpose. The fundamental reason is that enterprises usually only pay attention to analysis, mining and result application of data, but do not pay attention to a data processing process, so that the data quality is low, and cluster-level data integration and deep data utilization are hindered. Therefore, the work of introducing data processing is crucial, especially for large data analytics type system development and applications, which is essential.
A large supply chain integrated service group company consists of a plurality of first-level sub-enterprises and subordinate sub-enterprises, and a collective-level supply chain database is required to be uniformly used for strengthening data management and group control. However, due to the fact that a data operation and management system is not perfect, data management processes among different sub-enterprises are incomplete, work division and execution of responsibility are not in place, a duplication checking mechanism in the data extraction, cleaning, conversion, collection and distribution processes is not strict, data problems are found not to be timely and properly handled, and the like, a large amount of irregular redundant data can be generated, such as enterprise main data including internal sub-enterprise information and external business cooperation enterprise (client and supplier) information, data quality problems such as main data duplication and the like exist, and the data quality problems become a bottleneck for restricting the digital transformation development of enterprises. In the process of digital transformation implementation and promotion, if no effective method is used for standard clearing and duplication checking on stock data, and if incremental data is not used for effective duplication checking, data quality problems are accumulated day by day, data mining, analysis and application are seriously influenced, and great influence is brought to enterprise business operation and management and control. In the daily data operation management and control of enterprises, the processing of the redundant data mainly depends on manual data auditing and processing one by one if a traditional cleaning method is adopted, so that the efficiency is low and the method is only suitable for small data sets. Although the intelligent cleaning method is faster, users cannot participate in the data processing execution process, and the reliability of the cleaning result cannot be guaranteed when complex data problems are processed. Therefore, a solution is urgently needed to be provided, so that data can be processed more efficiently and more accurately, and the method has very important significance for mining data values, driving business innovation and scientifically making production, operation and management decisions in a targeted manner.
Disclosure of Invention
One or more embodiments of the present specification describe a supply chain data vectorization and visualization processing method and apparatus, which can process business owner data more efficiently and more accurately.
In a first aspect, a supply chain data vectorization method is provided, including:
acquiring enterprise owner data; the enterprise main data comprises internal sub-enterprise information and external business cooperation enterprise information;
extracting the internal sub-enterprise information and the external business cooperation enterprise information from the enterprise main data;
extracting a first word segmentation and a second word segmentation from the internal sub-enterprise information and the external business cooperation enterprise information respectively by using a feature word extraction model;
inputting the internal sub-enterprise information and the first segmentation into a text conversion model, and inputting the external business cooperation enterprise information and the second segmentation into the text conversion model;
the text conversion model comprises a first sub-model based on time sequence and a second sub-model based on word frequency statistics; the first sub-model is used for determining a first feature vector of input data according to vector representation of each item of content in the input data, position information of each item of content and vector representation of input word segmentation; the second submodel is used for determining a second feature vector of the input data according to the word frequency statistics of each content in the input data and the word frequency statistics of the input participles;
fusing first and second feature vectors corresponding to the internal sub-enterprise information and the external business cooperation enterprise information respectively to obtain respective fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information;
and fusing respective fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information to obtain a target feature vector corresponding to the enterprise owner data.
In a second aspect, a supply chain data visualization processing method is provided, including:
displaying an interactive interface for editing business owner data;
receiving a distance threshold based on the interactive interface input;
clustering the plurality of enterprise owner data into a plurality of clusters based on the distance threshold and the similarity distance between every two enterprise owner data in the plurality of enterprise owner data; the similarity distance between every two business owner data is determined based on the respective target feature vectors of the business owner data acquired by the method according to the first aspect;
determining the network structure of the plurality of enterprise owner data according to the plurality of class clusters; the network structure comprises a plurality of groups of nodes, wherein each group corresponds to one of the plurality of class clusters, and each node in each group represents each enterprise owner data belonging to the corresponding class cluster;
and displaying the network structure.
In a third aspect, a supply chain data vectorization apparatus is provided, including:
the acquisition unit is used for acquiring enterprise owner data; the enterprise owner data comprises internal sub-enterprise information and external business cooperation enterprise information;
an extracting unit, configured to extract the internal sub-enterprise information and the external business cooperation enterprise information from the enterprise master data;
the extracting unit is further configured to extract a first participle and a second participle from the internal sub-enterprise information and the external business cooperation enterprise information respectively by using a feature word extracting model;
the input unit is used for inputting the internal sub-enterprise information and the first segmentation into a text conversion model, and inputting the external business cooperation enterprise information and the second segmentation into the text conversion model;
the text conversion model comprises a first sub-model based on time sequence and a second sub-model based on word frequency statistics; the first sub-model is used for determining a first feature vector of input data according to vector representation of each item of content in the input data, position information of each item of content and vector representation of input word segmentation; the second submodel is used for determining a second feature vector of the input data according to the word frequency statistics of each content in the input data and the word frequency statistics of the input participles;
the fusion unit is used for fusing the first and second feature vectors corresponding to the internal sub-enterprise information and the external business cooperation enterprise information respectively to obtain fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information respectively;
the fusion unit is further configured to fuse respective fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information to obtain a target feature vector corresponding to the enterprise owner data.
In a fourth aspect, a supply chain data visualization processing apparatus is provided, including:
the display unit is used for displaying an interactive interface for editing business owner data;
the receiving unit is used for receiving a distance threshold value input based on the interactive interface;
the clustering unit is used for clustering the plurality of enterprise owner data into a plurality of clusters based on the distance threshold value and the similarity distance between every two enterprise owner data in the plurality of enterprise owner data; the similarity distance between every two business owner data is determined based on the respective target feature vectors of the business owner data acquired by the method according to the first aspect;
a determining unit, configured to determine a network structure of the plurality of enterprise owner data according to the plurality of class clusters; the network structure comprises a plurality of groups of nodes, wherein each group corresponds to one of the plurality of class clusters, and each node in each group represents each enterprise owner data belonging to the corresponding class cluster;
the display unit is further used for displaying the network structure.
In the supply chain data vectorization method provided by one or more embodiments of the present specification, the target feature vector of the business owner data is obtained by fusing the feature vectors output by the first sub-model considering the sequential relationship and the second sub-model not considering the time-series relationship, so that the accuracy of vector representation of the business owner data can be greatly improved, and the business owner data can be processed more accurately.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation scenario provided in an embodiment of the present specification;
FIG. 2 is a flowchart of a supply chain data vectorization method provided by an embodiment of the present description;
FIG. 3 is a flowchart of a supply chain data vectorization method according to another embodiment of the present disclosure;
fig. 4 is a flowchart of a supply chain data visualization processing method provided in an embodiment of the present specification;
FIG. 5a is one of the schematic interactive interfaces provided by an example of the present specification;
FIG. 5b is a second schematic diagram of an interactive interface provided by an example of the present specification;
FIG. 5c is a third schematic diagram of an interactive interface provided by an example of the present specification;
FIG. 5d is a fourth schematic view of an interactive interface provided by an example of the present specification;
FIG. 6 is a flow chart of a method for visualization processing of supply chain data according to another embodiment of the present disclosure;
FIG. 7a is one of the schematic interactive interfaces provided by another example of the present specification;
FIG. 7b is a second schematic view of an interface provided by another example of the present disclosure;
FIG. 7c is a third schematic diagram of an interactive interface provided by another example of the present disclosure;
fig. 8 is a schematic diagram of a processing method of business owner data provided in the present specification;
FIG. 9 is a schematic diagram of a supply chain data vectorization device according to an embodiment of the present description;
FIG. 10 is a schematic diagram of a supply chain data vectorization device according to another embodiment of the present disclosure;
fig. 11 is a schematic view of a supply chain data visualization processing apparatus provided in an embodiment of the present specification.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario provided in an embodiment of the present specification. In fig. 1, for the current business owner data, the feature word extraction model may be used to extract the target participle from the business owner data. The business owner data and the target participles may then be input into a text conversion model, where the text conversion model includes a first time-sequence based submodel and a second submodel based on word frequency statistics. The first submodel is used for determining a first feature vector according to the vector representation of each item of content in the input data, the position information of each item of content and the vector representation of the input word segmentation. The second submodel is used for determining a second feature vector according to the word frequency statistics of each item of content in the input data and the word frequency statistics of the input participles. And fusing the first feature vector and the second feature vector to obtain a target feature vector corresponding to the business owner data.
Certainly, in practical application, the enterprise main data may also be split first, for example, the data is split into internal sub-enterprise information and external business cooperation enterprise information, respective participles are extracted from the internal sub-enterprise information and the external business cooperation enterprise information respectively by using the feature word extraction model, and the target feature vector of the enterprise main data is determined based on the internal sub-enterprise information and the corresponding participles, and the external business cooperation enterprise information and the corresponding participles respectively by using the text conversion model.
After a target feature vector of certain business owner data is obtained, dimension reduction processing can be performed on the target feature vector to obtain a dimension reduction feature vector. And calculating the similarity distance between every two business owner data based on the dimensionality reduction characteristic vector of the business owner data and the dimensionality reduction characteristic vectors of other business owner data. And clustering each enterprise owner data according to the similarity distance between every two enterprise owner data and a distance threshold value. And displaying the network structure corresponding to each piece of business owner data based on the clustering result.
Specific implementations of the above scheme are described below.
Fig. 2 is a flowchart of a supply chain data vectorization method according to an embodiment of the present disclosure. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. As shown in fig. 2, the method may include the steps of:
step 202, acquiring business owner data.
The business main data described in the present specification is main data for business operation related between a sub-enterprise inside a large supply chain integrated services group company and an external business cooperation enterprise, and is also generally referred to as supply chain data.
In one example, the business owner data includes at least internal sub-business information and external business partner business information. Wherein, the external business cooperation enterprise information may include but is not limited to at least one of the following: business cooperation enterprise name, country and social unification credit number, and the like. The internal sub-enterprise information may include, but is not limited to, at least one of: the name and code of the sub-enterprise, the name and code of the affiliated first-level sub-enterprise, and the code of the first-level sub-enterprise.
Of course, in an actual business application scenario, the business owner data may further include other information such as a BP number, a provider ID number, and a customer number, which is not limited in this specification.
And step 204, extracting target participles from the enterprise owner data by using the characteristic word extraction model.
The feature word extraction model may include a hidden markov model with a new word discovery function. The feature word extraction model can be specifically realized as jieba, SnowNLP, PKUse or THULAC and the like.
In one example, the target participle may be extracted from a business name of the business owner data (such as the external business collaboration business name or the internal sub-business name described above). The target segmentation may be additional information such as regional information, business type, and brand name included in the business name.
For example, assume that the external business partner business name is: "Zhejiang medicine, Inc., the extracted target participles may be: "Zhejiang", "medicine" and "Limited".
Step 206, inputting the enterprise owner data and the target word segmentation into the text conversion model.
The text conversion model comprises a first sub-model based on time sequence and a second sub-model based on word frequency statistics. The first submodel is used for determining a first feature vector according to the vector representation of each item of content in the input data, the position information of each item of content and the vector representation of the input word segmentation. The second submodel is used for determining a second feature vector according to the word frequency statistics of each item of content in the input data and the word frequency statistics of the input participles.
It should be noted that, for the business owner data, it may contain contents with sequence information, such as address contents: "Zhejiang, Hangzhou City, West lake region", therefore, these content features with sequential information may be well learned based on the first sub-model, and a vector representation of the business owner data is determined based thereon.
Of course, in an actual business application scenario, the business owner data may also include content without sequence information, such as a business name and a business type, and therefore, based on the second sub-model, the content features without concern for sequence information may be categorized and summarized, and based on this, a vector representation of the business owner data may be determined.
In a specific example, the first sub-model may be trained based on the pb-dm method in Doc2Vec, which is also commonly referred to as a distributed memory module. The second submodel may be trained based on the pv-dbow method in Doc2Vec, which is also commonly referred to as a distribution bag module.
And step 208, fusing the first characteristic vector and the second characteristic vector to obtain a target characteristic vector corresponding to the business owner data.
In one example, the target feature vector may be obtained by stitching the first feature vector and the second feature vector.
Of course, in an actual service application scenario, the corresponding target feature vector may also be obtained by summing, averaging, or weighted averaging the first feature vector and the second feature vector.
It should be noted that, in the embodiment of the present specification, when obtaining the feature vector of the business owner data, the first sub-model considering the order information and the second sub-model not considering the order information are used at the same time, so that the accuracy of vector representation of the business owner data can be greatly improved.
It should be understood that the above is only an example of a business owner data, and the vectorization method thereof is described. Similarly, feature vectors for other business owner data in the corporate-level supply chain database of the supply chain integrated services corporate company may also be obtained. After the feature vectors of the enterprise owner data in the group-level supply chain database are acquired, the following steps can be further executed:
and carrying out dimensionality reduction on the respective feature vectors of the enterprise owner data to obtain the respective dimensionality reduction feature vectors. And then calculating the similarity distance between every two business owner data based on the respective dimensionality reduction feature vectors of all the business owner data. And finally, clustering each enterprise owner data according to the similarity distance between every two enterprise owner data and a distance threshold, and displaying the network structure corresponding to each enterprise owner data based on the clustering result. The specific display method will be described later.
The similarity distance between the business owner data information may include any one of the following: cosine similarity distance, euclidean distance, manhattan distance, pearson correlation coefficient, and the like. Further, the method of the above dimension reduction processing may include any one of: a Principal Component Analysis (PCA) method, a multi-dimensional scale change method, a T-random proximity embedding method, an incremental principal component analysis method and the like.
It should be understood that the above is a description of a method of vectorizing business owner data without splitting the business owner data. The following describes a vectorization method of business owner data in the case of splitting the business owner data.
Fig. 3 is a flowchart of a supply chain data vectorization method according to another embodiment of the present disclosure. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. As shown in fig. 3, the method may include the steps of:
step 302 is the same as step 202.
And step 304, extracting internal sub-enterprise information and external business cooperation enterprise information from the enterprise owner data.
Wherein the internal sub-enterprise information may include, but is not limited to, at least one of: the name and code of the sub-enterprise, the name and code of the affiliated first-level sub-enterprise, and the code of the first-level sub-enterprise. The external business partner enterprise information may include, but is not limited to, at least one of: business cooperation enterprise name, country and social unification credit number, and the like.
And step 306, extracting a first word segmentation and a second word segmentation from the internal sub-enterprise information and the external business cooperation enterprise information respectively by using the feature word extraction model.
The feature word extraction model herein may be described with reference to step 204 above.
In one example, a first participle may be extracted from a sub-business name in the internal sub-business information and a second participle may be extracted from a business collaboration business name in the external business collaboration business information using a feature word extraction model.
Step 308, inputting the internal sub-enterprise information and the first word segmentation into the text conversion model, and inputting the external business cooperation enterprise information and the second word segmentation into the text conversion model.
It should be understood that the above two information and their word segmentation input steps are performed separately. For example, after the internal sub-enterprise information and the first segmentation input text conversion model obtain the corresponding first and second feature vectors, the external business cooperation enterprise information and the second segmentation input text conversion model obtain the corresponding first and second feature vectors. Of course, in an actual service application scenario, the external service cooperation enterprise information and the second segmentation input text conversion model may be first input, and then the internal sub-enterprise information and the first segmentation input text conversion model may be input, which is not limited in this specification.
It should be noted that commonly used text vectorization methods include a One-hot presentation (One-hot presentation) method, a Word2vec method, a Doc2vec method, and the like. The storage efficiency of the one-hot coding method is limited by the length of the text, and as a result, the relationship between words cannot be maintained. The Word2vec method is based on a statistical method, and obtains a prediction Model through training a neural network, wherein the prediction Model comprises a Continuous Bag-of-Words Model (CBOW) for predicting a current Word according to a context and a Skip-Gram Model for predicting the context according to the current Word. The method can generate low-dimensional dense vectors, semantic information of words can be reserved in the result, and corresponding vectors of similar words in a vector space are close to each other. The Doc2vec method introduces the concept of segment vectors on the basis of the Word2vec method, and as a result, the Word order information is integrated on the basis of keeping the Word semantics, so that the method gets rid of the length limitation of the input text, and is more suitable for training multidimensional table data with indefinite length.
In order to keep the information of the attribute values and the relationship between the attributes, each piece of business owner data is regarded as a text with specific semantics and a specific structure, and a Doc2vec method is adopted to extract a corresponding feature vector. That is, the text conversion model is trained based on the Doc2vec method. The method comprises the steps of determining a first feature vector according to vector representations of various contents in input data, position information of various contents and vector representations of input participles by a first submodel based on time sequence and a second submodel based on word frequency statistics. The second submodel is used for determining a second feature vector according to the word frequency statistics of each item of content in the input data and the word frequency statistics of the input participles.
And 310, fusing the first and second feature vectors corresponding to the external business cooperation enterprise information and the internal sub-enterprise information respectively to obtain respective fusion vectors of the external business cooperation enterprise information and the internal sub-enterprise information.
In one example, the first and second feature vectors of the external business cooperation enterprise information may be spliced to obtain a corresponding fusion vector. And the first and second feature vectors of the internal sub-enterprise information can be spliced to obtain a corresponding fusion vector.
Of course, in practical applications, the corresponding fusion vector may also be obtained by summing, averaging, or weighted averaging the first and second feature vectors.
And step 312, fusing the respective fusion vectors of the external business cooperation enterprise information and the internal sub-enterprise information to obtain a target feature vector corresponding to the enterprise owner data.
In one example, the respective fusion vectors of the external business cooperation enterprise information and the internal sub-enterprise information may be spliced to obtain a target feature vector corresponding to the enterprise owner data.
Of course, in an actual service application scenario, the corresponding target feature vector may also be obtained by summing, averaging, or weighted averaging the two fused vectors.
The following describes a method for acquiring a target feature vector of business owner data, that is, a method for vectorizing supply chain data, with reference to a specific example.
First, it is assumed that the external business cooperation enterprise information may include: the name of the business cooperation enterprise, the country and the social uniform credit number are respectively expressed as A1,A2And A3The internal sub-enterprise information includes attributes such as sub-enterprise code, sub-enterprise name, attributive first-level sub-enterprise code and first-level sub-enterprise name, which are respectively represented as B1,B2,B3And B4. Further, assume that M words are extracted from the business collaboration business name, denoted as: a is1,a2,a3,……,aMAnd extracting N words from the sub-enterprise names, which are respectively denoted as b1,b2,b3,……bN. After the two kinds of information and the corresponding participles are combined, the following two input sentences s can be obtained1 = {A1,A2,A3,a1,a2,a3,……,aM},s2 = {B1,B2,B3,B4,b1,b2,b3,……,bN}。
For s1And outputting two corresponding feature vectors based on two sub-models in the text conversion model. Wherein the input sentence is represented by a unique sentence vector D when outputting the corresponding feature vector based on the first submodel. Since the sentence vector D plays a memory role, it can be considered as another word vector, which needs to be added to the input as well. The matrix V of (M + 4) × O dimensions is a weight matrix, where O is the size of the hidden layer. Assuming that the hidden layer is h, the output of the hidden layer h needs to be calculated first:
Wherein h isiLine i, representing h, then needs to compute the inputs to each node of the output layer:
Wherein,represents the jth column of the matrix V', which is the weight matrix connecting the hidden layer and the output layer.
Finally, the output of the output layer called the word order characteristic vector can be obtained by calculation:
Here, the word order feature vector is the first feature vector.
The feature representation modes of the first sub-model and the second sub-model are mirror images of each other, and a word sense feature vector beta can be obtained through similar feature representation, wherein the word sense feature vector is the second feature vector.
The word order characteristic vector alpha and the word sense characteristic vector beta corresponding to the external business cooperation enterprise information can be obtained from the output. For s2By the same feature representation method, the word order feature vector alpha 'and the word sense feature vector beta' corresponding to the internal sub-enterprise information can be obtained. And finally, superposing the four vectors to obtain a high-dimensional feature vector gamma of the enterprise owner data:
It should be noted that the high-dimensional feature vector is the target feature vector.
When the feature vectors of other business owner data are obtained, the high-dimensional feature vectors corresponding to all business owner data can be obtained through similar processes only by adjusting the dividing method of the attribute subset and the attribute needing word division.
It should be noted that, in the embodiment of the present specification, when obtaining the feature vector of the business owner data, first, external business cooperation enterprise information and internal sub-enterprise information are extracted from the business owner data, then, for each of the two kinds of information, a first sub-model considering sequence information and a second sub-model not considering sequence information are respectively used to determine a first feature vector and a second feature vector corresponding to each other, and finally, the two feature vectors of each kind of information are fused, and then the two obtained fusion vectors are fused again to obtain a final target feature vector, so that accuracy of vector representation of the business owner data can be greatly improved.
Similarly, the vectorization method is described above only by taking a piece of business owner data as an example. Similarly, feature vectors for other business owner data in the cluster-level supply chain database may also be obtained. After the feature vectors of the enterprise owner data in the group-level supply chain database are acquired, the following steps can be further executed:
and carrying out dimensionality reduction on the respective feature vectors of the enterprise owner data to obtain the respective dimensionality reduction feature vectors. And then calculating the similarity distance between every two business owner data based on the respective dimensionality reduction feature vectors of all the business owner data. And finally, clustering each enterprise owner data according to the similarity distance between every two enterprise owner data and a distance threshold, and displaying the network structure corresponding to each enterprise owner data based on the clustering result. The specific display method will be described later.
A method of displaying business owner data, that is, a method of visualizing supply chain data will be described below.
Fig. 4 is a flowchart of a supply chain data visualization processing method according to an embodiment of the present disclosure. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. As shown in fig. 4, the method may include the steps of:
step 402, displaying an interactive interface for editing business owner data.
In one example, the interactive interface may be as shown in FIG. 5 a. In FIG. 5a, the interactive interface may include a parameter setting area and a display area, wherein the parameter setting area may include two drop-down bars, one counter. The first pull-down bar can be used for selecting a dimensionality reduction algorithm, supported options comprise a Principal Component Analysis (PCA) method, a multi-dimensional scale change method, a T-random adjacent embedding method, an incremental principal component analysis method and the like, the second pull-down bar can be used for selecting a clustering algorithm, and supported options comprise a connected graph, a DBSCAN algorithm and the like. A counter may be used to enter a distance threshold, and the distance threshold may be incremented or decremented by clicking on the "+" and "-" buttons on either side of the counter. The display area is used for displaying the network structure of the business owner data.
At step 404, a distance threshold based on the input from the interactive interface is received.
It should be understood that the distance threshold may be input by the user for the first time, or may be adjusted by the user by clicking the "+" and "-" buttons on both sides of the counter.
And step 406, clustering the plurality of business owner data into a plurality of clusters based on the distance threshold and the similarity distance between every two business owner data in the plurality of business owner data.
The plurality of business owner data may be understood herein as being each of the business owner data in the corporate-level supply chain database of a large supply chain integrated services corporate company. Each of which has a target feature vector obtained by the method steps described above in fig. 2 or 3. In addition, before the visualization processing method is executed, the similarity distance between every two business owner data in the group-level supply chain database is also recorded in advance. The similarity distance calculation method may be as described above, that is, the respective target feature vectors of each enterprise owner data are subjected to the dimensionality reduction processing to obtain the respective dimensionality reduction feature vectors. And then calculating the similarity distance between every two business owner data based on the respective dimensionality reduction feature vectors of all the business owner data.
In step 406, a clustering algorithm may be used to cluster the plurality of business owner data based on the distance threshold and the pre-recorded similarity distance between each two business owner data. The clustering algorithm herein may include, but is not limited to, any of the following: a connectivity graph algorithm, a kmeans algorithm, a hierarchy-based clustering algorithm (e.g., a BIRCH algorithm, a CURE algorithm, etc.), and a density-based clustering algorithm (e.g., a DBSCAN algorithm, an OPTICS algorithm, etc.).
And step 408, determining the network structures of the plurality of enterprise owner data according to the plurality of class clusters.
The network structure may include a plurality of groups of nodes, each group corresponding to one of the plurality of class clusters, and each node in each group representing a respective enterprise owner data belonging to the corresponding class cluster.
Step 410, displaying the network structure.
In one example, the network structure may be displayed directly, i.e., several groups of nodes are displayed directly. It should be understood that in this example, there are no connecting edges between the nodes in each packet.
In another example, the above display network structure may specifically include: and for each node in the first group corresponding to any first cluster, fusing the similarity distance between the business owner data represented by other nodes and the business owner data represented by the node to obtain a fused distance corresponding to the node. And selecting the node with the lowest corresponding fusion distance from the nodes of the first group as a reference node of the first group. Connection edges between other nodes in the first packet and the reference node are established. The nodes and connecting edges in the first packet are displayed. Similarly, other groupings corresponding to other class clusters may be displayed.
It should be understood that, in the network structure shown by this example, each grouping node is presented as a star structure, wherein the central point of the star structure is the reference node of the grouping, so that it is convenient to intuitively know the distribution of the business owner data close to the business owner data represented by the reference node in each class cluster.
In one example, an interactive interface comprising the network architecture described above may be as shown in FIG. 5 b. In fig. 5b, the distance threshold entered in the parameter setting area is 0.4. The network structure displayed by the display area comprises a plurality of groups of nodes which are clustered according to 0.4, wherein each node represents one piece of enterprise data. In addition, the light nodes in fig. 5b represent nodes that are not successfully clustered, that is, the similarity distance between these nodes and the enterprise data represented by other nodes is greater than 0.4. The dark color nodes represent successfully clustered nodes, and the similarity distance between the nodes and the enterprise data represented by each node in the group is less than 0.4. Distance of similarity
It should be noted that fig. 5b is only an illustration of a display of the network structure on a single scale, and when the scale is adjusted, information such as the positions of the nodes in fig. 5b is changed accordingly.
After the network structure is displayed, a first click instruction aiming at any first node based on the input of the interactive interface can be received. And according to the first click instruction, highlighting the first group to which the first node belongs.
For example, in fig. 5b, after receiving a click command of a node, the group to which the node belongs may be highlighted, which may be specifically shown in fig. 5 c. In fig. 5c, a related description of the network structure shown in the display area may refer to fig. 5b, except that a lower left set of nodes is highlighted in fig. 5 c.
For the highlighted first group, a second click instruction for any second node in the first group based on the input of the interactive interface may also be received. And then, according to the second click instruction, the first group is displayed in an enlarged mode, and the detailed contents of each piece of business owner data represented by each node in the first group are displayed. In addition, the brief content of the business owner data represented by the second node can be displayed in the annotation box displayed in a floating manner. The brief content here may include, for example: internal sub-enterprise names and external business cooperation enterprise names, etc.
In one example, details of the pieces of business owner data represented by the nodes in the first group may be displayed in a table. Wherein one row of the table corresponds to the details of one piece of business owner data.
In one particular example, the details of the business owner data represented by the second node in the table may also be highlighted. It should be understood that, in this example, when a click instruction of any node in the table is received based on the interactive interface, a corresponding node in the network structure may also be specifically displayed to achieve the effect of multi-graph linkage.
It should be noted that, in addition to the contents (e.g., internal sub-enterprise information, external business cooperation enterprise information, etc.) covered by the detailed content of a piece of business owner data, the detailed content of a piece of business owner data described in this specification may also include a similarity distance between the piece of business owner data and data (hereinafter referred to as reference data) represented by a reference node of a class cluster to which the piece of business owner data belongs.
Specifically, the similarity distance between the external business cooperation business owner information in the piece of business owner data and the business cooperation enterprise information in the reference data, and the similarity distance between the internal sub-enterprise information in the piece of business owner data and the internal sub-enterprise information in the reference data may be included. Wherein, the similarity distance between the external business cooperation enterprise information may be calculated based on the respective fusion vector (see step 310). Similarly, the similarity distance between the internal sub-enterprise information may also be calculated based on the respective fusion vectors.
After the detailed contents of each piece of business owner data represented by each node in a certain group are displayed in the table, editing instructions (including modification and deletion) for the detailed contents of any piece of business owner data input based on the interactive interface can be received. And adjusting the detailed content of any piece of business owner data according to the editing instruction.
For example, in FIG. 5c, after receiving a click command for a node in the highlighted packet, FIG. 5c may be as shown in FIG. 5 d. In fig. 5d, an enlarged display of the group is shown above, with each node representing a piece of business owner data attributed to the highlighted group. In addition, the detailed contents of the business owner data represented by each node in the highlighted group are shown in the table below. Such as business collaboration business information and sub-business information. Wherein the first piece of business owner data is the reference data of the packet.
Of course, if a piece of business owner data is erroneously modified, the wrong modification may be undone by editing the piece of business owner data again in the table.
In summary, the visualized processing method for supply chain data provided in the embodiments of the present specification clusters similar business owner data in a group-level supply chain database, and then displays a network structure of the business owner data according to a clustering result, which may facilitate a user to search for business owner data (i.e., abnormal business owner data) containing redundant or irregular content. In addition, by displaying the detailed contents of the business owner data in the selected class cluster in the table, the user can be facilitated to quickly view or modify redundant or irregular contents. In a word, the scheme of the specification can greatly improve the processing efficiency of the enterprise owner data.
It should be noted that, with the above supply chain data visualization processing method, a user can only screen abnormal data by clicking each group one by one, which may affect the screening efficiency of the data. In practice, there is often a case where a user may roughly know information about some abnormal data, such as a business name and the like. It should be understood that based on the partial information, the screening efficiency of the abnormal data can be greatly improved. For this reason, the present specification further provides another supply chain data visualization processing method, which is described in detail below.
Fig. 6 is a flowchart of a supply chain data visualization processing method according to another embodiment of the present disclosure. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. As shown in fig. 6, the method may include the steps of:
step 602, displaying an interactive interface for editing business owner data.
In one example, the interactive interface may be as shown in FIG. 7 a. In fig. 7a, the interactive interface may include first and second parameter setting regions, a creation region, and a display region. In fig. 7a, the first parameter setting area (shown with a grey background) and the display area (shown with a white background) are shown above. The first parameter setting area may include two drop-down columns, one counter. The first of which may be used to select a dimension reduction algorithm, with the supported options as described above, and the second of which may be used to select a clustering algorithm, with the supported options as described above. A counter may be used to enter a distance threshold, and the distance threshold may be incremented or decremented by clicking on the "+" and "-" buttons on either side of the counter. The display area is used for displaying the network structure of the business owner data. In fig. 7a, the second parameter setting area (shown with a grey background) and the creation area (shown with a white background and divided by a dashed line) are shown below. The second parameter setting region may include several button controls, where different button controls are used to enter different types of values. For example, the method is used for inputting external business cooperation enterprise names, country and social uniform credit numbers, and the like, and also used for inputting internal sub-enterprise names, sub-enterprise codes, belonging first-level sub-enterprise names, first-level sub-enterprise codes, and the like. It should be noted that the button control can be dragged to a creation area, and the creation area is used for creating a search condition. Specifically, the search condition may be created according to the input value of each button control dragged to the creation area.
Step 604, receiving a search condition and a distance threshold value based on the input of the interactive interface.
The search condition here may be determined according to an input value of a button control dragged to the creation area by the user. For example, assuming that two button controls are dragged to the creation area, and the two button controls are used for inputting the business cooperation business name and country respectively, the input values at the two button controls are: "Zhengzhou Ming yang" and "zh". Then the corresponding search condition is: the name of the external business cooperation enterprise, Zheng Zhou Ming Yang and the country, zh.
In addition, the distance threshold may be first input by the user, or may be adjusted by the user by clicking the "+" and "-" buttons on both sides of the counter.
And 606, clustering the plurality of business owner data into a plurality of clusters based on the distance threshold and the similarity distance between every two business owner data in the plurality of business owner data.
It should be understood that each business owner data here has a target feature vector obtained by the method steps described above in fig. 2 or fig. 3. The specific clustering process can be described with reference to the step 406, and the description herein is not repeated. In addition, before the visualization processing method is executed, the similarity distance between every two business owner data in the group-level supply chain database is also recorded in advance. The similarity distance calculation method may be as described above, that is, the respective target feature vectors of each enterprise owner data are subjected to the dimensionality reduction processing to obtain the respective dimensionality reduction feature vectors. And then calculating the similarity distance between every two business owner data based on the respective dimensionality reduction feature vectors of all the business owner data.
The network structure here includes several groups of nodes, each of which corresponds to one of the plurality of class clusters, and each node in each group represents a respective enterprise owner data belonging to the corresponding class cluster.
Step 610, selecting a target class cluster from the plurality of class clusters based on the search condition.
Wherein the target class cluster here contains target data matching the search condition.
Step 612, the target group corresponding to the target class cluster is displayed differently from other groups.
The differential display may specifically include: the target group corresponding to the target class cluster is displayed specifically. For other packets in the network structure than the target packet, the display may be performed according to the method described in step 410.
In an example, the specifically displaying the target group corresponding to the target class cluster may specifically include: and sequentially taking each node in the target grouping as a current node to judge the similarity distance. The similarity distance judgment comprises the following steps: and calculating the similarity distance between the business owner data represented by the current node and the target data, and if the calculated similarity distance is smaller than a distance threshold, establishing a connecting edge between the current node and the target node corresponding to the target data. And after the similarity distance judgment is finished, highlighting and amplifying the nodes and the connecting edges in the target grouping.
In one example, an interactive interface comprising the network architecture described above may be as shown in FIG. 7 b. In fig. 7b, the interactive interface may include a first parameter setting area, a second parameter setting area, a creation area, and a display area, and the description of each area may be as described above, which is not repeated herein. In fig. 7b, the first parameter setting area and the display area are shown above, wherein the display area shows a network structure comprising groups of nodes clustered according to 1.5, wherein each node represents a piece of enterprise data. In addition, the display area highlights a group containing business owner data of the external business cooperation business named "zheng zhou minyang" and "zh" nationwide. In addition, in FIG. 7b, a second parameter setting area and a creation area are shown below, wherein the creation area shows two controls corresponding to the external business cooperation enterprise name, "Zheng Zhou Ming Yang" and the country: "zh". It should be appreciated that the corresponding search criteria can be determined based on the input values of the two controls.
It should be noted that, based on the above network structure, the distribution of each enterprise owner data in the target class cluster searched based on the search condition can be clearly observed.
In one example, if the user observes that similar groupings still exist around the currently highlighted grouping, the user may adjust the current distance threshold to facilitate re-clustering of various business owner data. The method comprises the following specific steps:
receiving a modification instruction aiming at the distance threshold value and input based on the interactive interface, and determining the modified distance threshold value according to the modification instruction. And clustering the plurality of business owner data again based on the modified distance threshold and the similarity distance between every two business owner data in the plurality of business owner data to obtain a plurality of updated clusters. And updating the display network structure according to the updated plurality of class clusters.
In addition, for the network structure shown in the embodiment of the present specification, a click instruction for a node therein may also be received based on the interactive interface. For example, a non-highlighted group may be highlighted when a click command is received for a node in the group. For another example, when a click instruction of a node in a highlighted group is received, the group may be displayed in an enlarged manner, and details of each piece of business owner data represented by each node in the group may be displayed in a table. Note that, if the currently highlighted and enlarged displayed packet is the above-described target packet, the first row of business owner data in the table may be the above-described target data. That is, in the visualization processing method shown in fig. 6, the reference data of the target group is the target data.
In addition, brief content of the business owner data represented by the node may also be displayed in an annotation box displayed in a floating manner, which may be specifically referred to above, and this specification is not repeated herein.
For example, when the user modifies the distance threshold in fig. 7b to: "1.8", and after a group is highlighted by clicking on a node therein, and then clicking on a node therein, fig. 7b may be as shown in fig. 7 c. In fig. 7c, the first parameter setting area and the display area are shown uppermost, wherein the display area shows a network structure comprising a set of highlighted nodes, each of which represents a piece of enterprise data in the group. In addition, in FIG. 7c, a second parameter setting area and a creation area are displayed in the middle, wherein the creation area displays two controls corresponding to the external business cooperation enterprise name, "Zheng Zhou Ming Yang" and the country: "zh". It should be appreciated that the corresponding search criteria can be determined based on the input values of the two controls. Finally, in FIG. 7c, the table containing details of the pieces of enterprise data in the group is also shown at the bottom. The first line of business owner data or benchmark data is the business owner data with the name of 'zheng zhou mingyang' and 'zh' in the country of the external business cooperation enterprise.
In summary, the visualized processing method for supply chain data provided in the embodiments of the present specification clusters similar business owner data in a group-level supply chain database, searches a target cluster including the abnormal data from a clustering result based on a search condition of the abnormal data, and displays the target cluster in a special manner, so that the efficiency of searching for the abnormal data can be greatly improved, and further, the business owner data can be efficiently managed.
Fig. 8 is a schematic diagram of a processing method of business owner data provided in this specification. In fig. 8, the treatment method includes the following steps: and a, preprocessing data. The step may specifically include: a1, acquiring the feature vector of the enterprise owner data. and a2, performing dimension reduction processing on the feature vector. a3, calculating the similarity distance between every two business owner data. And b, displaying the network structure. The step may specifically include: b1, clustering the business owner data based on the distance threshold. b2, displaying the network structure according to the clustering result. And c, positioning abnormal data. The step has two realization methods: one is to look at each packet in turn. Alternatively, the target packet is located based on the search criteria. And d, editing abnormal data. In particular, it may be that the exception data is modified or deleted based on the table. Of course, if the business owner data is modified by mistake, the data can also be restored by undoing the incorrectly modified content in the table.
Corresponding to the supply chain data vectorization method, an embodiment of the present specification further provides a supply chain data vectorization apparatus, as shown in fig. 9, the apparatus may include:
an obtaining unit 902 is configured to obtain business owner data.
And an extracting unit 904, configured to extract the target participle from the enterprise owner data by using the feature word extraction model.
An input unit 906, configured to input the enterprise owner data and the target word segmentation into a text conversion model, where the text conversion model includes a first sub-model based on time sequence and a second sub-model based on word frequency statistics. The first submodel is used for determining a first feature vector according to the vector representation of each item of content in the input data, the position information of each item of content and the vector representation of the input word segmentation. The second submodel is used for determining a second feature vector according to the word frequency statistics of each item of content in the input data and the word frequency statistics of the input participles.
And a fusion unit 908, configured to fuse the first feature vector and the second feature vector to obtain a target feature vector corresponding to the business owner data.
Optionally, the apparatus may further include:
and a dimension reduction unit 910, configured to perform dimension reduction processing on the target feature vector to obtain a dimension reduction feature vector.
The dimension reduction processing method comprises any one of the following steps: a Principal Component Analysis (PCA) method, a multi-dimensional scale change method, a T-random proximity embedding method and an incremental principal component analysis method.
A calculating unit 912, configured to calculate a similarity distance between every two business owner data based on the dimensionality reduction feature vector of the business owner data and the dimensionality reduction feature vectors of other business owner data.
And the clustering unit 914 is used for clustering each piece of business owner data according to the similarity distance between every two pieces of business owner data and the distance threshold.
A display unit 916, configured to display a network structure corresponding to each piece of business owner data based on the clustering result.
The functions of each functional module of the device in the above embodiments of the present description may be implemented through each step of the above method embodiments, and therefore, a specific working process of the device provided in one embodiment of the present description is not repeated herein.
The supply chain data vectorization device provided by one embodiment of the present specification can greatly improve the accuracy of vector representation of supply chain data.
Corresponding to the supply chain data vectorization method, an embodiment of the present specification further provides a supply chain data vectorization apparatus, as shown in fig. 10, the apparatus may include:
the obtaining unit 1002 is configured to obtain business owner data, where the business owner data information includes internal sub-enterprise information and external business cooperation enterprise information.
An extracting unit 1004 for extracting the internal sub-enterprise information and the external business cooperation enterprise information from the enterprise master data.
The extracting unit 1004 is further configured to extract the first participle and the second participle from the internal sub-enterprise information and the external business cooperation enterprise information, respectively, by using the feature word extraction model.
An input unit 1006, configured to input the internal sub-enterprise information and the first word into a text conversion model, and input the external business cooperation enterprise information and the second word into the text conversion model.
The text conversion model comprises a first sub-model based on time sequence and a second sub-model based on word frequency statistics. The first submodel is used for determining a first feature vector of the input data according to the vector representation of each item of content in the input data, the position information of each item of content and the vector representation of the input word segmentation. The second submodel is used for determining a second feature vector of the input data according to the word frequency statistics of each item of content in the input data and the word frequency statistics of the input participles.
The fusion unit 1008 is configured to fuse the first and second feature vectors corresponding to the internal sub-enterprise information and the external business cooperation enterprise information, respectively, to obtain a fusion vector of the internal sub-enterprise information and the external business cooperation enterprise information, respectively.
The fusion unit 1008 is further configured to fuse respective fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information to obtain a target feature vector corresponding to the enterprise master data.
The functions of each functional module of the device in the above embodiments of the present description may be implemented through each step of the above method embodiments, and therefore, a specific working process of the device provided in one embodiment of the present description is not repeated herein.
The supply chain data visualization device provided by one embodiment of the specification can greatly improve the accuracy of vector representation of supply chain data.
Correspondingly to the supply chain data visualization processing method, an embodiment of the present specification further provides a supply chain data visualization processing apparatus, as shown in fig. 11, the apparatus may include:
a display unit 1102, configured to display an interactive interface for editing business owner data.
A receiving unit 1104, configured to receive a distance threshold based on the input of the interactive interface.
The clustering unit 1106 is configured to cluster the plurality of business owner data into a plurality of clusters based on the distance threshold and the similarity distance between every two business owner data in the plurality of business owner data. The similarity distance between every two business owner data is determined based on the target feature vectors of the business owner data obtained by the method steps shown in fig. 2 or fig. 3.
A determining unit 1108, configured to determine a network structure of the plurality of enterprise owner data according to the plurality of class clusters. The network structure comprises a plurality of groups of nodes, wherein each group corresponds to one of a plurality of class clusters, and each node in each group represents each enterprise owner data information belonging to the corresponding class cluster.
The display unit 1102 is further configured to display a network structure.
The display unit 1102 is specifically configured to:
for each node in a first group corresponding to any first cluster, fusing similarity distances between business owner data represented by other nodes and business owner data represented by the node to obtain a fusion distance corresponding to the node;
selecting a node with the lowest corresponding fusion distance from all nodes of the first group as a reference node of the first group;
establishing connection edges between other nodes in the first group and the reference node;
the nodes and connecting edges in the first packet are displayed.
Optionally, the receiving unit 1104 is further configured to receive a first click instruction for an arbitrary first node based on the input of the interactive interface.
The display unit 1102 is further configured to highlight, according to the first click instruction, the first group to which the first node belongs.
Optionally, the receiving unit 1104 is further configured to receive a second click instruction for any second node in the first group based on the input of the interactive interface.
The display unit 1102 is further configured to display the first group in an enlarged manner according to the second click instruction, and display details of each piece of business owner data represented by each node in the first group.
The display unit 1102 is specifically configured to: details of each piece of business owner data represented by each node in the first group are displayed in a table. Wherein one row of the table corresponds to the details of one piece of business owner data.
Optionally, the display unit 1102 is further configured to display the brief content of the business owner data represented by the second node in the annotation box displayed in a floating manner.
Optionally, the receiving unit 1104 is further configured to receive an editing instruction for the detailed content of any piece of business owner data, which is input based on the interactive interface. And adjusting the detailed content of any piece of business owner data according to the editing instruction.
Optionally, the apparatus may further include: a selection unit 1110.
The receiving unit 1104 is further configured to receive a search condition input based on the interactive interface.
The selecting unit 1110 is configured to select a target class cluster from the multiple class clusters based on the search condition. Wherein the target class cluster contains target data matching the search criteria.
The display unit 1102 is specifically configured to:
the target group corresponding to the target class cluster is displayed differently from the other groups.
In one example of the use of a magnetic resonance imaging system,
the target group may be displayed specifically.
The display unit 1102 is further specifically configured to:
and sequentially taking each node in the target grouping as a current node to judge the similarity distance. The similarity distance judgment comprises the following steps: and calculating the similarity distance between the business owner data represented by the current node and the target data, and if the calculated similarity distance is smaller than a distance threshold, establishing a connecting edge between the current node and the target node corresponding to the target data. And after the similarity distance judgment is finished, highlighting and amplifying the nodes and the connecting edges in the target grouping.
Optionally, the receiving unit 1104 is further configured to receive a modification instruction for the distance threshold based on the input of the interactive interface.
The determining unit 1108 is further configured to determine the modified distance threshold according to the modification instruction.
The clustering unit 1106 is further configured to re-cluster the plurality of business owner data based on the modified distance threshold and the similarity distance between every two business owner data in the plurality of business owner data, so as to obtain a plurality of updated clusters.
The display unit 1102 is further configured to update the display network structure according to the updated plurality of class clusters.
The functions of each functional module of the device in the above embodiments of the present description may be implemented through each step of the above method embodiments, and therefore, a specific working process of the device provided in one embodiment of the present description is not repeated herein.
The supply chain data visualization processing device provided by one embodiment of the specification can facilitate a user to intuitively know the distribution situation of similar data in supply chain data.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2, 3 or 4.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in conjunction with fig. 2, fig. 3, or fig. 4.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or may be embodied in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a server. Of course, the processor and the storage medium may reside as discrete components in a server.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The above-mentioned embodiments, objects, technical solutions and advantages of the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the present specification, and are not intended to limit the scope of the present specification, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present specification should be included in the scope of the present specification.
Claims (10)
1. A supply chain data vectorization method, comprising:
acquiring enterprise owner data; the enterprise main data comprises internal sub-enterprise information and external business cooperation enterprise information;
extracting the internal sub-enterprise information and the external business cooperation enterprise information from the enterprise main data;
extracting a first word segmentation and a second word segmentation from the internal sub-enterprise information and the external business cooperation enterprise information respectively by using a feature word extraction model;
inputting the internal sub-enterprise information and the first segmentation into a text conversion model, and inputting the external business cooperation enterprise information and the second segmentation into the text conversion model;
the text conversion model comprises a first sub-model based on time sequence and a second sub-model based on word frequency statistics; the first sub-model is used for determining a first feature vector of input data according to vector representation of each item of content in the input data, position information of each item of content and vector representation of input word segmentation; the second submodel is used for determining a second feature vector of the input data according to the word frequency statistics of each content in the input data and the word frequency statistics of the input participles;
fusing first and second feature vectors corresponding to the internal sub-enterprise information and the external business cooperation enterprise information respectively to obtain respective fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information;
and fusing respective fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information to obtain a target feature vector corresponding to the enterprise owner data.
2. The method of claim 1, further comprising:
performing dimensionality reduction processing on the target feature vector to obtain a dimensionality reduction feature vector;
calculating the similarity distance between every two business owner data based on the dimensionality reduction characteristic vector of the business owner data and the dimensionality reduction characteristic vectors of other business owner data;
clustering each business owner data according to the similarity distance between every two business owner data and a distance threshold;
and displaying the network structure corresponding to each piece of business owner data based on the clustering result.
3. A supply chain data visualization processing method comprises the following steps:
displaying an interactive interface for editing business owner data;
receiving a distance threshold based on the interactive interface input;
clustering the plurality of enterprise owner data into a plurality of clusters based on the distance threshold and the similarity distance between every two enterprise owner data in the plurality of enterprise owner data; the similarity distance between every two business owner data is determined based on the target feature vectors of the business owner data acquired according to the method of claim 1;
determining the network structure of the plurality of enterprise owner data according to the plurality of class clusters; the network structure comprises a plurality of groups of nodes, wherein each group corresponds to one of the plurality of class clusters, and each node in each group represents each enterprise owner data belonging to the corresponding class cluster;
and displaying the network structure.
4. The method of claim 3, wherein said displaying the network structure comprises:
for each node in a first group corresponding to any first cluster, fusing similarity distances between business owner data represented by other nodes and business owner data represented by the node to obtain a fusion distance corresponding to the node;
selecting a node with the lowest corresponding fusion distance from all nodes of the first group as a reference node of the first group;
establishing connection edges between other nodes in the first group and the reference node;
and displaying the nodes and the connecting edges in the first group.
5. The method of claim 3 or 4, further comprising:
receiving a first click instruction aiming at any first node and input based on the interactive interface;
and according to the first click instruction, highlighting a first group to which the first node belongs, and displaying the data content of each enterprise owner data represented by each node in the first group.
6. The method of claim 3, wherein prior to said displaying said network structure, further comprising:
receiving a search condition input based on the interactive interface;
selecting a target class cluster from the plurality of class clusters based on the search condition; wherein the target class cluster contains target data matched with a search condition;
the displaying the network structure includes:
and distinguishing the target group corresponding to the target class cluster from other groups.
7. The method of claim 6, wherein said distinctively displaying target packets and other packets corresponding to said target class cluster comprises
Sequentially taking each node in the target grouping as a current node to judge the similarity distance; the similarity distance judgment comprises the following steps: calculating the similarity distance between the business owner data represented by the current node and the target data, and if the calculated similarity distance is smaller than the distance threshold, establishing a connecting edge between the current node and the target node corresponding to the target data;
and after the similarity distance is judged, highlighting and amplifying to display the nodes and the connecting edges in the target grouping.
8. The method of claim 3, further comprising:
receiving a modification instruction for the distance threshold based on the interactive interface input;
determining a modified distance threshold according to the modification instruction;
clustering the plurality of enterprise owner data again based on the modified distance threshold value and the similarity distance between every two enterprise owner data in the plurality of enterprise owner data to obtain a plurality of updated clusters;
and updating and displaying the network structure according to the updated plurality of class clusters.
9. A supply chain data vectorization device, comprising:
the acquisition unit is used for acquiring enterprise owner data; the enterprise main data comprises internal sub-enterprise information and external business cooperation enterprise information;
an extracting unit, configured to extract the internal sub-enterprise information and the external business cooperation enterprise information from the enterprise master data;
the extracting unit is further configured to extract a first participle and a second participle from the internal sub-enterprise information and the external business cooperation enterprise information respectively by using a feature word extracting model;
the input unit is used for inputting the internal sub-enterprise information and the first segmentation into a text conversion model, and inputting the external business cooperation enterprise information and the second segmentation into the text conversion model;
the text conversion model comprises a first sub-model based on time sequence and a second sub-model based on word frequency statistics; the first sub-model is used for determining a first feature vector of input data according to vector representation of each item of content in the input data, position information of each item of content and vector representation of input word segmentation; the second submodel is used for determining a second feature vector of the input data according to the word frequency statistics of each content in the input data and the word frequency statistics of the input participles;
the fusion unit is used for fusing the first and second feature vectors corresponding to the internal sub-enterprise information and the external business cooperation enterprise information respectively to obtain fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information respectively;
the fusion unit is further configured to fuse respective fusion vectors of the internal sub-enterprise information and the external business cooperation enterprise information to obtain a target feature vector corresponding to the enterprise owner data.
10. A supply chain data visualization processing apparatus, comprising:
the display unit is used for displaying an interactive interface for editing business owner data;
the receiving unit is used for receiving a distance threshold value input based on the interactive interface;
the clustering unit is used for clustering the plurality of enterprise owner data into a plurality of clusters based on the distance threshold value and the similarity distance between every two enterprise owner data in the plurality of enterprise owner data; the similarity distance between every two business owner data is determined based on the target feature vectors of the business owner data acquired according to the method of claim 1;
a determining unit, configured to determine a network structure of the plurality of enterprise owner data according to the plurality of class clusters; the network structure comprises a plurality of groups of nodes, wherein each group corresponds to one of the plurality of class clusters, and each node in each group represents each enterprise owner data belonging to the corresponding class cluster;
the display unit is further used for displaying the network structure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111045671.6A CN113486630B (en) | 2021-09-07 | 2021-09-07 | Supply chain data vectorization and visualization processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111045671.6A CN113486630B (en) | 2021-09-07 | 2021-09-07 | Supply chain data vectorization and visualization processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113486630A true CN113486630A (en) | 2021-10-08 |
CN113486630B CN113486630B (en) | 2021-11-19 |
Family
ID=77947287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111045671.6A Active CN113486630B (en) | 2021-09-07 | 2021-09-07 | Supply chain data vectorization and visualization processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113486630B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107943847A (en) * | 2017-11-02 | 2018-04-20 | 平安科技(深圳)有限公司 | Business connection extracting method, device and storage medium |
CN110275942A (en) * | 2019-06-26 | 2019-09-24 | 上海交通大学 | A kind of electronics authority security incident convergence analysis method |
CN111898378A (en) * | 2020-07-31 | 2020-11-06 | 中国联合网络通信集团有限公司 | Industry classification method and device for government and enterprise clients, electronic equipment and storage medium |
CN112632980A (en) * | 2020-12-30 | 2021-04-09 | 广州友圈科技有限公司 | Enterprise classification method and system based on big data deep learning and electronic equipment |
-
2021
- 2021-09-07 CN CN202111045671.6A patent/CN113486630B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107943847A (en) * | 2017-11-02 | 2018-04-20 | 平安科技(深圳)有限公司 | Business connection extracting method, device and storage medium |
CN110275942A (en) * | 2019-06-26 | 2019-09-24 | 上海交通大学 | A kind of electronics authority security incident convergence analysis method |
CN111898378A (en) * | 2020-07-31 | 2020-11-06 | 中国联合网络通信集团有限公司 | Industry classification method and device for government and enterprise clients, electronic equipment and storage medium |
CN112632980A (en) * | 2020-12-30 | 2021-04-09 | 广州友圈科技有限公司 | Enterprise classification method and system based on big data deep learning and electronic equipment |
Non-Patent Citations (2)
Title |
---|
周顺先等: "基于Word2vector 的文本特征化表示方法", 《重庆邮电大学学报( 自然科学版)》 * |
谈锦锋等: "浅谈"两化"融合背景下企业的主数据管理", 《信息技术与信息化》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113486630B (en) | 2021-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xiang et al. | Interactive correction of mislabeled training data | |
JP7002459B2 (en) | Systems and methods for ontology induction with statistical profiling and reference schema matching | |
US7398268B2 (en) | Systems and methods that facilitate data mining | |
De Bie et al. | Automating data science | |
CA3179300C (en) | Domain-specific language interpreter and interactive visual interface for rapid screening | |
WO2018236886A1 (en) | System and method for code and data versioning in computerized data modeling and analysis | |
WO2023130837A1 (en) | Automatic machine learning implementation method, platform and apparatus for scientific research application | |
Crescenzi et al. | A framework for learning web wrappers from the crowd | |
US11775267B2 (en) | Identification and application of related source code edits | |
Wu et al. | Explainable data transformation recommendation for automatic visualization | |
US11513673B2 (en) | Steering deep sequence model with prototypes | |
WO2021240370A1 (en) | Domain-specific language interpreter and interactive visual interface for rapid screening | |
US20110153530A1 (en) | Method and system for analyzing a legacy system based on trails through the legacy system | |
CN117056528A (en) | Method for constructing knowledge graph system based on real-time big data | |
CN113486630B (en) | Supply chain data vectorization and visualization processing method and device | |
CN110222032A (en) | A kind of generalised event model based on software data analysis | |
US11809698B1 (en) | Phrase builder for data analytic natural language interface | |
CN117764536B (en) | Innovative entrepreneur project auxiliary management system based on artificial intelligence | |
US12019998B1 (en) | Phrase recommendations for data visualizations | |
US20240152522A1 (en) | Data set semantic similarity clustering | |
Bernstein et al. | Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting | |
Mining | Gaël Bernard¹ (~) and Periklis Andritsos² Faculty of Business and Economics (HEC), University of Lausanne, Lausanne, Switzerland gael. bernard@ unil. ch Faculty of Information, University of Toronto, Toronto, Canada | |
Zubcoff et al. | On the suitability of time series analysis on data warehouses | |
Marconi | Stakeholder Relationship Management for Software Projects |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |