CN112650951A

CN112650951A - Enterprise similarity matching method, system and computing device

Info

Publication number: CN112650951A
Application number: CN202011522207.7A
Authority: CN
Inventors: 龙非池; 张炫
Original assignee: Rocking Digital Chongqing Technology Co ltd
Current assignee: Rocking Digital Chongqing Technology Co ltd
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2021-04-13

Abstract

The invention provides an enterprise similarity matching method, an enterprise similarity matching system and a computing device, wherein the method comprises the following steps: inputting query information, and generating a query information vector according to the query information; acquiring an enterprise corpus, and generating corresponding enterprise word vectors from enterprise description data in the enterprise corpus; calculating the similarity of the query information vector and the enterprise word vector by a similarity algorithm; and sequencing the enterprises according to the similarity of the enterprise word vectors, acquiring a sequencing result, and outputting similar enterprises according to the sequencing result. The invention can semantically match similar enterprises, improves the matching accuracy, avoids the full-text traversal operation of enterprise description data, and improves the matching speed by matching the system with a computing device for computation.

Description

Enterprise similarity matching method, system and computing device

Technical Field

The invention relates to the technical field of enterprise data analysis, in particular to an enterprise similarity matching method, an enterprise similarity matching system and a computing device.

Background

In the development process of enterprises, competitors are often required to pay attention to all the time, and the essence is taken to remove dregs, so that the self is perfected, better development is achieved, and the enterprise data analysis industry is derived. When enterprise data analysis is performed, firstly, a competitor enterprise needs to be queried, so that a competitor is selected, and dynamic attention of the competitor is paid constantly. In the prior art, an inverted index is generally established for enterprise description texts in a database according to a keyword retrieval mode, then keyword extraction is performed on query information, similar enterprises are retrieved by using the keywords, and similar enterprise ranking is performed according to an algorithm.

However, when searching similar enterprises according to keywords, the system often cannot fully identify effective near-meaning words, or needs to manually maintain a near-meaning word list, so that the matching process of the similar enterprises is more mechanical, and the matching result accuracy is low; in addition, when keyword query is performed in the prior art, full-text traversal needs to be performed on all enterprise description texts, which takes a long time.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an enterprise similarity matching method, system and computing device.

An enterprise similarity matching method comprises the following steps: inputting query information, and generating a query information vector according to the query information; acquiring an enterprise corpus, and generating corresponding enterprise word vectors from enterprise description data in the enterprise corpus; calculating the similarity of the query information vector and the enterprise word vector by a similarity algorithm; and sequencing the enterprises according to the similarity of the enterprise word vectors, acquiring a sequencing result, and outputting similar enterprises according to the sequencing result.

In one embodiment, the obtaining an enterprise corpus and generating corresponding enterprise word vectors from enterprise description data in the enterprise corpus specifically includes: the enterprise corpus comprises a plurality of enterprise description data, and the enterprise description data is subjected to word segmentation processing to obtain enterprise description words; obtaining a descriptor vector of the enterprise description vocabulary through a word2vec algorithm; and processing the descriptor vector of the enterprise description vocabulary through a recurrent neural network to obtain an enterprise word vector.

In one embodiment, the descriptor vector of the enterprise description vocabulary is processed through a transformer network to obtain an enterprise word vector.

In one embodiment, the calculating the similarity between the query information vector and the enterprise term vector by a similarity algorithm specifically includes: dividing the enterprise description data into an enterprise name field, an operation range field and an enterprise profile field; respectively calculating the similarity between the enterprise name field, the operation range field and the enterprise profile field according to the query information vector, and acquiring the name similarity, the operation range similarity and the profile similarity; and weighting according to the weight corresponding to the name similarity, the business range similarity and the introduction similarity to obtain the similarity of the enterprise word vectors.

In one embodiment, the weighting according to the weight corresponding to the name similarity, the business range similarity, and the profile similarity to obtain the similarity of the enterprise word vectors specifically includes: and setting the weight of the name similarity as 2, and setting the weight of the business range similarity and the profile similarity as 1.

In one embodiment, the sorting according to the similarity of the enterprise word vectors and outputting similar enterprise results specifically includes: sequencing similar enterprises from big to small according to the similarity of the enterprise word vectors to generate a sequencing result; presetting an output enterprise threshold, and selecting similar enterprises in the sequencing result according to the output enterprise threshold; and displaying the selected similar enterprise output.

An enterprise similarity matching system comprising: the information input module is used for inputting query information and generating a query information vector according to the query information; the first vector generation module is used for acquiring an enterprise corpus and generating corresponding enterprise word vectors from enterprise description data in the enterprise corpus; the first similarity calculation module is used for calculating the similarity of the query information vector and the enterprise word vector through a similarity calculation method; and the result output module is used for sequencing the enterprises according to the similarity of the enterprise word vectors, acquiring a sequencing result and outputting similar enterprises according to the sequencing result.

A computing device for computing enterprise similarities in conjunction with the enterprise similarity matching system, comprising: the second vector generation module and the second similarity calculation module are integrated on the same chip and are in communication connection; the second vector generation module is used for generating the enterprise word vectors from the enterprise description data in the enterprise corpus through a neural network algorithm and transmitting the enterprise word vectors to the second similarity calculation module; and the second similarity calculation module is used for calculating the similarity between the query information vector and the enterprise word vector and sequencing according to the similarity of the enterprise word vector.

In one embodiment, the second similarity calculation module includes: the system comprises an enterprise field vector calculation unit, an enterprise vector calculation unit, a similarity calculation unit and a sequencing unit, wherein the enterprise field vector calculation unit, the enterprise vector calculation unit, the similarity calculation unit and the sequencing unit are in communication connection; the enterprise field vector calculation unit is used for generating a corresponding enterprise field vector according to the enterprise field; the enterprise vector calculation unit is used for calculating an enterprise word vector according to the enterprise field vector; the similarity calculation unit is used for calculating the similarity between the enterprise word vector and the query information vector; and the sequencing unit is used for sequencing the similarity of the enterprises according to the similarity.

Compared with the prior art, the invention has the advantages and beneficial effects that:

1. the similar enterprise retrieval system can automatically acquire the near-meaning words without manually maintaining a near-meaning word list, so that the semantic information of the text can be understood in the retrieval process, and the matching precision of the similarity enterprise is improved.

2. The similarity matching speed of the enterprises is improved by jointly calculating through the calculating device and the similarity matching system of the enterprises.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a method for enterprise similarity matching, according to an embodiment;

FIG. 2 is a block diagram that illustrates an enterprise similarity matching system, according to an embodiment;

FIG. 3 is a diagram of a computing device in one embodiment;

fig. 4 is a schematic structural diagram of the second similarity calculation module in fig. 3.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings by way of specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In one embodiment, as shown in fig. 1, there is provided an enterprise similarity matching method, including the following steps:

step S101, inputting query information, and generating a query information vector according to the query information.

Specifically, the query information may include: the enterprise name field, the operation range field and the enterprise introduction field of the enterprise can respectively input query information to the fields, perform word vector conversion processing on the query information of different fields, and integrate the word vectors of all the query information to obtain the query information vector.

In order to ensure the comparability of the vector generated by the enterprise name field, it is generally necessary to process the vector, and identify and extract information having comparative significance, such as "science and technology companies, consultative management companies, and agricultural cooperative. Of course, when vector conversion is performed on the enterprise name field in the enterprise corpus, extraction of comparative meaning information is also required.

Step S102, an enterprise corpus is obtained, and enterprise description data in the enterprise corpus is generated into corresponding enterprise word vectors.

Specifically, the enterprise corpus includes enterprise description data of a plurality of enterprises, the enterprise description data may include enterprise names, business scopes, enterprise profiles, and the like, and the enterprise description data is converted into high-dimensional vectors including semantic information thereof through a neural network algorithm, for example, word2vec (a relevant model for generating word vectors) and LSTM (Long Short-Term Memory network) algorithms.

In order to accelerate the subsequent calculation speed increase when searching similar enterprises, enterprise description data in an enterprise corpus can be generated into corresponding enterprise word vectors in advance and stored in a database, and after receiving a query request, the similarity between the query vectors and the enterprise word vectors is directly compared, so that a query result can be quickly obtained.

Specifically, the language used by the enterprise description data is not limited in this embodiment, and may be chinese or english, or french or japanese. Because the description data of different languages needs to use different natural language processing modules, vectors generated by the enterprise description data fields can be mapped into the same space with comparable similarity, so that the translation function is achieved, and the subsequent similarity calculation is facilitated.

And step S103, calculating the similarity of the query information vector and the enterprise word vector by a similarity algorithm.

The cosine distance is respectively calculated for the query information vector and the enterprise word vector through a similarity algorithm, such as a cosine similarity algorithm, and the similarity between the query information vector and the enterprise word vector is judged according to the cosine distance, so that similar enterprises are obtained.

In addition, the enterprise description data can also be divided into different fields, such as an enterprise name field, an operation range field and an enterprise profile field, similarity is calculated for the fields respectively, and then the similarity between the enterprise and the query information is obtained through weighting according to the weights of the different fields, so that similar enterprises are obtained.

And step S104, sequencing the enterprises according to the similarity of the enterprise word vectors, acquiring a sequencing result, and outputting similar enterprises according to the sequencing result.

Specifically, the similarity between different enterprises and the query information is obtained according to the similarity of the enterprise word vectors, the enterprises can be ranked from big to small according to the similarity, and the enterprises corresponding to the ranking results are displayed as output results.

In this embodiment, query information is input, a query information vector is generated from the query information, an enterprise corpus is obtained, enterprise description data in the enterprise corpus is generated into corresponding enterprise word vectors, the similarity between the query information word vectors and the enterprise word vectors is calculated through a similarity algorithm, the enterprises are ranked according to the similarity, the similar enterprises are output according to the ranking result, the similar enterprises can be semantically matched, the matching accuracy is improved, full-text traversal operation of the enterprise description data is avoided, and the matching speed is improved.

In one embodiment, step S102 specifically includes: the enterprise corpus comprises a plurality of enterprise description data, and the enterprise description data is subjected to word segmentation processing to obtain enterprise description words; obtaining a descriptor vector of an enterprise description vocabulary through a word2vec algorithm; and processing the descriptor vector of the enterprise description vocabulary through a recurrent neural network to obtain an enterprise word vector.

And processing the descriptor vector of the enterprise description vocabulary through a transformer network to obtain an enterprise word vector.

Specifically, the transformer network is a neural network architecture based on a self-attention mechanism, and can directly process a sentence as a matrix, so that the calculation speed of the word vector of the enterprise is increased, and similar enterprise matching is accelerated.

In one embodiment, step S103 specifically includes: dividing the enterprise description data into an enterprise name field, an operation range field and an enterprise brief introduction field; respectively calculating the similarity between the enterprise name field, the operation range field and the enterprise profile field according to the query information vector, and acquiring the name similarity, the operation range similarity and the profile similarity; and weighting according to the corresponding weights of the name similarity, the business range similarity and the introduction similarity to obtain the similarity of the enterprise word vectors.

In one embodiment, the weight of the name similarity may be set to 2 and the weights of the business segment similarity and the profile similarity may be set to 1. Of course, the weight values of different fields may also be modified correspondingly according to actual needs, and are not limited to the setting of the weight values in this embodiment.

Specifically, the enterprise description data is generated by an n-dimensional vector, the value of n is not limited, and table 1 gives an indication that n is 2:

TABLE 1 Enterprise description data Generation of 2-dimensional vector example tables

The enterprise word vectors shown in table 1 are obtained by weighting and integrating the enterprise name vector weight set to 2 and the business range vector and enterprise profile vector weight set to 1, and then calculating the similarity between the query information vector and the enterprise word vectors according to a similarity algorithm, thereby obtaining the enterprise similarity.

Of course, the enterprise name vector, the business range vector and the enterprise profile vector in the corpus may also be respectively compared with the enterprise name vector, the business range vector and the enterprise profile vector in the query information to calculate similarities, and finally, the similarities of different fields are weighted and integrated, so as to obtain the enterprise similarity.

In one embodiment, step S104 specifically includes: sequencing similar enterprises from big to small according to the similarity of the enterprise word vectors to generate a sequencing result; presetting an output enterprise threshold, and selecting similar enterprises in the sequencing result according to the output enterprise threshold; and displaying the selected similar enterprise output.

Specifically, TOP-K sorting may be performed, that is, an enterprise with the similarity at the TOP K bits is selected as a similar enterprise, and K is a preset output enterprise threshold, and may be set according to the needs of the enterprise, for example, K is 10, that is, an enterprise at the TOP 10 of the sorting result bit column is taken as a similar enterprise, and is output and displayed.

As shown in fig. 2, there is provided an enterprise similarity matching system 20, comprising: the system comprises an information input module 21, a first vector generation module 22, a first similarity calculation module 23 and a result output module 24, wherein:

the information input module 21 is used for inputting query information and generating a query information vector according to the query information;

the first vector generation module 22 is configured to obtain an enterprise corpus, and generate corresponding enterprise word vectors from enterprise description data in the enterprise corpus;

the first similarity calculation module 23 is configured to calculate similarities of the query information vector and the enterprise word vector through a similarity algorithm;

and the result output module 24 is used for sequencing the enterprises according to the similarity of the enterprise word vectors, acquiring a sequencing result and outputting similar enterprises according to the sequencing result.

In one embodiment, the first vector generation module 22 is specifically configured to: the enterprise corpus comprises a plurality of enterprise description data, and the enterprise description data is subjected to word segmentation processing to obtain enterprise description words; obtaining a descriptor vector of an enterprise description vocabulary through a word2vec algorithm; and processing the descriptor vector corresponding to each sentence of enterprise description data through a recurrent neural network to obtain an enterprise word vector.

In one embodiment, the first similarity calculation module 23 is specifically configured to: dividing the enterprise description data into an enterprise name field, an operation range field and an enterprise brief introduction field; respectively calculating the similarity between the enterprise name field, the operation range field and the enterprise profile field according to the query information vector, and acquiring the name similarity, the operation range similarity and the profile similarity; and weighting according to the corresponding weights of the name similarity, the business range similarity and the introduction similarity to obtain the similarity of the enterprise word vectors.

In one embodiment, the result output module 24 is specifically configured to: sequencing similar enterprises from big to small according to the similarity of the enterprise word vectors to generate a sequencing result; presetting an output enterprise threshold, and selecting similar enterprises in the sequencing result according to the output enterprise threshold; and displaying the selected similar enterprise output.

As shown in fig. 3, there is provided a computing device 30 for computing enterprise similarities in cooperation with the enterprise similarity matching system, including: the second vector generation module 31 and the second similarity calculation module 32 are integrated on the same chip, and are in communication connection; the second vector generation module 31 is configured to generate enterprise word vectors from the enterprise description data in the enterprise corpus through a neural network algorithm, and transmit the enterprise word vectors to the second similarity calculation module 32; and the second similarity calculation module 32 is configured to calculate similarities between the query information vectors and the enterprise word vectors, and rank the similarity according to the similarities of the enterprise word vectors.

In the embodiment, in order to ensure low delay of similar enterprise retrieval by the enterprise similarity matching method and improve the speed of the retrieval process, a computing device 30 is provided. The computing device 30 may be used as a co-processor in cooperation with the enterprise similarity matching system 20 to perform enterprise similarity calculation.

Specifically, the second vector generation module 31 is configured to generate enterprise word vectors according to enterprise description data in an enterprise corpus, for languages such as chinese and japanese, word segmentation processing needs to be performed on an original text before word vector calculation is performed, and then calculation is performed through the second vector generation module 31 to obtain word vectors corresponding to each vocabulary and each vocabulary, and the word vectors are integrated to obtain the enterprise word vectors.

The word segmentation process is generally performed by software, and may be performed by, for example, a CPU of a server or a computing unit in the computing device 30.

Specifically, after the second vector generation module 31 obtains the enterprise word vector, the enterprise word vector is transmitted to the second similarity calculation module 32, and the similarity between the enterprise word vector and the query information vector is calculated according to the enterprise word vector.

In an embodiment, the computing apparatus 30 may also be an integrated circuit designed as shown in fig. 3, that is, a structure including a second vector generation module and a second similarity calculation module, where each such structure is used as a calculation core, a plurality of calculation cores are connected by a network-on-chip (noc) to form a board, a plurality of boards are loaded on a server, and a plurality of servers form a cluster.

Specifically, the computing device 30 may include a plurality of computing cores, and divide the workload of each computing core, where each computing core is responsible for a certain amount of enterprise description data, and calculates the similarity between the corresponding enterprise word vector and the query information vector, and performs similarity ranking of the part of enterprises, and finally, collects the enterprise word vectors with similarity ranking within a preset threshold range in each computing core through a collection algorithm, and performs secondary ranking to obtain the final similar enterprises.

Specifically, the computing process of similar enterprise matching can be further accelerated through a plurality of computing cores, so that the matching speed is improved.

The summary algorithm includes, but is not limited to, merging and sorting by using a CPU, merging and sorting by computing a kernel two by two, and the like.

In one embodiment, as shown in fig. 4, the second similarity calculation module 32 includes: the enterprise field vector computing unit 321, the enterprise vector computing unit 322, the similarity computing unit 323 and the sorting unit 324 are in communication connection with each other; an enterprise field vector calculating unit 321, configured to generate a corresponding enterprise field vector according to an enterprise field; an enterprise vector calculation unit 322, configured to calculate an enterprise word vector according to the enterprise field vector; the similarity calculation unit 323 is used for calculating the similarity between the enterprise word vector and the query information vector; and a sorting unit 324, configured to sort the similarity of the enterprises according to the similarity.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and optionally they may be implemented in program code executable by a computing device, such that they may be stored on a computer storage medium (ROM/RAM, magnetic disks, optical disks) and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The foregoing is a more detailed description of the present invention that is presented in conjunction with specific embodiments, and the practice of the invention is not to be considered limited to those descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. An enterprise similarity matching method is characterized by comprising the following steps:

inputting query information, and generating a query information vector according to the query information;

acquiring an enterprise corpus, and generating corresponding enterprise word vectors from enterprise description data in the enterprise corpus;

calculating the similarity of the query information vector and the enterprise word vector by a similarity algorithm;

and sequencing the enterprises according to the similarity of the enterprise word vectors, acquiring a sequencing result, and outputting similar enterprises according to the sequencing result.

2. The method according to claim 1, wherein the obtaining an enterprise corpus generates corresponding enterprise word vectors from enterprise description data in the enterprise corpus, specifically comprising:

the enterprise corpus comprises a plurality of enterprise description data, and the enterprise description data is subjected to word segmentation processing to obtain enterprise description words;

obtaining a descriptor vector of the enterprise description vocabulary through a word2vec algorithm;

and processing the descriptor vector of the enterprise description vocabulary through a recurrent neural network to obtain an enterprise word vector.

3. The method as claimed in claim 2, wherein the descriptor vector of the enterprise description vocabulary is processed through a transformer network to obtain an enterprise word vector.

4. The method according to claim 1, wherein the calculating the similarity between the query information vector and the enterprise word vector by a similarity algorithm specifically comprises:

dividing the enterprise description data into an enterprise name field, an operation range field and an enterprise profile field;

respectively calculating the similarity between the enterprise name field, the operation range field and the enterprise profile field according to the query information vector, and acquiring the name similarity, the operation range similarity and the profile similarity;

and weighting according to the weight corresponding to the name similarity, the business range similarity and the introduction similarity to obtain the similarity of the enterprise word vectors.

5. The method according to claim 4, wherein the obtaining the similarity of the enterprise word vectors by weighting according to the weights corresponding to the name similarity, the business segment similarity and the profile similarity specifically comprises:

and setting the weight of the name similarity as 2, and setting the weight of the business range similarity and the profile similarity as 1.

6. The method according to claim 1, wherein the sorting according to the similarity of the enterprise word vectors and outputting similar enterprise results specifically comprises:

sequencing similar enterprises from big to small according to the similarity of the enterprise word vectors to generate a sequencing result;

presetting an output enterprise threshold, and selecting similar enterprises in the sequencing result according to the output enterprise threshold;

and displaying the selected similar enterprise output.

7. An enterprise similarity matching system, comprising:

the information input module is used for inputting query information and generating a query information vector according to the query information;

the first vector generation module is used for acquiring an enterprise corpus and generating corresponding enterprise word vectors from enterprise description data in the enterprise corpus;

the first similarity calculation module is used for calculating the similarity of the query information vector and the enterprise word vector through a similarity calculation method;

and the result output module is used for sequencing the enterprises according to the similarity of the enterprise word vectors, acquiring a sequencing result and outputting similar enterprises according to the sequencing result.

8. A computing device configured to compute enterprise similarities in cooperation with the enterprise similarity matching system, comprising:

the second vector generation module and the second similarity calculation module are integrated on the same chip and are in communication connection;

the second vector generation module is used for generating the enterprise word vectors from the enterprise description data in the enterprise corpus through a neural network algorithm and transmitting the enterprise word vectors to the second similarity calculation module;

and the second similarity calculation module is used for calculating the similarity between the query information vector and the enterprise word vector and sequencing according to the similarity of the enterprise word vector.

9. The computing device according to claim 8, wherein the second similarity calculation module comprises:

the system comprises an enterprise field vector calculation unit, an enterprise vector calculation unit, a similarity calculation unit and a sequencing unit, wherein the enterprise field vector calculation unit, the enterprise vector calculation unit, the similarity calculation unit and the sequencing unit are in communication connection;

the enterprise field vector calculation unit is used for generating a corresponding enterprise field vector according to the enterprise field;

the enterprise vector calculation unit is used for calculating an enterprise word vector according to the enterprise field vector;

the similarity calculation unit is used for calculating the similarity between the enterprise word vector and the query information vector;

and the sequencing unit is used for sequencing the similarity of the enterprises according to the similarity.