CN117499340A

CN117499340A - Communication resource name matching method, device, equipment and medium

Info

Publication number: CN117499340A
Application number: CN202311439792.8A
Authority: CN
Inventors: 武军
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-02-02

Abstract

The application relates to a communication resource name matching method, a device, equipment and a medium. The method comprises the steps of obtaining resource names of resources to be matched, wherein the resources to be matched are communication resources in a communication resource system; according to the characteristic information in the resource names, M associated resource names associated with the resource names are obtained, and associated resources corresponding to the associated resource names and the resources to be matched belong to different communication resource systems; performing word segmentation processing on the resource names and the M associated resource names to obtain word segmentation word bags corresponding to the resource names and the M associated resource names respectively; acquiring TF-IDF similarity and Jacard similarity of the resource names and the M associated resource names according to word segmentation word bags corresponding to the resource names and the M associated resource names respectively; based on the TF-IDF similarity and the Jacquard similarity, N target resource names matched with the resource names are determined in the M associated resource names, and target resources corresponding to the N target resource names are matched with the resources to be matched.

Description

Communication resource name matching method, device, equipment and medium

Technical Field

The present disclosure relates to the field of communications technologies, and in particular, to a method, an apparatus, a device, and a medium for matching communication resource names.

Background

With the rapid development of internet technology, the communication field based on the internet is becoming a basic service industry necessary for people's daily life. In the field of communication service, the number of resources is numerous and complex, the attributes are variable, and the resources are often managed by different independent systems, and can be divided into equipment resources and line resources according to types and can be divided into resources such as a core network, a transmission network, an access network, a mobile network and the like according to a topological structure.

In the prior art, in order to realize intelligent matching of resource names among different systems, a keyword matching method, a fuzzy query method and other character string processing methods are generally adopted to find out another resource matched with a certain resource.

However, naming rules of different resource systems are inconsistent, and the above method only relies on a single character string to perform resource matching, which results in low association accuracy.

Disclosure of Invention

The application provides a method, a device, equipment and a medium for matching communication resource names, which are used for solving the problem of low accuracy of resource matching among different resource systems in the prior art.

In a first aspect, the present application provides a method for matching communication resource names, including:

acquiring a resource name of a resource to be matched, wherein the resource to be matched is a communication resource in a communication resource system;

According to the characteristic information in the resource names, M associated resource names associated with the resource names are obtained, and associated resources corresponding to the associated resource names and the resources to be matched belong to different communication resource systems, wherein the characteristic information comprises geographic position information, and M is a positive integer;

performing word segmentation processing on the resource names and the M associated resource names to obtain word segmentation word bags corresponding to the resource names and the M associated resource names respectively;

acquiring TF-IDF similarity and Jaccard similarity of the resource names and the M associated resource names according to the word segmentation word bags corresponding to the resource names and the word segmentation word bags corresponding to the M associated resource names;

and determining N target resource names matched with the resource names from the M associated resource names based on the TF-IDF similarity and the Jacaded similarity, wherein target resources corresponding to the N target resource names are matched with the resources to be matched, N is a positive integer, and M is greater than or equal to N.

In one possible implementation manner, the determining, based on the TF-IDF similarity and the jaccard similarity, N target resource names that match the resource names from the M associated resource names includes:

Determining a comprehensive similarity based on the TF-IDF similarity and the jaccard similarity;

and determining N target resource names matched with the resource names in the M associated resource names according to the comprehensive similarity, wherein the comprehensive similarity of the N target resource names and the resource names is greater than or equal to a preset similarity.

In one possible implementation, the determining the integrated similarity based on the TF-IDF similarity and the jaccard similarity includes:

calculating the product of the TF-IDF similarity and the first weight to obtain a first sub-similarity;

calculating the product of the Jacquard similarity and a second weight to obtain a second sub-similarity, wherein the sum of the second weight and the first weight is a preset weight;

and summing the first sub-similarity and the second sub-similarity to obtain the comprehensive similarity.

In a possible implementation manner, the word segmentation processing is performed on the resource name and the M associated resource names to obtain word segmentation word bags corresponding to the resource name and the M associated resource names, where the word segmentation word bags include:

according to an insertible word list, the resource names and the M associated resource names are subjected to word segmentation to obtain insertible words and non-insertible words which are respectively contained in the resource names and the M associated resource names, wherein the insertible word list comprises at least one of administrative area names, communication proper nouns, communication site names, organization unit names, continuous letters or continuous numbers;

Obtaining word segmentation word bags corresponding to the resource names according to the non-segmentable words and the non-segmentable words contained in the resource names;

and obtaining word segmentation word bags corresponding to each associated resource name in the M associated resource names according to the non-segmentable words and the non-segmentable words contained in the M associated resource names.

In a possible implementation manner, the obtaining, according to the word segmentation bag corresponding to the resource name and the word segmentation bags corresponding to the M associated resource names, TF-IDF similarity between the resource name and the M associated resource names includes:

removing word segmentation word bags corresponding to the resource names and stop words in the word segmentation word bags corresponding to the M associated resource names according to a stop word list, and obtaining new word segmentation word bags corresponding to the resource names and the M associated resource names respectively, wherein the stop word list comprises functional words and punctuation marks;

calculating the TF value and the IDF value of each word in the new word-segmentation word bag corresponding to each resource name and each M associated resource names, and calculating the TF-IDF value of each word;

and calculating cosine similarity of each associated resource name in the resource names and the M associated resource names based on TF-IDF values of each word in the new word-segmentation word bags corresponding to the resource names and the M associated resource names, and taking the cosine similarity as TF-IDF similarity.

In one possible implementation manner, the calculating the cosine similarity between the resource name and each associated resource name in the M associated resource names based on TF-IDF values of each word in the new word-segmentation word bag corresponding to each of the resource name and the M associated resource names includes:

vectorizing the resource names and the M associated resource names to obtain matrixes corresponding to the resource names and the M associated resource names respectively;

calculating cosine similarity of the matrix corresponding to the resource name and the matrix corresponding to each associated resource name in the M associated resource names, wherein the cosine similarity is calculated by the matrix corresponding to the resource name and the cosine similarity of the matrix corresponding to each associated resource name in the M associated resource names;

the vectorizing the resource names and the M associated resource names to obtain matrices corresponding to the resource names and the M associated resource names, including:

taking the resource name as a row of a matrix, taking the word in the new word bag corresponding to the resource name as a column of the matrix, and taking the TF-IDF value of each word as a value of the matrix.

In a possible implementation manner, the obtaining, according to the word segmentation bag corresponding to the resource name and the word segmentation bags corresponding to the M associated resource names, the jaccard similarity between the resource name and the M associated resource names includes:

The new word segmentation word bags corresponding to the resource names and the new word segmentation word bags corresponding to each associated resource name in the M associated resource names are subjected to intersection to obtain M word segmentation, wherein M is a positive integer;

combining the new word segmentation word bags corresponding to the resource names with the new word segmentation word bags corresponding to each associated resource name in the M associated resource names to obtain n word segmentation word bags, wherein n is a positive integer, and n is greater than or equal to M;

and calculating the quotient of the M and the n to obtain the Jacquard similarity of the resource name and each associated resource name in the M associated resource names.

In a second aspect, the present application provides a communication resource name matching apparatus, including:

the acquisition module is used for acquiring the resource name of the resource to be matched, wherein the resource to be matched is a communication resource in the communication resource system;

the first processing module is used for acquiring M associated resource names associated with the resource names according to the characteristic information in the resource names, wherein associated resources corresponding to the associated resource names and the resources to be matched belong to different communication resource systems, the characteristic information comprises geographic position information, and M is a positive integer;

The second processing module is used for carrying out word segmentation processing on the resource names and the M associated resource names to obtain word segmentation word bags corresponding to the resource names and the M associated resource names respectively;

the third processing module is used for obtaining TF-IDF similarity and Jaccard similarity of the resource names and the M associated resource names according to the word segmentation bag corresponding to the resource names and the word segmentation bags corresponding to the M associated resource names;

and a fourth processing module, configured to determine, based on the TF-IDF similarity and the jaccard similarity, N target resource names that match the resource names among the M associated resource names, where the N target resources corresponding to the N target resource names match the resource to be matched, and N is a positive integer, and M is greater than or equal to the N.

In a third aspect, the present application provides a communication resource name matching apparatus, including:

a memory;

at least one processor;

wherein the memory stores computer-executable instructions;

the at least one processor executes the computer-executable instructions stored in the memory to implement the communication resource name matching method as described in the various possible implementations of the first aspect above.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the communication resource name matching method as described in the various possible implementations of the first aspect.

The method, the device, the equipment and the medium for matching the names of the communication resources acquire the resource names of the resources to be matched, wherein the resources to be matched are communication resources in a communication resource system; according to the characteristic information in the resource names, M associated resource names associated with the resource names are obtained, and associated resources corresponding to the associated resource names and the resources to be matched belong to different communication resource systems, wherein the characteristic information comprises geographic position information, and M is a positive integer; performing word segmentation processing on the resource names and the M associated resource names to obtain word segmentation word bags corresponding to the resource names and the M associated resource names respectively; acquiring TF-IDF similarity and Jaccard similarity of the resource names and the M associated resource names according to the word segmentation word bags corresponding to the resource names and the word segmentation word bags corresponding to the M associated resource names; and determining N target resource names matched with the resource names from the M associated resource names based on the TF-IDF similarity and the Jacaded similarity, wherein target resources corresponding to the N target resource names are matched with the resources to be matched, N is a positive integer, and M is greater than or equal to N.

In the method, when the target resource names matched with the resources to be matched are required to be found from other communication resource systems where the resources not to be matched are located, firstly, a plurality of associated resource names associated with the resource names are preliminarily determined based on characteristic information in the resource names to be matched, then, word segmentation processing is carried out on the resource names and the associated resource names to obtain word segmentation word bags corresponding to the resource names, then, the TF-IDF similarity and Jacard similarity of the resource names and the associated resource names are calculated, finally, the two similarities are used as screening conditions, the target resource names with higher similarity are selected from the associated resource names, and the target resources corresponding to the target resource names are resources belonging to other communication resource systems matched with the resources to be matched. Thus, the intelligent matching of the resource names and the resources among different communication resource systems can be realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic illustration of an application flow provided herein;

Fig. 2 is a flowchart of a communication resource name matching method provided in the present application;

fig. 3 is a second flowchart of a communication resource name matching method provided in the present application;

fig. 4 is a flowchart III of a communication resource name matching method provided in the present application;

fig. 5 is a flowchart fourth of a communication resource name matching method provided in the present application;

FIG. 6 is a flowchart illustrating steps for calculating the TF-IDF similarity in FIG. 5;

fig. 7 is a flowchart fifth of a communication resource name matching method provided in the present application;

fig. 8 is a schematic structural diagram of a communication resource name matching device provided in the present application;

fig. 9 is a schematic hardware diagram of a communication resource name matching device provided in the present application.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is apparent that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms "preset," "first," "second," "third," and the like in the description and in the claims and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein.

In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

With the progress of urbanization and the rapid development of communication technologies, more and more residents rely on internet communication. In the field of communication service, resources can be divided into equipment resources and line resources according to types, and are divided into resources such as a core network, a transmission network, an access network, a mobile network and the like according to a topological structure. These resources are numerous and complex in number and variable in nature and often managed by different independent systems.

Due to the characteristics of the whole communication network, the resources are mutually connected, fused and interweaved. One resource often appears in a different resource system, but the naming rules are not consistent, and secondly, two types of resources that are coupled to each other in the same geographic location may belong to different systems. The realization of the association of these resources is an essential link for communication operation.

In the prior art, a character string processing method such as keyword matching, fuzzy query and the like is mainly adopted to correlate a certain resource in one resource system with another resource in another resource system, so that intelligent matching of resource names among different resource systems is realized.

However, resources of different systems can be associated only by names, and the association accuracy is not high due to non-uniformity of naming rules.

Aiming at the problems, the application provides a matching method of communication resource names among different communication resource names, which mainly takes characteristic information including geographical position information in the resource names of the resources to be matched as a preliminary screening condition, screens M associated resource names from other communication resource systems where non-matched resources are located, performs word segmentation processing on the resource names and the M associated resource names, calculates TF-IDF similarity and Jack similarity, and further selects N target resource names with higher similarity from the M associated resource names, wherein the resources corresponding to the N target resource names are matched with the resources to be matched.

TF-IDF (term frequency-inverse document frequency) is a common weighting technique for information retrieval and data mining. Wherein, TF is Term Frequency (Term Frequency), which refers to the Frequency of occurrence of a word in a resource name; IDF is the inverse text frequency index (Inverse Document Frequency), a measure of the general importance of a term.

The jaccard similarity coefficient (Jaccard similarity coefficient) is used to compare similarity to variability between limited sample sets. The larger the Jaccard coefficient value, the higher the sample similarity.

An application scenario of the present application will be described below with reference to fig. 1.

Fig. 1 is a schematic application flow chart provided in an embodiment of the present application. As shown in fig. 1, M associated resource names are obtained according to feature information including geographical location information in resource names of the resources to be matched, resources corresponding to the M associated resource names and the resources to be matched belong to different communication resource systems, word segmentation processing is performed on the resource names, and word segmentation word bags are built for each resource name, including word segmentation word bags corresponding to the resource names and word segmentation word bags corresponding to the M associated resource names.

Then, based on the word segmentation word bags corresponding to the resource names and the M associated resource names, creating word segmentation vectors for each resource name, namely taking the resource names as rows of a matrix, taking the words in the word segmentation word bags as columns of the matrix, taking the TF-IDF value of each word segmentation as the value of the matrix, and calculating the cosine similarity, namely the TF-IDF similarity, of the matrix corresponding to the resource names and the matrix corresponding to each associated resource name.

And then, calculating the Jaccard similarity of the resource name to be matched and each associated resource name based on the word segmentation word bags corresponding to the resource names and the M associated resource names.

And finally, determining the comprehensive similarity based on the TF-IDF similarity and the Jacaded similarity, and taking N associated resource names which are ranked at the top and higher than the preset similarity in the comprehensive similarity as target resource names according to the TopN rule, wherein the resources corresponding to the N target resource names are the resources matched with the resources to be matched.

Based on the TF-IDF similarity and the Jacquard similarity, not only is the resources matched with the resources to be matched conveniently determined, but also the TF-IDF similarity is corrected through the Jacquard similarity, so that the accuracy of resource matching is improved.

The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be implemented independently or combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 2 is a flowchart of a communication resource name matching method according to an embodiment of the present application. As shown in fig. 2, the method includes:

S201, obtaining the resource name of the resource to be matched, wherein the resource to be matched is the communication resource in the communication resource system.

In the step, the resource to be matched belongs to a communication resource system, and the resource name of the resource to be matched is used as key information for searching the resource.

S202, obtaining M associated resource names associated with the resource names according to characteristic information in the resource names, wherein associated resources corresponding to the associated resource names and resources to be matched belong to different communication resource systems, the characteristic information comprises geographic position information, and M is a positive integer.

In the above scheme, the characteristic information is used as the association condition, and the association resource names corresponding to the association resources in other communication resource systems where the non-to-be-matched resources are located are screened out.

S203, word segmentation processing is carried out on the resource names and the M associated resource names, and word segmentation word bags corresponding to the resource names and the M associated resource names are obtained.

In this step, each resource name has more than one word, which may include a plurality of words such as administrative domain name, communication proper noun, letter and number, so that the word segmentation process needs to be performed on the resource name and the associated resource name to obtain a corresponding word segmentation bag, and the word segmentation bag contains one or more words so as to perform more refined analysis processing.

S204, acquiring TF-IDF similarity and Jaccard similarity of the resource names and M associated resource names according to the word segmentation bag corresponding to the resource names and the word segmentation bags corresponding to the M associated resource names.

In the above scheme, each word segmentation word bag contains the split word corresponding to each resource name, and based on the split words, the TF-IDF value of each resource name can be calculated, so that the TF-IDF similarity of the resource name and each associated resource name is obtained, and meanwhile, the Jaccard similarity of the resource name and each associated resource name can be calculated.

S205, determining N target resource names matched with the resource names in M associated resource names based on TF-IDF similarity and Jacaded similarity, wherein N is a positive integer, and M is greater than or equal to N.

In this step, after calculating the TF-IDF similarity and the jaccard similarity of the resource name and each associated resource name, the two similarities may be comprehensively processed to obtain a final similarity, and according to the final similarity corresponding to each associated resource name in the M associated resource names, N target resource names corresponding to the final similarity, which are ranked forward and meet a certain size condition, are determined, where the target resources corresponding to the N target resource names are the resources matched with the resources to be matched.

According to the communication resource name matching method, when the target resource names matched with the resources to be matched are required to be found from other communication resource systems where the resources not to be matched are located, M associated resource names associated with the resource names are initially determined based on characteristic information in the resource names to be matched, then word segmentation processing is carried out on the resource names and the M associated resource names to obtain word segmentation word bags corresponding to the resource names, then TF-IDF similarity and Jacard similarity of the resource names and the M associated resource names are calculated, finally the two similarities are used as screening conditions, N target resource names with higher similarity are selected from the M associated resource names, and the target resources corresponding to the N target resource names are resources belonging to other communication resource systems matched with the resources to be matched.

The following describes, in connection with fig. 3 and the specific embodiment, an implementation procedure for determining N target resource names matching a resource name among M associated resource names based on TF-IDF similarity and jaccard similarity in the communication resource name matching method of the present application.

Fig. 3 is a flowchart two of a communication resource name matching method according to an embodiment of the present application. As shown in fig. 3, the method includes:

s301, calculating the product of the TF-IDF similarity and the first weight to obtain a first sub-similarity.

In the step, the TF-IDF similarity and the Jacquard similarity are comprehensively processed by using a weighting method, so that the subsequent analysis and calculation are facilitated, the Jacquard similarity can be used for correcting the TF-IDF similarity, and the accuracy of resource matching is improved.

S302, calculating the product of the Jacquard similarity and the second weight to obtain a second sub-similarity, wherein the sum of the second weight and the first weight is a preset weight.

In the above scheme, since the sparsity of the resource name vector is high, misleading results may be generated by simply using cosine similarity calculation, and the jaccard index is suitable for being applied to data with high sparsity, the present embodiment corrects TF-IDF similarity by using the jaccard similarity coefficient.

In a specific implementation process, for example, a weight of 0.7 may be given to the TF-IDF similarity, and a weight of 0.3 may be given to the jaccard similarity, and then the preset weight is 1.

And S303, summing the first sub-similarity and the second sub-similarity to obtain the comprehensive similarity.

In the step, the first sub-similarity corresponding to the TF-IDF similarity and the second sub-similarity corresponding to the Jacquard similarity are simply added to obtain the comprehensive similarity with a single value, so that the subsequent statistical analysis is facilitated.

S304, determining N target resource names matched with the resource names in the M associated resource names according to the comprehensive similarity, wherein the comprehensive similarity of the N target resource names and the resource names is greater than or equal to the preset similarity.

In the scheme, after the Jaccard similarity is adopted to correct the TF-IDF similarity, the comprehensive similarity with higher accuracy can be obtained, the comprehensive similarity corresponding to M associated resource names can be ranked in size, and N target resource names with top ranking are output; n associated resource names with the comprehensive similarity larger than or equal to the preset similarity corresponding to the M associated resource names can be used as target resource names; of course, the ranking and the preset threshold value can be combined, and the target resource name can be screened out.

In the embodiment of the application, different weights are given to the TF-IDF similarity and the Jacquard similarity, so that correction of the Jacquard similarity to the TF-IDF similarity can be realized, the comprehensive similarity which is more accurate and convenient for statistical analysis is obtained, and the target resource name is determined according to the magnitude relation between the comprehensive similarity corresponding to M associated resource names and the preset similarity.

The implementation process of word segmentation processing on the resource name and the M associated resource names to obtain word segmentation word bags corresponding to the resource name and the M associated resource names in the communication resource name matching method of the present application is described below with reference to fig. 4 and a specific embodiment.

Fig. 4 is a flowchart III of a communication resource name matching method according to an embodiment of the present application. As shown in fig. 4, the method includes:

s401, according to an insertible word list, word separation is carried out on the resource names and M associated resource names, and insertible words and non-insertible words which are contained in the resource names and the M associated resource names respectively are obtained, wherein the insertible word list comprises at least one of administrative area names, communication proper nouns, communication site names, organization unit names, continuous letters or continuous numbers.

In the step, word segmentation is carried out on the resource names and M associated resource names by applying word segmentation rules, wherein the word segmentation rules refer to rules for dividing the names into a plurality of word blocks, and can be rules for identifying and segmenting words according to administrative areas, address names, communication proper nouns and other keywords; rules for semantic recognition after semantic analysis training according to natural language processing (NLP, natural Language Processing) are also possible; it can also be implemented using existing word libraries (e.g., LAC, jieba, etc.).

In a specific implementation process, exemplary word segmentation rules formed by importing a self-defined non-partitionable word list into a Jieba word segmentation library can be adopted, wherein the non-partitionable word list mainly comprises the following four types of words: a. administrative domain name, street name, community name, village name and residence community name of the resource location; b. communication proper nouns, resource location station names; c. government agencies, enterprises and institutions names of the locations where the resources are located; d. consecutive letters, consecutive numbers in the resource name. The four words are formulated according to the characteristics of the names of the communication resources, so that the matching accuracy can be ensured.

In this embodiment, when the word segmentation is performed by using the non-segmentable vocabulary, the four types of words may be incorporated into the non-segmentable vocabulary, which may include at least one of an administrative area name, a communication proper noun, a communication site name, an organization unit name, a continuous letter, or a continuous number.

S402, obtaining word segmentation word bags corresponding to the resource names according to the non-segmentable words and the non-segmentable words contained in the resource names.

In the scheme, based on the non-partitionable vocabulary, the resource names of the resources to be matched can be split into non-partitionable words and non-partitionable words, and then the corresponding partitionable word bags are obtained.

Illustratively, the resource name is east-south city-northwest garden division 10# building east distribution box 001ZDH, and the word segmentation result is: the method comprises the steps of east-south city, northwest garden, house division, 10, # building, east, distribution box, 001 and ZDH, wherein the east-south city and northwest garden are cell names; the room distribution, the distribution box and the ZDH are proper communication nouns; 10. 001 is the building number and terminal box number, respectively.

S403, obtaining word segmentation word bags corresponding to each associated resource name in the M associated resource names according to the non-segmentable words and the non-segmentable words contained in the M associated resource names.

In this step, similarly, based on the non-partitionable vocabulary, each associated resource name of the M associated resource name categories may be split into a non-partitionable word and a non-partitionable word, thereby obtaining a corresponding partitionable word bag.

In the embodiment of the application, the resource names of the resources to be matched and M associated resource names are respectively subjected to word splitting through the non-word-splitting table, so that word-splitting word bags corresponding to the resource names and the M associated resource names are obtained.

The implementation process of obtaining the TF-IDF similarity between the resource name and M associated resource names according to the word segmentation bag corresponding to the resource name and the word segmentation bag corresponding to the M associated resource names in the communication resource name matching method of the present application is described below with reference to fig. 5 and a specific embodiment.

Fig. 5 is a flowchart of a communication resource name matching method according to an embodiment of the present application, and fig. 6 is a flowchart illustrating detailed steps for calculating TF-IDF similarity. As shown in fig. 5 and 6, the method includes:

s501, removing stop words in word segmentation word bags corresponding to the resource names and the word segmentation word bags corresponding to the M associated resource names according to the stop word list, and obtaining new word segmentation word bags corresponding to the resource names and the M associated resource names, wherein the stop word list comprises functional words and punctuation marks.

In this step, the resource names often contain functional words, which have no actual meaning compared with other words, and because of their popularity and function, it is difficult to express the information about the degree of relevance of the document alone, so in the field of information retrieval, these words are called stop words (stop words), meaning that they are encountered in the processing procedure, the processing is stopped immediately, and the stop words are discarded, so that the performance of the subsequent processing is improved by filtering the stop words in advance.

In the specific implementation process, a common Chinese stop word list, such as a Ha Gong stop word list, a Bai Gong stop word list, a Sichuan university machine intelligent experiment stop word library and the like, can be selected, and meaningless symbols such as punctuation marks in communication names can be added.

In this embodiment, the function word and punctuation mark are included in the stop word list, and on the basis of the stop word list, the stop word in each word segmentation word bag is removed, so as to obtain a new word segmentation word bag of the resource name and new word segmentation word bags corresponding to the M associated resource names.

S502, calculating TF values and IDF values of each word in the new word-segmentation word bags corresponding to the resource names and the M associated resource names, and calculating TF-IDF values of each word.

In the above scheme, in order to prevent the situation of biasing towards longer names, in this embodiment, the TF value is normalized, and the calculation formula is as follows:

TF (t) = (number of occurrences of the word t in the resource name)/(total number of words in the resource name)

Illustratively, the resource name is east-south city-northwest garden division 10# building east distribution box 001ZDH, and the word segmentation word bag comprises: the number of occurrences of the distribution boxes is 1, the total number of the words is 9, and then the TF value of the word-division distribution box is 1/9.

In this embodiment, the IDF value is smoothed, and the calculation formula is as follows:

IDF (t) =log (total number of names in corpus/(number of names containing word t+1) in corpus)

It is noted that in the calculation of the IDF, the collection of resource names is adopted as the respective corpus, respectively, due to the specificity of the communication resource names. Illustratively, all the resource names related to southwest and northwest gardens, house branches, 10, # buildings, eastern, distribution boxes, 001 and ZDH are the corpus corresponding to the resource names, and assuming that the corpus contains 100 resource names, and the number of the resource names containing the word segmentation distribution boxes in the 100 resource names is 10, the IDF value of the word segmentation distribution boxes is log (100/11).

The calculation formula of TF-IDF is:

TF-IDF(t)＝TF(t)×IDF(t)

illustratively, the word segmentation block terminal has an IDF value of 1/9×log (100/11).

S503, taking the resource name as a row of a matrix, taking the word in the new word bag corresponding to the resource name as a column of the matrix, and taking the TF-IDF value of each word as a value of the matrix to obtain the matrix corresponding to the resource name and M associated resource names.

In this step, the new word segmentation vector of each resource name is converted into a matrix, where the resource name is used as a row of the matrix, all the words are used as columns, and TF-IDF values of the words are used as values of the matrix, so that a matrix corresponding to each of the resource name and M associated resource names can be obtained.

In the specific implementation process, since the resource names are mostly short texts, the word segmentation vectors are sparse vectors, namely, the TF-IDF values of some word segmentation are possibly 0, if the TF-IDF values of all word segmentation are completely written into a matrix and stored, the storage pressure is greatly increased, at this time, a hash table can be used for replacing the matrix, namely, the matrix is stored when the TF-IDF value is not 0, and the matrix is omitted when the TF-IDF value is 0.

S504, calculating cosine similarity of each associated resource name in the resource names and M associated resource names, and taking the cosine similarity as TF-IDF similarity.

In the above scheme, the resource name and each associated resource name of the M associated resource names have been vectorized, and at this time, the similarity between the resource names may be calculated by using a cosine similarity algorithm, where the cosine similarity algorithm evaluates the similarity between two vectors by measuring the cosine value of the included angle between the two vectors, and the calculation formula is as follows:

wherein θ is an included angle between the vector a and the vector B, the vector a is a vector corresponding to the resource name a, the vector B is a vector corresponding to the resource name B, and n is TF-IDF values of all the segmentation words in the resource name.

In the embodiment of the application, the TF value and the IDF value are obtained through word segmentation word bags corresponding to the resource names and the M associated resource names respectively, the TF-IDF value of the words segmented in each word segmentation word bag is further obtained, word segmentation vectors in the resource names and the associated resource names are converted into matrixes based on the TF-IDF value, and finally the TF-IDF similarity is obtained based on a cosine similarity algorithm.

The following describes, with reference to fig. 7 and the specific embodiment, a process for implementing the jekcard similarity between a resource name and M associated resource names according to a word segmentation bag corresponding to the resource name and a word segmentation bag corresponding to the M associated resource names in the communication resource name matching method of the present application.

Fig. 7 is a flowchart fifth of a communication resource name matching method provided in an embodiment of the present application. As shown in fig. 7, the method includes:

s701, solving intersection of a new word segmentation word bag corresponding to the resource name and a new word segmentation word bag corresponding to each associated resource name in the M associated resource names to obtain M segmented words, wherein M is a positive integer.

In this step, the molecule of the jaccard similarity coefficient is the intersection of the word segmentation sets of two resource names, so that the new word segmentation bag corresponding to the resource name and the new word segmentation bag corresponding to each associated resource name need to be intersected, and in an exemplary case, the resource name a is the 5# building 4 unit 001WLX of the family at the public agency, then the new word segmentation bag comprises: public offices, one place, family homes, 5#, buildings, 4 units, 001, WLX; b resource name is a 4# building 2 unit 001WLX optical splitter of a family of a public way office, and then the new word segmentation word bag comprises: the intersection of the two is the public road office, the department, the family, the building, the 001 and the WLX.

S702, combining the new word segmentation word bags corresponding to the resource names and the new word segmentation word bags corresponding to each associated resource name in the M associated resource names to obtain n word segmentation word bags, wherein n is a positive integer, and n is greater than or equal to M.

In the above scheme, the denominator of the jaccard similarity coefficient is the union of the word segmentation sets of two resource names, so that the new word segmentation bag corresponding to the resource name and the new word segmentation bag corresponding to each associated resource name need to be combined, and the union of the resource name a and the resource name B is exemplified as a public bureau, a place, a family, a building, 001, WLX, # 5, # 4, a beam splitter, a 4 unit and a 2 unit.

S703, calculating the quotient of M and n to obtain the Jacquard similarity of the resource name and each associated resource name in the M associated resource names.

In this step, the numerator is divided by the denominator to obtain the jekcard similarity, and the jekcard similarity coefficient of the a resource name and the B resource name is denoted as J (a, B), and J (a, B) = |a n b|/|a u b|=6/11=0.54.

In order to better compare the result difference of the TF-IDF similarity and the jaccard similarity, in this embodiment, the TF-IDF similarity of the resource name a and the resource name B is calculated by using the calculation formula of the TF-IDF similarity, and the TF-IDF similarity of the two is 0.81. From this, it is clear that the Jacquard similarity coefficient shows that the two resource names are significantly different, but that the TF-IDF similarity is calculated to be 0.81, resulting in misleading, and that the Jacquard similarity corrects it.

In the embodiment of the application, the intersection and the union are calculated for the new word segmentation word bag corresponding to the resource name and the new word segmentation word bag corresponding to each associated resource name in the M associated resource names, so that the Jacquard similarity capable of correcting the TF-IDF similarity is obtained.

In summary, according to the communication resource name matching method provided in the embodiment of the present application, feature information in resource names of resources to be matched may be used as a preliminary screening condition, M associated resource names may be screened out from other communication resource systems where non-resources to be matched are located, word segmentation processing is performed on the resource names and the M associated resource names, TF-IDF similarity is calculated, and jaccard similarity is calculated, so that N target resource names with higher similarity are selected from the M associated resource names, and resources corresponding to the N target resource names are matched with the resources to be matched.

Fig. 8 is a schematic structural diagram of a communication resource name matching device provided in the present application. As shown in fig. 8, the communication resource name matching device 80 may include various functional modules for implementing the foregoing communication resource name matching method, and any functional module may be implemented by software and/or hardware.

For example, the communication resource name matching means 80 may include: an acquisition module 801, a first processing module 802, a second processing module 803, a third processing module 804, and a fourth processing module 805.

An obtaining module 801, configured to obtain a resource name of a resource to be matched, where the resource to be matched is a communication resource in a communication resource system;

a first processing module 802, configured to obtain M associated resource names associated with the resource names according to feature information in the resource names, where the associated resources and the resources to be matched belong to different communication resource systems, and the feature information includes geographic location information, and M is a positive integer;

the second processing module 803 is configured to perform word segmentation processing on the resource name and M associated resource names, so as to obtain word segmentation word bags corresponding to the resource name and M associated resource names respectively;

a third processing module 804, configured to obtain TF-IDF similarity and jaccard similarity of the resource name and M associated resource names according to the word segmentation bag corresponding to the resource name and the word segmentation bags corresponding to the M associated resource names;

a fourth processing module 805, configured to determine N target resource names matched with the resource names from M associated resource names based on TF-IDF similarity and jaccard similarity, where N is a positive integer, and M is greater than or equal to N.

Optionally, the second processing module 803 may be further specifically configured to: according to the non-partitionable word list, the resource name and M associated resource names are subjected to word segmentation to obtain non-partitionable words and non-partitionable words contained in the resource name and the M associated resource names respectively, wherein the non-partitionable word list comprises at least one of administrative area names, communication proper nouns, communication site names, organization unit names, continuous letters or continuous numbers; obtaining word segmentation word bags corresponding to the resource names according to the non-partitionable words and the non-partitionable words contained in the resource names; and obtaining word segmentation word bags corresponding to each associated resource name in the M associated resource names according to the non-segmentable words and the non-segmentable words contained in the M associated resource names.

Optionally, the third processing module 804 may be further configured to remove, according to an stop vocabulary, stop words in the word segmentation bag corresponding to the resource name and the word segmentation bags corresponding to the M associated resource names, to obtain new word segmentation bags corresponding to the resource name and the M associated resource names, where the stop vocabulary includes a function word and punctuation marks; calculating the TF value and the IDF value of each word in the new word-segmentation word bag corresponding to each resource name and M associated resource names, and calculating the TF-IDF value of each word; based on the TF-IDF value of each word in the new word-segmentation word bag corresponding to each of the resource name and the M associated resource names, calculating the cosine similarity of each associated resource name in the resource name and the M associated resource names, and taking the cosine similarity as the TF-IDF similarity.

Optionally, the third processing module 804 may be further specifically configured to: vectorizing the resource names and M associated resource names to obtain matrixes corresponding to the resource names and the M associated resource names respectively; calculating cosine similarity of a matrix corresponding to the resource name and a matrix corresponding to each associated resource name in the M associated resource names, wherein the cosine similarity is calculated; vectorizing the resource names and M associated resource names to obtain matrixes corresponding to the resource names and the M associated resource names, wherein the vectorizing comprises the following steps: taking the resource name as a row of a matrix, taking the word in the new word bag corresponding to the resource name as a column of the matrix, and taking the TF-IDF value of each word as a value of the matrix.

Optionally, the third processing module 804 may be further configured to calculate an intersection of the new word-segmentation word bag corresponding to the resource name and the new word-segmentation word bag corresponding to each associated resource name in the M associated resource names, to obtain M segmented words, where M is a positive integer; combining the new word segmentation word bags corresponding to the resource names with the new word segmentation word bags corresponding to each associated resource name in the M associated resource names to obtain n word segmentation word bags, wherein n is a positive integer, and n is greater than or equal to M; and calculating the quotient of M and n to obtain the Jacquard similarity of the resource name and each associated resource name in the M associated resource names.

Optionally, the fourth processing module 805 may be further specifically configured to: determining a comprehensive similarity based on the TF-IDF similarity and the Jacquard similarity; and determining N target resource names matched with the resource names in the M associated resource names according to the comprehensive similarity, wherein the comprehensive similarity of the N target resource names and the resource names is greater than or equal to the preset similarity.

Optionally, the fourth processing module 805 may be further configured to calculate a product of the TF-IDF similarity and the first weight to obtain a first sub-similarity; calculating the product of the Jacquard similarity and the second weight to obtain a second sub-similarity, wherein the sum of the second weight and the first weight is a preset weight; and summing the first sub-similarity and the second sub-similarity to obtain the comprehensive similarity.

The communication resource name matching device is used for executing the technical scheme provided by the embodiment of the communication resource name matching method, and the implementation principle and the technical effect are similar to those of the embodiment of the method, and are not repeated here.

The application also provides a communication resource name matching device, which comprises: at least one processor and memory;

the memory stores computer-executable instructions;

at least one processor executes the computer-executable instructions stored in the memory to cause the at least one processor to perform a communication resource name matching method.

Fig. 9 is a schematic hardware diagram of a communication resource name matching device according to an embodiment of the present invention. As shown in fig. 9, the present embodiment provides a communication resource name matching apparatus 90 including: at least one processor 901 and a memory 902.

A memory 902 for storing computer-executable instructions;

a processor 901 for executing computer-executable instructions stored in the memory 902 to implement the steps executed by the communication resource name matching method in the above-described embodiments. Reference may be made in particular to the description of the foregoing embodiments of the method for matching communication resource names.

Alternatively, the memory 902 may be separate or integrated with the processor 901.

When the memory 902 is provided separately, the electronic device further comprises a bus 903 for connecting the memory 902 and the processor 901.

The present application also provides a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the steps of the communication resource name matching method as described above.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

Claims

1. A method for matching communication resource names, comprising:

2. The method of claim 1, wherein the determining N target resource names that match the resource name among the M associated resource names based on the TF-IDF similarity and the jaccard similarity comprises:

3. The method of claim 2, wherein said determining a composite similarity based on said TF-IDF similarity and said jaccard similarity comprises:

4. The method of claim 1, wherein the performing word segmentation on the resource name and the M associated resource names to obtain word segmentation bags corresponding to the resource name and the M associated resource names respectively includes:

5. The method of claim 1, wherein the obtaining the TF-IDF similarity between the resource name and the M associated resource names according to the word segmentation bag corresponding to the resource name and the word segmentation bags corresponding to the M associated resource names comprises:

6. The method of claim 5, wherein calculating the cosine similarity of the resource name and each of the M associated resource names based on TF-IDF values of each of the new word-segmentation bags corresponding to the resource name and each of the M associated resource names, comprises:

7. The method of claim 6, wherein the obtaining the jaccard similarity between the resource name and the M associated resource names according to the word segmentation bag corresponding to the resource name and the word segmentation bags corresponding to the M associated resource names comprises:

8. A communication resource name matching apparatus, comprising:

9. A communication resource name matching device, comprising:

a memory;

at least one processor;

wherein the memory stores computer-executable instructions;

the at least one processor executing computer-executable instructions stored in the memory to implement the communication resource name matching method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the communication resource name matching method as claimed in any of claims 1 to 7.