CN112613310A

CN112613310A - Name matching method and device, electronic equipment and storage medium

Info

Publication number: CN112613310A
Application number: CN202110003686.XA
Authority: CN
Inventors: 黄建颖
Original assignee: Chengdu Yanchuang Qixin Information Technology Co ltd
Current assignee: Chengdu Knownsec Information Technology Co ltd
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2021-04-06

Abstract

The application provides a name matching method, a name matching device, electronic equipment and a storage medium, and relates to the technical field of name matching. Firstly, performing word segmentation and word frequency analysis on an article to be detected to obtain a target name of the article to be detected and a target keyword corresponding to the target name, then inputting the target name and the target keyword into a vector space model to obtain a target feature vector corresponding to the target name, then determining the similarity between the target feature vector and a pre-stored feature vector, and finally determining that the target feature vector is matched with the pre-stored feature vector when the similarity is greater than a threshold value. The name matching method, device, electronic equipment and storage medium have the effect that the matched name error is smaller.

Description

Name matching method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of name matching technologies, and in particular, to a name matching method and apparatus, an electronic device, and a storage medium.

Background

At present, in order to screen out articles including names of people, the articles are generally required to be screened for names of people.

However, in the prior art, when people name screening is performed, people names in an article can only be simply identified, and whether the people names introduced by the article match the people names needing to be screened cannot be determined.

In summary, there is a problem that a matching error is large when name screening is performed in the prior art.

Disclosure of Invention

The application aims to provide a name matching method, a name matching device, electronic equipment and a storage medium, so as to solve the problem that in the prior art, when name screening is carried out, matching errors are large.

In order to achieve the above purpose, the embodiments of the present application employ the following technical solutions:

in a first aspect, the present application provides a name matching method, including:

performing word segmentation and word frequency analysis on an article to be detected to obtain a target name of the article to be detected and a target keyword corresponding to the target name;

inputting the target person name and the target keyword into a vector space model to obtain a target feature vector corresponding to the target person name;

determining the similarity between the target characteristic vector and a pre-stored characteristic vector;

when the similarity is larger than a threshold value, determining that the target feature vector is matched with the pre-stored feature vector.

In a second aspect, the present application further provides a name matching apparatus, the apparatus including:

the information acquisition unit is used for performing word segmentation and word frequency analysis on the article to be detected so as to acquire a target name of the article to be detected and a target keyword corresponding to the target name;

the feature vector acquisition unit is used for inputting the target person name and the target keyword into a vector space model so as to acquire a target feature vector corresponding to the target person name;

a similarity determining unit, configured to determine a similarity between the target feature vector and a pre-stored feature vector;

a matching determination unit, configured to determine that the target feature vector matches the pre-stored feature vector when the similarity is greater than a threshold.

In a third aspect, the present application provides an electronic device, comprising: a memory for storing one or more programs; a processor; the one or more programs, when executed by the processor, implement the person name matching method described above.

In a fourth aspect, the present application also provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor implements the name matching method described above.

Compared with the prior art, the method has the following beneficial effects:

the application provides a person name matching method, a person name matching device, electronic equipment and a storage medium. According to the method and the device, when the names are matched, the matching can be carried out based on the names and the keywords, and the keywords are confirmed based on the whole article to be detected, so that the matched names in the article to be detected have smaller errors.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and it will be apparent to those skilled in the art that other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic block diagram of an electronic device according to an embodiment of the present disclosure.

Fig. 2 is an exemplary flowchart of a name matching method provided in an embodiment of the present application.

Fig. 3 is another exemplary flowchart of a name matching method provided in an embodiment of the present application.

Fig. 4 is a schematic block diagram of a name matching apparatus according to an embodiment of the present application.

In the figure: 100-an electronic device; 101-a processor; 102-a memory; 103-a communication interface; 200-name matching means; 210-an information acquisition unit; 220-a feature vector obtaining unit; 230-a similarity determination unit; 240 — match determination unit.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

As described in the background, currently, names of people with social influence often need special attention in content scheduling, such as some movie stars or songs, etc., related news of mainstream content distribution platforms, big news websites, government media platforms.

Therefore, in searching for a related article, person name matching is required, however, there may be a case where the person names are the same in the case of person name matching. The existing person name recognition algorithm only finds out the person name in the text, so the error is likely to be large. For example, when a relevant article of singer "zhang san" needs to be screened, the prior art may match the actor "zhang san", the painter "zhang san", the teacher "zhang san", and the like at the same time, which has a large error.

In view of the above, in order to solve the above problems, the present application provides a person name matching method, which makes a finally matched person name more accurate by using a way of determining a target feature vector by using a person name and a keyword.

It should be noted that the name matching method provided in the present application can be applied to an electronic device 100, and fig. 1 illustrates a schematic structural block diagram of the electronic device 100 provided in the embodiment of the present application, where the electronic device 100 includes a memory 102, a processor 101, and a communication interface 103, and the memory 102, the processor 101, and the communication interface 103 are electrically connected to each other directly or indirectly to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The memory 102 may be used to store software programs and modules, such as program instructions or modules corresponding to the name matching apparatus provided in the embodiment of the present application, and the processor 101 executes the software programs and modules stored in the memory 102 to execute various functional applications and data processing, thereby executing the steps of the name matching method provided in the embodiment of the present application. The communication interface 103 may be used for communicating signaling or data with other node devices.

The Memory 102 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Programmable Read-Only Memory (EEPROM), and the like.

The processor 101 may be an integrated circuit chip having signal processing capabilities. The Processor 101 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

It will be appreciated that the configuration shown in FIG. 1 is merely illustrative and that electronic device 100 may include more or fewer components than shown in FIG. 1 or have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

The following describes an exemplary name matching method provided in the embodiment of the present application, with the electronic device 100 as a schematic execution subject.

As an implementation manner, referring to fig. 2, the name matching method includes:

s102, performing word segmentation and word frequency analysis on the article to be detected to obtain a target name of the article to be detected and a target keyword corresponding to the target name.

And S104, inputting the target person name and the target keyword into a vector space model to obtain a target characteristic vector corresponding to the target person name.

And S106, determining the similarity between the target characteristic vector and a pre-stored characteristic vector.

And S108, when the similarity is larger than the threshold value, determining that the target feature vector is matched with a pre-stored feature vector.

As an implementation manner, when name matching is required, matching may be performed on each article in sequence in a queue manner. For example, when the names of all articles published by a certain website are matched, the names of all articles published by the website are sequentially matched in a queue manner.

When the names of any article are matched, word segmentation and word frequency analysis are required to be carried out on the article. Word segmentation means that a whole sentence is segmented into single words, and word frequency analysis means that the frequency of occurrence of a specific word is analyzed. The present application is not limited to the word segmentation tool and the word frequency analysis tool, and for example, a word segmentation tool may be used for word segmentation, and a TF-IDF word frequency analysis tool may be used for word frequency analysis.

After word segmentation is carried out, not only can the names of the people mentioned in the article be obtained, but also keywords corresponding to the names of the people can be determined in a general word frequency analysis mode.

As an implementation manner, when determining the keyword corresponding to the name through the word frequency analysis tool, the screening keyword may be determined in a contextual manner, for example, the target keyword is determined in the current paragraph, the previous paragraph, and the next paragraph where the name appears. It should be noted that, for the same article to be detected, a certain person may appear many times, so that the target keyword is determined in a context manner, the target keyword can be actually determined from multiple places of the article, and further the target keyword is determined from the full text range, and the determined target keyword is more accurate.

Meanwhile, understandably, through the implementation mode, when the introduction of the article to the name of the person is more informative, more keywords can be extracted, and the accuracy of the keyword is higher.

For example, for the singer "zhang san", it is possible to extract all keywords related to the article from the article, and take all the extracted keywords as weighted keywords, such as a series of keywords such as "man, sichuan, stage, singing", and the like.

It should be noted that, in the same article, a plurality of names may appear, and for example, information of zhang san, li si and wang wu is reported in a certain article at the same time. As one implementation, all the names of the persons in the article can be determined at the same time, and the word frequency analysis method is used to determine the keywords corresponding to each name, for example, the keyword corresponding to zhang san is "man, sichuan, stage, singing …", the keyword corresponding to lie tetrad is "calligraphy, auction, landscape, and china ink …", and the keyword corresponding to wang wu is "student, university, physics …".

After all the names and the corresponding keywords in the article to be detected are determined, all the names and the corresponding keywords can be input into a vector space model, and then the target characteristic vector corresponding to each name is obtained. The vector space model is also called a phrase vector model, and is an algebraic model applied to information filtering, information acquisition, indexing and relevance evaluation.

It is understood that when multiple names are present in an article, multiple target feature vectors can be determined from the article, for example, when 3 names are present in the article, 3 target feature vectors can be determined from the article.

After the target feature vectors of all the names are determined, whether the names appearing in the content are the target names needs to be obtained in a similarity determining mode. Optionally, the determination is performed by determining the similarity between the target feature vector and a pre-stored feature vector. When the similarity is larger than the threshold value, determining that the target characteristic vector is matched with a pre-stored characteristic vector, and the name of the person appearing in the article is equal to the detected name of the person; and when the similarity is smaller than the threshold value, determining that the target characteristic vector is not matched with the pre-stored characteristic vector, and determining that the name appearing in the article is not the detected name.

As an implementation manner, when a plurality of names appear in an article, whether the names appearing in the article are names of people needing to be searched can be determined by using a manner of traversing the feature vectors in the article by using pre-stored feature vectors.

In a possible implementation manner, the present application may determine the similarity between the target feature vector and the pre-stored feature vector by using a cosine similarity or K-nearest neighbor classification algorithm. The cosine similarity is also called cosine similarity, and the similarity is evaluated by calculating the cosine value of the included angle between two vectors. Cosine similarity maps vectors into a vector space, such as the most common two-dimensional space, according to coordinate values. The K-nearest neighbor algorithm is to classify an input instance into a class given a training data set, for a new input instance, by finding K instances in the training data set that are nearest to the instance, most of which belong to the class.

The cosine similarity is taken as an example for explanation, and when the cosine similarity is adopted to determine the similarity of the cosine similarity and the cosine similarity, the cosine similarity satisfies the formula:

wherein A is_iRepresenting components of the target feature vector, B_iRepresenting components of a pre-stored feature vector.

In other words, after the target feature vector corresponding to the name in the article to be detected is determined, the target feature vector and the pre-stored feature vector may be substituted into the above formula, so as to determine the similarity between the two vectors.

It should be noted that the cosine similarity result indicates the degree of similarity between word frequency feature vectors, which is a numerical value between 0 and 1, and the closer the numerical value is to 1, the higher the similarity is, indicating that the name appearing in the content to be detected is more desirable, as an implementation manner, when the cosine similarity is greater than or equal to 0.5 and less than or equal to 1, the name in the article is considered to be matched with the name to be searched, and when the cosine similarity is less than 0.5, the name in the article is not matched, so that whether the article to be detected is the required article can be determined.

In addition, as an alternative implementation, referring to fig. 3, before S102, the method further includes:

s101-1, obtaining preset name introduction information.

S101-2, performing word segmentation and word frequency analysis on the name introduction information to obtain names in the name introduction information and keywords corresponding to the names.

S101-3, inputting the name and the keyword into a vector space model to obtain a pre-stored characteristic vector.

That is, in the present application, it is necessary to first determine a pre-stored feature vector, and as an implementation, the feature vector may be determined by a piece of material such as a character biographical introduction. For example, the related information of a certain person name in the hundred-degree encyclopedia is used as the preset person name introduction information. The server can perform word segmentation and word frequency analysis according to the name introduction information, further obtain names in the name introduction information and keywords corresponding to the names, and then input the names into the vector space model to obtain the feature vectors corresponding to the names.

For example, the introductory information of zhang san is: zhang III, Men, Sichuan people, famous movie stars, as denoted by "xxx", then the keywords obtained by word frequency analysis after word segmentation are "Men, Sichuan, movie stars" and "xxx". Inputting the keyword and name Zhang III into the vector space model, and determining the characteristic vector corresponding to Zhang III.

When the article to be detected is analyzed, if the name of a certain article is also Zhang III, but the corresponding keyword is "woman, northeast, painter", it can be understood that although the names of the two articles are the same, the corresponding keywords are completely different, and further the cosine similarity between the feature vectors corresponding to the two articles is 0, so that it can be determined that the name of the article is not the name of the person to be searched.

Alternatively, if the name of a person involved in a certain article is zhang san and the corresponding keyword is "professor man, sichuan and xx", it can be understood that the name of the person is the same as the name of the person, and some of the keywords are the same, but the cosine similarity between the feature vectors corresponding to the two may be only 0.2 and less than 0.5, and it is determined that the two do not match.

It should be noted that, since the more keywords, the higher the matching accuracy, when providing introduction information related to names, the most detailed information can be provided, so that the subsequent determination of cosine similarity is more accurate.

Through the implementation mode, the matched name of the person can be more accurate.

Based on the foregoing implementation, please refer to fig. 4, the present application further provides a name matching apparatus 200, where the name matching apparatus 200 includes:

the information obtaining unit 210 is configured to perform word segmentation and word frequency analysis on the article to be detected to obtain a target name of the article to be detected and a target keyword corresponding to the target name.

It is understood that S102 may be performed by the information obtaining unit 210.

The feature vector obtaining unit 220 is configured to input the target person name and the target keyword into the vector space model to obtain a target feature vector corresponding to the target person name.

It is understood that S104 may be performed by the feature vector acquisition unit 220.

A similarity determining unit 230, configured to determine a similarity between the target feature vector and a pre-stored feature vector.

It is to be understood that S106 may be performed by the similarity determination unit 230.

A matching determination unit 240, configured to determine that the target feature vector matches a pre-stored feature vector when the similarity is greater than the threshold.

It is understood that S108 may be performed by the matching determining unit 240.

Further, the information obtaining unit 210 is also configured to obtain preset name introduction information.

It is understood that S101-1 may be performed by the information acquisition unit 210.

The information obtaining unit 210 is further configured to perform word segmentation and word frequency analysis on the name introduction information to obtain names and keywords corresponding to the names in the name introduction information.

It is understood that S101-2 may be performed by the information acquisition unit 210.

The feature vector obtaining unit 220 is further configured to input the person name and the keyword into the vector space model to obtain a pre-stored feature vector.

It is understood that the feature vector acquisition unit 220 may perform S101-3.

Optionally, the similarity determining unit 230 is configured to determine the similarity between the target feature vector and a pre-stored feature vector by using a cosine similarity or a K-nearest neighbor classification algorithm.

When cosine similarity is adopted, the cosine similarity satisfies the formula:

In summary, the present application provides a person name matching method, an apparatus, an electronic device, and a storage medium, where a to-be-detected article is first subjected to word segmentation and word frequency analysis to obtain a target person name of the to-be-detected article and a target keyword corresponding to the target person name, then the target person name and the target keyword are input into a vector space model to obtain a target feature vector corresponding to the target person name, then a similarity between the target feature vector and a pre-stored feature vector is determined, and finally, when the similarity is greater than a threshold value, it is determined that the target feature vector matches the pre-stored feature vector. According to the method and the device, when the names are matched, the matching can be carried out based on the names and the keywords, and the keywords are confirmed based on the whole article to be detected, so that the matched names in the article to be detected have smaller errors.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).

It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A person name matching method, the method comprising:

2. The person name matching method according to claim 1, wherein before the step of determining the similarity of the target feature vector to a pre-stored feature vector, the method further comprises:

acquiring preset name introduction information;

performing word segmentation and word frequency analysis on the name introduction information to obtain names in the name introduction information and keywords corresponding to the names;

and inputting the name and the keyword into a vector space model to obtain the pre-stored characteristic vector.

3. The person name matching method according to claim 1, wherein the step of determining the similarity of the target feature vector and a pre-stored feature vector comprises:

and determining the similarity between the target characteristic vector and a pre-stored characteristic vector by utilizing a cosine similarity or K nearest neighbor classification algorithm.

4. The person name matching method according to claim 1, wherein the cosine similarity satisfies a formula:

5. A person name matching apparatus, characterized in that the apparatus comprises:

6. The person name matching apparatus according to claim 5, wherein the information obtaining unit is further configured to obtain preset person name introduction information;

the information acquisition unit is also used for carrying out word segmentation and word frequency analysis on the name introduction information so as to acquire names in the name introduction information and keywords corresponding to the names;

the feature vector obtaining unit is further configured to input the name and the keyword into a vector space model to obtain the pre-stored feature vector.

7. The name matching apparatus as claimed in claim 5, wherein the similarity determining unit is configured to determine the similarity between the target feature vector and a pre-stored feature vector by using a cosine similarity or K-nearest neighbor classification algorithm.

8. The person name matching apparatus according to claim 7, wherein the cosine similarity satisfies a formula:

9. An electronic device, comprising:

a memory for storing one or more programs;

a processor;

the one or more programs, when executed by the processor, implement the method of any of claims 1-4.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.