CN110674640B

CN110674640B - Chinese name acquisition method, and training method and device of Chinese name extraction model

Info

Publication number: CN110674640B
Application number: CN201910910471.9A
Authority: CN
Inventors: 袁杰; 张�杰; 陈秀坤; 高古明
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2022-10-25
Anticipated expiration: 2039-09-25
Also published as: CN110674640A

Abstract

The application provides a Chinese name obtaining method, a training method and a training device of a Chinese name extraction model, wherein a feature vector corresponding to each candidate name is obtained, a plurality of feature vectors form a feature matrix, the feature vectors corresponding to different candidate names are different, the context relationship between the names is not required to be considered, then a parameter column vector in the name extraction model is trained by using the feature matrix, a first feature vector is determined according to the position of the maximum value in an index vector output by the name extraction model, when the distance value between the first feature vector and a second feature vector corresponding to the real name is not equal to a preset value, the name extraction model is updated, the feature matrix is input into the updated model, and the updated model is used as the trained name extraction model until the distance value between a third feature vector and the second feature vector determined by using the updated model is equal to the preset value, so that the trained model can accurately determine the real name from a series of names.

Description

Chinese name acquisition method, and training method and device of Chinese name extraction model

Technical Field

The application relates to the technical field of information identification, in particular to a Chinese name acquisition method, a training method of a Chinese name extraction model and a training device of the Chinese name extraction model.

Background

In some service scenarios, due to a large number of data sources, a situation that one mobile phone number corresponds to multiple names often occurs, for example, for the same person, in different relationship networks, contact information of an opposite party in an address book of the opposite party is stored inconsistently, for example, for a specific person, zhang san (male), a parent of the opposite party may be a nickname "baby son" for the contact name stored by the parent, a friend or a classmate may be a real name zhang for the name stored by the friend or the classmate, and names stored by coworkers and clients may be related to company information, such as "xx company, zhang san".

However, existing methods of extracting chinese names typically extract chinese names in text based on the contextual relationship of words in long sentence text. For example, for a sentence "he is born in Jinhua, zhejiang, and his name is Jinhua. The text sequence is firstly participled, then the sequence tagging is carried out by utilizing algorithms such as deep learning or conditional random field and the like through the context of the part of speech, and the part of speech and the entity type corresponding to each participle are obtained, and the final result is as follows, the u name \ n of \ p Zhejiang \ ns Jinhua \ ns is born \ v, and \ w of \ w \ r is named \ n is named \ v Jinhua \ nr. W ", where ns represents the place name, \ nr represents the person name, \ v represents the verb, and \ w represents the punctuation mark.

Because the prior art can only extract long-sentence texts with context and context relationship and is difficult to extract names aiming at fixed noun phrases without semantic environment, the prior art cannot accurately determine real names from a series of names.

Content of application

In view of this, an object of the embodiments of the present application is to provide a method for obtaining a chinese name, a method for training a chinese name extraction model, and a device thereof, so as to accurately determine a real name from a series of names.

In a first aspect, an embodiment of the present application provides a method for training a chinese name extraction model, where the method includes: obtaining a feature vector corresponding to each candidate name in a plurality of candidate names, and forming a feature matrix by the plurality of feature vectors; wherein, the feature vectors corresponding to different candidate names are different; the candidate names comprise real names required to be determined from the candidate names; inputting the characteristic matrix into a name extraction model to obtain an index vector; determining whether a first distance value of a first feature vector corresponding to the position of the maximum value in the index vector and a second feature vector corresponding to the real name is equal to a preset value; updating a parameter column vector in the name extraction model upon determining that the first distance value is not equal to the preset value; wherein the dimensions of the parameter column vector and the dimensions of the feature vector are the same; inputting the feature matrix into the updated name extraction model to obtain a new index vector; determining whether a second distance value between a third feature vector corresponding to the position of the maximum value in the new index vector and the second feature vector is equal to the preset value; and when the second distance value is determined to be equal to the preset value, taking the updated name extraction model as a trained name extraction model.

In the implementation process, a feature vector corresponding to each candidate name is obtained, the feature vectors are combined into a feature matrix, wherein the feature vectors corresponding to different candidate names are different, so that the candidate names have uniqueness without considering context and context relations among the names, then parameter column vectors in the name extraction model are trained by the feature matrix, a first feature vector predicted by the name extraction model is determined according to the corresponding relation between the position of the maximum value in an index vector output by the name extraction model and the feature vectors in the feature matrix, when the distance value between the first feature vector and a second feature vector corresponding to the real name is not equal to a preset value, the currently obtained name extraction model cannot accurately determine the real name from the candidate names, then the parameter column vectors in the name extraction model are updated, the feature matrix is continuously input into the updated name extraction model, and the updated name extraction model is used as the updated name extraction model until the distance value between a third feature vector determined by the updated extraction model and the second feature vector is equal to the preset value, and the name extraction model is accurately determined by the trained name extraction model.

In one possible design based on the first aspect, the name extraction model is

Wherein X represents the plurality of candidate names, f (X) represents the feature matrix, and α represents the parameter column vector; v. of _i Representing a feature vector corresponding to the ith candidate name, and z represents the index vector; wherein i is an integer of 1 or more.

In the implementation process, each parameter in the parameter column vector in the name extraction model is equivalent to the weight required to be multiplied by each feature vector, and the feature vector corresponding to the maximum value in the index vector output by the name extraction model is the same as the feature vector of the real name by continuously adjusting the value of the weight, so that the distinguishing capability of the trained parameter column vector on each candidate name is ensured, and the accuracy of the trained name extraction model can be further ensured.

In a possible design based on the first aspect, obtaining a feature vector corresponding to each candidate name in a plurality of candidate names includes: determining a sum of distances between the candidate name and each of the remaining candidate names in the plurality of candidate names; obtaining a judgment result for representing whether a first word in the candidate name belongs to a surname; acquiring the sum of the times of occurrence of each word in the candidate name in the candidate names; and respectively endowing corresponding elements in the corresponding feature vectors with corresponding values by using the distance sum, the judgment result and the frequency sum to obtain the corresponding feature vectors.

In the implementation process, the distance sum can represent the similarity between the candidate name and other candidate names, and the smaller the distance sum, the more likely the candidate name is to be a real name, and meanwhile, since the first word of the real name certainly belongs to the surname and the other candidate names may not be surnames, the more likely the real name and other candidate names can be distinguished by judging whether the candidate name belongs to the surname, and secondly, if the times of occurrence of the words in the candidate name are more, the more likely the candidate name is to be a real name, and vice versa, the smaller the distance sum, the judgment result, and the times sum are used as elements of the feature vectors of the candidate name, so that the distinction between the real name and other candidate names is ensured, and the name extraction model can determine the candidate name more accurately and more rapidly.

In a possible design, based on the first aspect, updating the parameter column vector in the name extraction model includes: updating a parameter column vector in the name extraction model based on a difference of the second feature vector and the first feature vector.

In the implementation process, the parameter column vector is updated by using the difference value between the first characteristic vector and the characteristic vector corresponding to the real name, so that the updated parameter column vector can be quickly close to the real parameter column vector, the training times are reduced, and the model training efficiency is improved.

Based on the first aspect, in one possible design, after determining whether the second distance value is equal to the preset value, the method further includes: when the second distance value is not equal to the preset value, determining whether the current updating times are equal to preset updating times; when the current updating times are determined to be equal to the preset updating times, determining a minimum distance value from the first distance value and the second distance value; and updating the name extraction model by using the parameter column vector corresponding to the minimum distance value.

In the implementation process, in the process of training the name extraction model, there may be a case that the obtained distance value is not equal to the preset value all the time, and at this time, in order to avoid that the training process is continuously performed all the time, therefore, the training process is ended when it is determined that the current update frequency is equal to the preset update frequency, so that the training efficiency is improved, and meanwhile, as the distance value is smaller, it indicates that the name determined by the name extraction model is closer to the real name, so that the name extraction model is updated by using the parameter column vector corresponding to the minimum distance value, and the accuracy of the name extraction model is ensured.

In a possible design based on the first aspect, after determining whether the current number of updates is equal to the preset number of updates, the method further includes: when the current updating times are determined to be smaller than the preset updating times, updating parameter column vectors in the name extraction model to obtain a new updated name extraction model; when a third distance value determined by using a new updated name extraction model is equal to the preset value, taking the new updated name extraction model as the trained name extraction model; or when the next updating time is equal to the preset updating time, updating the name extraction model by using the parameter column vector corresponding to the minimum distance value in the first distance value, the second distance value and the third distance value.

In the implementation process, when it is determined that the current update time is less than the preset update time or the currently acquired distance value is not equal to the preset value, the name extraction model continues to be trained until a third distance value determined by using a new updated name extraction model is equal to the preset value, the new updated name extraction model is used as the trained name extraction model, or when the next update time is equal to the preset update time, the name extraction model is updated by using a parameter column vector corresponding to a minimum distance value among the first distance value, the second distance value and the third distance value, so that the accuracy of the trained name extraction model is ensured.

Based on the first aspect, in a possible design, before obtaining a feature vector characterizing the candidate name feature, the method further includes: acquiring a plurality of alternative names corresponding to the same mobile phone number; and utilizing a word segmentation method to segment the candidate names to obtain the candidate names.

In the implementation process, the alternative names are segmented by using a segmentation method, so that the number of the candidate names is more than or equal to the number of the alternative names, and the probability of the real name appearing in the candidate names is increased.

In a second aspect, an embodiment of the present application provides a method for extracting a chinese name, where the method includes: obtaining a feature vector of each candidate name in a plurality of candidate names to obtain a feature matrix formed by the feature vectors; inputting the feature matrix into a name extraction model trained by the method of the first aspect to obtain an index vector; obtaining the position of the maximum value in the index vector; determining a name corresponding to the position in the plurality of candidate names as a real name.

In the implementation process, by obtaining the feature vector of each candidate name in the plurality of candidate names and inputting the feature matrix formed by the plurality of feature vectors into the trained name extraction model, the trained name extraction model can be used for quickly and accurately determining the real name from the plurality of candidate names due to the uniqueness of the feature vectors corresponding to different candidate names.

In a third aspect, an embodiment of the present application provides a training apparatus for a chinese name extraction model, where the apparatus includes: the first feature vector obtaining unit is used for obtaining a feature vector corresponding to each candidate name in a plurality of candidate names and forming a feature matrix by the feature vectors; wherein, the feature vectors corresponding to different candidate names are different; the candidate names comprise real names required to be determined from the candidate names; the first acquisition unit is used for inputting the characteristic matrix into a name extraction model to obtain an index vector; a first determining unit, configured to determine whether a first distance value of a first feature vector corresponding to a position of a maximum value in the index vector and a second feature vector corresponding to the real name are equal to a preset value; the updating unit is used for updating the parameter column vectors in the name extraction model when the first distance value is determined not to be equal to the preset value; wherein the dimensions of the parameter column vector and the dimensions of the feature vector are the same; the second acquisition unit is used for inputting the feature matrix to the updated name extraction model to obtain a new index vector; a second determining unit, configured to determine whether a second distance value between a third eigenvector corresponding to a position of a maximum value in the new index vector and the second eigenvector is equal to the preset value; and the model determining unit is used for taking the updated name extraction model as a trained name extraction model when the second distance value is determined to be equal to the preset value.

In a possible design, based on the third aspect, the name extraction model is

Wherein X represents the plurality of candidate names, f (X) represents the feature matrix, αRepresenting the parameter column vector; v. of _i Representing a feature vector corresponding to the ith candidate name, and z represents the index vector; wherein i is an integer of 1 or more.

Based on the third aspect, in one possible design, the first feature vector obtaining unit is further configured to determine a sum of distances between the candidate name and each of the remaining candidate names in the plurality of candidate names; and obtaining a judgment result for representing whether the first word in the candidate name belongs to a surname; acquiring the sum of the times of occurrence of each word in the candidate name in the candidate names; and obtaining the distance sum, the judgment result and the vector with the frequency sum as an element, wherein the vector is the corresponding characteristic vector.

In a possible design, based on the third aspect, the updating unit is further configured to update the parameter column vector in the name extraction model based on a difference between the second feature vector and the first feature vector.

In a possible design based on the third aspect, the apparatus further includes: the frequency determining unit is used for determining whether the current updating frequency is equal to the preset updating frequency or not when the second distance value is determined not to be equal to the preset value; a minimum distance value determining unit, configured to determine a minimum distance value from all the first distance values and the second distance values when it is determined that the current update time is equal to the preset update time; and the model updating unit is used for updating the name extraction model by using the parameter column vector corresponding to the minimum distance value.

In a possible design based on the third aspect, the apparatus further includes: the parameter updating unit is used for updating the parameter column vector in the name extraction model when the current updating times are determined to be smaller than the preset updating times, and obtaining a new updated name extraction model; when a third distance value determined by using a new updated name extraction model is equal to the preset value, taking the new updated name extraction model as the trained name extraction model; or when the next updating time is equal to the preset updating time, updating the name extraction model by using the parameter column vector corresponding to the minimum distance value in the first distance value, the second distance value and the third distance value.

In a possible design based on the third aspect, the apparatus further includes: the alternative call acquisition unit is used for acquiring a plurality of alternative calls corresponding to the same mobile phone number; and the candidate name acquisition unit is used for segmenting the plurality of alternative names by using a segmentation method to obtain the plurality of candidate names.

In a fourth aspect, an embodiment of the present application provides a chinese name obtaining apparatus, where the apparatus includes: the second feature vector acquisition unit is used for acquiring a feature vector of each candidate name in a plurality of candidate names to obtain a feature matrix formed by the feature vectors; a third obtaining unit, configured to input the feature matrix into a name extraction model trained by the method of the first aspect, and obtain an index vector; a position obtaining unit, configured to obtain a position of a maximum value in the index vector; a name determining unit, configured to determine a name corresponding to the position in the candidate names as a real name.

In a fifth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory connected to the processor, where a computer program is stored in the memory, and when the computer program is executed by the processor, the electronic device is caused to perform the method of the first aspect and the second aspect.

In a sixth aspect, embodiments of the present application provide a storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method of the first aspect and the second aspect.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of a training method for a chinese name extraction model according to an embodiment of the present application.

Fig. 2 is a flowchart of a method for extracting a chinese name according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a training apparatus for a chinese name extraction model according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a chinese name extraction device according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, fig. 1 is a flowchart of a training method of a chinese name extraction model according to an embodiment of the present application, where the method includes the steps of: s100, S200, S300, S400, S500, S600, and S700.

S100: obtaining a feature vector corresponding to each candidate name in a plurality of candidate names, and forming a feature matrix by the plurality of feature vectors; wherein, the feature vectors corresponding to different candidate names are different; the candidate names include real names required to be determined from the candidate names.

S200: and inputting the characteristic matrix into a name extraction model to obtain an index vector.

S300: and determining whether a first distance value of a first feature vector corresponding to the position of the maximum value in the index vector and a second feature vector corresponding to the real name is equal to a preset value or not.

S400: updating a parameter column vector in the name extraction model upon determining that the first distance value is not equal to the preset value; wherein the dimension of the parameter column vector is the same as the dimension of the feature vector.

S500: and inputting the feature matrix into the updated name extraction model to obtain a new index vector.

S600: and determining whether a second distance value between a third feature vector corresponding to the position of the maximum value in the new index vector and the second feature vector is equal to the preset value.

S700: and when the second distance value is determined to be equal to the preset value, taking the updated name extraction model as a trained name extraction model.

The above method is described in detail below:

in some service scenarios, due to a large number of data sources, situations that one mobile phone number corresponds to multiple names often occur, for example, for the same person, in different relationship networks, contact information of the opposite party in an address book of the opposite party is stored inconsistently, and therefore, how to accurately extract real names from multiple names corresponding to one mobile phone number to solve some name matching services is very valuable. For example, the method helps police officers to quickly and accurately find the real name of a criminal suspect from a plurality of candidate names, and can facilitate subsequent investigation and evidence collection and case detection.

Before S100, the method further comprises the steps of: a and B.

A: and acquiring a plurality of alternative names corresponding to the same mobile phone number.

In the actual implementation process, the method a may be implemented in such a manner that names of different people to contacts corresponding to the same mobile phone number are collected for the same mobile phone number, wherein the means for collecting alternative names is not specifically required. Wherein, the mobile phone number is not limited. The more alternative names corresponding to one mobile phone number and/or the more mobile phone numbers used by the training model, which means that the more sample data, the better the applicability and performance of the trained name extraction model will be.

B: and utilizing a word segmentation method to segment the candidate names to obtain the candidate names.

Because a real name may not exist in the multiple alternative names, in order to increase the probability of the occurrence of the real name in the alternative names, after the multiple alternative names corresponding to the same mobile phone number are obtained, a word segmentation method is used for carrying out word segmentation on the alternative names capable of carrying out word segmentation in the multiple alternative names, so as to obtain the multiple alternative names; it is understood that alternative names that can be participated are participated, and alternative names that cannot be participated are not participated, e.g., "zhang san", which is still "zhang san" after the word segmentation process, and "xx company zhang san" will be participated as ("xx", "company" and "zhang san") or ("xx company" and zhang san).

Because the word segmentation process can divide one alternative name into one, two and three waiting alternative names, the number of the candidate names obtained by the word segmentation method is more than or equal to the number of the alternative names. It is worth mentioning that if an alternative name can be divided into only one candidate name, it is equivalent to that the alternative name cannot be divided.

The word segmentation method may be a JIEBA word segmentation method, a HANLP word segmentation method, a STANFORD word segmentation method, or the like, and since the specific implementation of the word segmentation method for segmenting the multiple alternative names is well known by those skilled in the art, the details are not described herein.

In this embodiment, the number of the candidate names is defined as m, where m is an integer greater than or equal to 2, a value of m is determined according to an actual situation, and a dimension n × 1 of each corresponding feature vector, and in other embodiments, a dimension corresponding to each feature vector may also be 1 × n, where n is a positive integer greater than or equal to 2, and a value of n is determined according to an actual requirement. The larger the value of n is, the smaller the similarity between the feature vectors corresponding to different candidate names is, and the more favorable the name distinguishing is.

Defining a feature mapping function as f, x _i Representing the ith candidate name in the candidate names, the feature vector v corresponding to the candidate name _i The definition is as follows:

v _i ＝f(x _i )i＝1,2,..m

thus, for a set X comprising m candidate names, X =<x ₁ ,x ₂ ,...,x _i ,...,x _m >After obtaining the feature vector corresponding to each candidate name in the m candidate names, obtaining a feature matrix with dimension n × m formed by the feature vectors:

f(X)＝<v ₁ ,v ₂ ,...,v _i ,...,v _m >

after acquiring a plurality of candidate names, in order to better distinguish the candidate names so that each candidate name has uniqueness, S100 includes, as an embodiment, the steps of: C. d, E and F.

C: a sum of distances between the candidate name and each of the remaining candidate names in the plurality of candidate names is determined.

For each candidate name in the plurality of candidate names, a Jaccard distance between the candidate name and each candidate name remaining in the plurality of candidate names is calculated using a Jaccard algorithm, the Jaccard distance being used to measure dissimilarity between the two sets. Wherein, the smaller the Jaccard distance between two candidate names, the more similar the two candidate names are. Since it is well known to those skilled in the art to use the Jaccard algorithm to calculate the Jaccard distance between two candidate names, it is not described herein in detail.

For example: set a = { a, B, c, d }, set B = { c, d, e, f }, a = { c, d }, a ═ B = { c, d }, a ═ u = { a, B, c, d, e, f }, and Jaccard distances are: 1-1/3=2/3.

D: and obtaining a judgment result for representing whether the first word in the candidate name belongs to the surname.

The method comprises the steps of pre-storing surnames in surname dictionaries such as a common surname dictionary, a surname dictionary appearing in books and a surname dictionary of historical names, ensuring the accuracy of a judgment result as more surname data are collected, determining a first word in a candidate name aiming at each candidate name in a plurality of candidate names, determining whether the first word of the candidate name exists in a pre-stored library or not through character comparison, if so, obtaining the judgment result for representing that the first word in the candidate name belongs to a surname, and if not, obtaining the judgment result for representing that the first word in the candidate name does not belong to a surname. In this embodiment, the judgment result belonging to the last name is denoted by 1, and the judgment result not belonging to the last name is denoted by 0. In other embodiments, other numbers such as-1, -2, etc. may be used to indicate the result of the determination of whether the surname belongs to.

E: and acquiring the sum of the times of occurrence of each word in the candidate name in the candidate names.

As an embodiment, for each word in the candidate name, by comparing the word with each word in the candidate names one by one, the number of times that the word is the same as the number of times that the word in the candidate names is found, and then the sum of the number of times that each word in the candidate names appears in the candidate names can be obtained.

For example, the plurality of candidate names includes: the times of the 'third page', 'small page', 'son' and 'third page' are 2, aiming at the third candidate name, the 'third page' is compared with each character in the candidate names one by one, and the times of the 'third page' being the same as the characters in the candidate names is determined to be 2, namely the times of the 'third page' appearing in the candidate names is determined to be 2; comparing the 'three' with each word in the plurality of candidate names one by one, and determining that the number of times that the 'sheet' is the same as the words in the plurality of candidate names is 1, namely the number of times that the sheet appears in the candidate names is 1. Therefore, the sum of the number of times each word in the candidate name zhang is 2+1=3 in the plurality of candidate names.

As a mode, after the number of times of occurrence of each word in the candidate name is determined, a first corresponding relationship between the word and the number of times corresponding to the multiple candidate names is established, and when the word is obtained again, the number of times corresponding to the word is quickly found out from the first corresponding relationship, so that the efficiency is improved.

As an embodiment, the number of times each word in the candidate names appears in the total word number of the candidate names is determined in advance, a second correspondence between the number of times and the word corresponding to the candidate names is established and stored, for each candidate name in the candidate names, the number of times corresponding to the word is quickly found from the second correspondence based on each word in the candidate name, and then the sum of the number of times each word in the candidate name appears is determined.

For example, the plurality of candidate names includes: the number of times of "zhang san", "xiao zhang", "son", "zhang" is 2, the number of times of "tri" is 1, the number of times of "little" is 1, the number of times of "son" is 1. The sum of the number of times each word in the candidate name "zhang san" appears in the candidate names is 2+1=3. The sum of the number of times each word in candidate name "son" appears in the plurality of candidate names is 1+1=2.

Wherein the execution order of C, D and E is not limited.

F: and obtaining the distance sum, the judgment result and the vector taking the frequency sum as an element, wherein the vector is the corresponding characteristic vector.

In this embodiment, the value of the determination result is 0 or 1.

And for each candidate name in the candidate names, after obtaining the distance sum, the judgment result and the frequency sum corresponding to the candidate name, assigning the value of the distance sum corresponding to the candidate name to a first element in the feature vector corresponding to the candidate name, assigning the value of the judgment result corresponding to the candidate name to a second element in the feature vector corresponding to the candidate name, assigning the value of the frequency sum corresponding to the candidate name to a third element in the feature vector corresponding to the candidate name, and then obtaining the corresponding feature vector.

It should be noted that the value of the distance sum, the value of the frequency sum, and the value of the determination result corresponding to the candidate name may be respectively assigned to any one element in the feature vector corresponding to the candidate name. The value of the distance sum corresponding to the candidate name can be assigned to only one element in one feature vector at a time, and it can be understood that the value of the distance sum corresponding to the candidate name cannot be assigned to a plurality of elements such as two elements, three elements, and the like in the corresponding feature vector at the same time. Similarly, the value of the sum of times and the value of the determination result are also the same.

For example, when the sum of the distances of the candidate name is 1.5, the determination result is 0, and the sum of the times is 2, the feature vector of the candidate name may be [1.5,0,2], may be [1.5,2,0], may be [2,1.5,0], and it is understood that the positions of the values of these three dimensions in the corresponding feature vector are not limited. As long as it is ensured that each candidate name obtains the feature vector in the same manner.

As an embodiment, corresponding values are respectively given to corresponding elements in the corresponding feature vectors by using the distance sum and the judgment result, so as to obtain the corresponding feature vectors.

As an embodiment, S100 includes: acquiring the word number of the candidate name; a sum of frequencies of occurrence of each word in the candidate name in the plurality of candidate names; acquiring the sum of the times of occurrence of each word in the candidate name in the candidate names; obtaining a judgment result for representing whether a first word in the candidate name belongs to a surname; determining a sum of distances between the candidate name and each of the remaining candidate names in the plurality of candidate names; and obtaining the vector with the sum of the distance, the judgment result, the sum of times, the word number and the frequency as elements, wherein the vector is the corresponding characteristic vector. S200: and inputting the characteristic matrix into a name extraction model to obtain an index vector.

Wherein the name extraction model is

Wherein X represents the plurality of candidate names, f (X) represents the feature matrix, and a represents the parameter column vector; v. of _i Representing a feature vector corresponding to the ith candidate name, and z represents the index vector; wherein i is an integer of 1 or more.

Wherein, in the present embodiment, α is initialized by assigning 0 to all elements in the normalized parametric column vector α. In other embodiments, the initialization may be performed by assigning values other than 0 to the elements in the quantized parametric column vector α. The parameter column vector α has a dimension of n × 1 and the index vector z has a dimension of m × 1.

In order to train the parameter column vectors in the name extraction model, after the parameter column vector training is finished, the trained name extraction model can be obtained, so that for the candidate names, after the feature matrix f (X) corresponding to the candidate names is obtained, the feature matrix is input into the name extraction model, and the value of each element in the index vector z is obtained.

S300: and determining whether a first distance value of a first feature vector corresponding to the position of the maximum value in the index vector and a second feature vector corresponding to the real name is equal to a preset value or not. In order to ensure the performance of the trained extraction model, in this embodiment, the preset value is 0. In other embodiments, the preset value may also be a number greater than 0 and smaller than 1.

After the index vector is obtained, determining a maximum value of an element in the index vector, after the maximum value of the element is determined, determining a position of the maximum value in the index vector, wherein the position of the maximum value in the index vector is represented by a row number of the maximum value in the index vector, after the row number of the maximum value in the index vector is determined, searching the first feature vector with a row number value equal to that of the maximum value in the index vector from the feature matrix, and determining a Euclidean distance value between the first feature vector and a second feature vector corresponding to a real name in order to determine whether the name predicted by using the name extraction model is the same as the real name, wherein the Euclidean distance value is a first distance value, the first distance value is compared with the preset value in a difference mode, and when the difference value is 0, the first distance value is represented to be equal to the preset value, namely the name predicted by using the extraction model is the same as the real name; and when the difference value is not 0, representing that the difference value is not equal to a preset value, namely that the name predicted by the name extraction model is different from the real name.

E.g. in an indexed vector

The second feature vector of the real name is

Then, the maximum value in the index vector is 8, the position corresponding to the maximum value is 3, and the first feature vector is

Thus, the first distance value between the first feature vector and the second feature vector is

It is worth mentioning that when at least two maximum values with the same value exist in the index vector, one position is arbitrarily selected from the positions of the at least two maximum values in the index vector, and then the first feature vector corresponding to the selected position is determined according to the selected position.

S400: updating parameter column vectors in the name extraction model when the first distance value is determined not to be equal to the preset value; wherein the dimension of the parameter column vector is the same as the dimension of the feature vector.

It will be appreciated that after the parameter column vectors in the name extraction model are updated, the name extraction model is updated as well.

When it is determined that the first distance value is not equal to the preset value in S300, the parameter column vector acquired last time needs to be updated, so that the updated parameter column vector can be quickly close to the real parameter column vector, the training frequency is reduced, and the model training efficiency is improved, therefore, as an implementation, S400 includes: updating a parameter column vector in the name extraction model based on a difference of the second feature vector and the first feature vector.

Obtaining a difference vector by making a difference between the second feature vector and the first feature vector, wherein the dimension of the difference vector is the same as the dimension of the parameter column vector, and updating the parameter column vector in the name extraction model by using the sum of the difference vector and the parameter column vector, wherein it can be understood that the sum of the difference vector and the parameter column vector is the updated parameter column vector.

For example, the second feature vector is [1,2,3 ]] ^T The first feature vector is [2, 3 ]] ^T Then, the difference vector is [ -1,0] ^T Assume that the parametric column vector is [0,0] ^T Then the updated parameter column vector is the difference vector is [ -1,0 [ ]] ^T And the parameter column vector is [0,0] ^T I.e. the updated parameter column vector is [ -1,0] ^T 。

In one embodiment, a difference vector is obtained by subtracting the first feature vector and the second feature vector, and the parameter column vector in the name extraction model is updated by using the sum of the difference vector and the parameter column vector.

In other embodiments, the parameter column vector may be updated in other ways.

After the updated name extraction model is obtained, in order to verify whether the updated name extraction model can truly predict a real name, the feature matrix is input into the updated name extraction model to obtain the value of each element in a new index vector, and it can be understood that the value of each element in the new index vector is not exactly the same as the value of the corresponding element in the index vector calculated last time. Since the specific implementation of S500 is the same as S200, it is not described herein again.

Since the specific implementation of S600 is the same as that of S300, it is not described again.

And when the second distance value is determined to be equal to the preset value, representing that the name predicted by the updated name extraction model is the same as the real name, and taking the updated name extraction model as a trained name extraction model. It is to be understood that the parameter column vectors in the updated name extraction model are parameter column vectors in the trained name extraction model.

As an embodiment, after S600, the method further includes: and when it is determined that the second distance value is not equal to the preset value, performing S400-S600 until the name predicted by the name extraction model is the same as the real name.

Since there is a case where it is impossible to train that the real name can be accurately determined using the name extraction model, in order to avoid the training process from falling into the endless loop under the condition that the performance of the trained name extraction model is ensured, as an embodiment, after S300, the method further includes the steps of: G. h and I.

G: and when the second distance value is not equal to the preset value, determining whether the current updating times are equal to preset updating times.

The preset updating times can be positive integers such as 8,10,15,18 and the like, wherein the higher the preset updating times, the higher the precision of the trained name extraction model is, but the longer the time spent on training the model is.

When the Euclidean distance value between the first feature vector and the second feature vector corresponding to the real name is determined to be not equal to the preset value, determining the current updating frequency, after determining the current updating frequency, comparing the current updating frequency with the preset updating frequency in a difference mode, when the difference value is 0, determining that the current updating frequency is equal to the preset updating frequency, and when the difference value is not 0, determining that the current updating frequency is not equal to the preset updating frequency.

For example, the name extraction model is updated for the second time, and when the distance value determined by using the name extraction model updated for the second time is not equal to the preset value, the current update frequency is determined to be 2.

H: and when the current updating times are determined to be equal to the preset updating times, determining a minimum distance value from the first distance value and the second distance value.

And for the candidate names, each parameter column vector corresponds to a distance value, so that when the current update times are determined to be equal to the preset update times, in order to ensure the performance of the trained name extraction model, a minimum distance value is determined by comparing every two distance values in all the distance values, wherein the distance value with the minimum value is the minimum distance value.

I: and updating the name extraction model by using the parameter column vector corresponding to the minimum distance value.

And searching the parameter column vector corresponding to the minimum distance value based on the corresponding relation between the parameter column vector and the distance value stored in advance, and updating the parameter column vector in the name extraction model by using the corresponding parameter column vector to obtain the updated name extraction model.

To ensure the performance of the trained name extraction model, therefore, as an embodiment, after G, the method further comprises: when the current updating times are determined to be smaller than the preset updating times, obtaining a new updated name extraction model; when a third distance value determined by using a new updated name extraction model is equal to the preset value, taking the new updated name extraction model as the trained name extraction model; or when the next updating time is equal to the preset updating time, updating the name extraction model by using the parameter column vector corresponding to the minimum distance value in the first distance value, the second distance value and the third distance value.

The specific implementation of updating the parameter column vectors in the name extraction model is the same as S400-S600, G, H, and I, and therefore, the details are not repeated here.

The following provides a general description of the training method of the Chinese name extraction model.

Firstly, acquiring a feature matrix corresponding to a plurality of candidate names; inputting the characteristic matrix into an initialized name extraction model to obtain a first index vector; determining whether a first distance value between a first feature vector corresponding to the maximum value in the first index vector and a feature vector corresponding to a real name is equal to a preset value; when the first distance value is determined not to be equal to the preset value, updating parameter column vectors in the initialized name extraction model to obtain a name extraction model after first updating; inputting the feature matrix into the name extraction model updated for the first time to obtain a second index vector; determining whether a second distance value between a third feature vector corresponding to the position of the maximum value in the second index vector and the second feature vector is equal to the preset value; when the second distance value is not equal to the preset value, determining whether the current updating times are equal to preset times or not; when the current updating times are determined to be not equal to the preset updating times, updating parameter column vectors in the name extraction model after the first updating to obtain a name extraction model after the second updating; inputting the feature matrix into the name extraction model updated for the second time to obtain a third index vector; determining whether a third distance value between a fourth feature vector corresponding to the position of the maximum value in the third index vector and the second feature vector is equal to the preset value; and when the third distance value is determined to be equal to the preset value, training the name extraction model after the second updating.

Or when the third distance value is not equal to the preset value, determining whether the current updating times are equal to the preset times, when the current updating times are equal to the preset times, determining a minimum distance value from the first distance value, the second distance value and the third distance value, and updating the initialized name extraction model by using a parameter column vector corresponding to the minimum distance value to obtain a trained name extraction model.

Referring to fig. 2, an embodiment of the present application provides a flow chart diagram of a method for obtaining a chinese name, where the method includes: s10, S20, S30 and S40.

S10: and acquiring a feature vector of each candidate name in a plurality of candidate names to obtain a feature matrix formed by the plurality of feature vectors.

S20: and inputting the characteristic matrix into the name extraction model trained by the training method of the Chinese name extraction model in the embodiment to obtain an index vector.

S30: and acquiring the position of the maximum value in the index vector.

S40: determining a name corresponding to the position in the plurality of candidate names as a real name.

Since the specific implementation of S10-S30 is basically the same as the implementation of the training method of the chinese name extraction model, the only difference is that the model used in S20 is the name extraction model trained by the training method of the chinese name extraction model, and therefore, the details are not described here.

In an actual implementation process, S40 may be implemented in such a manner that, since the position is represented by a row number of the maximum value in the index vector, after the position is obtained, a feature vector having a column number equal to the row number of the maximum value in the index vector is found from the feature matrix, and since the feature vectors in the feature matrix are in one-to-one correspondence with names, a name corresponding to the feature vector, that is, the corresponding name is the real name, can be determined.

Referring to fig. 3, an embodiment of the present application provides a training apparatus for a chinese name extraction model, the apparatus including:

a first feature vector obtaining unit 410, configured to obtain a feature vector corresponding to each candidate name in a plurality of candidate names, and form a feature matrix from the plurality of feature vectors; wherein, the feature vectors corresponding to different candidate names are different; the candidate names include real names required to be determined from the candidate names.

The first obtaining unit 420 is configured to input the feature matrix into a name extraction model to obtain an index vector.

A first determining unit 430, configured to determine whether a first distance value of a first feature vector corresponding to a position of a maximum value in the index vector and a second feature vector corresponding to the real name are equal to a preset value.

An updating unit 440, configured to update the parameter column vector in the name extraction model when it is determined that the first distance value is not equal to the preset value; wherein the dimension of the parameter column vector is the same as the dimension of the feature vector.

The second obtaining unit 450 is configured to input the feature matrix to the updated name extraction model to obtain a new index vector.

A second determining unit 460, configured to determine whether a second distance value between a third eigenvector corresponding to a position of the maximum value in the new index vector and the second eigenvector is equal to the preset value.

A model determining unit 470, configured to take the updated name extraction model as a trained name extraction model when it is determined that the second distance value is equal to the preset value.

As one embodiment, the name extraction model is

As an embodiment, the first feature vector obtaining unit 410 is further configured to determine a sum of distances between the candidate name and each of the remaining candidate names in the plurality of candidate names; and obtaining a judgment result for representing whether the first word in the candidate name belongs to a surname; acquiring the sum of the times of occurrence of each word in the candidate names; and obtaining the distance sum, the judgment result and the vector with the frequency sum as an element, wherein the vector is the corresponding characteristic vector.

As an embodiment, the updating unit 440 is further configured to update the parameter column vector in the name extraction model based on a difference between the second feature vector and the first feature vector.

As an embodiment, the apparatus further comprises: the frequency determining unit is used for determining whether the current updating frequency is equal to the preset updating frequency or not when the second distance value is determined not to be equal to the preset value; a minimum distance value determining unit, configured to determine a minimum distance value from all the first distance values and the second distance values when it is determined that the current update time is equal to the preset update time; and the model updating unit is used for updating the name extraction model by using the parameter column vector corresponding to the minimum distance value.

As an embodiment, the apparatus further comprises: and the parameter updating unit is used for updating the parameter column vector in the name extraction model when the current updating times are determined to be less than the preset updating times.

As an embodiment, the apparatus further comprises: the alternative call acquisition unit is used for acquiring a plurality of alternative calls corresponding to the same mobile phone number; and the candidate name acquisition unit is used for segmenting the plurality of alternative names by using a segmentation method to obtain the plurality of candidate names.

In a fourth aspect, an embodiment of the present application provides a chinese name obtaining apparatus, where the apparatus includes:

a second feature vector obtaining unit 510, configured to obtain a feature vector of each candidate name in the plurality of candidate names to obtain a feature matrix formed by the plurality of feature vectors.

A third obtaining unit 520, configured to input the feature matrix into the name extraction model trained by the method according to the first aspect, so as to obtain an index vector.

A position obtaining unit 530, configured to obtain a position of a maximum value in the index vector.

A name determining unit 540, configured to determine a name corresponding to the position in the multiple candidate names as a real name.

For the process of implementing each function by each functional unit in this embodiment, please refer to the content described in the embodiment shown in fig. 1 and fig. 2, which is not described herein again.

Referring to fig. 5, an embodiment of the present application provides an electronic device 100 applied to the methods described in fig. 1 and fig. 2, and in the embodiment of the present application, the electronic device 100 may be a tablet computer, a smart phone, a Personal Digital Assistant (PDA), and the like.

The electronic device may include: memory 102, processing 101, and a communication bus for enabling the connection communication of these components.

The Memory 102 is configured to store various data such as an element value in a parameter column vector, a feature vector of each candidate name in a plurality of candidate names, a name extraction model, a trained name extraction model, and a computer program instruction corresponding to a training method of a chinese name extraction model, a chinese name extraction method, and a device provided in this embodiment of the present application, where the Memory 102 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), and the like.

The processor 101 is configured to execute the method for training the chinese name extraction model provided in the embodiment of the present application when reading and running the computer program instruction corresponding to the method for training the chinese name extraction model stored in the memory, so as to train the name extraction model to obtain the trained name extraction model.

The processor 101 is configured to execute the method for extracting a chinese name provided in the embodiment of the present application when reading and executing the computer program instructions stored in the memory and corresponding to the method for extracting a chinese name, so as to determine that a name corresponding to the position in the plurality of candidate names is a real name.

The processor 101 may be an integrated circuit chip having signal processing capability. The Processor 101 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also a Digital Signal Processor (DSP), discrete gate or transistor logic, discrete hardware components.

In addition, a storage medium is provided in an embodiment of the present application, and a computer program is stored in the storage medium, and when the computer program runs on a computer, the computer is caused to execute the method provided in any embodiment of the present application.

In summary, the method for obtaining a chinese name, the method for training a chinese name extraction model, and the apparatus thereof according to the embodiments of the present application, by obtaining a feature vector corresponding to each candidate name, form a feature matrix from the plurality of feature vectors, wherein the feature vectors corresponding to different candidate names are different, so that the candidate names have uniqueness without considering context relationship between names, train parameter column vectors in the name extraction model using the feature matrix, determine a first feature vector predicted by the name extraction model according to a correspondence between a position of a maximum value in an index vector output by the name extraction model and a feature vector in the feature matrix, when a distance value between the first feature vector and a second feature vector corresponding to a real name is not equal to a preset value, characterize that the currently obtained name extraction model is not able to accurately determine the real name from the plurality of candidate names, update the column vectors in the name extraction model, and continue to input the feature matrix into the updated name extraction model until a distance between a third feature vector of the updated name extraction model and the second feature vector is determined to be equal to the preset value, and then extract the parameter column vectors from the updated name extraction model as an accurate name extraction model.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of training a chinese name extraction model, the method comprising:

obtaining a feature vector corresponding to each candidate name in a plurality of candidate names, and forming a feature matrix by the plurality of feature vectors; wherein, the feature vectors corresponding to different candidate names are different; the candidate names comprise real names required to be determined from the candidate names;

inputting the characteristic matrix into a name extraction model to obtain an index vector;

determining whether a first distance value of a first feature vector corresponding to the position of the maximum value in the index vector and a second feature vector corresponding to the real name is equal to a preset value;

updating a parameter column vector in the name extraction model upon determining that the first distance value is not equal to the preset value; wherein the dimensions of the parameter column vector and the dimensions of the feature vector are the same;

inputting the feature matrix into the updated name extraction model to obtain a new index vector;

determining whether a second distance value between a third feature vector corresponding to the position of the maximum value in the new index vector and the second feature vector is equal to the preset value;

when the second distance value is determined to be equal to the preset value, taking the updated name extraction model as a trained name extraction model;

wherein the name extraction model is

(ii) a Wherein X represents the plurality of candidate names,

a matrix of the features is represented and,

representing the parameter column vector;

representing a feature vector corresponding to the ith candidate name, and z represents the index vector; wherein i is an integer of 1 or more.

2. The method of claim 1, wherein obtaining a feature vector corresponding to each candidate name in the plurality of candidate names comprises:

determining a sum of distances between the candidate name and each of the remaining candidate names in the plurality of candidate names;

obtaining a judgment result for representing whether a first word in the candidate name belongs to a surname;

acquiring the sum of the times of occurrence of each word in the candidate name in the candidate names;

and obtaining the distance sum, the judgment result and the vector with the frequency sum as an element, wherein the vector is the corresponding characteristic vector.

3. The method of claim 1, wherein updating the parameter column vectors in the name extraction model comprises:

updating a parameter column vector in the name extraction model based on a difference of the second feature vector and the first feature vector.

4. The method of claim 1, wherein after determining whether the second distance value is equal to the preset value, the method further comprises:

when the second distance value is not equal to the preset value, determining whether the current updating times are equal to preset updating times or not;

when the current updating times are determined to be equal to the preset updating times, determining a minimum distance value from the first distance value and the second distance value;

and updating the name extraction model by using the parameter column vector corresponding to the minimum distance value.

5. The method of claim 4, wherein after determining whether the current number of updates is equal to the preset number of updates, the method further comprises:

when the current updating times are determined to be smaller than the preset updating times, updating parameter column vectors in the name extraction model to obtain a new updated name extraction model;

when a third distance value determined by using a new updated name extraction model is equal to the preset value, taking the new updated name extraction model as the trained name extraction model; or

And when the next updating time is equal to the preset updating time, updating the name extraction model by using the parameter column vector corresponding to the minimum distance value in the first distance value, the second distance value and the third distance value.

6. The method of claim 1, wherein before obtaining the feature vector characterizing the candidate name, the method further comprises:

acquiring a plurality of alternative names corresponding to the same mobile phone number;

and utilizing a word segmentation method to segment the candidate names to obtain the candidate names.

7. A method for chinese name acquisition, the method comprising:

obtaining a feature vector of each candidate name in a plurality of candidate names to obtain a feature matrix formed by the feature vectors;

inputting the feature matrix into a name extraction model trained by the method of any one of claims 1-6 to obtain an index vector;

obtaining the position of the maximum value in the index vector;

determining a name corresponding to the position in the plurality of candidate names as a real name.

8. A training apparatus of a chinese name extraction model, the apparatus comprising:

the first feature vector obtaining unit is used for obtaining a feature vector corresponding to each candidate name in a plurality of candidate names and forming a feature matrix by the plurality of feature vectors; wherein, the feature vectors corresponding to different candidate names are different; the candidate names comprise real names which need to be determined from the candidate names;

a first obtaining unit, configured to input the feature matrix into a name extraction model to obtain an index vector, where the name extraction model is

(ii) a Wherein X represents the plurality of candidate names,

a matrix of said features is represented by,

representing a parameter column vector;

representing a feature vector corresponding to the ith candidate name, and z represents the index vector; wherein i is an integer greater than or equal to 1;

a first determining unit, configured to determine whether a first distance value of a first feature vector corresponding to a position of a maximum value in the index vector and a second feature vector corresponding to the real name are equal to a preset value;

the updating unit is used for updating the parameter column vector in the name extraction model when the first distance value is determined not to be equal to the preset value; wherein the dimensions of the parameter column vector and the dimensions of the feature vector are the same;

the second acquisition unit is used for inputting the feature matrix to the updated name extraction model to obtain a new index vector;

a second determining unit, configured to determine whether a second distance value between a third feature vector corresponding to a position of a maximum value in the new index vector and the second feature vector is equal to the preset value;

and the model determining unit is used for taking the updated name extraction model as a trained name extraction model when the second distance value is determined to be equal to the preset value.

9. An electronic device comprising a memory and a processor, the memory having stored therein computer program instructions that, when read and executed by the processor, perform the method of any of claims 1-7.

10. A storage medium having stored thereon computer program instructions which, when read and executed by a computer, perform the method of any one of claims 1-7.