CN108255806B - Name recognition method and device - Google Patents

Name recognition method and device Download PDF

Info

Publication number
CN108255806B
CN108255806B CN201711414492.9A CN201711414492A CN108255806B CN 108255806 B CN108255806 B CN 108255806B CN 201711414492 A CN201711414492 A CN 201711414492A CN 108255806 B CN108255806 B CN 108255806B
Authority
CN
China
Prior art keywords
name
ambiguous
corpus
credible
names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711414492.9A
Other languages
Chinese (zh)
Other versions
CN108255806A (en
Inventor
刘兵
苗艳军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201711414492.9A priority Critical patent/CN108255806B/en
Publication of CN108255806A publication Critical patent/CN108255806A/en
Application granted granted Critical
Publication of CN108255806B publication Critical patent/CN108255806B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Abstract

The invention provides a name recognition method and a name recognition device, which can automatically recognize names contained in a name corpus to be recognized by using a name recognition model. Because the names contained in the name knowledge base are comprehensive, the models trained by the name knowledge base and the video knowledge base have certain accuracy for identifying ambiguous names, and meanwhile, names can be identified, so that the overall effect of name identification is improved.

Description

Name recognition method and device
Technical Field
The invention relates to the technical field of internet search, in particular to a name identification method and device.
Background
Named Entity Recognition (NER), also known as "proper name Recognition," specifically refers to recognizing Entity names of specific significance in text, such as person names, place names, and organization names. In the video industry, for example, a large number of names exist in video titles and entertainment news, and the identification effect of the names contained in the text greatly influences the popularization of video application products.
Currently, name recognition is mainly achieved by constructing a general name recognition model, such as a classification model or a conditional random field model. However, ambiguous name mentions, such as "dawn", are likely to appear in the text to be recognized, and the recognition error rate of the universal name recognition model for the ambiguous name mentions is high, so that the effect of video application products such as video search and video push is affected.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for identifying names, so as to solve the problem that the existing universal name identification model has a high error rate in identifying ambiguous names, thereby affecting the effects of video application products such as video search and video push. The technical scheme is as follows:
a person name recognition method, comprising:
receiving name corpus of a person to be identified;
calling a name recognition model, wherein the name recognition model is obtained by training a preset machine learning classification model by utilizing a name knowledge base and a video knowledge base in advance;
and identifying the name contained in the name corpus to be identified by using the name identification model.
Preferably, the process of training the preset machine learning classification model by using the name knowledge base and the video knowledge base in advance to obtain the name recognition model comprises the following steps:
extracting an ambiguous name list and a credible name list from a name knowledge base;
extracting ambiguous name corpus from a video knowledge base based on the ambiguous name list, and extracting ambiguous name features of the ambiguous name corpus;
extracting a credible name corpus from the video knowledge base based on the credible name list, and extracting credible name features of the credible name corpus;
and training a preset machine learning classification model by using the ambiguous name features and the credible name features to obtain a name recognition model.
Preferably, the extracting the list of ambiguous names and the list of trusted names from the name knowledge base includes:
acquiring ambiguous names from a name knowledge base, and generating an ambiguous name list containing the ambiguous names;
acquiring non-ambiguous names except the ambiguous names from the name knowledge base;
calling a search log, and selecting a credible name from the non-ambiguous names by using the search log;
and generating a trusted person name list containing the trusted person name.
Preferably, the extracting the ambiguous person name corpus from the video knowledge base based on the ambiguous person name list includes:
for each video text in a video knowledge base, acquiring a title of the video text;
performing word segmentation on the titles to obtain a plurality of title phrases;
judging whether the plurality of title phrases contain at least one ambiguous name in the ambiguous name list;
if yes, determining the video text as an ambiguous name text;
and generating an ambiguous name corpus comprising all the ambiguous name texts.
Preferably, the determining the video text as the ambiguous person name text further includes:
respectively calculating text similarity distances between the video text and at least one ambiguous name contained in the video text;
judging whether a text similar distance larger than a distance threshold exists in at least one text similar distance obtained through calculation;
and if so, executing the step of determining the video text as the ambiguous name text.
Preferably, the extracting the ambiguous person name feature of the ambiguous person name corpus includes:
extracting features of the ambiguous name corpus to obtain a first feature list, wherein the first feature list comprises a plurality of features;
performing word segmentation on the ambiguous name corpus to obtain a plurality of first corpus word groups;
and respectively adding labels to the features which are the same as the first corpus phrases in the first feature table and the ambiguous names which extract the ambiguous name corpus, and converting the features added with the labels and the ambiguous names into feature sequences.
Preferably, the extracting of the credible name features of the credible name corpus includes:
performing feature extraction on the credible name corpus to obtain a second feature list, wherein the second feature list comprises a plurality of features;
performing word segmentation on the credible name corpus to obtain a plurality of second corpus word groups;
and respectively adding labels to the characteristics which are the same as the second corpus phrases in the second characteristic table and the credible names which extract the credible name corpus, and converting the characteristics added with the labels and the credible names into characteristic sequences.
A person name recognition apparatus comprising: the system comprises a corpus receiving module, a model calling module and a name identification module, wherein the model calling module comprises a model generating unit;
the corpus receiving module is used for receiving a corpus of names of people to be identified;
the model generation unit is used for training a preset machine learning classification model by utilizing a name knowledge base and a video knowledge base in advance to obtain a name recognition model;
the model calling module is used for calling the name recognition model;
and the name recognition module is used for recognizing names contained in the name corpus to be recognized by using the name recognition model.
Preferably, the model generating unit is specifically configured to:
extracting an ambiguous name list and a credible name list from a name knowledge base; extracting ambiguous name corpus from a video knowledge base based on the ambiguous name list, and extracting ambiguous name features of the ambiguous name corpus; extracting a credible name corpus from the video knowledge base based on the credible name list, and extracting credible name features of the credible name corpus; and training a preset machine learning classification model by using the ambiguous name features and the credible name features to obtain a name recognition model.
Preferably, the model generating unit, configured to extract the ambiguous name list and the reliable name list from the name knowledge base, is specifically configured to:
acquiring ambiguous names from a name knowledge base, and generating an ambiguous name list containing the ambiguous names; acquiring non-ambiguous names except the ambiguous names from the name knowledge base; calling a search log, and selecting a credible name from the non-ambiguous names by using the search log; and generating a trusted person name list containing the trusted person name.
The name recognition method and the name recognition device provided by the invention can automatically recognize the names contained in the name corpus to be recognized by using the name recognition model. Because the names contained in the name knowledge base are comprehensive, the models trained by the name knowledge base and the video knowledge base have certain accuracy for identifying ambiguous names, and meanwhile, names can be identified, so that the overall effect of name identification is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for generating a name recognition model according to an embodiment of the present invention;
fig. 2 is a flowchart of a method in step S20 of a person name recognition model generation method according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for "extracting an ambiguous name list and a trusted name list from a name knowledge base" in step S201 in a method for generating a name recognition model according to an embodiment of the present invention;
fig. 4 is a flowchart of a method for extracting ambiguous-person-name corpus from a video knowledge base based on an ambiguous-person-name list in step S202 in a person-name recognition model generation method according to an embodiment of the present invention;
fig. 5 is a flowchart of a method for extracting ambiguous name features of an ambiguous name corpus in step S202 in a method for generating a name recognition model according to an embodiment of the present invention;
fig. 6 is a flowchart of a method for "extracting a trusted person name feature of a trusted person name corpus" in step S203 in the person name recognition model generation method according to the embodiment of the present invention;
fig. 7 is a schematic structural diagram of a person name recognition model generation apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a person name identification method, and the method has a flow chart as shown in figure 1, and comprises the following steps:
s10, receiving name corpus of the person to be recognized;
s20, calling a name recognition model, wherein the name recognition model is obtained by training a preset machine learning classification model by utilizing a name knowledge base and a video knowledge base in advance;
in the embodiment of the invention, a background database of a video search engine company is provided with a name knowledge base and a video knowledge base.
The name knowledge base has a large amount of name data. The name can be a credible name and an ambiguous name, wherein the credible name refers to a conventional name, namely when the credible name is referred, the name can be determined to be the name, and the ambiguous name refers to an unconventional name, namely when the ambiguous name is referred, the name cannot be determined to be the name, and whether the name is referred to as the name or the non-name cannot be determined. In addition, the person name knowledge base stores the structured field information of the person name, and the structured field information comprises native place, birthday, alias, life story, related works, related persons and the like.
The video knowledge base includes video title text and video introduction text.
Respectively extracting ambiguous names and credible names from a name knowledge base; then, extracting ambiguous person name linguistic data from a video knowledge base by using the ambiguous person name, and extracting credible person name linguistic data from the video knowledge base by using the credible person name; and finally, training a preset machine learning classification model by using the ambiguous name corpus and the credible name corpus to obtain a name recognition model.
In the specific implementation process, in step S20, the process of "training the preset machine learning classification model by using the name knowledge base and the video knowledge base in advance to obtain the name recognition model" is as follows, fig. 2:
s201, extracting an ambiguous name list and a credible name list from a name knowledge base;
in this embodiment, since the number of the credible names in the name knowledge base is much larger than the number of the ambiguous names, the ambiguous name list may be first obtained from the name knowledge base, and then the credible name list may be obtained from the remaining non-ambiguous names.
In a specific implementation process, in the process of "extracting an ambiguous name list and a trusted name list from a name knowledge base" in step S201, the following steps may be specifically adopted, and a flowchart of the method is shown in fig. 3:
s1001, acquiring ambiguous names from a name knowledge base and generating an ambiguous name list containing the ambiguous names;
in the process of executing step S1001, the names in the name knowledge base may be compared with the segmentation dictionary, and names overlapping with words of non-name part of speech in the segmentation dictionary are ambiguous names. Of course, when a certain name does not exist in the word segmentation dictionary, prompt information can be generated to prompt a user to determine whether the name is an ambiguous name, and the ambiguous name is added to the word segmentation dictionary after the part of speech is calibrated, so that the comprehensiveness of the word segmentation dictionary is increased.
It should be noted that the parts of speech of each phrase in the word segmentation dictionary are pre-labeled and are divided into parts of speech of human names and parts of speech of non-human names, that is, a phrase belongs to only one part of speech and belongs to a human name or a non-human name.
S1002, acquiring non-ambiguous names except for ambiguous names from a name knowledge base;
s1003, calling a search log, and selecting a credible person name from the non-ambiguous person names by using the search log;
in the process of executing step S1003, the search log is used to count the number of times of searching for the unambiguous person name, the unambiguous person name whose number of times of searching is higher than the threshold value is regarded as a popular name, and the popular name is selected as the authentic person name.
Certainly, for the selection of the credible names, the names can be sorted from high to low according to the search times, and the names with the preset number of search times are selected as the credible names.
S1004, a list of trusted people names including the trusted people name is generated.
S202, extracting ambiguous name corpus from a video knowledge base based on an ambiguous name list, and extracting ambiguous name features of the ambiguous name corpus;
in the process of executing step S202, the video knowledge base includes a corpus of ambiguous names, that is, a corpus of ambiguous names; further, the ambiguous person name features are extracted from the ambiguous person name corpus according to preset features.
In a specific implementation process, in the process of "extracting ambiguous-person-name corpus from the video knowledge base based on the ambiguous-person-name list" in step S202, the following steps may be specifically adopted, and a flowchart of the method is shown in fig. 4:
s1005, acquiring the title of the video text for each video text in the video knowledge base;
in this embodiment, since the probability that the video title text contains the name of the person is higher, the video title text is selected as the source of the ambiguous name corpus.
S1006, segmenting the title to obtain a plurality of title phrases;
in the process of executing step S1006, it is assumed that a certain news title is segmented into a title phrase a, a title phrase b, a title phrase c, and a title phrase d.
S1007, determining whether the plurality of title phrases include at least one ambiguous name in the ambiguous name list;
s1008, if yes, determining the video text as an ambiguous name text;
in the practical application process, in order to select a more effective ambiguous name text from a video knowledge base, a text similarity distance between the video text and at least one ambiguous name contained in the video text can be calculated; judging whether a text similar distance larger than a distance threshold exists in the at least one text similar distance obtained by calculation; if yes, the step of determining the video text as the ambiguous name text is executed.
In the process of calculating the text similarity distance: the name knowledge base stores structured field information of names, and the structured field information can be used as knowledge characteristics, for example, the title phrase a is an ambiguous name a, and the knowledge characteristics of the ambiguous name a include: "person name B", "person name C", "program 1", "program 2", and "program 3". Therefore, the method can convert the step of calculating the similar distance between the video text and the ambiguous person name into the step of calculating the similar distance between the video text and the knowledge characteristic. Therefore, the calculation method of the similarity distance is a text similarity calculation method.
S1009 generates an ambiguous name corpus including all ambiguous name texts.
In a specific implementation process, the process of "extracting ambiguous person name features of ambiguous person name corpus" in step S202 may specifically adopt the following steps, and a flowchart of the method is shown in fig. 5:
s1010, extracting features of the ambiguous person name corpus to obtain a first feature list, wherein the first feature list comprises a plurality of features;
in the feature selection process, the inventor of the application finds that through data observation and implementation analysis, the mentioning context of the name of a person has two strongly related features: first, words or part-of-speech features in the left and right windows, e.g., "concert" and "" and "; second, it is a strongly correlated knowledge feature referred to by a person name, e.g., the strongly correlated knowledge feature "person name B" of ambiguous person name a.
According to the window features and the CONTEXT features, feature extraction is performed on the ambiguous person name corpus, for example, feature extraction is performed on the news headlines to obtain a feature list shown in table 1, all features are numbered, and the features in the feature list are represented as integer serial numbers, wherein CONTEXT _ KNOWLEDGE _ FEA is a CONTEXT KNOWLEDGE feature.
Characteristic serial number Feature(s)
1 CONTEXT_KNOWLEDGE_FEA
2 T01/heading phrase b
3 T02/heading phrase c
…… ……
TABLE 1
S1011, performing word segmentation on ambiguous name corpora to obtain a plurality of first corpus phrases;
s1012, respectively adding labels to the features of the first feature table, which are the same as the first corpus phrases, and the ambiguous names of the extracted ambiguous name corpus, and converting the labeled features and the ambiguous names into feature sequences;
in the process of step S1012, a negative label is added to the feature of the first feature table that is the same as the first corpus phrase, and a positive label is added to the ambiguous person name when the ambiguous feature corpus is extracted. For example, the conversion results for ambiguous person name a with a positive label and feature title phrase b with a negative label are as follows:
label (R) Characteristic sequence
1 1:1 2:1 3:1 4:1……
2 5:1 6:1 7:1……
…… ……
TABLE 2
S203, extracting a credible name corpus from the video knowledge base based on the credible name list, and extracting credible name features of the credible name corpus;
in this embodiment, since the trusted person name is unambiguous, the name mentions in any context can be determined to appear as the name, and all video texts containing the trusted person name can be extracted from the video knowledge base as the corpus of the trusted person name. The video title text has higher possibility of containing the name of the person, so that the video title text is selected as the source of the credible name corpus, and the video title text containing the credible name of the person can be used as the credible name corpus.
In the specific implementation process, the process of "extracting the credible name features of the credible name corpus" in step S203 may specifically adopt the following steps, and a flowchart of the method is shown in fig. 6:
s1013, performing feature extraction on the credible name corpus to obtain a second feature list, wherein the second feature list comprises a plurality of features;
in the feature selection process, the inventor of the application finds that through data observation and implementation analysis, the mentioning context of the name of a person has two strongly related features: first, words or part-of-speech features in the left and right windows, e.g., "concert" and "" and "; second, it is a strongly related knowledge characteristic referred to by names of people.
According to the window characteristics and the context characteristics, performing characteristic extraction on the ambiguous name corpus
S1014, performing word segmentation on the credible name corpus to obtain a plurality of second corpus phrases;
and S1015, adding labels to the features in the second feature table, which are the same as the features in the second corpus phrase, and converting the features after adding the labels into feature sequences.
And S204, training a preset machine learning classification model by using the ambiguous name features and the credible name features to obtain a name recognition model.
In this embodiment, the preset machine learning classification model includes, but is not limited to, a support vector machine model, a logistic regression model, and the like, and may be selected according to actual needs, and this embodiment is not particularly limited.
And S30, identifying the names contained in the name corpus to be identified by using the name identification model.
The above steps S201 to S204 are only one preferred implementation manner of the process of "training the preset machine learning classification model to obtain the name recognition model by using the name knowledge base and the video knowledge base in advance" in step S20 of the embodiment of the present application, and the specific implementation manner of the process may be arbitrarily set according to the needs of the user, and is not limited herein.
The above steps S1001 to S1004 are only one preferred implementation manner of the process of "extracting the ambiguous name list and the trusted name list from the name knowledge base" in step S201 in the embodiment of the present application, and a specific implementation manner of the process may be arbitrarily set according to a requirement of the user, and is not limited herein.
The above steps S1005 to S1009 are only a preferred implementation manner of the process of "extracting the ambiguous-person-name corpus from the video knowledge base based on the ambiguous-person-name list" in step S202 in the embodiment of the present application, and a specific implementation manner of the process may be arbitrarily set according to a requirement of the user, which is not limited herein.
The above steps S1010 to S1012 are only a preferred implementation manner of the process of "extracting the ambiguous person name feature of the ambiguous person name corpus" in step S202 in the embodiment of the present application, and the specific implementation manner related to the process may be arbitrarily set according to the requirement thereof, and is not limited herein.
The above steps S1013 to S1015 are only one preferable implementation manner of the process of "extracting the credible name feature of the credible name corpus" in step S203 in this embodiment, and the specific implementation manner of this process may be arbitrarily set according to its own requirements, and is not limited herein.
The name recognition method provided by the embodiment of the invention can automatically recognize the names contained in the name corpus to be recognized by using the name recognition model. Because the names contained in the name knowledge base are comprehensive, the models trained by the name knowledge base and the video knowledge base have certain accuracy for identifying ambiguous names, and meanwhile, names can be identified, so that the overall effect of name identification is improved.
Based on the person name recognition method provided in the foregoing embodiment, an apparatus for performing the person name recognition method in the embodiment of the present invention is shown in fig. 7, and includes: the system comprises a corpus receiving module 10, a model calling module 20 and a name identifying module 30, wherein the model calling module 20 comprises a model generating unit 201;
the corpus receiving module 10 is used for receiving corpus of names of people to be identified;
the model generation unit 201 is used for training a preset machine learning classification model by using a name knowledge base and a video knowledge base in advance to obtain a name recognition model;
the model calling module 20 is used for calling a name recognition model;
and the name recognition module 30 is configured to recognize names included in the name corpus to be recognized by using the name recognition model.
In some other embodiments, the model generating unit 201 is specifically configured to:
extracting an ambiguous name list and a credible name list from a name knowledge base; extracting ambiguous name corpus from a video knowledge base based on the ambiguous name list, and extracting ambiguous name features of the ambiguous name corpus; extracting a credible name corpus from a video knowledge base based on a credible name list, and extracting credible name features of the credible name corpus; and training a preset machine learning classification model by using the ambiguous name features and the credible name features to obtain a name recognition model.
In some other embodiments, the model generating unit 201 for extracting the ambiguous name list and the trusted name list from the name knowledge base is specifically configured to:
acquiring ambiguous names from a name knowledge base, and generating an ambiguous name list containing the ambiguous names; acquiring non-ambiguous names except ambiguous names from a name knowledge base; calling a search log, and selecting a credible name from the non-ambiguous names by using the search log; and generating a trusted person name list containing the trusted person names.
The name recognition device provided by the embodiment of the invention can automatically recognize names contained in the name corpus to be recognized by using the name recognition model. Because the names contained in the name knowledge base are comprehensive, the models trained by the name knowledge base and the video knowledge base have certain accuracy for identifying ambiguous names, and meanwhile, names can be identified, so that the overall effect of name identification is improved.
The method and the device for identifying the name of the person provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include or include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A person name recognition method is characterized by comprising the following steps:
receiving name corpus of a person to be identified;
calling a name recognition model, wherein the name recognition model is obtained by utilizing a name knowledge base and a video knowledge base to train a preset machine learning classification model in advance, specifically, an ambiguous name list and a credible name list are respectively extracted from the name knowledge base, an ambiguous name corpus is extracted from the video knowledge base by utilizing the ambiguous name list, ambiguous name features of the ambiguous name corpus are extracted, credible name corpora are extracted from the video knowledge base by utilizing the credible name list, credible name features of the credible name corpus are extracted, and the preset machine learning classification model is trained by utilizing the ambiguous name features of the ambiguous name corpus and the credible name features of the credible name corpus to obtain the name recognition model;
identifying the name contained in the name corpus to be identified by using the name identification model;
wherein the extracting of the ambiguous name feature of the ambiguous name corpus comprises: extracting features of the ambiguous name corpus to obtain a first feature list, wherein the first feature list comprises a plurality of features; performing word segmentation on the ambiguous name corpus to obtain a plurality of first corpus word groups; respectively adding labels to the features of the first feature table, which are the same as the first corpus phrases, and the ambiguous names of the ambiguous name corpus, and converting the features added with the labels and the ambiguous names into feature sequences;
wherein, the extracting the credible name characteristics of the credible name corpus comprises the following steps: performing feature extraction on the credible name corpus to obtain a second feature list, wherein the second feature list comprises a plurality of features; performing word segmentation on the credible name corpus to obtain a plurality of second corpus word groups; and respectively adding labels to the characteristics which are the same as the second corpus phrases in the second characteristic table and the credible names which extract the credible name corpus, and converting the characteristics added with the labels and the credible names into characteristic sequences.
2. The method of claim 1, wherein the extracting the list of ambiguous names and the list of trusted names from the name repository, respectively, comprises: acquiring ambiguous names from a name knowledge base, and generating an ambiguous name list containing the ambiguous names; acquiring non-ambiguous names except the ambiguous names from the name knowledge base; calling a search log, and selecting a credible name from the non-ambiguous names by using the search log; and generating a trusted person name list containing the trusted person name.
3. The method of claim 1, wherein the extracting the ambiguous speech corpus from the video knowledge base using the list of ambiguous speech corps comprises:
for each video text in a video knowledge base, acquiring a title of the video text;
performing word segmentation on the titles to obtain a plurality of title phrases;
judging whether the plurality of title phrases contain at least one ambiguous name in the ambiguous name list;
if yes, determining the video text as an ambiguous name text;
and generating an ambiguous name corpus comprising all the ambiguous name texts.
4. The method of claim 3, wherein the determining the video text as ambiguous person name text further comprises:
respectively calculating text similarity distances between the video text and at least one ambiguous name contained in the video text;
judging whether a text similar distance larger than a distance threshold exists in at least one text similar distance obtained through calculation;
and if so, executing the step of determining the video text as the ambiguous name text.
5. A person name recognition apparatus, comprising: the system comprises a corpus receiving module, a model calling module and a name identification module, wherein the model calling module comprises a model generating unit;
the corpus receiving module is used for receiving a corpus of names of people to be identified;
the model generation unit is used for training a preset machine learning classification model by utilizing a name knowledge base and a video knowledge base in advance to obtain a name recognition model, specifically, respectively extracting an ambiguous name list and a credible name list from the name knowledge base, extracting ambiguous name corpora from the video knowledge base by utilizing the ambiguous name list, extracting ambiguous name features of the ambiguous name corpora, extracting credible name corpora from the video knowledge base by utilizing the credible name list, extracting credible name features of the credible name corpora, and training the preset machine learning classification model by utilizing the ambiguous name features of the ambiguous name corpora and the credible name features of the credible name corpora to obtain the name recognition model;
the model calling module is used for calling the name recognition model;
the name recognition module is used for recognizing names contained in the name corpus to be recognized by using the name recognition model;
wherein the model generation unit extracts ambiguous name features of the ambiguous name corpus, including: the model generating unit extracts the features of the ambiguous person name corpus to obtain a first feature list, wherein the first feature list comprises a plurality of features; performing word segmentation on the ambiguous name corpus to obtain a plurality of first corpus word groups; respectively adding labels to the features of the first feature table, which are the same as the first corpus phrases, and the ambiguous names of the ambiguous name corpus, and converting the features added with the labels and the ambiguous names into feature sequences;
wherein, the model generating unit extracts the credible name characteristics of the credible name corpus, and the method comprises the following steps: the model generation unit performs feature extraction on the credible name corpus to obtain a second feature list, wherein the second feature list comprises a plurality of features; performing word segmentation on the credible name corpus to obtain a plurality of second corpus word groups; and respectively adding labels to the characteristics which are the same as the second corpus phrases in the second characteristic table and the credible names which extract the credible name corpus, and converting the characteristics added with the labels and the credible names into characteristic sequences.
6. The apparatus according to claim 5, wherein the model generation unit configured to extract the list of ambiguous names and the list of trusted names from the name repository, respectively, is specifically configured to:
acquiring ambiguous names from a name knowledge base, and generating an ambiguous name list containing the ambiguous names; acquiring non-ambiguous names except the ambiguous names from the name knowledge base; calling a search log, and selecting a credible name from the non-ambiguous names by using the search log; and generating a trusted person name list containing the trusted person name.
CN201711414492.9A 2017-12-22 2017-12-22 Name recognition method and device Active CN108255806B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711414492.9A CN108255806B (en) 2017-12-22 2017-12-22 Name recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711414492.9A CN108255806B (en) 2017-12-22 2017-12-22 Name recognition method and device

Publications (2)

Publication Number Publication Date
CN108255806A CN108255806A (en) 2018-07-06
CN108255806B true CN108255806B (en) 2021-12-17

Family

ID=62722815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711414492.9A Active CN108255806B (en) 2017-12-22 2017-12-22 Name recognition method and device

Country Status (1)

Country Link
CN (1) CN108255806B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401083B (en) * 2019-01-02 2023-05-02 阿里巴巴集团控股有限公司 Name identification method and device, storage medium and processor

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101454750A (en) * 2006-03-31 2009-06-10 谷歌公司 Disambiguation of named entities
CN101950284A (en) * 2010-09-27 2011-01-19 北京新媒传信科技有限公司 Chinese word segmentation method and system
CN102033879A (en) * 2009-09-27 2011-04-27 腾讯科技(深圳)有限公司 Method and device for identifying Chinese name
CN104424332A (en) * 2013-09-11 2015-03-18 富士通株式会社 Unambiguous Japanese name list building method and name identification method and device
CN105868193A (en) * 2015-01-19 2016-08-17 富士通株式会社 Device and method used to detect product relevant information in electronic text
CN106156051A (en) * 2015-03-27 2016-11-23 深圳市腾讯计算机系统有限公司 Build the method and device of name language material identification model
CN106407180A (en) * 2016-08-30 2017-02-15 北京奇艺世纪科技有限公司 Entity disambiguation method and apparatus
CN106708796A (en) * 2015-07-15 2017-05-24 中国科学院计算技术研究所 Text-based key personal name extraction method and system
CN106779080A (en) * 2017-01-13 2017-05-31 武汉理工数字传播工程有限公司 A kind of people information knowledge base method for auto constructing
CN107180087A (en) * 2017-05-09 2017-09-19 北京奇艺世纪科技有限公司 A kind of searching method and device
CN107391485A (en) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 Entity recognition method is named based on the Korean of maximum entropy and neural network model

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8594996B2 (en) * 2007-10-17 2013-11-26 Evri Inc. NLP-based entity recognition and disambiguation
CN101833569A (en) * 2010-04-08 2010-09-15 中国科学院自动化研究所 Method for automatically identifying film human face image
CN102521321B (en) * 2011-12-02 2013-07-31 华中科技大学 Video search method based on search term ambiguity and user preferences
GB2499249B (en) * 2012-02-13 2016-09-21 Sony Computer Entertainment Europe Ltd System and method of image augmentation
US8799257B1 (en) * 2012-03-19 2014-08-05 Google Inc. Searching based on audio and/or visual features of documents
CN103714094B (en) * 2012-10-09 2017-07-11 富士通株式会社 The apparatus and method of the object in identification video
US9721002B2 (en) * 2013-11-29 2017-08-01 Sap Se Aggregating results from named entity recognition services
CN104918136B (en) * 2015-05-28 2018-08-31 北京奇艺世纪科技有限公司 Video locating method and device
CN106446754A (en) * 2015-08-11 2017-02-22 阿里巴巴集团控股有限公司 Image identification method, metric learning method, image source identification method and devices
CN106649272B (en) * 2016-12-23 2019-06-25 东北大学 A kind of name entity recognition method based on mixed model

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101454750A (en) * 2006-03-31 2009-06-10 谷歌公司 Disambiguation of named entities
CN102033879A (en) * 2009-09-27 2011-04-27 腾讯科技(深圳)有限公司 Method and device for identifying Chinese name
CN101950284A (en) * 2010-09-27 2011-01-19 北京新媒传信科技有限公司 Chinese word segmentation method and system
CN104424332A (en) * 2013-09-11 2015-03-18 富士通株式会社 Unambiguous Japanese name list building method and name identification method and device
CN105868193A (en) * 2015-01-19 2016-08-17 富士通株式会社 Device and method used to detect product relevant information in electronic text
CN106156051A (en) * 2015-03-27 2016-11-23 深圳市腾讯计算机系统有限公司 Build the method and device of name language material identification model
CN106708796A (en) * 2015-07-15 2017-05-24 中国科学院计算技术研究所 Text-based key personal name extraction method and system
CN106407180A (en) * 2016-08-30 2017-02-15 北京奇艺世纪科技有限公司 Entity disambiguation method and apparatus
CN106779080A (en) * 2017-01-13 2017-05-31 武汉理工数字传播工程有限公司 A kind of people information knowledge base method for auto constructing
CN107180087A (en) * 2017-05-09 2017-09-19 北京奇艺世纪科技有限公司 A kind of searching method and device
CN107391485A (en) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 Entity recognition method is named based on the Korean of maximum entropy and neural network model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
统计与规则相结合的维吾尔语人名识别方法;塔什甫拉提·尼扎木丁 等;《自动化学报》;20170430;第43卷(第4期);正文第656页第3.2节及图2 *

Also Published As

Publication number Publication date
CN108255806A (en) 2018-07-06

Similar Documents

Publication Publication Date Title
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
CN107315737B (en) Semantic logic processing method and system
CN106537370B (en) Method and system for robust tagging of named entities in the presence of source and translation errors
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
US20190163691A1 (en) Intent Based Dynamic Generation of Personalized Content from Dynamic Sources
US20180341871A1 (en) Utilizing deep learning with an information retrieval mechanism to provide question answering in restricted domains
US9330661B2 (en) Accuracy improvement of spoken queries transcription using co-occurrence information
CN111339283B (en) Method and device for providing customer service answers aiming at user questions
CN111046656B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN114328881A (en) Short text matching-based voice question-answering method and system
CN114580382A (en) Text error correction method and device
Watts et al. Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages fromfound'data: evaluation and analysis
CN114416942A (en) Automatic question-answering method based on deep learning
WO2022134355A1 (en) Keyword prompt-based search method and apparatus, and electronic device and storage medium
CN114817570A (en) News field multi-scene text error correction method based on knowledge graph
CN106897274B (en) Cross-language comment replying method
CN108345694B (en) Document retrieval method and system based on theme database
CN112765977B (en) Word segmentation method and device based on cross-language data enhancement
CN107562907B (en) Intelligent lawyer expert case response device
CN108255806B (en) Name recognition method and device
CN113761137B (en) Method and device for extracting address information
Oo et al. An analysis of ambiguity detection techniques for software requirements specification (SRS)
CN112581327A (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN113822052A (en) Text error detection method and device, electronic equipment and storage medium
US8666987B2 (en) Apparatus and method for processing documents to extract expressions and descriptions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant