CN110413983B - Method and device for identifying name - Google Patents

Method and device for identifying name Download PDF

Info

Publication number
CN110413983B
CN110413983B CN201810392724.3A CN201810392724A CN110413983B CN 110413983 B CN110413983 B CN 110413983B CN 201810392724 A CN201810392724 A CN 201810392724A CN 110413983 B CN110413983 B CN 110413983B
Authority
CN
China
Prior art keywords
text
name
recognized
sentence
person
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810392724.3A
Other languages
Chinese (zh)
Other versions
CN110413983A (en
Inventor
何耀
蒋松岐
刘笑逸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Haima Light Sail Entertainment Technology Co ltd
Original Assignee
Beijing Haima Light Sail Entertainment Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Haima Light Sail Entertainment Technology Co ltd filed Critical Beijing Haima Light Sail Entertainment Technology Co ltd
Priority to CN201810392724.3A priority Critical patent/CN110413983B/en
Publication of CN110413983A publication Critical patent/CN110413983A/en
Application granted granted Critical
Publication of CN110413983B publication Critical patent/CN110413983B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method and a device for identifying a person name, wherein the method comprises the following steps: extracting a bystander text adjacent to the dialog text from the text to be recognized; performing word segmentation on the voice text to obtain at least one word segmentation; calculating a likelihood score for the participle, the likelihood score characterizing a likelihood that the participle is a person name; and determining whether the participle is a name of a person according to the possibility score. Therefore, the method for identifying the names of the people provided by the embodiment of the invention can be used for identifying and segmenting the voice-over text adjacent to the dialog text, and calculating the possibility that each segmented word is the name of the people, so that the names of the people in the text to be identified can be effectively identified.

Description

Method and device for identifying name
Technical Field
The invention relates to the field of automatic identification, in particular to a method and a device for identifying a name of a person.
Background
With the development of automatic identification technology, automatic identification technology is adopted in many fields to replace traditional manual identification. The automatic identification technology has the advantages of high efficiency and high accuracy.
Automatic recognition techniques may be applied to episodic analysis of literary works, for example, may be applied to analysis of narrative literary works. The narrative literature is a literature in which a story line description is performed from the viewpoint of occurrence of an event. The narrative literature is subjected to plot analysis, the character roles in the narrative literature, namely the names of people in the narrative literature are particularly important to be identified, and the names of people in the narrative literature are helpful for the plot analysis of the narrative literature.
Therefore, there is a need to provide a method for accurately and efficiently identifying names of people in narrative literature.
Disclosure of Invention
The invention aims to solve the technical problem of how to accurately and effectively extract the names of people in narrative literature.
In a first aspect, an embodiment of the present invention provides a method for identifying a person name, including:
extracting a voice-over text adjacent to the dialog text from the text to be recognized;
performing word segmentation on the voice text to obtain at least one word segmentation;
calculating a likelihood score for the participle, the likelihood score characterizing a likelihood that the participle is a person name;
and determining whether the participle is a name of a person according to the possibility score.
Optionally, the extracting the voice-over text adjacent to the dialog content from the text to be recognized includes:
segmenting a text to be recognized to obtain a plurality of sentences; the obtained plurality of sentences includes a conversational sentence and a non-conversational sentence;
extracting a voice-over sentence adjacent to the dialogue sentence from the obtained sentences as voice-over text adjacent to the dialogue content.
Optionally, the extracting the voice-over sentence adjacent to the dialogue sentence from the obtained sentences includes:
extracting a conversation sentence from the obtained plurality of sentences;
judging whether the previous sentence of the dialogue sentence is ended by a colon or a comma or not, if so, extracting the previous sentence of the dialogue sentence;
and (c) a second step of,
and judging whether the next sentence of the dialogue sentence is a non-dialogue sentence or not, and if so, extracting the next sentence of the dialogue sentence.
Optionally, the extracting, from the obtained plurality of sentences, a voice-over sentence adjacent to the dialogue sentence includes:
extracting a non-conversational sentence from the obtained plurality of sentences;
and judging whether the non-dialog sentence is ended by a colon or a comma, and if so, extracting the non-dialog sentence.
Optionally, the calculating the likelihood score of the participle includes:
calculating the times of the word segmentation in the text to be recognized;
extracting the times of the occurrence of the participles in other texts;
and calculating the possibility score of the word segmentation according to the times of the word segmentation in the text to be recognized and the times of the word segmentation in the other texts.
Optionally, the determining whether the participle is a name of a person according to the likelihood score includes:
judging whether the possibility score is larger than a first preset threshold value or not;
if yes, determining the participle as a name of a person.
Optionally, the method further includes:
and obtaining the first preset threshold value in advance.
Optionally, the obtaining the first preset threshold in advance includes:
manually marking the name of the person in the text to be recognized;
presetting a first preset number of reference thresholds, and respectively counting the recognized names corresponding to the reference thresholds; determining the accuracy of the name recognition and the recall rate of the name recognition corresponding to each reference threshold value according to the recognized name and the name in the manually marked text to be recognized;
calculating a name identification score corresponding to the reference threshold according to the name identification accuracy and the name identification recall rate corresponding to the reference threshold;
and taking the reference threshold value corresponding to the highest name identification score as the first preset threshold value.
Optionally, the obtaining the first preset threshold in advance includes:
manually marking the name of the person in the text to be recognized;
calculating the possibility score of the person name in the text to be recognized, and obtaining the range of the possibility score of the person name in the text to be recognized;
determining the first preset threshold according to the range of the likelihood score.
In a second aspect, an embodiment of the present invention provides an apparatus for identifying a name of a person, including:
the system comprises a dialogue text extraction unit, a dialogue text recognition unit and a dialogue recognition unit, wherein the dialogue text extraction unit is used for extracting an adjacent dialogue text from a text to be recognized;
the word segmentation unit is used for segmenting the Chinese character to obtain at least one segmented word;
a likelihood score calculating unit for calculating a likelihood score of the participle, the likelihood score being used for representing the likelihood that the participle is a person name;
and the name determining unit is used for determining whether the participle is the name of the person according to the possibility score.
Optionally, the voice-over text extracting unit includes: a segmentation subunit and an extraction subunit.
The segmentation subunit is used for segmenting the text to be recognized to obtain a plurality of sentences; the obtained plurality of sentences includes a conversational sentence and a non-conversational sentence;
the extracting subunit is configured to extract, from the obtained plurality of sentences, a bystander sentence that is adjacent to the dialogue sentence as a bystander text that is adjacent to the dialogue content.
Optionally, the extraction subunit is specifically configured to:
extracting a dialog sentence from the obtained plurality of sentences;
judging whether the previous sentence of the dialogue sentence is ended by a colon or a comma or not, if so, extracting the previous sentence of the dialogue sentence;
and the number of the first and second groups,
and judging whether the next sentence of the dialogue sentence is a non-dialogue sentence or not, and if so, extracting the next sentence of the dialogue sentence.
Optionally, the extraction subunit is specifically configured to:
extracting a non-conversational sentence from the obtained plurality of sentences;
and judging whether the non-dialog sentence is ended by a colon or a comma, and if so, extracting the non-dialog sentence.
Optionally, the likelihood score calculating unit includes: a first frequency calculation subunit, a second frequency extraction subunit, and a likelihood score calculation subunit.
And the first frequency calculating subunit is used for calculating the times of the occurrences of the participles in the text to be recognized.
And the second frequency extraction subunit is used for extracting the times of the occurrences of the participles in other texts.
And the possibility score calculating subunit is used for calculating the possibility score of the participle according to the frequency of the participle appearing in the text to be recognized and the frequency of the participle appearing in the other texts.
Optionally, the name determining unit includes:
the judging subunit is used for judging whether the possibility score is larger than a first preset threshold value or not;
and the name determining subunit is used for determining the participle as the name of the person when the possibility score is greater than a first preset threshold value.
Optionally, the apparatus further comprises: a first preset threshold obtaining unit.
The first preset threshold obtaining unit is configured to obtain the first preset threshold in advance.
Optionally, the first preset threshold obtaining unit includes: the system comprises a first name annotation subunit, a statistics subunit, a correct rate and recall rate determining subunit, a name identification score calculating subunit and a first preset threshold determining subunit.
And the first name labeling subunit is used for manually labeling names in the text to be recognized.
The counting subunit is configured to preset a first preset number of reference thresholds, and count the identified names corresponding to the reference thresholds respectively.
And the accuracy and recall rate determining subunit is used for determining the accuracy of the name identification and the recall rate of the name identification corresponding to each reference threshold value according to the identified name and the name in the manually marked text to be identified.
And the name identification score calculating subunit is used for calculating the name identification score corresponding to the reference threshold according to the name identification accuracy and the name identification recall rate corresponding to the reference threshold.
And the first preset threshold determining subunit is configured to use a reference threshold corresponding to the highest name recognition score as the first preset threshold.
Optionally, the first preset threshold obtaining unit includes: the second name labeling subunit, the possibility score range obtaining subunit and the first preset threshold obtaining subunit.
The second name labeling subunit is used for manually labeling names in the text to be recognized;
the likelihood score range obtaining subunit is configured to calculate a likelihood score of a person name in the text to be recognized, and obtain a range of the likelihood score of the person name in the text to be recognized;
and the first preset threshold obtaining subunit is configured to determine the first preset threshold according to the range of the likelihood score.
In a third aspect, an embodiment of the present invention provides an apparatus for identifying a name of a person, including a memory, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:
extracting a voice-over text adjacent to the dialog text from the text to be recognized;
performing word segmentation on the voice text to obtain at least one word segmentation;
calculating a likelihood score for the participle, the likelihood score characterizing a likelihood that the participle is a person name;
and determining whether the participle is a name of a person according to the possibility score.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, where instructions of the storage medium, when executed by a processor of an electronic device, enable the processor to perform the method for identifying a person name according to any one of the above first aspects.
Compared with the prior art, the embodiment of the invention has the following advantages:
in the embodiment of the invention, the method comprises the steps of firstly extracting the voice-over text adjacent to the dialogue text from the text to be recognized, and then segmenting the voice-over text to obtain at least one segmented word. And finally, judging whether the participle is a name of a person or not. Specifically, whether the participle is a person name or not is judged, and by calculating a possibility score of the participle, the possibility that the participle is a person name can be identified due to the possibility score, so that whether the participle is a person name or not is judged according to the possibility score. Therefore, the method for identifying the names of the people provided by the embodiment of the invention identifies and segments the voice-over text adjacent to the dialogue text, and calculates the possibility that each segment is the name of the people, so that the names of the people in the text to be identified can be accurately and effectively identified.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and it is also possible for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for identifying a name according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a method for identifying a name according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a method for identifying a name according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of a method for obtaining a first preset threshold in advance according to an embodiment of the present invention;
fig. 5 is a schematic flowchart of another method for obtaining a first preset threshold in advance according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an apparatus for identifying a name of a person according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The inventor of the invention finds that the narrative literature is subjected to plot analysis, the character role, namely the name of the person, is particularly important to be identified, and the person name is identified, so that the plot analysis of the narrative literature is facilitated. In practical application, the name of a person can be identified by a part-of-speech tagging method. The part-of-speech tagging can identify the part of speech of nouns, verbs, adjectives, prepositions and the like in the sentence, and further can identify place names and person names in the nouns, so that the person names in the text to be identified can be identified. However, the person names in narrative literature often have rare surnames and first names, and the method of part-of-speech tagging often can only identify the person names including common surnames and common first names. Therefore, the method of part-of-speech tagging cannot effectively identify names of people in narrative literature.
The inventor also found in the research that on one hand, the narrative literature is mainly composed of pure bystander and dialogue, and the bystander is possibly generated before the dialogue is started or after the dialogue is ended and during the dialogue, and the names of people in the bystander are more likely to be generated. That is, there is a high probability that a person's name appears in the bystander adjacent to the conversation. On the other hand, the likelihood score of a person name is different from that of a non-person name.
In view of this, in the embodiment of the present invention, first, the whitespace text adjacent to the dialog text is extracted from the text to be recognized, and then the whitespace text is subjected to word segmentation to obtain at least one word segmentation. And finally, judging whether the participle is a name of a person or not. Specifically, whether the participle is a name of a person is judged, and the possibility that the participle is the name of the person can be identified by calculating the possibility score of the participle, so that whether the participle is the name of the person is judged according to the possibility score. Therefore, the method for identifying the names of the people provided by the embodiment of the invention identifies and segments the voice-over texts adjacent to the dialogue text, and calculates the possibility that each segmented word is the name of the people, so that the names of the people in the text to be identified can be accurately and effectively identified.
Various non-limiting embodiments of the present invention are described in detail below with reference to the accompanying drawings.
First embodiment
Referring to fig. 1, the figure is a schematic flow chart of a method for identifying a name according to this embodiment.
The method for identifying a person name provided in this embodiment can be implemented through the following steps 101 to 104.
Step 101: and extracting the adjacent voice-over text adjacent to the dialog text from the text to be recognized.
The text to be recognized mentioned in the present embodiment includes text composed of contents such as conversation and voice-over, for example, the text to be recognized may be narrative literary works.
The adjacent spoken text to the dialog text in this embodiment refers to a non-dialog text that is one sentence before the dialog text or a non-dialog text that is one sentence after the dialog text.
It should be noted that, in this embodiment, the "previous sentence" and the "next sentence" are used to identify the positional relationship between the dialog text and the adjacent bystander text, where the "one sentence" is not used to limit that the "adjacent bystander text to the dialog text" is a complete sentence, i.e., it is not limited that the "adjacent bystander text to the dialog text" must end with a period. That is, the adjacent Chinese text to the dialog text mentioned in this embodiment may or may not be a complete sentence.
Step 102: and performing word segmentation on the voice text to obtain at least one word segmentation.
It should be noted that, in this embodiment, the bystander text may be divided into a plurality of participles according to a certain window size, where the window size refers to the number of characters included in a participle, and one character may be a chinese character, an english word, a character in another language, or the like. For example, if the text to be recognized is chinese and the window size is 1, the text to be recognized is divided into a plurality of segments, and each segment includes a chinese character.
It is considered that the name of the text to be recognized may be a name of two words, a name of three words, or a name of four words. Therefore, in this embodiment, the participles may be obtained by performing the participle on the voice-over text with 2, 3, and 4 as window sizes, and the operation of step 103 may be performed on the participles.
For example, if the bystander text is "good weather", the bystander text is segmented with 2 as the window size, and the obtained segmentation is as follows: "weather, gas is very good"; and taking 3 as the window size, performing word segmentation on the voice-over text, wherein the obtained word segmentation is as follows: "weather is good, gas is good"; and taking 4 as the window size, performing word segmentation on the voice-over text, wherein the obtained word segmentation is as follows: "weather is good". Further, the operation of step 103 is performed for the word "weather, good, fair, good weather".
Step 103: calculating a likelihood score for the participle, the likelihood score characterizing a likelihood that the participle is a person's name.
Step 104: and determining whether the participle is a name of a person according to the possibility score.
With respect to step 103 and step 104, it should be noted that, since the likelihood score may represent the likelihood that the participle is a person name, it may be determined whether the participle is a person name through the likelihood score of the participle. Thereby identifying the name of the person in the text to be identified.
In a specific implementation, step 104 may determine whether the likelihood score is greater than a first preset threshold, and if so, determine that the word segmentation is a name of a person.
The first preset threshold is predetermined. The specific value of the first preset threshold is not specifically limited in this embodiment, and the first preset threshold may be specifically set according to the text to be recognized. As an example, the first preset threshold may be 15.
It is understood that when the likelihood score of the participle is greater than the first preset threshold, the possibility that the participle is the name of the person is high, and thus, the participle can be determined to be the name of the person.
The method for recognizing the name of the person provided by the embodiment includes the steps of firstly, extracting the voice-over text adjacent to the dialogue text from the text to be recognized, and then segmenting the voice-over text to obtain at least one segmented word. And finally, judging whether the participle is a name of a person or not. Specifically, whether the participle is a name of a person is judged, and the possibility that the participle is the name of the person can be identified by calculating the possibility score of the participle, so that whether the participle is the name of the person is judged according to the possibility score. Therefore, the method for identifying the names of the people provided by the embodiment of the invention identifies and segments the voice-over text adjacent to the dialogue text, and calculates the possibility that each segment is the name of the people, so that the names of the people in the text to be identified can be accurately and effectively identified.
Second embodiment
A first embodiment provides a method for identifying a person name, and a second embodiment will describe a specific implementation manner of step 101 in the first embodiment with reference to the drawings.
Referring to fig. 2, this figure is a schematic flowchart of a method for extracting a text adjacent to a dialog text according to this embodiment.
The method for extracting the voice-over text adjacent to the dialog text provided by the embodiment can be implemented by the following steps 201 to 202.
Step 201: the method comprises the steps of segmenting a text to be recognized to obtain a plurality of sentences, wherein the obtained sentences comprise a dialogue sentence and a non-dialogue sentence.
Since the text to be recognized includes dialog text and bystander text. Dialog text typically appears within the double quotation marks and, at the end of the dialog, other punctuation marks typically appear in succession with the right quotation marks in the double quotation marks, e.g. exclamation marks, periods, etc.
Therefore, in this embodiment, the text to be recognized may be segmented by using double quotation marks and other punctuation marks. Specifically, firstly, the text to be recognized is segmented by using two punctuations which continuously appear, such as a period right quotation mark, an exclamation mark right quotation mark, a question mark right quotation mark and the like, so as to obtain a first segmented text. And secondly, further segmenting the first segmented text by using the left quotation marks to obtain a second segmented text. It is understood that the second segmentation text already segments the dialog text in the text to be recognized, and for other non-dialog texts, the second segmentation text is segmented again by using punctuations such as commas, periods and the like, so as to segment the text to be recognized into a plurality of sentences.
Step 202: extracting a voice-over sentence adjacent to the dialog sentence from the obtained plurality of sentences as voice-over text adjacent to the dialog text.
After the text to be recognized is divided into a plurality of sentences, the bystander sentences adjacent to the dialog sentences can be extracted according to the position relationship between the dialog sentences and the non-dialog sentences, so that the bystander texts adjacent to the dialog texts can be extracted.
When the step 202 is implemented in detail, various implementations are possible.
In one possible implementation, a dialog sentence may be extracted from the obtained plurality of sentences; judging whether the previous sentence of the dialogue sentence is ended by a colon or a comma or not, if so, extracting the previous sentence of the dialogue sentence; and judging whether the next sentence of the dialogue sentences is a non-dialogue sentence, and if so, extracting the next sentence of the dialogue sentences.
It is understood that if the preceding sentence of the dialog sentence ends with a colon or a comma, it is indicated that the preceding sentence of the dialog sentence is a bystander sentence adjacent to the dialog sentence. If the next sentence of the dialog sentence is a non-dialog sentence, it is indicated that the next sentence of the dialog sentence is a bystander sentence adjacent to the dialog sentence.
In another possible implementation, a non-conversational sentence may be extracted from the obtained plurality of sentences; and judging whether the non-dialog sentence is ended by a colon or a comma, and if so, extracting the non-dialog sentence.
It is understood that if a non-dialog sentence ends with a colon or a comma, the latter sentence of the non-dialog sentence is explained as a dialog sentence, i.e., the non-dialog sentence is a non-dialog sentence adjacent to the dialog sentence.
In the method provided by this embodiment, the text to be recognized is segmented by punctuation marks to obtain a dialog sentence and a non-dialog sentence, and the bystander text adjacent to the dialog text is extracted by using the position relationship between the dialog sentence and the non-dialog sentence.
Third embodiment
A first embodiment provides a method for identifying a person name, and a third embodiment will describe a specific implementation manner of step 103 in the first embodiment with reference to the drawings.
Referring to fig. 3, the figure is a schematic flow chart of the method for calculating the likelihood score of the participle according to the embodiment.
The method for calculating the likelihood score of the participle provided by the embodiment can be implemented by the following steps 301 to 303.
Step 301: and calculating the times of the occurrence of the word segmentation in the text to be recognized.
Step 302: and extracting the times of the occurrences of the participles in other texts.
It should be noted that the number of times that the word segmentation occurs in other texts is pre-calculated and stored in the corresponding storage space. The storage space stores the times of occurrence of many segmented words in other texts. Thus, the number of times the word-segmentation occurs in other text can be extracted directly from the storage space.
The present embodiment does not specifically limit the number of texts included in the other texts. For example, the other text may be text composed of 10 novel.
It is understood that other texts include different numbers of texts, and the number of occurrences of the corresponding participle in other texts may also be different, so that the number of occurrences of the participle in other texts stored in the storage space may be updated. For example, when a text whose text number is 10 novels becomes a text consisting of 20 novels, the number of times a word appears in other texts may be updated to the number of times the word appears in a text consisting of 20 novels from the number of times the word appears in a text consisting of 10 novels.
Step 303: and calculating the possibility score of the word segmentation according to the times of the word segmentation in the text to be recognized and the times of the word segmentation in other texts.
It should be noted that the specific implementation of step 303 is similar to a term frequency-inverse document frequency (TF-IDF) algorithm. TF-IDF is a statistical method to assess how important a word is to a document. The importance of a word-segmentation increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in other documents.
In this embodiment, a ratio of the number of times that the participle appears in the text to be recognized to the number of times that the participle appears in other texts may be calculated as a likelihood score of the participle, which is used for representing the likelihood that the participle is a person name.
However, considering that in practical applications, when a person name in a text to be recognized is very rare, the person name may never appear in other texts, and at this time, the denominator in the probability score formula is 0, and the calculation is meaningless, in this embodiment, the probability calculation may adopt the following formula:
Figure BDA0001643818520000111
wherein s is i A likelihood score representing the word segmentation i,f i 1 denotes the number of occurrences of a participle i in the text to be recognized, f i 2 denotes the number of times the word segmentation i occurs in other text.
In the method provided by this embodiment, the thought of the TF-IDF algorithm is used to calculate the possibility score of the participle, so that whether the participle is a name or not can be determined according to the possibility score.
Fourth embodiment
The first embodiment provides a method for identifying a person name, and in the first embodiment, it is mentioned that whether the participle is a person name can be determined by determining whether the likelihood score of the participle is greater than a first preset threshold. It should be noted that the first preset threshold may be obtained in advance.
On the basis of the method for identifying a person name provided in the first embodiment, a fourth embodiment will describe a specific implementation method for obtaining a first preset threshold in advance with reference to the drawings.
Referring to fig. 4, the figure is a schematic flowchart of a method for obtaining a first preset threshold in advance according to this embodiment.
The method for obtaining the first preset threshold in advance provided by this embodiment may be implemented by the following steps 401 to 404.
Step 401: and manually marking the name of the person in the text to be recognized.
The manual marking of the name in the text to be recognized in this embodiment means that the name in the text to be recognized is marked in a manual participation manner.
Step 402: presetting a first preset number of reference thresholds, and respectively counting the recognized names corresponding to the reference thresholds.
It should be noted that, the specific value of the first preset number is not specifically limited in this embodiment, and the first preset number may be specifically set according to an actual situation. For example, the first preset number may be 5.
The step 402 of counting the recognized names corresponding to the reference thresholds means that the reference thresholds are respectively used as first preset thresholds, and the method for recognizing the names provided in the first embodiment is used to perform name recognition, so as to count the recognized names corresponding to the reference thresholds.
Step 403: and determining the name recognition accuracy and the name recognition recall rate corresponding to each reference threshold value according to the recognized name and the name in the manually marked text to be recognized.
It is understood that the reference threshold is different, and the recognized names are different, and the recognized names may include correct names or incorrect names.
The accuracy of the name identification corresponding to the reference threshold mentioned in this embodiment is a ratio of the correctly identified name corresponding to the reference threshold to the total number of the identified names.
The recall rate of the name recognition corresponding to the reference threshold in this embodiment is a ratio of the number of correctly recognized names corresponding to the reference threshold to the number of names in the manually labeled text to be recognized.
Step 404: and calculating the name identification score corresponding to the reference threshold according to the name identification accuracy and the name identification recall rate corresponding to the reference threshold.
It can be understood that the accuracy of the name recognition represents the ratio of the number of correctly recognized names to the total number of recognized names in the recognized names corresponding to the reference threshold, and represents the quality of the name recognition corresponding to the reference threshold; the recall rate of the person name recognition represents the ratio of the number of correct recognitions to the number of manually labeled person names in the recognized person names corresponding to the reference threshold, and represents the number of the recognizable person names.
It is understood that, when performing the person name recognition, on the one hand, when performing the person name recognition, the non-person name is not recognized as the person name as much as possible, that is, the accuracy of the person name recognition is relatively high. On the other hand, the names of the persons in the text to be recognized are recognized as much as possible, i.e., the recall rate of the name recognition is high.
In the embodiment of the present application, the accuracy of the name recognition and the recall rate of the name recognition corresponding to the reference threshold may be integrated, and the name recognition score corresponding to the reference threshold may be calculated. The name recognition score may be viewed as a weighted balance of the accuracy of the name recognition and the recall rate of the name recognition. Specifically, the name recognition score may be obtained by the following formula:
Figure BDA0001643818520000131
wherein s represents a name recognition score, f 1 Indicating the accuracy of the name recognition, f 2 Representing the recall rate of the identification of the name of the person.
Step 405: and taking the reference threshold value corresponding to the highest name identification score as the first preset threshold value. It can be understood that, the higher the name recognition score is, the higher the recall and accuracy rate of the name recognition corresponding to the reference threshold may be, and therefore, the better the name recognition effect is.
The following description will be made by taking an example of steps 401 to 405. For example, 5 reference thresholds are set in advance, and the 5 reference thresholds are 12, 13, 14, 15, and 16, respectively. Respectively performing word segmentation by using the method in the first embodiment to obtain a possibility score of each word segmentation, taking the word segmentation with the possibility score larger than 12 as a name, and counting the name; accordingly, the participles with the likelihood scores of more than 13, 14, 15 and 16 are respectively taken as the names of the persons and are respectively counted. The accuracy of the person name recognition and the recall rate of the person name recognition corresponding to the reference threshold values 12, 13, 14, 15 and 16 are calculated, respectively, thereby calculating the person name recognition scores corresponding to the reference threshold values 12, 13, 14, 15 and 16. The name recognition scores corresponding to the respective reference thresholds are shown in table 1. Wherein 50 person names are manually labeled.
TABLE 1
Figure BDA0001643818520000132
Figure BDA0001643818520000141
Wherein, the number of correctly identified names in table 1 refers to the number of identified names manually labeled; the "total vocabulary number of recognition output" in table 1 means the total number of recognized names.
As can be seen from table 1, when the reference threshold is 15, the corresponding name recognition score is the highest, and therefore, the reference threshold 15 may be used as the first preset threshold, so that the name of the person in the text to be recognized can be accurately recognized by using the method provided by the first embodiment.
In addition to the method shown in fig. 4, another method for obtaining the first preset threshold value in advance is provided in the present embodiment.
Referring to fig. 5, this is a schematic flow chart of another method for obtaining a first preset threshold in advance according to this embodiment.
The method for obtaining the first preset threshold in advance provided by this embodiment can be implemented by the following steps 501 to 503.
Step 501: and manually marking the name of the person in the text to be recognized.
Step 501 is the same as step 401, and for the specific description, reference may be made to the description part in step 401, which is not described herein again.
Step 502: and calculating the possibility score of the person name in the text to be recognized, and obtaining the range of the possibility score of the person name in the text to be recognized.
Step 503: determining the first preset threshold according to the range of the likelihood score.
It should be noted that, since all the person names of the text to be recognized are manually marked, the range of the likelihood score of the person names in the text to be recognized can be obtained.
Step 503, when implemented specifically, may use the lower limit of the range of the likelihood score as a first preset threshold. For example, if the probability score of the person name in the text to be recognized ranges from 15 to 21, the lower limit 15 of the range from 15 to 21 may be used as the first preset threshold.
Since the probability scores of all the names in the text to be recognized are greater than or equal to 15, the names in the text to be recognized can be almost recognized by using 15 as the first preset threshold.
Fifth embodiment
The present invention also provides an apparatus for recognizing a person's name based on the method for recognizing a person's name provided in the above first to fourth embodiments, and the fifth embodiment will be described with reference to the accompanying drawings.
Referring to fig. 6, this figure is a schematic structural diagram of an apparatus 600 for identifying a name according to this embodiment. The apparatus 600 may specifically include, for example: a context extracting unit 610, a participle unit 620, a likelihood score calculating unit 630, and a name determining unit 640.
The text-to-speech extracting unit 610 is configured to extract a text-to-speech adjacent to the dialog text from the text to be recognized.
The word segmentation unit 620 is configured to perform word segmentation on the voice-over text to obtain at least one word segmentation.
The likelihood score calculating unit 630 is configured to calculate a likelihood score of the participle, where the likelihood score is used to characterize a likelihood that the participle is a person name.
The name determining unit 640 is configured to determine whether the participle is a name according to the likelihood score.
Optionally, the text-to-speech extracting unit 610 includes: a segmentation subunit and an extraction subunit.
The segmentation subunit is used for segmenting the text to be recognized to obtain a plurality of sentences; the obtained plurality of sentences includes a conversational sentence and a non-conversational sentence;
the extracting subunit is configured to extract, from the obtained plurality of sentences, a voice-over sentence adjacent to the dialogue sentence as a voice-over text adjacent to the dialogue content.
Optionally, the extraction subunit is specifically configured to:
extracting a dialog sentence from the obtained plurality of sentences;
judging whether the previous sentence of the dialogue sentence is ended by a colon or a comma or not, if so, extracting the previous sentence of the dialogue sentence;
and the number of the first and second groups,
and judging whether the next sentence of the dialogue sentence is a non-dialogue sentence or not, and if so, extracting the next sentence of the dialogue sentence.
Optionally, the extraction subunit is specifically configured to:
extracting a non-conversational sentence from the obtained plurality of sentences;
and judging whether the non-dialog sentence is ended by a colon or a comma, and if so, extracting the non-dialog sentence.
Optionally, the possibility score calculating unit 630 includes: a first frequency calculation subunit, a second frequency extraction subunit, and a likelihood score calculation subunit.
And the first frequency calculating subunit is used for calculating the times of the occurrences of the participles in the text to be recognized.
And the second frequency extraction subunit is used for extracting the times of the occurrences of the participles in other texts.
And the possibility score calculating subunit is used for calculating the possibility score of the participle according to the frequency of the participle appearing in the text to be recognized and the frequency of the participle appearing in the other texts.
Optionally, the name determining unit 640 includes:
the judging subunit is used for judging whether the possibility score is larger than a first preset threshold value or not;
and the name determining subunit is used for determining the participle as the name of the person when the possibility score is greater than a first preset threshold value.
Optionally, the apparatus further comprises: a first preset threshold obtaining unit.
The first preset threshold obtaining unit is configured to obtain the first preset threshold in advance.
Optionally, the first preset threshold obtaining unit includes: the system comprises a first name annotation subunit, a statistic subunit, a correct rate and recall rate determining subunit, a name identification score calculating subunit and a first preset threshold value determining subunit.
And the first name labeling subunit is used for manually labeling names in the text to be recognized.
The counting subunit is configured to preset a first preset number of reference thresholds, and count the identified names corresponding to the reference thresholds respectively.
And the accuracy and recall rate determining subunit is used for determining the accuracy of the name identification and the recall rate of the name identification corresponding to each reference threshold value according to the identified name and the name in the manually marked text to be identified.
And the name identification score calculating subunit is used for calculating the name identification score corresponding to the reference threshold according to the name identification accuracy and the name identification recall rate corresponding to the reference threshold.
And the first preset threshold determining subunit is configured to use a reference threshold corresponding to the highest name recognition score as the first preset threshold.
Optionally, the first preset threshold obtaining unit includes: the second name labeling subunit, the possibility score range obtaining subunit and the first preset threshold obtaining subunit.
The second name labeling subunit is used for manually labeling names in the text to be recognized;
the likelihood score range obtaining subunit is used for calculating a likelihood score of the name in the text to be recognized and obtaining a range of the likelihood score of the name in the text to be recognized;
the first preset threshold obtaining subunit is configured to determine the first preset threshold according to the range of the likelihood score.
The device for identifying a person name provided in this embodiment is a device corresponding to the method for identifying a person name provided in the first to fourth embodiments, and therefore, specific implementation portions may refer to the descriptions in the first to fourth embodiments, and are not described herein again.
The apparatus for recognizing a person name provided in this embodiment extracts, first, the bystander text adjacent to the dialog text from the text to be recognized, and then performs word segmentation on the bystander text to obtain at least one word segmentation. And finally, judging whether the participle is a name of a person or not. Specifically, whether the participle is a name of a person is judged, and the possibility that the participle is the name of the person can be identified by calculating the possibility score of the participle, so that whether the participle is the name of the person is judged according to the possibility score. Therefore, the device for identifying the names of the people provided by the embodiment of the invention identifies and segments the voice-over text adjacent to the dialogue text, and calculates the possibility that each segment is the name of the people, so that the names of the people in the text to be identified can be accurately and effectively identified.
Based on the methods for identifying a person name provided in the first to fourth embodiments above, the present invention also provides a device for identifying a person name. The apparatus includes a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:
extracting a voice-over text adjacent to the dialog text from the text to be recognized;
performing word segmentation on the voice text to obtain at least one word segmentation;
calculating a likelihood score for the participle, the likelihood score characterizing a likelihood that the participle is a person name;
and determining whether the participle is a name of a person according to the possibility score.
The present invention also provides a non-transitory computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the processor to perform the method of recognizing a person name as described in the first to fourth embodiments.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims (11)

1. A method of identifying a person's name, comprising:
extracting a voice-over text adjacent to the dialog text from the text to be recognized;
performing word segmentation on the voice text to obtain at least one word segmentation;
calculating a likelihood score for the participle, the likelihood score characterizing a likelihood that the participle is a person name;
determining whether the participle is a name of a person according to the possibility score;
the calculating the likelihood score of the participle comprises:
calculating the times of the word segmentation in the text to be recognized;
extracting the times of the occurrence of the participles in other texts;
and calculating the possibility score of the word segmentation according to the times of the word segmentation in the text to be recognized and the times of the word segmentation in the other texts.
2. The method of claim 1, wherein the extracting the text to be recognized with the adjacent dialogue content comprises:
segmenting a text to be recognized to obtain a plurality of sentences; the obtained plurality of sentences includes a conversational sentence and a non-conversational sentence;
extracting a voice-over sentence adjacent to the dialogue sentence from the obtained sentences as voice-over text adjacent to the dialogue content.
3. The method according to claim 2, wherein said extracting a bystander sentence adjacent to the dialogue sentence from the obtained plurality of sentences comprises:
extracting a conversation sentence from the obtained plurality of sentences;
judging whether the previous sentence of the dialogue sentence is ended by a colon or a comma or not, if so, extracting the previous sentence of the dialogue sentence;
and (c) a second step of,
and judging whether the next sentence of the dialogue sentence is a non-dialogue sentence or not, and if so, extracting the next sentence of the dialogue sentence.
4. The method according to claim 2, wherein said extracting the bystander sentences adjacent to the dialogue sentence from the obtained plurality of sentences comprises:
extracting a non-conversational sentence from the obtained plurality of sentences;
and judging whether the non-dialog sentence is ended by a colon or a comma, and if so, extracting the non-dialog sentence.
5. The method of claim 1, wherein determining whether the participle is a name of a person according to the likelihood score comprises:
judging whether the possibility score is larger than a first preset threshold value or not;
if so, determining the participle as a name of a person.
6. The method of claim 5, further comprising:
and obtaining the first preset threshold value in advance.
7. The method according to claim 6, wherein the pre-obtaining the first preset threshold comprises:
manually marking the name of the person in the text to be recognized;
presetting a first preset number of reference thresholds, and respectively counting the recognized names corresponding to the reference thresholds; determining the accuracy of the name recognition and the recall rate of the name recognition corresponding to each reference threshold value according to the recognized name and the name in the manually marked text to be recognized;
calculating a name identification score corresponding to the reference threshold according to the name identification accuracy and the name identification recall rate corresponding to the reference threshold;
and taking the reference threshold value corresponding to the highest name identification score as the first preset threshold value.
8. The method according to claim 6, wherein the pre-obtaining the first preset threshold value comprises:
manually marking the name of the text to be recognized;
calculating the possibility score of the person name in the text to be recognized, and obtaining the range of the possibility score of the person name in the text to be recognized;
determining the first preset threshold according to the range of the likelihood score.
9. An apparatus for recognizing a name of a person, comprising:
the text-to-speech recognition unit is used for recognizing the text to be recognized;
the word segmentation unit is used for segmenting the Chinese character to obtain at least one segmented word;
a likelihood score calculating unit for calculating a likelihood score of the participle, the likelihood score being used for representing the likelihood that the participle is a person name;
the name determining unit is used for determining whether the participle is a name according to the possibility score;
the likelihood score calculating unit is specifically configured to:
calculating the times of the word segmentation in the text to be recognized;
extracting the times of the occurrence of the participles in other texts;
and calculating the possibility score of the word segmentation according to the times of the word segmentation in the text to be recognized and the times of the word segmentation in the other texts.
10. An apparatus for identifying a person's name, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:
extracting a voice-over text adjacent to the dialog text from the text to be recognized;
performing word segmentation on the voice text to obtain at least one word segmentation;
calculating a likelihood score for the participle, the likelihood score characterizing a likelihood that the participle is a person name;
determining whether the participle is a name of a person according to the possibility score;
the calculating the likelihood score of the participle comprises:
calculating the times of the word segmentation in the text to be recognized;
extracting the times of the occurrence of the participles in other texts;
and calculating the possibility score of the word segmentation according to the times of the word segmentation in the text to be recognized and the times of the word segmentation in the other texts.
11. A non-transitory computer readable storage medium having instructions stored thereon that, when executed by a processor of an electronic device, enable the processor to perform the method of identifying a person name of any one of claims 1-8.
CN201810392724.3A 2018-04-27 2018-04-27 Method and device for identifying name Active CN110413983B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810392724.3A CN110413983B (en) 2018-04-27 2018-04-27 Method and device for identifying name

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810392724.3A CN110413983B (en) 2018-04-27 2018-04-27 Method and device for identifying name

Publications (2)

Publication Number Publication Date
CN110413983A CN110413983A (en) 2019-11-05
CN110413983B true CN110413983B (en) 2022-09-27

Family

ID=68346651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810392724.3A Active CN110413983B (en) 2018-04-27 2018-04-27 Method and device for identifying name

Country Status (1)

Country Link
CN (1) CN110413983B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131871B (en) * 2020-09-22 2023-06-30 平安国际智慧城市科技股份有限公司 Method, device, equipment and storage medium for identifying Chinese name

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023124A (en) * 2014-05-14 2014-09-03 上海卓悠网络科技有限公司 Method and device for automatically identifying and extracting a name in short message
CN106294321A (en) * 2016-08-04 2017-01-04 北京智能管家科技有限公司 The dialogue method for digging of a kind of specific area and device
CN107729309A (en) * 2016-08-11 2018-02-23 中兴通讯股份有限公司 A kind of method and device of the Chinese semantic analysis based on deep learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023124A (en) * 2014-05-14 2014-09-03 上海卓悠网络科技有限公司 Method and device for automatically identifying and extracting a name in short message
CN106294321A (en) * 2016-08-04 2017-01-04 北京智能管家科技有限公司 The dialogue method for digging of a kind of specific area and device
CN107729309A (en) * 2016-08-11 2018-02-23 中兴通讯股份有限公司 A kind of method and device of the Chinese semantic analysis based on deep learning

Also Published As

Publication number Publication date
CN110413983A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN108536654B (en) Method and device for displaying identification text
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN112527992B (en) Long text processing method, related device and readable storage medium
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN110211571B (en) Sentence fault detection method, sentence fault detection device and computer readable storage medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
US20090265166A1 (en) Boundary estimation apparatus and method
US9811517B2 (en) Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text
CN108090099B (en) Text processing method and device
EP3885962A1 (en) Method and system for extraction of key-terms and synonyms for the key-terms
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN111354340B (en) Data annotation accuracy verification method and device, electronic equipment and storage medium
CN111046660B (en) Method and device for identifying text professional terms
CN110413983B (en) Method and device for identifying name
US20170061957A1 (en) Method and apparatus for improving a language model, and speech recognition method and apparatus
CN109062891B (en) Media processing method, device, terminal and medium
CN114970554B (en) Document checking method based on natural language processing
CN115810346A (en) Voice recognition method, device, equipment and medium
Chiu et al. Chinese spell checking based on noisy channel model
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
CN111951827B (en) Continuous reading identification correction method, device, equipment and readable storage medium
CN110807322B (en) Method, device, server and storage medium for identifying new words based on information entropy
CN113793611A (en) Scoring method, scoring device, computer equipment and storage medium
CN108021918B (en) Character recognition method and device
CN112000767A (en) Text-based information extraction method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant