CN112883162A - Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium - Google Patents

Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium Download PDF

Info

Publication number
CN112883162A
CN112883162A CN202110242757.1A CN202110242757A CN112883162A CN 112883162 A CN112883162 A CN 112883162A CN 202110242757 A CN202110242757 A CN 202110242757A CN 112883162 A CN112883162 A CN 112883162A
Authority
CN
China
Prior art keywords
recognized
phrase
name
transliteration
transliterated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110242757.1A
Other languages
Chinese (zh)
Inventor
聂镭
齐凯杰
聂颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Longma Zhixin Zhuhai Hengqin Technology Co ltd
Original Assignee
Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Longma Zhixin Zhuhai Hengqin Technology Co ltd filed Critical Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority to CN202110242757.1A priority Critical patent/CN112883162A/en
Publication of CN112883162A publication Critical patent/CN112883162A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The embodiment of the application is suitable for the field of natural language processing, and provides a transliteration name identification method, a transliteration name identification device and a readable storage medium, wherein the method comprises the following steps: acquiring a text to be processed; performing word segmentation processing on a text to be processed to obtain a phrase to be recognized, wherein the text to be processed comprises at least one phrase to be recognized; calling a preset transliteration name recognition rule; and identifying the phrase to be identified according to the transliteration name identification rule to obtain a transliteration name identification result, wherein the transliteration name identification result comprises the phrase to be identified which is identified as the transliteration name. Therefore, the method and the device can identify the names of the foreigners containing transliteration in the Chinese text through the preset transliteration name identification rule, and improve the identification accuracy in the current stage of the named entity identification of Chinese.

Description

Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium
Technical Field
The present application relates to the field of natural language processing, and in particular, to a transliteration name recognition method, apparatus, recognition device, and readable storage medium.
Background
The Chinese named entity recognition mainly realizes the entity recognition of the types of names, place names, organizations and the like at the present stage, wherein the recognition of the names has relatively mature means for the Chinese name recognition result,
but the recognition results are relatively poor when the chinese text contains the transliterated foreign person name. The foreign names of the transliterated names have variable lengths and no obvious boundary marker words, so that the foreign names containing the transliterated names cannot be identified in the current stage of the named entity identification of the Chinese, and the problem of low identification accuracy is caused.
Disclosure of Invention
In view of the above, embodiments of the present application provide a transliterated name recognition method, apparatus, recognition device and readable storage medium, so as to solve the problem of low recognition accuracy caused by the inability to recognize a foreign name containing a transliteration in the current stage of recognition of a named entity in chinese.
A first aspect of an embodiment of the present application provides a transliteration name recognition method, including:
acquiring a text to be processed;
performing word segmentation processing on the text to be processed to obtain a phrase to be recognized, wherein the text to be processed comprises at least one phrase to be recognized;
calling a preset transliteration name recognition rule;
and identifying the phrase to be identified according to the transliteration name identification rule to obtain a transliteration name identification result, wherein the transliteration name identification result comprises the phrase to be identified which is identified as the transliteration name.
In a possible implementation manner of the first aspect, invoking a preset transliteration name recognition rule includes:
extracting text attributes of the text to be processed;
and calling a transliteration name database corresponding to the text attribute.
In a possible implementation manner of the first aspect, identifying the phrase to be identified according to the transliteration name identification rule to obtain a transliteration name identification result includes:
inputting the phrase to be recognized into the transliteration name database to obtain the frequency corresponding to each character in the phrase to be recognized;
and obtaining a transliteration name recognition result of the phrase to be recognized according to the frequency corresponding to each character in the phrase to be recognized.
In a possible implementation manner of the first aspect, before the inputting the phrase to be recognized into the transliteration name database and obtaining the frequency corresponding to each word in the phrase to be recognized, the method further includes:
recognizing the word group to be recognized obtained after word segmentation processing according to a preset name recognition model, and generating a name recognition result; the person name recognition result comprises a word group to be recognized, wherein the word group comprises a person name;
and taking the phrase to be recognized with the name recognition result as the phrase to be recognized before being input into the transliteration name database.
In a possible implementation manner of the first aspect, obtaining a transliteration name recognition result of the phrase to be recognized according to a frequency corresponding to each word in the phrase to be recognized includes:
whether the internal solidification degree of the phrase to be recognized meets a first preset condition is determined according to the following formula:
Figure 406799DEST_PATH_IMAGE001
where i =1, 2.,. n, min (pi) denotes the minimum frequency of words in the phrase to be recognized, PThreshold valueRepresents a frequency threshold;
and when the internal solidification degree of the phrase to be recognized meets a first preset condition, determining that the phrase to be recognized is an transliterated name.
In a possible implementation manner of the first aspect, after obtaining a transliterated name recognition result of the phrase to be recognized according to a frequency corresponding to each word in the phrase to be recognized, the method further includes:
extracting the transliterated names in the transliterated name recognition result;
verifying the transliteration name to obtain a verification result;
and eliminating the transliterated names corresponding to the verification results which do not accord with the second preset condition in the transliterated name identification results.
In a possible implementation manner of the first aspect, the verification result includes a first part-of-speech verification result and a second part-of-speech verification result;
verifying the transliteration name to obtain a verification result, wherein the verification result comprises the following steps:
inputting the transliterated name into a preset prefix corpus to obtain a left adjacent character corresponding to the transliterated name;
inputting the transliterated name into a preset suffix corpus to obtain a right adjacent character corresponding to the transliterated name;
performing first part-of-speech analysis on the left adjacent character to obtain a first part-of-speech verification result;
and performing second part-of-speech analysis on the right adjacent character to obtain a second part-of-speech verification result.
A second aspect of an embodiment of the present application provides a transliteration name recognition apparatus, including:
the acquisition module is used for acquiring a text to be processed;
the word segmentation processing module is used for carrying out word segmentation processing on the text to be processed to obtain a word group to be recognized, wherein the text to be processed comprises at least one word group to be recognized;
the calling module is used for calling a preset transliteration name recognition rule;
and the recognition module is used for recognizing the phrase to be recognized according to the transliteration name recognition rule to obtain a transliteration name recognition result.
In a possible implementation manner of the second aspect, the invoking module includes:
the extraction unit is used for extracting the text attribute of the text to be processed;
and the calling unit is used for calling the transliteration name database corresponding to the text attribute.
In one possible implementation manner of the second aspect, the identification module includes:
the recognition unit is used for inputting the phrase to be recognized into the transliteration name database to obtain the frequency corresponding to each character in the phrase to be recognized;
and the generating unit is used for obtaining the transliterated name recognition result of the phrase to be recognized according to the frequency corresponding to each character in the phrase to be recognized.
In a possible implementation manner of the second aspect, the apparatus further includes:
the generating module is used for identifying the word group to be identified obtained after word segmentation processing according to a preset name identification model and generating a name identification result; the person name recognition result comprises a word group to be recognized, wherein the word group comprises a person name;
and the screening module is used for taking the phrase to be recognized with the name recognition result of the person as the phrase to be recognized before the phrase is input into the transliteration name database.
In a possible implementation manner of the second aspect, the generating unit includes:
a judging subunit, configured to determine whether the internal solidification degree of the phrase to be recognized meets a first preset condition according to the following formula:
Figure 269713DEST_PATH_IMAGE001
where i =1, 2.,. n, min (pi) denotes the minimum frequency of words in the phrase to be recognized, PThreshold valueRepresents a frequency threshold;
and the determining subunit is used for determining the phrase to be recognized as the transliteration name when the internal solidification degree of the phrase to be recognized meets a first preset condition.
In a possible implementation manner of the second aspect, the apparatus further includes:
the extracting module is used for extracting the transliterated names in the transliterated name recognition result;
the verification module is used for verifying the transliteration name to obtain a verification result;
and the rejecting module is used for rejecting the transliterated name corresponding to the verification result which does not accord with the second preset condition in the transliterated name recognition result.
In a possible implementation manner of the second aspect, the check result includes a first part-of-speech check result and a second part-of-speech check result, and the check module includes:
the first extraction unit is used for inputting the transliterated name into a preset prefix corpus to obtain a left adjacent character corresponding to the transliterated name;
the second extraction unit is used for inputting the transliterated name into a preset suffix corpus to obtain a right adjacent character corresponding to the transliterated name;
the first analysis unit is used for carrying out first part-of-speech analysis on the left adjacent words to obtain a first part-of-speech verification result;
and the second analysis unit is used for performing second part-of-speech analysis on the right adjacent character to obtain a second part-of-speech verification result.
A third aspect of an embodiment of the present application provides an identification device, including: a memory, a processor, an image capture device, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method as described above when executing the computer program.
A fourth aspect of embodiments of the present application provides a readable storage medium, which stores a computer program that, when executed by a processor, implements the steps of the method as described above.
Compared with the prior art, the embodiment of the application has the advantages that:
according to the method and the device, the names of the foreigners who contain transliterations in the Chinese text are identified through the preset transliteration name identification rule, and the identification accuracy in the current stage of the named entity identification of Chinese is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a transliterated name recognition method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a transliteration name recognition apparatus according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an identification device provided by an embodiment of the present application;
fig. 4 is a directed acyclic graph related to the process of obtaining a phrase to be recognized by performing word segmentation on the text to be processed in fig. 1.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.
Referring to fig. 1, a schematic flow chart of a transliterated name recognition method provided in an embodiment of the present application is applied to a recognition device, where the recognition device includes a server or a terminal device, the server may be a computing device such as a cloud server, and the terminal device may be a computing device such as a desktop computer, a notebook computer, and a palm computer, and the method includes the following steps:
and S101, acquiring a text to be processed.
The voice text to be processed is obtained by directly recognizing voice audio according to a voice recognition technology, and the source of the voice audio can be directly obtained, for example, in an application scene of an intelligent customer service, the voice audio of a customer is directly obtained, and the voice audio of the customer is recognized through the voice recognition technology to obtain the voice text to be processed; for example, in an application scenario of voice quality inspection, a transit server stores voice calls between customer service personnel and clients, indirectly obtains the voice calls between the customer service personnel and the clients from the transit server, and recognizes the voice calls between the customer service personnel and the clients through a voice recognition technology to obtain a to-be-processed voice text.
And S102, performing word segmentation processing on the text to be processed to obtain a word group to be recognized.
The text to be processed comprises at least one phrase to be recognized.
In specific application, a word segmentation method based on character string matching can be adopted to perform word segmentation on the text to be processed to obtain the word group to be recognized. The method matches the entries in the sentences to be participled with the words in the corpus according to a certain scanning mode, and then returns the corresponding results. The matching algorithm can be further divided into the following types according to the scanning mode:
1. forward maximum matching method (left to right direction)
2. Reverse maximum matching method (from right to left direction)
3. Minimum segmentation (minimizing the number of words cut out in each sentence)
4. Two-way maximum matching method (two scans from left to right and right to left are performed).
Corpus (lexicon): [ We, I, today, specially, very, and, eating, mango ]
For example, the following steps are carried out:
the sentence to be participled is "I want to eat mango especially today".
The implementation of the forward maximum matching algorithm is then: firstly, a maximum entry length is given, the maximum entry length (the length of a sliding window) is defined to be max _ num =3, the first 3 characters of a sentence are taken out firstly, whether the first 3 characters exist in a word stock or not is judged, if yes, the first word segmentation is returned, and the sliding window slides backwards for 3 positions; if the word segmentation result does not exist, the sliding window is reduced from right to left by 1, whether the first two characters exist in a word bank or not is judged, if the word segmentation result exists, the word segmentation is returned, the sliding window slides backwards by 2, and if the word segmentation result does not exist, the sliding window is continuously reduced until the whole sentence is traversed, so that the final word segmentation result is obtained. Firstly, the position of a sliding window is in 'I today', the word does not exist in a word stock, the sliding window is reduced by 1, the position is in 'I today', the word stock still does not exist, the window is reduced by 1, and if the word stock exists, a first word segmentation result is returned, and the sliding window is backwards slid by 1 distance; at this time, the position of the window is 'today special', and the first process is repeated until the whole sentence is traversed, and according to the method, the final word segmentation result of the window is that 'I especially want to eat mango today'.
The implementation process of the reverse maximum matching algorithm is as follows: just contrary to the forward maximum matching algorithm, the sliding window is slid from back to front.
The implementation process of the minimum segmentation algorithm is as follows: the number of words to be cut out per sentence is minimized. (the idea of dynamic programming can be used here, divided into local problems).
The implementation process of the bidirectional maximum matching algorithm is as follows: in most cases, the results obtained for forward and backward matching are the same, but there are different cases. The bidirectional maximum matching is to perform forward and backward maximum matching simultaneously and then analyze the two results. If the two results are consistent, the ambiguity phenomenon is not considered to exist; if not, a locate-to-ambiguity field process is required. The method has the advantages of improving the word segmentation accuracy and eliminating partial ambiguity. The disadvantage is that the algorithm implementation requires bi-directional scanning, and the time complexity increases.
Such a maximum matching algorithm can be regarded as a greedy algorithm in fact, because it only considers a local optimal solution at a time, and does not consider the solution from a global perspective, but some greedy algorithms obtain a local optimal solution that is a global optimal solution, which depends on a specific scene. Compared with a greedy algorithm and dynamic planning, the final result obtained by dynamic planning is a global optimal solution, and although the dynamic planning only considers local optimization at each time, the local optimization of the dynamic planning can influence other local optimization, so that the problem is solved in a more clever mode, the complexity can be reduced, and the calculation efficiency is improved. What should be taken is a problem for max _ num of the maximum matching algorithm, generally, the larger this value is, the better the obtained word segmentation effect is, but correspondingly, the time complexity is proportional to this number, for example, we take 100, and we need to loop 100 times in the worst case each time a local match is made. However, in some application scenarios, the longer the word is, the better the word is, and the shorter the word segmentation result may be, which obviously does not achieve good effect. The maximum matching algorithm mainly applies the idea of matching, does not consider semantics, and therefore does not solve the problem of ambiguity well.
In another specific application, a word segmentation method based on statistics may be adopted to perform word segmentation on the text to be processed, so as to obtain a phrase to be recognized. The word segmentation method based on statistics is to use a statistical machine learning model to learn the rules of word segmentation (called training) on the premise of giving a large amount of already segmented texts, thereby realizing the segmentation of unknown texts. Such as a maximum probability word segmentation method, a maximum entropy word segmentation method, and the like. The word segmentation method based on statistics can consider semantic problems to a certain extent to obtain the best word segmentation result, and the main statistical models are as follows: n-gram (N-gram), Hidden Markov Model (HMM), maximum entropy Model (ME), Conditional Random field Model (CRF), etc. Such a process is generally divided into two steps: 1. finding out all word segmentation results of the sentence; 2. the best one among all the word segmentation results is found.
Given a word stock and a sentence, how to find all word segmentation results is basically consistent with the leetecode 140 question, and a recursive method is needed to be adopted to realize: after all word segmentation results are obtained, the user needs to judge which word segmentation result is the best, so the quality of the word segmentation result can be determined by using a language model, and the specific model includes unigram, bigram and the like, and generally does not exceed trigram. Assuming we use unigram as a model and know the weight (or probability, usually measured in terms of frequency of occurrence) of each word in the thesaurus, the score value of each participle result can be measured as the product of the probabilities of each word of the participle result.
For example, the thesaurus above is copus = [ 'we', 'i', 'today', 'special', 'very much', 'eat', 'mango', 'want' ]
And the probability value corresponding to each word in the lexicon is p = [ 'we': 0.2, 'my': 0.2, 'today': 0.05, 'special': 0.04, 'special': 0.01, 'eating' 0.05, 'mango' 0.4, 'special': 0.1] (the sum of the total probability values should be 1)
Model according to unigram
p (i, today, in particular, want, eat, mango) = p (i) p (today) p (in particular) p (want) p (mango) =0.2 x 0.05 x 0.04 x 0.1 x 0.05 x 0.4 =0.0000008
p (i, o, mango) = p (i) p (i, i) p (i), i, o, p (i, i) p (i, o
Comparing the two scores, the word segmentation result with the higher score is better. Since the probability value of each word is a value of 0-1, when more probabilities are multiplied, the probability value is very close to 0, and the longer the sentence is, the smaller the score value is, overflow can be exceeded when one sentence is very large, namely, the floating point representation range of the computer is exceeded. To solve this problem, we add a log to the probability value and a negative sign to the probability value, so that the score of the word segmentation result can be converted into
p (i, today, in particular, want, eat, mango) = - [ log (p (i)) + log (p (today)) + log (p (in particular)) + log (p (want)) + log (p (eat)) + log (p (mango)) ]
p (i, today, fang you, mango) = - [ log (p (i)) + log (p (today)) + log (p (fang you)) + log (p (mango)) ]
Here, because the log value is preceded by a negative sign, finding the highest score value translates into minimizing the problem. Here, because the word stock and the sentence are both relatively small, only two sets of word segmentation results are obtained, and generally, if the word stock is large enough and the sentence is relatively long, a large time complexity is needed to obtain the word segmentation results of all permutation and combination, and the number of the results is also large, so that the score values of all the word segmentation results need to be calculated, and the time complexity is o (n). Therefore, an optimization algorithm is provided, and the optimization algorithm is based on the idea of dynamic programming. We need to construct a directed graph as shown in fig. 4 according to the probability value (taking-log (p)) of the words in the sentence in the lexicon, and then find the shortest path.
As can be seen, the corpus in fig. 4 is cprpus: [ 'we', 'me', 'today', 'special', 'very', 'eating', 'mango', 'want' ], the probability that each character in the corpus corresponds to is P: [0.2, 0.2, 0.05, 0.04, 0.01, 0.04, 0.01, 0.05, 0.4, 0.1] the characters correspond to a score weight of-log (p) [0.7, 0.7, 1.3, 1.4, 2, 1.4, 2, 1.3, 0.4, 1 ].
The edges of this directed graph are each character, the weight is the scoring weight (-log (p)) of the corresponding character, and the number of nodes is the number of edges plus one. The problem of finding the best segmentation result translates into the problem of finding the shortest path. Because the problem of the requirement is the shortest path problem, for words without word banks, a larger weight can be given, so that when the shortest path is found, the word segmentation result without word banks can be avoided.
From the above fig. 4, we can obtain that the shortest path from the node 1 to the node 10 is 1,2, 4, 7, 8, 10; the corresponding word segmentation results are [ 'I', 'today', 'specially think', 'eat', 'mango' ]
For such a directed graph, we use a dynamic planning method to find the minimum path, and the idea of dynamic planning is to decompose a problem into each sub-problem (sub-state), and find the shortest path (optimal solution) of the sub-state each time, thus pushing to the final path. We assume that the final target is f (10), representing the sum of the minimum path weights from node 1 to node 10, f (9) is the sum of the minimum path weights from node 1 to node 9, and so on. Then we start the computation from f (1), f (2) is f (1) plus the weight of the shortest path from node 1 to node 2, f (3) is: f (2) plus the weight of node 2 to node 3, f (1) plus the minimum … … between the weights of the shortest paths from node 1 to node 3, and so on. And finally, f (9) can be obtained, and in the calculation process, the nodes which the shortest path from the node 1 to each node passes need to be recorded, so that the final shortest path can be obtained.
Step S103, calling a preset transliteration name recognition rule.
In the specific application, calling a preset transliteration name recognition rule comprises the following steps:
firstly, extracting text attributes of a text to be processed.
The text attribute may refer to a country corresponding to the text to be processed.
In specific application, the text to be processed can be input into a Google langdetect detection language interface to identify the country corresponding to the text to be processed.
And secondly, calling a transliteration name database corresponding to the text attribute.
Wherein the transliteration name database is constructed according to corpora of different text attributes (such as countries).
It can be understood that names of different countries have different naming rules, for example, the name of Burmese is the only name without surname, the name has at least one character, and more names can be as long as six or seven characters, but before the name is called, the Burmese will add 'Du', 'Ma' and the like, and the men will add 'Guo', 'Wu' and the like, and the rules need to be modified appropriately according to the age, social status and the like, so the rules need to be considered when identifying the name of Burmese; when the english name is transliterated, a connector such as ". or" - "may appear in the middle of the name, so that the influence of the special character needs to be considered when the english name is transliterated and identified. According to the name transliteration characteristics of different countries, corresponding identification rules are established, although the purpose of full identification cannot be achieved, the rules can identify partial effective names, and the error rate is reduced.
And step S104, identifying the phrase to be identified according to the transliteration name identification rule to obtain a transliteration name identification result.
And the transliteration name recognition result comprises the phrases to be recognized which are recognized as the transliteration name.
In the specific application, the method for identifying the phrase to be identified according to the transliteration name identification rule to obtain a transliteration name identification result comprises the following steps:
firstly, inputting a phrase to be recognized into a transliteration name database to obtain the frequency corresponding to each character in the phrase to be recognized.
And secondly, obtaining a transliteration name recognition result of the phrase to be recognized according to the frequency corresponding to each character in the phrase to be recognized.
Illustratively, determining whether the internal solidification degree of the phrase to be recognized meets a first preset condition is obtained according to the following formula:
Figure 494020DEST_PATH_IMAGE002
where i =1, 2.,. n, min (pi) denotes the minimum frequency of words in the phrase to be recognized, PThreshold valueRepresents a frequency threshold;
and when the internal solidification degree of the phrase to be recognized meets a first preset condition, determining the phrase to be recognized as the transliteration name.
The first preset condition is that the minimum frequency of the words in the phrase to be recognized is greater than a frequency threshold value.
It is understood that the sentence (phrase to be recognized) is sequentially queried in the transliterated name database, and when the internal solidification degree of the established phrase is greater than a threshold value, the phrase is possibly a transliterated name.
In an optional implementation manner, after obtaining a transliterated name recognition result of the phrase to be recognized according to a frequency corresponding to each word in the phrase to be recognized, the method further includes:
the first step is to extract the transliterated names in the transliterated name recognition result.
And secondly, verifying the transliterated name to obtain a verification result.
And the verification result comprises a first part-of-speech verification result and a second part-of-speech verification result.
Specifically, the transliteration name is verified to obtain a verification result, and the verification result includes:
1. and inputting the transliterated names into a preset prefix corpus to obtain left adjacent characters corresponding to the transliterated names.
2. And inputting the transliterated name into a preset suffix corpus to obtain a right adjacent character corresponding to the transliterated name.
It can be understood that the corresponding transliteration name is obtained by manual collection according to the data in the corpus, and the corresponding words or characters before and after the transliteration name are obtained according to the word segmentation; taking the English transliteration name as an example, "Ohio State Long Mike Germany, law enforcement agencies continue to worry about possible violent incidents in the next few days, and will continue to maintain the state party building alert level. "wherein" Mike. Dewen "is the name of a person, the word preceding it is" State Long ", and the word following it is" say ". Collecting a large number of transliteration names according to the corpus data to form a transliteration name corpus; forming a prefix language database according to the words or characters before the transliteration name; and forming a suffix corpus according to the words or characters after the transliteration name.
3. And performing first part-of-speech analysis on the left adjacent characters to obtain a first part-of-speech verification result.
Wherein the first part-of-speech verification result is a noun, a verb or an adverb.
4. And performing second part-of-speech analysis on the right adjacent character to obtain a second part-of-speech verification result.
Wherein the second part-of-speech verification result is a verb or an assistant word.
It is understood that the part of speech of the prefix name of the transliterated name to be determined may be a noun, a verb, etc., but certainly not an adverb; the suffix part of the transliterated name to be determined may be a verb, an assistant, etc. And performing part-of-speech analysis on the data in the corpus according to the prefix corpus and the suffix corpus.
In specific application, a part-of-speech analysis method based on rules, a part-of-speech analysis method based on a statistical model or a part-of-speech tagging method based on deep learning can be adopted.
And thirdly, eliminating the transliterated names corresponding to the verification results which do not accord with the second preset condition in the transliterated name recognition results.
The second preset condition is that the part of speech of the left adjacent character is not an adverb, and the part of speech of the right adjacent character is a verb or an assistant character.
In an optional implementation manner, before inputting the phrase to be recognized into the transliteration name database and obtaining the frequency corresponding to each word in the phrase to be recognized, the method further includes:
firstly, recognizing a word group to be recognized obtained after word segmentation processing according to a preset name recognition model, and generating a name recognition result.
The result of the person name recognition includes recognizing a phrase to be recognized including the person name, and the person name recognition model may be a classification model, such as an SVM, a maximum entropy model, a bayesian model, or the like.
It can be understood that the name recognition model is obtained by inputting the name as a corpus into the classification model for training.
And secondly, taking the phrase to be recognized containing the name as the recognition result of the name as the phrase to be recognized before being input into the transliterated name database.
It can be understood that in the embodiment of the application, the phrases to be recognized without names are removed through the name recognition model and are not used as the phrases to be recognized before being input into the transliteration name database, so that the problems of ambiguity or false extraction of the transliteration names in the sentences and the like are solved.
In the embodiment of the application, the names of the foreigners who contain transliterations in the Chinese text are identified through the preset transliteration name identification rule, so that the identification accuracy in the current stage of the named entity identification of Chinese is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
The transliteration name recognition apparatus provided in the embodiments of the present application will be described below.
The transliterated name recognition apparatus of the present embodiment corresponds to the transliterated name recognition method described above.
Fig. 2 is a schematic structural diagram of a transliterated name recognition apparatus provided in an embodiment of the present application, where the apparatus may be specifically integrated in a recognition device, and the apparatus may include:
an obtaining module 21, configured to obtain a text to be processed;
the word segmentation processing module 22 is configured to perform word segmentation processing on the text to be processed to obtain a word group to be recognized, where the text to be processed includes at least one word group to be recognized;
the calling module 23 is used for calling a preset transliteration name recognition rule;
and the recognition module 24 is configured to recognize the phrase to be recognized according to the transliteration name recognition rule, so as to obtain a transliteration name recognition result.
In one possible implementation, the invoking module includes:
the extraction unit is used for extracting the text attribute of the text to be processed;
and the calling unit is used for calling the transliteration name database corresponding to the text attribute.
In one possible implementation, the identification module includes:
the recognition unit is used for inputting the phrase to be recognized into the transliteration name database to obtain the frequency corresponding to each character in the phrase to be recognized;
and the generating unit is used for obtaining the transliterated name recognition result of the phrase to be recognized according to the frequency corresponding to each character in the phrase to be recognized.
In one possible implementation, the apparatus further includes:
the generating module is used for identifying the word group to be identified obtained after word segmentation processing according to a preset name identification model and generating a name identification result; the person name recognition result comprises a word group to be recognized, wherein the word group comprises a person name;
and the screening module is used for taking the phrase to be recognized with the name recognition result of the person as the phrase to be recognized before the phrase is input into the transliteration name database.
In one possible implementation, the generating unit includes:
a judging subunit, configured to determine whether the internal solidification degree of the phrase to be recognized meets a first preset condition according to the following formula:
Figure 555517DEST_PATH_IMAGE001
where i =1, 2.,. n, min (pi) denotes the minimum frequency of words in the phrase to be recognized, PThreshold valueRepresents a frequency threshold;
and the determining subunit is used for determining the phrase to be recognized as the transliteration name when the internal solidification degree of the phrase to be recognized meets a first preset condition.
In one possible implementation, the apparatus further includes:
the extracting module is used for extracting the transliterated names in the transliterated name recognition result;
the verification module is used for verifying the transliteration name to obtain a verification result;
and the rejecting module is used for rejecting the transliterated name corresponding to the verification result which does not accord with the second preset condition in the transliterated name recognition result.
In a possible implementation manner, the verification result includes a first part-of-speech verification result and a second part-of-speech verification result, and the verification module includes:
the first extraction unit is used for inputting the transliterated name into a preset prefix corpus to obtain a left adjacent character corresponding to the transliterated name;
the second extraction unit is used for inputting the transliterated name into a preset suffix corpus to obtain a right adjacent character corresponding to the transliterated name;
the first analysis unit is used for carrying out first part-of-speech analysis on the left adjacent words to obtain a first part-of-speech verification result;
and the second analysis unit is used for performing second part-of-speech analysis on the right adjacent character to obtain a second part-of-speech verification result.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
Fig. 3 is a schematic diagram of an identification device 3 provided in an embodiment of the present application. As shown in fig. 3, the identification apparatus 3 of this embodiment includes: a processor 30, a memory 31 and a computer program 32 stored in said memory 31 and executable on said processor 30. The steps in the above-described method embodiments are implemented when the computer program 32 is executed by the processor 30. Alternatively, the processor 30 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 32.
Illustratively, the computer program 32 may be partitioned into one or more modules/units that are stored in the memory 31 and executed by the processor 30 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 32 in the identification device 3.
The identification device 3 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The identification device 3 may include, but is not limited to, a processor 30, a memory 31. It will be appreciated by a person skilled in the art that fig. 3 is only an example of the identification device 3 and does not constitute a limitation of the identification device 3 and may comprise more or less components than shown, or combine certain components, or different components, e.g. the identification device 3 may further comprise an input output device, a network access device, a bus, etc.
The Processor 30 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 31 may be an internal storage unit of the identification device 3, such as a hard disk or a memory of the identification device 3. The memory 31 may also be an external storage device of the identification device 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the identification device 3. Further, the memory 31 may also include both an internal storage unit of the identification device 3 and an external storage device. The memory 31 is used for storing the computer program and other programs and data required by the identification device 3. The memory 31 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed terminal device and method may be implemented in other ways. For example, the above-described terminal device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical function division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a readable storage medium, where the readable storage medium can be a computer readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments described above can be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A transliterated name recognition method, comprising:
acquiring a text to be processed;
performing word segmentation processing on the text to be processed to obtain a phrase to be recognized, wherein the text to be processed comprises at least one phrase to be recognized;
calling a preset transliteration name recognition rule;
and identifying the phrase to be identified according to the transliteration name identification rule to obtain a transliteration name identification result, wherein the transliteration name identification result comprises the phrase to be identified which is identified as the transliteration name.
2. The transliteration name recognition method of claim 1, wherein invoking the preset transliteration name recognition rule comprises:
extracting text attributes of the text to be processed;
and calling a transliteration name database corresponding to the text attribute.
3. The transliteration name recognition method of claim 2, wherein recognizing the phrase to be recognized according to the transliteration name recognition rule to obtain a transliteration name recognition result comprises:
inputting the phrase to be recognized into the transliteration name database to obtain the frequency corresponding to each character in the phrase to be recognized;
and obtaining a transliteration name recognition result of the phrase to be recognized according to the frequency corresponding to each character in the phrase to be recognized.
4. The transliteration name recognition method according to claim 2, wherein before the phrase to be recognized is input into the transliteration name database and the frequency corresponding to each word in the phrase to be recognized is obtained, the method further comprises:
recognizing the word group to be recognized obtained after word segmentation processing according to a preset name recognition model, and generating a name recognition result; the person name recognition result comprises a word group to be recognized, wherein the word group comprises a person name;
and taking the phrase to be recognized with the name recognition result as the phrase to be recognized before being input into the transliteration name database.
5. The transliteration name recognition method of claim 3, wherein obtaining the transliteration name recognition result of the phrase to be recognized according to the frequency corresponding to each word in the phrase to be recognized comprises:
whether the internal solidification degree of the phrase to be recognized meets a first preset condition is determined according to the following formula:
Figure 148767DEST_PATH_IMAGE001
where i =1, 2.,. n, min (pi) denotes the minimum frequency of words in the phrase to be recognized, PThreshold valueRepresents a frequency threshold;
and when the internal solidification degree of the phrase to be recognized meets a first preset condition, determining that the phrase to be recognized is an transliterated name.
6. The transliteration name recognition method of claim 3, wherein after obtaining the transliteration name recognition result of the phrase to be recognized according to the frequency corresponding to each word in the phrase to be recognized, the method further comprises:
extracting the transliterated names in the transliterated name recognition result;
verifying the transliteration name to obtain a verification result;
and eliminating the transliterated names corresponding to the verification results which do not accord with the second preset condition in the transliterated name identification results.
7. The transliterated name recognition method of claim 6, wherein the verification results comprise a first part-of-speech verification result and a second part-of-speech verification result;
verifying the transliteration name to obtain a verification result, wherein the verification result comprises the following steps:
inputting the transliterated name into a preset prefix corpus to obtain a left adjacent character corresponding to the transliterated name;
inputting the transliterated name into a preset suffix corpus to obtain a right adjacent character corresponding to the transliterated name;
performing first part-of-speech analysis on the left adjacent character to obtain a first part-of-speech verification result;
and performing second part-of-speech analysis on the right adjacent character to obtain a second part-of-speech verification result.
8. A transliterated name recognition apparatus, comprising:
the acquisition module is used for acquiring a text to be processed;
the word segmentation processing module is used for carrying out word segmentation processing on the text to be processed to obtain a word group to be recognized, wherein the text to be processed comprises at least one word group to be recognized;
the calling module is used for calling a preset transliteration name recognition rule;
and the recognition module is used for recognizing the phrase to be recognized according to the transliteration name recognition rule to obtain a transliteration name recognition result.
9. Identification device comprising a memory, a processor, a camera means and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the computer program.
10. Readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN202110242757.1A 2021-03-05 2021-03-05 Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium Pending CN112883162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110242757.1A CN112883162A (en) 2021-03-05 2021-03-05 Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110242757.1A CN112883162A (en) 2021-03-05 2021-03-05 Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112883162A true CN112883162A (en) 2021-06-01

Family

ID=76055447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110242757.1A Pending CN112883162A (en) 2021-03-05 2021-03-05 Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112883162A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145166A (en) * 2007-11-13 2008-03-19 北京中搜在线软件有限公司 Syllable drive based transliterated entity name computer automatic identification method
CN102227724A (en) * 2008-10-10 2011-10-26 谷歌公司 Machine learning for transliteration
CN104657343A (en) * 2013-11-15 2015-05-27 富士通株式会社 Method and device for recognizing transliteration name
CN105070289A (en) * 2015-07-06 2015-11-18 百度在线网络技术(北京)有限公司 English name recognition method and device
CN109446521A (en) * 2018-10-18 2019-03-08 京东方科技集团股份有限公司 Name entity recognition method, device, electronic equipment, machine readable storage medium
US10789410B1 (en) * 2017-06-26 2020-09-29 Amazon Technologies, Inc. Identification of source languages for terms

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145166A (en) * 2007-11-13 2008-03-19 北京中搜在线软件有限公司 Syllable drive based transliterated entity name computer automatic identification method
CN102227724A (en) * 2008-10-10 2011-10-26 谷歌公司 Machine learning for transliteration
CN104657343A (en) * 2013-11-15 2015-05-27 富士通株式会社 Method and device for recognizing transliteration name
CN105070289A (en) * 2015-07-06 2015-11-18 百度在线网络技术(北京)有限公司 English name recognition method and device
US10789410B1 (en) * 2017-06-26 2020-09-29 Amazon Technologies, Inc. Identification of source languages for terms
CN109446521A (en) * 2018-10-18 2019-03-08 京东方科技集团股份有限公司 Name entity recognition method, device, electronic equipment, machine readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李志刚: "音译外国人名自动识别的研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Similar Documents

Publication Publication Date Title
CN110210029B (en) Method, system, device and medium for correcting error of voice text based on vertical field
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
US20070067280A1 (en) System for recognising and classifying named entities
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
US20220318509A1 (en) Entity recognition method and device, dictionary creating method, device and medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN113326702B (en) Semantic recognition method, semantic recognition device, electronic equipment and storage medium
CN111414746A (en) Matching statement determination method, device, equipment and storage medium
CN112380866A (en) Text topic label generation method, terminal device and storage medium
US10970488B2 (en) Finding of asymmetric relation between words
CN114186061A (en) Statement intention prediction method, device, storage medium and computer equipment
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN110874408B (en) Model training method, text recognition device and computing equipment
CN115831117A (en) Entity identification method, entity identification device, computer equipment and storage medium
CN112883162A (en) Transliteration name recognition method, transliteration name recognition device, recognition equipment and readable storage medium
CN110750967B (en) Pronunciation labeling method and device, computer equipment and storage medium
CN113076740A (en) Synonym mining method and device in government affair service field
CN116842168B (en) Cross-domain problem processing method and device, electronic equipment and storage medium
CN110851560A (en) Information retrieval method, device and equipment
CN114186552B (en) Text analysis method, device and equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210601