CN111144102B - Method and device for identifying entity in statement and electronic equipment - Google Patents

Method and device for identifying entity in statement and electronic equipment Download PDF

Info

Publication number
CN111144102B
CN111144102B CN201911373507.0A CN201911373507A CN111144102B CN 111144102 B CN111144102 B CN 111144102B CN 201911373507 A CN201911373507 A CN 201911373507A CN 111144102 B CN111144102 B CN 111144102B
Authority
CN
China
Prior art keywords
language
entity
classified
entities
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911373507.0A
Other languages
Chinese (zh)
Other versions
CN111144102A (en
Inventor
王萌萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201911373507.0A priority Critical patent/CN111144102B/en
Publication of CN111144102A publication Critical patent/CN111144102A/en
Application granted granted Critical
Publication of CN111144102B publication Critical patent/CN111144102B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The present disclosure provides a method for identifying entities in statements, comprising: and acquiring a sentence to be processed in a first language, wherein the first language is Japanese. Candidate entities are extracted from the sentence to be processed. And matching the candidate entities with the classified entity set of the second language to screen out the candidate entities with the matching degree higher than the preset threshold value as entities to be classified, wherein the second language is English. And then, processing the entity to be classified by using the classification model to obtain the class to which the entity to be classified belongs. The classification model is obtained by training based on a classified entity set of a second language, and the first language is different from the second language. The disclosure also provides a device and an electronic device for identifying the entity in the sentence.

Description

Method and device for identifying entity in statement and electronic equipment
Technical Field
The disclosure relates to a method, a device and an electronic device for identifying entities in sentences.
Background
In traditional machine learning based methods, the identification of entities in a sentence is treated as a sequence labeling problem. And (4) learning a labeling model by using large-scale linguistic data so as to label each position of the sentence to identify whether an entity exists in each position. However, based on the entity recognition method, when the entity in the to-be-processed sentence in the first language is recognized, if there is no classified entity set in the first language, it takes a lot of time to collect the corpus of the first language and label the corpus, and then the corpus can be trained to obtain a label model, and then the entity classification is performed based on the label model.
Disclosure of Invention
One aspect of the present disclosure provides a method for identifying an entity in a statement, comprising: and acquiring a sentence to be processed in a first language, wherein the first language is Japanese. Candidate entities are extracted from the sentence to be processed. And matching the candidate entities with the classified entity set of the second language to screen out the candidate entities with the matching degree higher than the preset threshold value as entities to be classified, wherein the second language is English. And then, processing the entity to be classified by using the classification model to obtain the class to which the entity to be classified belongs. The classification model is obtained by training based on a classified entity set of a second language, and the first language is different from the second language.
Optionally, the extracting candidate entities from the sentence to be processed includes: candidate entities in a first language are extracted from the sentence to be processed, and the candidate entities in the first language are converted into candidate entities in a second language.
Optionally, the extracting the candidate entity in the first language from the sentence to be processed includes: and extracting continuous character strings which do not contain Japanese kana and Chinese characters from the sentence to be processed as candidate entities of the first language. Alternatively or additionally, successive katakana are extracted from the sentence to be processed as candidate entities in the first language.
Optionally, the classified entity set of the second language includes a plurality of classified entities, and any classified entity in the plurality of classified entities carries label information, where the label information is used to characterize a category to which any classified entity belongs.
Optionally, the matching the candidate entity with the classified entity set of the second language includes: on the one hand, a first vector representation of a candidate entity in a second language is obtained, and on the other hand, a second vector representation of any classified entity in the set of classified entities in the second language is obtained. Then, a similarity between the first vector representation and the second vector representation is calculated. On this basis, the screening out the candidate entities with the matching degree higher than the predetermined threshold as the entities to be classified comprises: and if the calculated similarity is higher than a preset threshold value, determining the candidate entity of the second language as the entity to be classified.
Optionally, the processing the entity to be classified by using the classification model to obtain the category to which the entity to be classified belongs includes: the first vector representation of the entity to be classified is input to a classification model, and then a class to which the entity to be classified belongs is determined based on an output of the classification model.
Optionally, the obtaining the first vector representation of the candidate entity in the second language includes: in one aspect, any character in the candidate entities in the second language is converted to a feature value to obtain a character vector for the candidate entities. On the other hand, feature values regarding the specified features in the candidate entities in the second language are acquired. The character vector and the feature values for the specified features are then combined into a first vector representation.
Optionally, the specific features include at least one of: whether a word is included in the candidate entity in the second language, whether a number is included in the candidate entity in the second language, whether a particular symbol is included in the candidate entity in the second language, and a length of the candidate entity in the second language.
Another aspect of the present disclosure provides an apparatus for identifying an entity in a sentence, including: the device comprises an acquisition module, an extraction module, a matching module, a screening module and an identification module. The acquisition module is used for acquiring the statements to be processed in a first language, and the first language is Japanese. The extraction module is used for extracting candidate entities from the statement to be processed. The matching module is used for matching the candidate entity with the classified entity set of the second language, wherein the second language is English. The screening module is used for screening out candidate entities with the matching degree higher than a preset threshold value as entities to be classified. The identification module is used for processing the entity to be classified by utilizing the classification model so as to obtain the category to which the entity to be classified belongs. Wherein the classification model is trained based on the classified entity set of the second language.
Another aspect of the present disclosure provides an electronic device including: a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor, when executing the computer program, is configured to: and acquiring a sentence to be processed in a first language, wherein the first language is Japanese. Candidate entities are extracted from the sentence to be processed. And matching the candidate entities with the classified entity set of the second language to screen out the candidate entities with the matching degree higher than the preset threshold value as entities to be classified, wherein the second language is English. And then, processing the entity to be classified by using the classification model to obtain the class of the entity to be classified. The classification model is obtained by training based on a classified entity set of a second language, and the first language is different from the second language.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.
Drawings
For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario of the method and apparatus for identifying entities in statements according to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow diagram of a method for identifying entities in a statement according to an embodiment of the present disclosure;
FIG. 3 schematically shows a flow diagram of a method for identifying entities in a statement according to another embodiment of the present disclosure;
FIG. 4A schematically illustrates an example diagram of a process for identifying entities in a statement according to an embodiment of this disclosure;
FIG. 4B schematically illustrates a schematic diagram of matching a candidate entity with a classified entity set according to an embodiment of the present disclosure;
FIG. 5 schematically shows a block diagram of an apparatus for identifying entities in a statement according to an embodiment of the present disclosure; and
fig. 6 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that these descriptions are illustrative only and are not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system.
Embodiments of the present disclosure provide a method and apparatus for identifying an entity (entity) in a statement. The method for identifying an entity in a statement may include: an acquisition process, a preprocessing process and an identification process. In the obtaining process, a to-be-processed sentence in a first language is obtained, where the first language may be japanese. Then a pre-treatment process is performed. The preprocessing process can be divided into an extraction process, a matching process, and a screening process. And extracting candidate entities from the sentences to be processed in the extraction process, matching the candidate entities with the classified entity set of the second language in the matching process, and screening the candidate entities with the matching degree higher than a preset threshold value in the screening process as the entities to be classified, wherein the second language is English. Therefore, the identification process can be carried out, and the entity to be classified obtained through screening is processed by using the classification model so as to obtain the category of the entity to be classified. The classification model is obtained by training based on a classified entity set of a second language, and the first language is different from the second language.
In recent years, deep learning methods based on neural networks have been greatly developed in the fields of computer vision, speech recognition, Natural Language Processing (NLP), and the like. The key basic tasks of natural language processing include Named Entity Recognition (NER), which is also called Named Recognition, and has a very wide application range. An entity is identified and extracted from the unstructured input text, the entity may be various special text segments meeting business requirements, and may include a name of a person, a name of a place, a name of an organization, a date and time, a proper noun, and the like, and further various entities may be identified according to the business requirements, such as a product name, a model number, a price, a brand name, a software name, an operating system name, and the like, which is not limited herein. Entity recognition in a statement can be the basis of many NLP tasks such as relationship extraction, event extraction, knowledge graph, machine translation, question-answering system, etc.
Fig. 1 schematically illustrates an application scenario of the method and apparatus for identifying an entity in a sentence according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the application scenario may include a terminal device 101, a network 102 and a server/server cluster 103. Network 102 serves as a medium for providing communication links between terminal devices 101 and server/server clusters 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal device 101 to interact with server/server cluster 103 over network 102 to input questions and receive answers. The terminal device 101 may be various electronic devices having input and output functions, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like.
The server/server cluster 103 may be a server or a server cluster providing various services, and the background management server or the server cluster may analyze and process data such as a received user input question, and feed back a corresponding answer to the terminal device. For example, upon receiving a user's question about a computer business, identifying entities, such as model, brand, part, software name, operating system name, etc., contained in the question helps to better identify the user's intent (intent) and thus answer the user's question more accurately.
It should be noted that the method for identifying entities in statements provided by the embodiments of the present disclosure may be generally performed by the server/server cluster 103. Accordingly, the apparatus for identifying entities in statements provided by the embodiments of the present disclosure may be generally disposed in the server/server cluster 103. Alternatively, the method for identifying the entity in the sentence provided by the embodiment of the present disclosure may also be executed by the terminal device 101. Correspondingly, the apparatus for identifying the entity in the sentence provided by the embodiment of the present disclosure may also be disposed in the terminal device 101. Alternatively, the method for identifying the entity in the sentence provided by the embodiment of the present disclosure may also be performed by a server or a server cluster which is different from the server/server cluster 103 and can communicate with the terminal device 101 and/or the server/server cluster 103. Correspondingly, the apparatus for identifying the entity in the sentence provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster that is different from the server/server cluster 103 and is capable of communicating with the terminal device 101 and/or the server/server cluster 103.
It should be understood that the number of end devices, networks, and server/server clusters in fig. 1 is illustrative only. There may be any number of end devices, networks, and server/server clusters, as desired for implementation.
In traditional machine learning based methods, the identification of entities in a sentence is treated as a sequence labeling problem. And (4) learning a labeling model by using large-scale linguistic data so as to label each position of the sentence to identify whether an entity exists in each position. However, based on the entity recognition method, when the entity in the to-be-processed sentence in the first language is recognized, if there is no classified entity set in the first language, it takes a lot of time to collect the corpus of the first language and label the corpus, and then the corpus can be trained to obtain a label model, and then the entity classification is performed based on the label model.
In another entity identification scheme, word vectors of different languages are mapped to the same semantic space, words of the sentence to be identified are translated one by one through the semantic space, for example, the sentence to be identified is translated one by one from a first language to a second language, a labeled corpus is obtained based on the labeling condition of the second language, and entity identification and classification are performed according to the sequence labeling method. Although the method can save the time of manual labeling, the corpus error obtained through word-by-word translation is large, so that the accuracy of final entity identification and classification is not high.
According to the embodiment of the disclosure, a method for identifying an entity in a sentence is provided, which is used for identifying the entity from the sentence to be processed and classifying the entity. The method is illustrated by the figure below. It should be noted that the sequence numbers of the respective operations in the following methods are merely used as representations of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.
FIG. 2 schematically shows a flow chart of a method for identifying entities in a statement according to an embodiment of the disclosure.
As shown in fig. 2, the method may include operations S210 to S250 as follows.
In operation S210, a to-be-processed sentence in a first language is acquired.
Illustratively, the entity needing to be identified can be preset according to the business requirement. The first language may be a language having a specific expression rule for a preset entity. For example, the preset entity to be recognized is a science and technology class vocabulary, and most of the science and technology class vocabularies in japanese are foreign languages and have a specific expression rule, so in this example, japanese can be used as the first language.
Then, in operation S220, candidate entities are extracted from the sentence to be processed.
Illustratively, the candidate entity is a text segment extracted from the sentence to be processed and having a high possibility of being the recognized entity according to the expression rule of the first language for the preset entity. Further screening by subsequent operations is required.
Next, in operation S230, the candidate entities are matched with the classified entity set in the second language.
The second language is different from the first language, and exemplarily, the second language may be a language that is more mature to be applied in the field of natural language processing, and a large number of related data sets, labeling information, and the like can be directly used, so that it is not necessary to spend time on data acquisition, labeling, and the like. For example, the second language may be english. The set of classified entities in the second language includes a plurality of classified entities in the second language. This operation S230 matches the extracted candidate entities with the classified entity set of the second language to obtain a matching degree between each candidate entity and the classified entity set of the second language.
Next, in operation S240, candidate entities with matching degrees higher than a predetermined threshold are screened out as entities to be classified.
When the matching degree between a candidate entity and the classified entity set of the second language is higher than a preset threshold value, the candidate entity is matched with the characteristics of any classified entity of the second language, the candidate entity can be used as an entity to be classified, and classification can be carried out by utilizing a classification model of the second language.
In operation S250, the entity to be classified is processed by using the classification model to obtain a category to which the entity to be classified belongs.
The classification model is a model trained for the entity classification target of the second language, and is exemplarily obtained by training based on the classified entity set of the second language. Because the second language is a relatively mature language applied to the field of natural language processing, the training completion degree of the corresponding classification model is relatively high, and the second language only needs to be directly obtained and used without spending time and manpower for training the classification model.
Those skilled in the art can understand that the method for identifying entities in sentences according to the embodiment of the present disclosure extracts candidate entities from a sentence to be processed in a first language, maps the candidate entities in the first language and classified entities in a second language to the same feature space through a matching process between the candidate entities and a classified entity set in the second language, and screens out entities to be classified which are matched with the classified entities in the second language in the feature space. Therefore, the entity classification can be directly carried out by using the existing classification model aiming at the second language entity so as to obtain the category information of each entity in the statement to be processed in the first language. The embodiment of the disclosure realizes the scheme of entity recognition under the condition of cross-language without processes of linguistic data labeling, word-by-word translation, model training and the like, the scheme does not need complex processes of data labeling, model training and the like, can avoid large error loss caused by word-by-word translation, and can simultaneously ensure the efficiency and accuracy of entity recognition to a large extent.
According to an embodiment of the present disclosure, the above process of extracting candidate entities from a to-be-processed sentence in a first language may include: candidate entities in a first language are extracted from the sentence to be processed, and the candidate entities in the first language are converted into candidate entities in a second language. The extracted candidate entities have a relatively clear corresponding relationship between the first language and the second language, so that the conversion accuracy is relatively high when the candidate entities in the first language are converted into the candidate entities in the second language.
For example, the first language is Japanese and the second language is English. The preset entities needing to be identified are scientific and technological entities, such as mobile phone models, company names, software names, operating system names and the like. The scientific and technological entities in japanese are mostly foreign languages from english, usually expressed as katakana or as a continuous string of characters that does not contain japanese kana and kanji. Therefore, the above process of extracting the candidate entity of the first language from the sentence to be processed may be performed in at least one of the following manners: (1) extracting continuous character strings which do not contain Japanese kana and Chinese characters from Japanese sentences to be processed to serve as candidate entities of Japanese; (2) successive katakana are extracted from the sentence to be processed in japanese as candidate entities in japanese. It can be understood that katakana usually has a relatively clear corresponding english vocabulary, and the extracted katakana is converted into english with high accuracy, and continuous character strings which do not contain japanese kana and chinese characters do not need to be converted. Thereby obtaining candidate entities in english.
Fig. 3 schematically shows a flowchart of a method for identifying an entity in a sentence according to another embodiment of the present disclosure, for illustrating the process of matching the candidate entity with the classified entity set of the second language in operation S230 described above.
As shown in fig. 3, the method may include operations S231 to S233 as follows.
In operation S231, a first vector representation of a candidate entity in a second language is obtained.
Illustratively, the above process of obtaining a first vector representation of a candidate entity in a second language may include: in one aspect, any character in the candidate entities in the second language is converted to a feature value to obtain a character vector for the candidate entities. On the other hand, feature values regarding the specified features in the candidate entities in the second language are acquired. Next, the character vector obtained as described above and the feature value relating to the specified feature are combined into a first vector representation. For example, the above-mentioned specified features include at least one of: (1) whether a word is included in the candidate entity in the second language. If yes, the feature value of the specified feature is 1, and if no, the feature value of the specified feature is 0. (2) Whether a number is included in the candidate entity in the second language. If yes, the characteristic value of the specified characteristic is 1, and if not, the characteristic value of the specified characteristic is 0. (3) Whether a particular symbol is included in the candidate entity in the second language. If yes, the feature value of the specified feature is 1, and if no, the feature value of the specified feature is 0. (4) Length of the candidate entity in the second language. If yes, the feature value of the specified feature is 1, and if no, the feature value of the specified feature is 0. The feature values of the above specified features are only examples, and may be set as needed, and are not limited herein.
Then, in operation S232, a second vector representation of any one of the classified entities in the set of classified entities in the second language is obtained.
The process of obtaining the second vector representation of the classified entity in the second language is the same as the process of obtaining the first vector representation of the candidate entity in the first language, and is not described herein again. To this end, candidate entities in a first language and classified entities in a second language are mapped to the same feature space.
Next, in operation S233, a similarity between the first vector representation and the second vector representation is calculated.
The similarity between the first vector representation and the second vector representation can be calculated in various ways, for example, the euclidean distance between the first vector representation and the second vector representation is calculated, and the greater the euclidean distance between the first vector representation and the second vector representation, the higher the similarity is, which indicates that the features of the first vector representation and the second vector representation are closer. It can be understood that, in the above matching process, the candidate entities and the classified entities originally belonging to different languages are mapped to the same feature space, and the matching relationship between the candidate entities and the classified entities is determined by the distance of the corresponding vector representation in the feature space, so that the entities to be classified originating from the first language can be screened out.
On this basis, according to the embodiment of the present disclosure, the process of screening out the candidate entity with the matching degree higher than the predetermined threshold as the entity to be classified may be exemplified as follows: for example, the classified entity set of the second language includes 10 classified entities, and for a candidate entity, the similarity between the candidate entity and any one of the 10 classified entities is calculated according to the matching degree calculation method shown in fig. 3. And if the similarity between the candidate entity and at least one classified entity is higher than a preset threshold value, determining the candidate entity of the second language as the entity to be classified.
According to an embodiment of the present disclosure, the classified entity set in the second language includes a plurality of classified entities, and any classified entity in the plurality of classified entities carries labeling information, and the labeling information is used for characterizing a category to which any classified entity belongs.
The classification model is obtained by training a plurality of classified entities with labeling information in a classified entity set. According to an embodiment of the present disclosure, the processing the entity to be classified by using the classification model to obtain the category to which the entity to be classified belongs includes: a first vector representation of an entity to be classified is input to a classification model, and then a class to which the entity to be classified belongs is determined based on an output of the classification model.
Referring to fig. 4A and 4B, a method for identifying an entity in a sentence according to an embodiment of the present disclosure is illustrated with reference to a specific example.
FIG. 4A schematically illustrates an example schematic diagram of a process for identifying entities in a statement according to an embodiment of this disclosure.
As shown in fig. 4A, a method for identifying entities in statements according to an embodiment of the present disclosure is performed based on a preprocessing stage 410 shown in the dashed box. The preprocessing stage 410 may be performed by various participants, various devices during the development of the natural language domain, and is not limited herein. A method for identifying entities in statements according to an embodiment of the present disclosure entails obtaining a set of classified entities in english, a second vector representation of the classified entities, and a classification model from the preprocessing stage 410.
In the preprocessing stage 410, an english entity set 412 is obtained based on the english corpus 411, and the english entity set is the classified entity set described above. In operation S413, the word vector representation is trained on the english entity set 412, i.e., a second vector representation is obtained. In operation S414, the entity classification model is trained using the english entity set 412, i.e., the classification model described above is obtained.
On this basis, the method for identifying entities in statements according to embodiments of the present disclosure may begin to be performed. In operation S421, a candidate entity is acquired. And converts the candidate entity to an english candidate entity using english-japanese dictionary 420. Operation S422 is then performed to determine a vector representation of the candidate entity, which may be referred to as an entity representation. Then, operation S423 is performed, and entity screening is performed by matching the entity representation with the second vector representation, so as to obtain an entity to be classified. Then, operation S424 is performed to classify the entity to be classified by using the classification model. This is done.
For example, the pending statement is as follows:
"" Zi フリ - ズしてしまいます "" Windows10 でこ -P なことは beginning めてです "" レノボ - プ - ジ Li ろ dealtek is one ディ - ドライバをダウン one ドしてインスト - ルしました -DI improved しませ - でした.
For the sentence to be processed, continuous katakana or continuous character strings not containing japanese kana and kanji are extracted therefrom as candidate entities. And translates the candidate entity represented by the katakana into english. The resulting candidate entities are shown below:
a) candidate entities: フリ - ズ, the corresponding english candidate entities are: freeze;
b) candidate entities: windows10, the corresponding english candidate entities are: windows 10;
c) candidate entities: レノボ, the corresponding English candidate entities are: lenovo;
d) candidate entities: ペ - ジ, the corresponding english candidate entities are: page;
e) candidate entities: realtek, the corresponding english candidate entities are: realtek;
f) candidate entities: only デイ is ドライバ, the corresponding english candidate entities are: an audio driver;
g) candidate entities: ダウン, mouth one ド, the corresponding english candidate entities are: descending;
h) candidate entities: インスト - ル, the corresponding english candidate entities are: install.
And then, performing vector representation on the candidate entities respectively to obtain a first vector representation corresponding to each candidate entity. A second vector representation of each classified entity in the english entity set can be obtained in the same way. And then, similarity calculation is carried out on the first vector representation of the candidate entities and the second vector representation of the classified entities obtained in the English corpus one by one.
Fig. 4B schematically illustrates a schematic diagram of matching a candidate entity with a classified entity set according to an embodiment of the present disclosure.
As shown in fig. 4A, the english entity set includes: entity 1, entity 2, entity 3, entity 4, etc. Taking candidate entity "lenoo" as an example, the score of the similarity between "lenoo" and entity 1 is calculated to be 0.9, the score of the similarity between "lenoo" and entity 2 is calculated to be 0.3, the score of the similarity between "lenoo" and entity 3 is calculated to be 0.2, and the score of the similarity between "lenoo" and entity 4 is calculated to be 0.1, so on, and the description is omitted. Setting a predetermined threshold value to be 0.5, and if the similarity score between any one entity in the English entity set and the candidate entity "enovo" is greater than the predetermined threshold value, determining to reserve the candidate entity as the entity to be classified. And if the similarity scores between all the entities in the English entity set and the candidate entity 'lenoov' are not greater than the preset threshold value, determining to delete the candidate entity and not taking the candidate entity as a subsequent entity to be classified. In this example, the score of the similarity between "lenoov" and the entity 1 is 0.9, and "lenoov" is determined as the entity to be classified.
The entities to be classified may then be classified. The classification model in this example is a specified classifier. Taking the entity to be classified "lens" as an example, the first vector representation of the entity to be classified "lens" is taken as the input of the specified classifier, the classification result for the entity to be classified "lens" is obtained, and the entity label of the entity to be classified "lens" is obtained.
For example, the entity label that specifies classifier output "lenoov" is brand (a label that indicates that the entity is a brand category). Therefore, the classification result of the entity "レノボ" in the sentence to be processed corresponding to "lenoov" is brand (label indicating that the entity is a brand category).
Fig. 5 schematically shows a block diagram of an apparatus for identifying entities in a sentence according to an embodiment of the present disclosure.
As shown in fig. 5, the apparatus 500 for identifying an entity in a sentence includes: an acquisition module 510, an extraction module 520, a matching module 530, a screening module 540, and an identification module 550.
The obtaining module 510 is configured to obtain a to-be-processed sentence in a first language, where the first language is japanese or english.
The extraction module 520 is used for extracting candidate entities from the sentence to be processed.
The matching module 530 is used to match the candidate entities with the classified entity set of the second language.
The screening module 540 is configured to screen out the candidate entities with matching degrees higher than a predetermined threshold as entities to be classified.
The recognition module 550 is configured to process the entity to be classified by using the classification model to obtain the category to which the entity to be classified belongs. Wherein the classification model is trained based on the classified entity set of the second language.
It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be implemented at least partly as a computer program module, which when executed, may perform a corresponding function.
For example, any of the obtaining module 510, the extracting module 520, the matching module 530, the screening module 540, and the identifying module 550 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 510, the extracting module 520, the matching module 530, the screening module 540, and the identifying module 550 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or may be implemented in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any several of them. Alternatively, at least one of the obtaining module 510, the extracting module 520, the matching module 530, the screening module 540 and the identifying module 550 may be at least partially implemented as a computer program module, which when executed may perform a corresponding function.
Fig. 6 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic device 600 includes a processor 610 and a computer-readable storage medium 620. The electronic device 600 may perform a method according to an embodiment of the present disclosure.
In particular, the processor 610 may comprise, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 610 may also include onboard memory for caching purposes. The processor 610 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
Computer-readable storage medium 620, for example, may be a non-volatile computer-readable storage medium, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and so on.
The computer-readable storage medium 620 may include a computer program 621, which computer program 621 may include code/computer-executable instructions that, when executed by the processor 610, cause the processor 610 to perform a method according to an embodiment of the disclosure, or any variation thereof.
The computer program 621 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 621 may include one or more program modules, including 621A, 621B, … …, for example. It should be noted that the division and number of the modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that the processor 610 may execute the method according to the embodiment of the present disclosure or any variation thereof when the program modules are executed by the processor 610.
According to an embodiment of the present invention, at least one of the obtaining module 510, the extracting module 520, the matching module 530, the screening module 540 and the identifying module 550 may be implemented as a computer program module as described with reference to fig. 6, which, when executed by the processor 610, may implement the method described above.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
While the disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims and their equivalents. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims (9)

1. A method for identifying entities in a statement, comprising:
acquiring a sentence to be processed in a first language, wherein the first language is Japanese;
extracting candidate entities from the sentence to be processed, including: extracting candidate entities of a first language from the sentence to be processed; and converting the candidate entities in the first language to candidate entities in a second language;
matching the candidate entities with a classified set of entities in a second language, the second language being English;
screening out candidate entities with matching degree higher than a preset threshold value as entities to be classified; and
and processing the entity to be classified by using a classification model to obtain the class of the entity to be classified, wherein the classification model is obtained by training based on the classified entity set of the second language.
2. The method of claim 1, wherein,
the extracting candidate entities in a first language from the sentence to be processed comprises:
extracting continuous character strings which do not contain Japanese kana and Chinese characters from the sentence to be processed as candidate entities of the first language; and/or
And extracting continuous katakana from the sentence to be processed as a candidate entity of the first language.
3. The method of claim 1, wherein the set of classified entities in the second language comprises a plurality of classified entities, any one of the plurality of classified entities bearing labeling information characterizing a category to which the any one classified entity belongs.
4. The method of claim 3, wherein said matching the candidate entity with the classified set of entities in the second language comprises:
obtaining a first vector representation of a candidate entity in the second language;
obtaining a second vector representation of any classified entity in the set of classified entities in the second language; and
calculating a similarity between the first vector representation and the second vector representation;
screening out candidate entities with the matching degree higher than a preset threshold value as entities to be classified comprises: and if the similarity is higher than a preset threshold value, determining the candidate entity of the second language as the entity to be classified.
5. The method of claim 4, wherein the processing the entity to be classified by using the classification model to obtain the class to which the entity to be classified belongs comprises:
inputting a first vector representation of the entity to be classified to the classification model; and
determining a category to which the entity to be classified belongs based on an output of the classification model.
6. The method of claim 4, wherein the obtaining the first vector representation of the candidate entity in the second language comprises:
converting any character in the candidate entity of the second language into a characteristic value to obtain a character vector aiming at the candidate entity;
acquiring a characteristic value of the designated characteristic in the candidate entity of the second language; and
combining the character vector and the feature value for the specified feature into the first vector representation.
7. The method of claim 6, wherein the specified characteristics include at least one of:
whether a word is included in a candidate entity in the second language, whether a number is included in a candidate entity in the second language, whether a particular symbol is included in a candidate entity in the second language, and a length of a candidate entity in the second language.
8. An apparatus for identifying an entity in a sentence, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a sentence to be processed in a first language, and the first language is Japanese;
an extracting module, configured to extract candidate entities from the to-be-processed sentence, including: extracting candidate entities of a first language from the sentence to be processed; and converting the candidate entities in the first language to candidate entities in a second language; a matching module for matching the candidate entities with a classified entity set of a second language, the second language being English;
the screening module is used for screening out candidate entities with the matching degree higher than a preset threshold value as entities to be classified; and
and the recognition module is used for processing the entity to be classified by utilizing a classification model to obtain the category of the entity to be classified, wherein the classification model is obtained by training based on the classified entity set of the second language.
9. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor;
the processor, when executing the computer program, is configured to:
acquiring a sentence to be processed in a first language, wherein the first language is Japanese;
extracting candidate entities from the sentence to be processed, including: extracting candidate entities of a first language from the sentence to be processed; and converting the candidate entities in the first language to candidate entities in a second language;
matching the candidate entities with a classified set of entities in a second language, the second language being English;
screening out candidate entities with matching degree higher than a preset threshold value as entities to be classified; and
and processing the entity to be classified by using a classification model to obtain the class of the entity to be classified, wherein the classification model is obtained by training based on the classified entity set of the second language.
CN201911373507.0A 2019-12-26 2019-12-26 Method and device for identifying entity in statement and electronic equipment Active CN111144102B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911373507.0A CN111144102B (en) 2019-12-26 2019-12-26 Method and device for identifying entity in statement and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911373507.0A CN111144102B (en) 2019-12-26 2019-12-26 Method and device for identifying entity in statement and electronic equipment

Publications (2)

Publication Number Publication Date
CN111144102A CN111144102A (en) 2020-05-12
CN111144102B true CN111144102B (en) 2022-05-31

Family

ID=70521244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911373507.0A Active CN111144102B (en) 2019-12-26 2019-12-26 Method and device for identifying entity in statement and electronic equipment

Country Status (1)

Country Link
CN (1) CN111144102B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761922A (en) * 2020-06-05 2021-12-07 北京金山数字娱乐科技有限公司 Word processing method and device based on multitask model
CN111813942B (en) * 2020-07-23 2022-07-12 思必驰科技股份有限公司 Entity classification method and device
CN112765977B (en) * 2021-01-11 2023-12-12 百果园技术(新加坡)有限公司 Word segmentation method and device based on cross-language data enhancement
CN113688243B (en) * 2021-08-31 2024-02-13 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for labeling entities in sentences

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073707A (en) * 2010-12-22 2011-05-25 百度在线网络技术(北京)有限公司 Method and device for identifying short text category information in real time, and computer equipment
CN103559181A (en) * 2013-11-14 2014-02-05 苏州大学 Establishment method and system for bilingual semantic relation classification model
CN104536953A (en) * 2015-01-22 2015-04-22 苏州大学 Method and device for recognizing textual emotion polarity
CN104900230A (en) * 2014-03-03 2015-09-09 联想(北京)有限公司 Information processing method and electronic equipment
CN105354199A (en) * 2014-08-20 2016-02-24 北京羽扇智信息科技有限公司 Scene information based entity meaning identification method and system
WO2017040436A1 (en) * 2015-08-31 2017-03-09 Microsoft Technology Licensing, Llc Distributed server system for language understanding
CN106933802A (en) * 2017-02-24 2017-07-07 黑龙江特士信息技术有限公司 A kind of social security class entity recognition method and device towards multi-data source
CN109582969A (en) * 2018-12-04 2019-04-05 联想(北京)有限公司 Methodology for Entities Matching, device and electronic equipment
CN109614615A (en) * 2018-12-04 2019-04-12 联想(北京)有限公司 Methodology for Entities Matching, device and electronic equipment
CN110516254A (en) * 2019-08-30 2019-11-29 联想(北京)有限公司 A kind of information processing method and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138684B (en) * 2015-09-15 2018-12-14 联想(北京)有限公司 A kind of information processing method and information processing unit
US11238363B2 (en) * 2017-04-27 2022-02-01 Accenture Global Solutions Limited Entity classification based on machine learning techniques

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073707A (en) * 2010-12-22 2011-05-25 百度在线网络技术(北京)有限公司 Method and device for identifying short text category information in real time, and computer equipment
CN103559181A (en) * 2013-11-14 2014-02-05 苏州大学 Establishment method and system for bilingual semantic relation classification model
CN104900230A (en) * 2014-03-03 2015-09-09 联想(北京)有限公司 Information processing method and electronic equipment
CN105354199A (en) * 2014-08-20 2016-02-24 北京羽扇智信息科技有限公司 Scene information based entity meaning identification method and system
CN104536953A (en) * 2015-01-22 2015-04-22 苏州大学 Method and device for recognizing textual emotion polarity
WO2017040436A1 (en) * 2015-08-31 2017-03-09 Microsoft Technology Licensing, Llc Distributed server system for language understanding
CN106933802A (en) * 2017-02-24 2017-07-07 黑龙江特士信息技术有限公司 A kind of social security class entity recognition method and device towards multi-data source
CN109582969A (en) * 2018-12-04 2019-04-05 联想(北京)有限公司 Methodology for Entities Matching, device and electronic equipment
CN109614615A (en) * 2018-12-04 2019-04-12 联想(北京)有限公司 Methodology for Entities Matching, device and electronic equipment
CN110516254A (en) * 2019-08-30 2019-11-29 联想(北京)有限公司 A kind of information processing method and electronic equipment

Also Published As

Publication number Publication date
CN111144102A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN111144102B (en) Method and device for identifying entity in statement and electronic equipment
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
US10192545B2 (en) Language modeling based on spoken and unspeakable corpuses
US11106879B2 (en) Multilingual translation device and method
US10176804B2 (en) Analyzing textual data
US10831796B2 (en) Tone optimization for digital content
US20170286397A1 (en) Predictive Embeddings
CN104503998B (en) For the kind identification method and device of user query sentence
US20180025121A1 (en) Systems and methods for finer-grained medical entity extraction
US11031009B2 (en) Method for creating a knowledge base of components and their problems from short text utterances
US9703773B2 (en) Pattern identification and correction of document misinterpretations in a natural language processing system
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN111783471A (en) Semantic recognition method, device, equipment and storage medium of natural language
CN110717021A (en) Input text and related device for obtaining artificial intelligence interview
CN110633456B (en) Language identification method, language identification device, server and storage medium
US11126797B2 (en) Toxic vector mapping across languages
CN111354354A (en) Training method and device based on semantic recognition and terminal equipment
CN111666405A (en) Method and device for recognizing text implication relation
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium
CN114398896A (en) Information input method and device, electronic equipment and computer readable storage medium
CN110276001B (en) Checking page identification method and device, computing equipment and medium
Faruqe et al. Bangla Hate Speech Detection System Using Transformer-Based NLP and Deep Learning Techniques
Manenti et al. Unsupervised speech unit discovery using k-means and neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant