CN112464667B - Text entity identification method and device, electronic equipment and storage medium - Google Patents

Text entity identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112464667B
CN112464667B CN202011294254.0A CN202011294254A CN112464667B CN 112464667 B CN112464667 B CN 112464667B CN 202011294254 A CN202011294254 A CN 202011294254A CN 112464667 B CN112464667 B CN 112464667B
Authority
CN
China
Prior art keywords
processed
text
entity recognition
participles
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011294254.0A
Other languages
Chinese (zh)
Other versions
CN112464667A (en
Inventor
郭韦良
阳晓文
张荣驰
何小莲
邓奕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huabin Licheng Technology Co ltd
Original Assignee
Beijing Huabin Licheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huabin Licheng Technology Co ltd filed Critical Beijing Huabin Licheng Technology Co ltd
Priority to CN202011294254.0A priority Critical patent/CN112464667B/en
Publication of CN112464667A publication Critical patent/CN112464667A/en
Application granted granted Critical
Publication of CN112464667B publication Critical patent/CN112464667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a text entity identification method, a text entity identification device, electronic equipment and a storage medium, and relates to the technical field of data processing, wherein the text entity identification method comprises the following steps: acquiring a text to be processed; the text to be processed is a mixed text of at least two languages; obtaining a sentence dividing tool according to the language category, and carrying out sentence dividing processing on the text to be processed through the sentence dividing tool to obtain a plurality of sentences to be processed; performing word segmentation processing on a plurality of sentences to be processed to obtain a plurality of participles to be processed, and splicing the plurality of participles to be processed into a character string with a target length; and when the target length is greater than the preset length threshold value, matching and labeling the multiple participles to be processed based on the vocabulary entry of the dictionary to obtain an entity recognition result. Therefore, the entity recognition of the multi-language mixed text is realized, and the accuracy of the entity recognition of the overlong text can be improved.

Description

Text entity identification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a text entity identification method, device electronic equipment, and storage medium.
Background
At present, with the continuous development of the medical and health field, data with different sources and formats in the medical and health field are continuously emerged, and a large amount of information which can be identified and mined is hidden in the large data. As the most important step of medical data analysis, medical entity identification (especially disease entity identification) can extract medical terms existing in relevant texts, which plays an important role in subsequent research. Different problems exist due to different sources of medical text, such as: the Chinese-based medical documents are often doped with English-described disease words, target words and the like; the medical patent texts often have phenomena of overlong descriptive sentences and the like.
In the related art, the multi-Language coding fine tuning model BERT-based or BERT variant algorithm becomes a new technical standard in the field of NLP (Natural Language Processing), which includes entity recognition. However, the scheme of pretraining and fine tuning represented by BERT cannot be directly migrated and applied to prediction of chinese data for the BERT model based on english data fine tuning, and has the problems that input text data is truncated and cannot be completely recognized for an ultra-long sentence, and the detailed preprocessing related to the specific scene of disease target recognition is not accurate enough.
Disclosure of Invention
The present application is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present application is to provide a text entity identification method, so as to implement entity identification on a multi-language mixed text, improve accuracy of entity identification on a too long text, and solve technical problems that an ultra-long sentence is subject to complete identification due to truncation of input text data, and an identification result is not accurate enough in the prior art.
A second object of the present application is to provide a text entity recognition apparatus.
A third object of the present application is to propose a computer device.
A fourth object of the present application is to propose a non-transitory computer-readable storage medium.
A fifth object of the present application is to propose a computer program product.
In order to achieve the above object, an embodiment of a first aspect of the present application provides a text entity identification method, including:
acquiring a text to be processed; the text to be processed is a mixed text of at least two languages;
obtaining a sentence dividing tool according to the language category, and carrying out sentence dividing processing on the text to be processed through the sentence dividing tool to obtain a plurality of sentences to be processed;
performing word segmentation processing on the sentences to be processed to obtain a plurality of participles to be processed, and splicing the participles to be processed into a character string with a target length;
when the target length is greater than a preset length threshold, performing matching labeling on the multiple to-be-processed participles based on entries of a dictionary to obtain an entity recognition result, wherein the preset length threshold is used for judging an entity recognition processing mode of the multiple to-be-processed participles, and the judging the entity recognition processing mode of the multiple to-be-processed participles specifically comprises: if the target length exceeds the preset length threshold, calling a dictionary-based automatic labeling system DALS module to perform matching labeling on the dictionary-based entries to obtain an entity recognition result, and if the target length does not exceed the preset length threshold, calling a multi-language coding fine tuning model to perform entity recognition to obtain an entity recognition result.
The entity identification method of the text comprises the steps of obtaining the text to be processed; the text to be processed is a mixed text of at least two languages; obtaining a sentence dividing tool according to the language category, and carrying out sentence dividing processing on the text to be processed through the sentence dividing tool to obtain a plurality of sentences to be processed; performing word segmentation processing on a plurality of sentences to be processed to obtain a plurality of participles to be processed, and splicing the plurality of participles to be processed into a character string with a target length; and when the target length is greater than the preset length threshold value, matching and labeling the multiple participles to be processed based on the vocabulary entry of the dictionary to obtain an entity recognition result. Therefore, the entity recognition of the multi-language mixed text is realized, and the accuracy of the entity recognition of the overlong text can be improved.
In an embodiment of the application, when the target length is less than or equal to the preset length threshold, the character string is input into a multi-language coding fine tuning model for entity recognition, and an entity recognition result is obtained.
In an embodiment of the present application, before the matching and labeling of the multiple to-be-processed participles by the dictionary-based entry and obtaining a labeling result, the method further includes:
acquiring a vocabulary entry list of a target category;
semantic analysis is carried out on the entries in the entry list, the entries in the entry list are adjusted according to semantic information, and stop words are deleted from the entry list;
dividing each entry into a group according to the upper and lower inclusion relation among the entries, and sequencing each group according to a preset length; each entry and the corresponding entity type form a pair.
In an embodiment of the present application, the matching and labeling the multiple to-be-processed segmented words by the dictionary-based entry to obtain the entity recognition result includes:
performing first matching on each word to be processed and the entry in the entry list, and replacing the word to be processed corresponding to complete matching with a label;
and after the first matching, performing second matching on the to-be-processed participles which are not replaced by the labels with the entries in the entry list, and replacing the to-be-processed participles which are completely matched with the labels until the matching of the to-be-processed participles is completed, so as to obtain an entity recognition result.
In an embodiment of the application, the method for entity identification of text further includes:
acquiring a training data text;
segmenting the training data text to obtain a plurality of training participles, and obtaining target training data of which the character lengths of the training participles are larger than the maximum sequence length value;
and sorting the character lengths of the training participles corresponding to the target training data according to a descending order, and selecting the minimum character length as the target length.
In an embodiment of the application, the method for entity identification of a text performs sentence segmentation on the text to be processed by using a sentence segmentation tool to obtain a plurality of sentences to be processed, including:
segmenting each Chinese character in the text to be processed based on the regular pattern to obtain word segmentation results and non-Chinese texts of each Chinese character;
and segmenting the non-Chinese text according to the blank space.
In order to achieve the above object, a second embodiment of the present application provides an entity recognition apparatus for text, including:
the acquisition module is used for acquiring a text to be processed; the text to be processed is a mixed text of at least two languages;
the word acquiring and segmenting module is used for acquiring a sentence segmenting tool according to the language category and carrying out sentence segmenting processing on the text to be processed through the sentence segmenting tool to acquire a plurality of sentences to be processed;
the word segmentation and splicing module is used for carrying out word segmentation on the sentences to be processed to obtain a plurality of words to be processed and splicing the words to be processed into character strings with target length;
a processing module, configured to perform matching labeling on the multiple to-be-processed segmented words based on a vocabulary entry of a dictionary when the target length is greater than a preset length threshold, to obtain an entity recognition result, where the preset length threshold is used to determine an entity recognition processing manner of the multiple to-be-processed segmented words, and the processing module is specifically configured to: if the target length exceeds the preset length threshold, calling a dictionary-based automatic labeling system DALS module to perform matching labeling on the dictionary-based entries to obtain an entity recognition result, and if the target length does not exceed the preset length threshold, calling a multi-language coding fine tuning model to perform entity recognition to obtain an entity recognition result.
The entity recognition device for the text obtains the text to be processed; the text to be processed is a mixed text of at least two languages; obtaining a sentence dividing tool according to the language category, and carrying out sentence dividing processing on the text to be processed through the sentence dividing tool to obtain a plurality of sentences to be processed; performing word segmentation processing on a plurality of sentences to be processed to obtain a plurality of participles to be processed, and splicing the plurality of participles to be processed into a character string with a target length; and when the target length is greater than the preset length threshold value, matching and labeling the multiple participles to be processed based on the vocabulary entry of the dictionary to obtain an entity recognition result. Therefore, the entity recognition of the multi-language mixed text is realized, and the accuracy of the entity recognition of the overlong text can be improved.
To achieve the above object, a third aspect of the present application provides a computer device, including: a processor; a memory for storing the processor-executable instructions; the processor reads the executable program code stored in the memory to run a program corresponding to the executable program code, so as to execute the entity identification method of the text described in the embodiment of the first aspect.
In order to achieve the above object, a fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement an entity identification method for text according to an embodiment of the first aspect of the present application.
In order to achieve the above object, a fifth aspect of the present application provides a computer program product, where an instruction processor of the computer program product, when executing the instruction processor, implements the entity identification method of the text according to the first aspect of the present application.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart illustrating an entity identification method for a text according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an entity recognition apparatus for text according to an embodiment of the present disclosure; and
FIG. 3 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
An entity identification method, an apparatus, an electronic device, and a storage medium of the text of the embodiments of the present application are described below with reference to the drawings.
The text entity recognition method solves the problem that an ultra-long sentence cannot be recognized when the ultra-long sentence is cut off by BERT, can allow disease target recognition of scenes such as Chinese and English mixed texts, and can perform dictionary labeling on data by using the automatic dictionary matching function of the invention when labeled training data does not exist, thereby greatly reducing the manual labeling cost.
The text entity identification method can be applied to a plurality of scenes, such as titles, abstracts and disease target identification of patent texts of Chinese and English, and can be used for identifying the claimed disease targets; by analyzing the target or disease in the patent text, the research and development trend of the pharmaceutical industry is analyzed, the latest industry dynamics is monitored, and the target or disease with the optimal value is captured. For example, identifying diseases of English clinical test titles and inclusion standard texts; and (3) mining the subdivision indications in the clinical test text, and getting through a clinical test-clinical result-subdivision indication data chain, so that the unmet clinical requirements and clinical research and development trends are conveniently searched. Indications (diseases) identification such as chinese and english drug instruction; the method has the advantages of identifying and standardizing the indication information in the drug specification text, assisting in getting through the Chinese clinical research database-Chinese drug declaration database indication data chain, supplementing drug indication information, facilitating searching of related drugs according to indications, discovering new words of disease targets of medical documents, and the like.
Fig. 1 is a flowchart illustrating a text entity identification method according to an embodiment of the present application.
As shown in fig. 1, the method for recognizing the entity of the text includes the following steps:
step 101, acquiring a text to be processed; the text to be processed is a mixed text of at least two languages.
In the embodiment of the present application, the entity identification method based on the text described above may be applied to many scenarios, and a file to be processed may be selected according to a specific application scenario to perform entity identification processing, which is illustrated as follows.
The first example, title, abstract, and claimed disease target identification of Chinese and English patent text, and the text to be processed is Chinese and English patent text.
In the second example, the medical literature disease target new word discovery, the text to be processed is the medical literature.
In the embodiment of the present application, the text to be processed is a mixed text of at least two languages, such as a mixed text of chinese and english, and a mixed text of chinese and french.
And 102, acquiring a sentence dividing tool according to the language category, and performing sentence dividing processing on the text to be processed through the sentence dividing tool to acquire a plurality of sentences to be processed.
And 103, performing word segmentation on the sentences to be processed to obtain a plurality of participles to be processed, and splicing the participles to be processed into a character string with a target length.
In the embodiment of the application, the text to be processed is divided into sentences (if the number of the text sentences is more than or equal to 2), then the text to be processed is divided through mixed languages, if the length of the text to be processed exceeds a preset threshold value, a DALS (Dictionary-based Auto Labeling System, based on a Dictionary automatic Labeling System) module is called to perform Dictionary-based entry matching Labeling, an entity recognition result is obtained, and if the length of the text to be processed is not too long, a multi-language coding fine tuning model is called to perform entity recognition, and an entity recognition result is obtained.
In the embodiment of the application, different processing modes may be selected for different mixed language texts, as an example, each Chinese character is segmented from a text to be processed on the basis of the regular pattern, a word segmentation result and a non-Chinese text of each Chinese character are obtained, and the non-Chinese text is segmented according to a blank space.
For example, the text to be processed is more than or equal to 1 text, each text to be processed can contain more than or equal to 1 sentence, the text to be processed is [ How can you be found two years be cured PD-1and H.I.V. is the target point', [ What is central two years is infected (CDI)? ' ]; the text to be processed is converted into: [ [ [ [ ' How ', ' can ', ' diabetes ', ' be ', ' cured ', '? ' ], [ ' PD-1 ', ' and ', ' HIV ', ' is ', ' target ', ' dot ', ' is ', ], [ [ ' at ', ' is ', ' central ', ' diabetes ', ' instant ', ' CDI ', ' is '? ',]]].
In the embodiment of the present application, a language identification tool is used to identify whether an input text (which may be a sentence, an article, etc.) is english or chinese, wherein the chinese and english are respectively used by different sentence segmentation tools, and the sentence segmentation is to split a sentence into a plurality of single sentences according to a sentence termination symbol in a grammatical sense as a sentence boundary, and then perform word segmentation sentence by sentence, that is, the above-described word segmentation process.
More specifically, each Chinese character in the text to be processed is cut out individually based on the regular pattern, and other segments are temporarily reserved as respective whole: such as [' How can diabetes be cured? ' ], the text to be processed does not contain Chinese and is therefore temporarily output as a whole. Such as 'PD-1and h.i.v', 'yes', 'target', 'dot', 'do', chinese characters are split separately, and others are persisted. Such as [' What is Central Diabetes Insipidus (CDI)? ' is retained as described above without Chinese.
Further, the upper non-Chinese character segment is further divided according to the blank space and possible non-letter symbols at the tail of each segment are separated; if there is no space interval between the parentheses and the non-Chinese fragments in the text to be processed, the parentheses are exclusively cut into independent participles (tokens) individually, because the words in the parentheses may also need to be recognized entities; if there is no space between the punctuation at the end of the sentence and the segment at the end of the sentence, the punctuation at the end of the sentence needs to be cut into independent tokens separately; if a fragment without spaces contains an "ANTI-" (ANTI-, etc. are all considered) prefix, the prefix is split, since in medical text the words behind the ANTI-are likely to be the target words [ ' How ', ' can ', ' diabetes ', ' be ', ' cured ', ' can? ' ]; [ 'PD-1', 'and', 'HIV', 'is', 'target', 'dot', 'does' ]; [ 'at', 'is', 'central', 'diabets', 'causing', ',' CDI ','? ',]].
Therefore, as for the result of the segmentation of each sentence in the previous step, the target length of the character string based on the single character is checked after the character string is spliced by using the single space.
And 104, when the target length is greater than a preset length threshold value, matching and labeling the multiple participles to be processed based on the vocabulary entry of the dictionary to obtain an entity recognition result.
In the embodiment of the application, the preset length threshold is preset, and as a possible implementation mode, a training data text is obtained; segmenting a training data text to obtain a plurality of training participles, and obtaining target training data of which the character lengths of the training participles are larger than the maximum sequence length value; and sorting the character lengths of the training participles corresponding to the target training data according to a descending order, and selecting the minimum character length as the target length.
Specifically, a batch of training data texts of which the participles are subjected to single space splicing after being processed by the mixed language segmentation strategy adopted by the application are segmented by a word segmentation module carried by BERT to judge whether the number of the training participles is larger than 512(512 is a maximum sequence length value specified by the BERT), then the overlength proportion of the training participles of the batch of training data texts is counted, then the character lengths of the training participles of the overlength training data texts (the single space spliced texts after being segmented by the mixed language adopted by the application) are arranged in a descending order, and the value with the minimum character length is selected as a preset length threshold value.
In the embodiment of the application, when the target length is greater than the preset length threshold, matching and labeling are carried out on a plurality of to-be-processed participles based on the vocabulary entry of the dictionary, and an entity recognition result is obtained.
It should be noted that, when the target length is less than or equal to the preset length threshold, the character string is input into the multi-language coding fine tuning model for entity recognition, and an entity recognition result is obtained.
In this embodiment of the present application, before matching and labeling the multiple to-be-processed segmented words based on the vocabulary entry of the dictionary and obtaining the labeling result, the method further includes: acquiring a vocabulary entry list of a target category; semantic analysis is carried out on the entries in the entry list, the entries in the entry list are adjusted according to semantic information, and stop words are deleted from the entry list; dividing each entry into a group according to the upper and lower inclusion relation among the entries, and sequencing each group according to a preset length; each entry and the corresponding entity type form a pair.
In the embodiment of the present application, matching and labeling a plurality of to-be-processed segmented words based on the vocabulary entry of the dictionary to obtain an entity recognition result includes: performing first matching on each word to be processed and entries in the entry list, and replacing the word to be processed corresponding to complete matching with a label; and after the first matching, performing second matching on the to-be-processed participles of which the labels are not replaced by the to-be-processed participles and the entries in the entry list, and replacing the to-be-processed participles corresponding to complete matching with the labels until the matching of the to-be-processed participles is completed, so as to obtain an entity recognition result.
Specifically, the Dictionary-based automatic Labeling System (DALS) firstly labels a batch of texts to be processed, and then delivers the texts to be processed for manual proofreading. The effect of the automatic labeling system is mainly influenced by the word covering surface of the dictionary, namely the more the words, the better the initial labeling effect, and is particularly suitable for the condition that the dependence of the entity words on the context is small, so that the dictionary accumulated before is fully utilized to reduce the manual work. There is a large percentage of very long sentences for the sentence (the excess length is truncated by BERT resulting in data loss), so DALS is also used for labeling of this part of very long sentences. The DALS comprises the steps of dictionary formatting, dictionary stop word removing, dictionary entry grouping, dictionary entry word segmentation, dictionary entry labeling and input text-dictionary matching.
Specifically, dictionary formatting and dictionary stop word removal are performed, that is, an original disease target dictionary is stored in a table similar to Excel, the first column of each row is started by a core word, a plurality of columns behind the same row are abbreviations or full names, aliases and the like of the core word, and all words in each row are synonyms; there are no synonyms between words of different rows. Formatting needs to arrange all dictionaries of disease or target words into single columns (flatten), cancel classified typesetting of synonyms, and remove the words in the entries which are the same as stop words (stop-words). If the abbreviated alias is also attached to the back of each entry, the abbreviated alias and the main word outside the included number are required to be separated respectively to be used as a new entry, and the entry is persistently stored after being formatted, so that the binary file is convenient for subsequent quick calling.
Specifically, the dictionary entries are grouped: the entries with the upper and lower inclusion relations are divided into a group and are sorted according to the length, and each entry and the corresponding entity type form a pair. Example (c):
the formula of the Chinese medicine feed is as follows { 'diabetes mellitus' [ 'type 2 diabetes mellitus', 'hyperglycemia' ], 'target' [ 'Hepatitis D Virus (HDV)', 'hepatitis D virus', 'HDV', 'EDA-FN' ] }.
Outputs [ [ [ [ 'target', 'Hepatitis D Virus (HDV)' ], [ 'target', 'hepatitis D virus' ], [ [ 'target', 'HDV' ], [ [ [ 'diabetes', 'type 2 diabetes' ], [ 'disease', 'hyperglycemia' ], [ 'target', 'EDA-FN' ] ].
In the above example, three entities of the "disease" class, 'type 2 diabetes' contains 'diabetes' and does not contain 'hyperglycemia', so 'type 2 diabetes', 'diabetes' are divided into one group, and 'hyperglycemia' is another group.
Specifically, dictionary entry word segmentation & dictionary entry labeling: the step adopts the word segmentation scheme which is the same as the input text; then labeling each entry by adopting a BIO labeling method; and counting the number of tokens of each entry for use in the next step.
Wherein, the term label example (based on the above example result): [ in the present specification, [ { ' entry _ tokens ' ], [ (' type ', ' sugar ', ' urine ', ' disease ', ' i-disease ', ' i-disease ', ' entry _ token # ' sugar ', ' urine ', ' disease ', ' i-disease ', ' entry _ token # ' 5}, { ' entry _ token ' ], ' sugar ', ' urine ', ' disease ', ' label ', ' b-distance ', ' i-distance ', ' entry _ token ' ], ' entry _ token # ' blood ', ' sugar ', ' label ', ' b-distance ', ' i-distance ', ' entry _ token # ' entry _ token ', ' high ', ' blood ', ' sugar ', ' label ' ], ' b-distance ', ' i-distance ', ' disease '.
Specifically, input text-dictionary matching: namely, matching and labeling the input text subjected to word segmentation and the entries in the dictionary after the previous processing, initializing the label matching result of the sentence according to the number of tokens after token refinement (mixed language segmentation), which is exemplified as follows:
the input is what the difference between the Hepatis D Virus (HDV) and the hepatis D virus is.
The words segmentation results [ 'hepatis', 'D', 'virus', '(', 'HDV', ')', 'and', 'hepatis', 'D', 'virus', 'asso', 'how', 'zone', 'other' ].
The input text label initializes [ 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O' ].
Then, the tokens of the input text are matched and labeled group by using the dictionary prepared previously, and the initialized sentence tokens are continuously replaced by corresponding labels which can be completely matched with the token corresponding to the dictionary entry. At this time, the function of grouping the entries in the dictionary according to the inclusion relationship and the length is shown: that is, each group is matched successively according to descending order of length, once a relatively longer entry is matched, all the entries of the remaining group without the length of the entry stop matching.
The result of the above operation process is that ('# labeled #') indicates that the token at the current position has been matched to a dictionary entry entity (the number of tokens prepared by the corresponding entry of the previous dictionary), and the temporary tokens are not involved in matching in the subsequent matching rounds.
The first round of matching is temporary Tokens [ '# labeled #', '# labeled #', '# labeled #', '# labeled #', 'and', 'hepatitis', 'D', 'virus', 'sh', 'zone', 'other'; (ii) a Temporary Labels [ ' b-target ', ' i-target ', ' i-target ', ' i-target ', ' i-target ', ' O ', ' O ', ' O ', ' O ', ' O ', ' O ', ' and ' O ' ].
The second round of matching is temporary Tokens [ '# labeled #', '# labeled #', '# labeled #', '# labeled #', '# labeled #', 'and', 'labeled #', 'labeled #' dummy # 'assorted #', 'assorted', 'zone', 'other'; temporary Labels [ ' b-target ', ' i-target ', ' i-target ', ' i-target ', ' i-target ', ' O ', ' b-target ', ' i-target ', ' i-target ', ' i-target ', ' O ', ' O ', ' ].
Therefore, the method and the device realize word segmentation of multi-language sentences such as Chinese and English mixed sentences, and solve the problem that a model obtained by mBERT fine tuning training (fine-tuning) on pure English cannot be directly used for entity prediction of Chinese texts; the system is not only suitable for Chinese and English, but also suitable for languages similar to English typesetting, namely, the words are all separated by spaces, such as German, French and Western languages; and languages similar to Chinese typesetting, i.e. composed of isolated words without spaces between the words (simplified and traditional mandarin, cantonese, etc.).
Furthermore, the length of the character string based on the single character of the divided words is checked after the divided words are spliced into the character string by using single spaces, if the length exceeds a preset threshold value, dictionary-based automatic matching is carried out by a DALS module, the problem that entity recognition cannot be accurately carried out due to the fact that overlong texts are cut off is solved, the dictionary is formatted, words are divided and labeled, the data characteristics based on diseases, target words and the like are solved, and matching conflict of different entries with inclusion relations is avoided.
The entity identification method of the text comprises the steps of obtaining the text to be processed; the text to be processed is a mixed text of at least two languages; obtaining a sentence dividing tool according to the language category, and carrying out sentence dividing processing on the text to be processed through the sentence dividing tool to obtain a plurality of sentences to be processed; performing word segmentation processing on a plurality of sentences to be processed to obtain a plurality of participles to be processed, and splicing the plurality of participles to be processed into a character string with a target length; and when the target length is greater than the preset length threshold value, matching and labeling the multiple participles to be processed based on the vocabulary entry of the dictionary to obtain an entity recognition result. Therefore, entity recognition of multi-language mixed texts is achieved, and accurate entity recognition of overlong texts can be achieved.
In order to implement the above embodiments, the present application further provides an entity recognition apparatus for a text.
Fig. 2 is a schematic structural diagram of an entity recognition apparatus for text according to an embodiment of the present disclosure.
As shown in fig. 2, the entity recognition apparatus for text includes: the system comprises an acquisition module 210, an acquisition segmentation module 220, a segmentation concatenation module 230 and a processing module 240.
An obtaining module 210, configured to obtain a text to be processed; the text to be processed is a mixed text of at least two languages.
The word obtaining and segmenting module 220 is configured to obtain a sentence segmenting tool according to the language category, perform sentence segmenting processing on the text to be processed through the sentence segmenting tool, and obtain a plurality of sentences to be processed.
The word segmentation and concatenation module 230 is configured to perform word segmentation on the multiple sentences to be processed, obtain multiple words to be processed, and concatenate the multiple words to be processed into a character string with a target length.
And the processing module 240 is configured to, when the target length is greater than a preset length threshold, perform matching labeling on the multiple to-be-processed segmented words based on the vocabulary entry of the dictionary, and obtain an entity recognition result.
In an embodiment of the application, the processing module 240 is further configured to, when the target length is less than or equal to the preset length threshold, input the character string into a multi-language coding fine tuning model for entity recognition, and obtain an entity recognition result.
The entity recognition device for the text obtains the text to be processed; the text to be processed is a mixed text of at least two languages; obtaining a sentence dividing tool according to the language category, and carrying out sentence dividing processing on the text to be processed through the sentence dividing tool to obtain a plurality of sentences to be processed; performing word segmentation processing on a plurality of sentences to be processed to obtain a plurality of participles to be processed, and splicing the plurality of participles to be processed into a character string with a target length; and when the target length is greater than the preset length threshold value, matching and labeling the multiple participles to be processed based on the vocabulary entry of the dictionary to obtain an entity recognition result. Therefore, the entity recognition of the multi-language mixed text is realized, and the accuracy of the entity recognition of the overlong text can be improved.
It should be noted that the explanation of the embodiment of the text entity identification method is also applicable to the text entity identification apparatus of the embodiment, and is not repeated here.
In order to implement the foregoing embodiments, the present application also provides a computer device, including: a processor, and a memory for storing processor-executable instructions.
Wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the entity identification method of the text as proposed in the foregoing embodiment of the present application.
To achieve the above embodiments, the present application also proposes a non-transitory computer-readable storage medium, in which instructions are executed by a processor, so that the processor can execute the entity identification method of the text proposed by the foregoing embodiments of the present application.
In order to implement the foregoing embodiments, the present application also proposes a computer program product, wherein when the instructions of the computer program product are executed by a processor, the computer program product executes the entity identification method implementing the text proposed by the foregoing embodiments of the present application.
FIG. 3 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present application. The computer device 12 shown in fig. 3 is only an example and should not bring any limitation to the function and scope of use of the embodiments of the present application.
As shown in FIG. 3, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, and commonly referred to as a "hard drive"). Although not shown in FIG. 3, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, such as implementing the entity recognition method of the text mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. A text entity identification method is characterized by comprising the following steps:
acquiring a text to be processed; the text to be processed is a mixed text of at least two languages;
obtaining a sentence dividing tool according to the language category, and carrying out sentence dividing processing on the text to be processed through the sentence dividing tool to obtain a plurality of sentences to be processed;
performing word segmentation processing on the sentences to be processed to obtain a plurality of participles to be processed, and splicing the participles to be processed into a character string with a target length;
when the target length is greater than a preset length threshold, performing matching labeling on the multiple to-be-processed participles based on entries of a dictionary to obtain an entity recognition result, wherein the preset length threshold is used for judging an entity recognition processing mode of the multiple to-be-processed participles, and the judging the entity recognition processing mode of the multiple to-be-processed participles specifically comprises: if the target length exceeds the preset length threshold, calling a dictionary-based automatic labeling system DALS module to perform matching labeling on the dictionary-based entries to obtain an entity recognition result, and if the target length does not exceed the preset length threshold, calling a multi-language coding fine tuning model to perform entity recognition to obtain an entity recognition result.
2. The method for entity recognition of text according to claim 1, further comprising:
and when the target length is less than or equal to the preset length threshold, inputting the character string into a multi-language coding fine tuning model for entity recognition to obtain an entity recognition result.
3. The method for entity recognition of text according to claim 1, wherein before the matching labeling of the plurality of to-be-processed segmented words by the dictionary-based vocabulary entry and obtaining the labeling result, the method further comprises:
acquiring a vocabulary entry list of a target category;
semantic analysis is carried out on the entries in the entry list, the entries in the entry list are adjusted according to semantic information, and stop words are deleted from the entry list;
dividing each entry into a group according to the upper and lower inclusion relation among the entries, and sequencing each group according to a preset length; each entry and the corresponding entity type form a pair.
4. The method for entity recognition of text according to claim 3, wherein the matching and labeling of the multiple to-be-processed participles based on the vocabulary entry of the dictionary to obtain the entity recognition result comprises:
performing first matching on each word to be processed and the entry in the entry list, and replacing the word to be processed corresponding to complete matching with a label;
and after the first matching, performing second matching on the to-be-processed participles which are not replaced by the labels with the entries in the entry list, and replacing the to-be-processed participles which are completely matched with the labels until the matching of the to-be-processed participles is completed, so as to obtain an entity recognition result.
5. The method for entity recognition of text according to claim 1, further comprising:
acquiring a training data text;
segmenting the training data text to obtain a plurality of training participles, and obtaining target training data of which the character lengths of the training participles are larger than the maximum sequence length value;
and sorting the character lengths of the training participles corresponding to the target training data according to a descending order, and selecting the minimum character length as the target length.
6. The method for entity recognition of text according to claim 1, wherein the sentence segmentation processing of the text to be processed by the sentence segmentation tool to obtain a plurality of sentences to be processed comprises:
segmenting each Chinese character in the text to be processed based on the regular pattern to obtain word segmentation results and non-Chinese texts of each Chinese character;
and segmenting the non-Chinese text according to the blank space.
7. An apparatus for entity recognition of text, the apparatus comprising:
the acquisition module is used for acquiring a text to be processed; the text to be processed is a mixed text of at least two languages;
the word acquiring and segmenting module is used for acquiring a sentence segmenting tool according to the language category and carrying out sentence segmenting processing on the text to be processed through the sentence segmenting tool to acquire a plurality of sentences to be processed;
the word segmentation and splicing module is used for carrying out word segmentation on the sentences to be processed to obtain a plurality of words to be processed and splicing the words to be processed into character strings with target length;
a processing module, configured to perform matching labeling on the multiple to-be-processed segmented words based on a vocabulary entry of a dictionary when the target length is greater than a preset length threshold, to obtain an entity recognition result, where the preset length threshold is used to determine an entity recognition processing manner of the multiple to-be-processed segmented words, and the processing module is specifically configured to: if the target length exceeds the preset length threshold, calling a dictionary-based automatic labeling system DALS module to perform matching labeling on the dictionary-based entries to obtain an entity recognition result, and if the target length does not exceed the preset length threshold, calling a multi-language coding fine tuning model to perform entity recognition to obtain an entity recognition result.
8. The apparatus for entity recognition of text as recited in claim 7, further comprising:
and the processing module is further used for inputting the character string into a multi-language coding fine tuning model for entity recognition when the target length is less than or equal to the preset length threshold value, and acquiring an entity recognition result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, when executing the program, implementing a method of entity recognition of a text as claimed in any one of claims 1 to 6.
10. A non-transitory computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the method for entity identification of a text according to any one of claims 1 to 6.
CN202011294254.0A 2020-11-18 2020-11-18 Text entity identification method and device, electronic equipment and storage medium Active CN112464667B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011294254.0A CN112464667B (en) 2020-11-18 2020-11-18 Text entity identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011294254.0A CN112464667B (en) 2020-11-18 2020-11-18 Text entity identification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112464667A CN112464667A (en) 2021-03-09
CN112464667B true CN112464667B (en) 2021-11-16

Family

ID=74836657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011294254.0A Active CN112464667B (en) 2020-11-18 2020-11-18 Text entity identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112464667B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139033A (en) * 2021-05-13 2021-07-20 平安国际智慧城市科技股份有限公司 Text processing method, device, equipment and storage medium
CN113488194B (en) * 2021-05-25 2023-04-07 四川大学华西医院 Medicine identification method and device based on distributed system
CN113312915B (en) * 2021-05-28 2022-08-30 北京航空航天大学 Intelligent epidemiology investigation system
CN113743089A (en) * 2021-09-03 2021-12-03 科大讯飞股份有限公司 Multilingual text generation method, device, equipment and storage medium
CN113946677B (en) * 2021-09-14 2024-06-14 中北大学 Event identification and classification method based on bidirectional cyclic neural network and attention mechanism
CN114138945B (en) * 2022-01-19 2022-06-14 支付宝(杭州)信息技术有限公司 Entity identification method and device in data analysis
CN114201967B (en) * 2022-02-17 2022-06-10 杭州费尔斯通科技有限公司 Entity identification method, system and device based on candidate entity classification
CN114817469B (en) * 2022-04-27 2023-08-08 马上消费金融股份有限公司 Text enhancement method, training method and training device for text enhancement model
CN115081440B (en) * 2022-07-22 2022-11-01 湖南湘生网络信息有限公司 Method, device and equipment for recognizing variant words in text and extracting original sensitive words

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309926A (en) * 2013-03-12 2013-09-18 中国科学院声学研究所 Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN109829162A (en) * 2019-01-30 2019-05-31 新华三大数据技术有限公司 A kind of text segmenting method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468584A (en) * 2015-12-31 2016-04-06 武汉鸿瑞达信息技术有限公司 Filtering method and system for bad literal information in text
CN105894088B (en) * 2016-03-25 2018-06-29 苏州赫博特医疗信息科技有限公司 Based on deep learning and distributed semantic feature medical information extraction system and method
CN107808124B (en) * 2017-10-09 2019-03-26 平安科技(深圳)有限公司 Electronic device, the recognition methods of medical text entities name and storage medium
CN109948154B (en) * 2019-03-12 2023-05-05 南京邮电大学 Character acquisition and relationship recommendation system and method based on mailbox names
CN110046348B (en) * 2019-03-19 2021-05-25 西安理工大学 Method for recognizing main body in subway design specification based on rules and dictionaries
CN111950283B (en) * 2020-07-31 2021-09-07 合肥工业大学 Chinese word segmentation and named entity recognition system for large-scale medical text mining

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309926A (en) * 2013-03-12 2013-09-18 中国科学院声学研究所 Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN109829162A (en) * 2019-01-30 2019-05-31 新华三大数据技术有限公司 A kind of text segmenting method and device

Also Published As

Publication number Publication date
CN112464667A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN112464667B (en) Text entity identification method and device, electronic equipment and storage medium
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
Hornik et al. The textcat package for n-gram based text categorization in R
US5680628A (en) Method and apparatus for automated search and retrieval process
CN106649783B (en) Synonym mining method and device
CN109192255B (en) Medical record structuring method
US20060149557A1 (en) Sentence displaying method, information processing system, and program product
CN107832301B (en) Word segmentation processing method and device, mobile terminal and computer readable storage medium
CN107766325B (en) Text splicing method and device
El-Haj et al. Arabic dialect identification in the context of bivalency and code-switching
CN112908487B (en) Automatic identification method and system for updated content of clinical guideline
EP1471440A2 (en) System and method for word analysis
CN110287286B (en) Method and device for determining similarity of short texts and storage medium
CN115983233A (en) Electronic medical record duplication rate estimation method based on data stream matching
Paripremkul et al. Segmenting words in Thai language using Minimum text units and conditional random Field
Feng et al. Unsupervised segmentation of Chinese corpus using accessor variety
CN114970514A (en) Artificial intelligence based Chinese word segmentation method, device, computer equipment and medium
Uchimoto et al. Morphological analysis of the Corpus of Spontaneous Japanese
US11544304B2 (en) System and method for parsing user query
Vlachos et al. Bootstrapping the recognition and anaphoric linking of named entities in drosophila articles
US10572592B2 (en) Method, device, and computer program for providing a definition or a translation of a word belonging to a sentence as a function of neighbouring words and of databases
JP2002503849A (en) Word segmentation method in Kanji sentences
Aziz et al. A hybrid model for spelling error detection and correction for Urdu language
KR20120046850A (en) Method for calculating similarity of korean word
CN115905297B (en) Method, apparatus and medium for retrieving data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant