CN108052508B - Information extraction method and device - Google Patents

Information extraction method and device Download PDF

Info

Publication number
CN108052508B
CN108052508B CN201711476786.4A CN201711476786A CN108052508B CN 108052508 B CN108052508 B CN 108052508B CN 201711476786 A CN201711476786 A CN 201711476786A CN 108052508 B CN108052508 B CN 108052508B
Authority
CN
China
Prior art keywords
word
undetermined
words
segmentation result
currently
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711476786.4A
Other languages
Chinese (zh)
Other versions
CN108052508A (en
Inventor
李重勋
王利叶
胡可云
陈联忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiahesen Health Technology Co., Ltd.
Original Assignee
Beijing Jiahesen Health Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiahesen Health Technology Co ltd filed Critical Beijing Jiahesen Health Technology Co ltd
Priority to CN201711476786.4A priority Critical patent/CN108052508B/en
Publication of CN108052508A publication Critical patent/CN108052508A/en
Application granted granted Critical
Publication of CN108052508B publication Critical patent/CN108052508B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses an information extraction method, which comprises the steps of segmenting a preset text according to a preset word bank to obtain a first segmentation result, extracting a plurality of undetermined words from the first segmentation result, and determining the undetermined words without containing relations from the plurality of undetermined words to serve as information extraction results of the first segmentation result. By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result of the first word segmentation result without inclusion relation can be further extracted from the longer first word segmentation result, for example, words representing information such as parts, diseases and the like are extracted from the complete words representing the operation names, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result of the first word segmentation result, and the data query and the positioning are facilitated. The embodiment of the application also discloses an information extraction device.

Description

Information extraction method and device
Technical Field
The present application relates to the field of text processing, and in particular, to an information extraction method and apparatus.
Background
Electronic Medical Records (EMRs) are also called computerized Medical Record systems or computer-based patient records. The medical record is a digitalized medical service working record of characters, symbols, charts, data, graphs and the like generated by medical staff of medical institutions to clinic diagnosis and treatment and guide intervention of outpatients and inpatients, and using an information system. The development of the electronic medical record provides convenience for doctors to know the information of patients and clinical research in real time. However, currently, electronic medical records have both structured data and unstructured data, and some important information mostly exists in the unstructured data, such as chief complaints, current medical history, past history, and the like in the electronic medical records. Therefore, in order to effectively utilize the electronic medical record and discover useful information in the electronic medical record, the unstructured data needs to be generated into structured data, and the process is information extraction.
In the information extraction process, it is often necessary to perform word segmentation on the text based on a preset word library to obtain useful information, such as words representing diseases, symptoms, operations, and the like. In the prior art, word segmentation is performed based on the longest matching principle, that is, word segmentation is performed according to the longest word matched with a word bank, but in many cases, the longest word also includes other shorter words, which are very useful information, and the shorter words cannot be extracted based on the longest matching principle, so that the extracted information is less, and the effect of data structuring is affected.
For example, assuming that the segmentation result "diverticulectomy in bladder" is obtained based on the longest matching rule, the word belongs to one operation name as a whole, but in the word, including the site name (in bladder), the disease name (diverticulum), and the operation name (resection), since the "diverticulectomy in bladder" exists in the thesaurus, even if the three words "in bladder", "diverticulum", and "resection" exist in the thesaurus, they cannot be extracted according to the longest matching rule.
Disclosure of Invention
In order to solve the technical problem that a large number of words in a word segmentation result obtained according to a longest matching principle cannot be extracted in the prior art, the embodiment of the application provides an information extraction method and device.
In a first aspect, an embodiment of the present application provides an information extraction method, where the method includes:
segmenting words of a preset text according to a preset word bank to obtain a first segmentation result;
extracting a plurality of undetermined words from the first segmentation result based on the preset word bank, wherein the undetermined words do not comprise the first segmentation result;
and determining the undetermined words without containing relations from the undetermined words as the information extraction result of the first word segmentation result.
Optionally, the determining, from the multiple undetermined words, an undetermined word without a relation included as an information extraction result of the first segmentation result includes:
sequencing the multiple undetermined words according to the position of the first character and/or the tail character of each undetermined word in the first segmentation result;
if the current undetermined word has an adjacent next undetermined word, judging whether the current undetermined word has a containing relation with the adjacent next undetermined word, if not, taking the current undetermined word and/or the adjacent next undetermined word as the information extraction result, wherein the current undetermined word is one of the plurality of undetermined words.
Optionally, if the currently pending word and the next adjacent pending word have a containment relationship, the method further includes:
if the currently undetermined word contains the adjacent next undetermined word, covering the currently undetermined word with the adjacent next undetermined word, enabling the adjacent next undetermined word to be the currently undetermined word, and executing the step of judging whether the currently undetermined word contains the adjacent next undetermined word;
if the currently undetermined word is contained by the adjacent next undetermined word, the adjacent next undetermined word is made to be the currently undetermined word, and the step of judging whether the currently undetermined word contains the adjacent next undetermined word is executed.
Optionally, if the information extraction result includes a first pending word and a second pending word, the method further includes:
judging whether the first undetermined word and the second undetermined word comprise cross words or not based on the preset word bank, wherein the cross words are a part of the first undetermined word and a part of the second undetermined word;
if yes, judging whether the attribute of the cross word is the same as the attribute of the first undetermined sub-word, if so, removing the cross word from the second undetermined word to obtain a modified second undetermined word; and if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined sub-word is a word adjacent to the cross word in the first to-be-determined word.
Optionally, if the attribute of the cross word is the same as the attribute of the first pending subword, the method further includes:
storing the mapping relation among the first word segmentation result, the first pending word and the modified second pending word;
if the attribute of the cross word is different from the attribute of the first pending subword, the method further comprises the following steps:
and storing the mapping relation among the first word segmentation result, the modified first word to be determined and the second word to be determined.
In a second aspect, an embodiment of the present application provides an information extraction apparatus, including:
the word segmentation unit is used for segmenting words of a preset text according to a preset word bank to obtain a first word segmentation result;
the extracting unit is used for extracting a plurality of undetermined words from the first segmentation result based on the preset word bank, wherein the undetermined words do not comprise the first segmentation result;
and the determining unit is used for determining the undetermined words without containing relations from the undetermined words as the information extraction result of the first segmentation result.
Optionally, the determining unit includes:
the sequencing subunit sequences the multiple undetermined words according to the positions of the first character and/or the tail character of each undetermined word in the first segmentation result;
a determining subunit, configured to determine, if there is an adjacent next undetermined word in the current undetermined word, whether the current undetermined word has a inclusion relationship with the adjacent next undetermined word, where the current undetermined word is one of the multiple undetermined words;
and the determining subunit is configured to, if the determination result is negative, use the currently pending word and/or the adjacent next pending word as the information extraction result.
Optionally, the determining unit further includes an executing subunit, configured to:
when the currently undetermined word and the adjacent next undetermined word have an inclusion relationship, if the currently undetermined word contains the adjacent next undetermined word, covering the currently undetermined word with the adjacent next undetermined word, enabling the adjacent next undetermined word to be the currently undetermined word, and executing the step of judging whether the currently undetermined word contains the adjacent next undetermined word;
if the currently undetermined word is contained by the adjacent next undetermined word, the adjacent next undetermined word is made to be the currently undetermined word, and the step of judging whether the currently undetermined word contains the adjacent next undetermined word is executed.
Optionally, the apparatus further comprises:
a cross word judging unit, configured to judge, based on the preset word library, whether the first to-be-determined word and the second to-be-determined word include a cross word if the information extraction result includes the first to-be-determined word and the second to-be-determined word, where the cross word is a part of the first to-be-determined word and is a part of the second to-be-determined word;
the attribute judging unit is used for judging whether the attribute of the cross word is the same as the attribute of the first undetermined subword or not if the attribute of the cross word is included;
the modifying unit is used for removing the cross word from the second undetermined word if the cross word is the same as the second undetermined word to obtain a modified second undetermined word; and if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined sub-word is a word adjacent to the cross word in the first to-be-determined word.
Optionally, the apparatus further includes a storage unit, specifically configured to:
if the attribute of the cross word is the same as that of the first undetermined subword, storing the mapping relation among the first word segmentation result, the first undetermined word and the modified second undetermined word;
and if the attribute of the cross word is different from the attribute of the first undetermined subword, storing the mapping relation among the first word segmentation result, the modified first undetermined word and the second undetermined word.
As can be seen from the above, in the information extraction method provided in the embodiment of the present application, first, a preset text is segmented according to a preset word bank to obtain a first segmentation result, then, based on the preset word bank, a plurality of undetermined words included in the first segmentation result are extracted from the first segmentation result, the plurality of undetermined words do not include the first segmentation result, and an undetermined word without an inclusion relation is determined from the plurality of undetermined words as an information extraction result of the first segmentation result.
By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result of the first word segmentation result without inclusion relation can be further extracted from the longer first word segmentation result, for example, words representing information such as parts, diseases and the like are extracted from the complete words representing the operation names, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result of the first word segmentation result, and the data searching and positioning are facilitated.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of an information extraction method provided in the present application;
fig. 2 is a flowchart of an information extraction result method for determining an undetermined word without inclusion relationship from a plurality of undetermined words as a first segmentation result according to the present application;
FIG. 3 is a flow chart of an information extraction method provided herein;
fig. 4 is a schematic structural diagram of an information extraction apparatus provided in the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In a conventional information extraction method, a text is often segmented based on a preset word bank to obtain useful information, such as words representing diseases, symptoms, operations, and the like, and then segmented based on a longest matching principle, that is, segmented according to the longest word matched with the word bank to obtain a segmentation result, so as to realize information extraction in an electronic medical record.
However, in many cases, the longest word also includes other shorter words, which are also very useful information, and these shorter words cannot be extracted based on the longest matching principle, so that the extracted information is less, and the effect of data structuring is affected.
In view of this, an embodiment of the present application provides an information extraction method, where a preset text is segmented according to a preset word bank to obtain a first segmentation result, and then multiple undetermined words included in the first segmentation result are extracted based on the preset word bank, where the multiple undetermined words do not include the first segmentation result, and an undetermined word without an inclusion relation is determined from the multiple undetermined words to serve as an information extraction result of the first segmentation result.
By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result of the first word segmentation result without inclusion relation can be further extracted from the longer first word segmentation result, for example, words representing information such as parts, diseases and the like are extracted from the complete words representing the operation names, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result of the first word segmentation result, and the data query and the positioning are facilitated.
In order to make the information extraction method provided by the embodiment of the present application clearer, a specific implementation manner of the information extraction method provided by the embodiment of the present application is described below with reference to the accompanying drawings.
Fig. 1 is a flowchart illustrating an information extraction method according to an embodiment of the present application, and please refer to fig. 1, where the method includes:
s101: and segmenting words of the preset text according to the preset word bank to obtain a first segmentation result.
The preset word bank can be understood as a preset dictionary bank. The preset lexicon in the embodiment of the present application may be a lexicon in the medical field, including a symptom (symptom) lexicon, a time (time) lexicon, an organ part (organ) lexicon, a disease (disease) lexicon, an operation (operation) lexicon, and the like. The preset lexicon may comprise at least one of the lexicons mentioned above. In order to improve the accuracy of the word segmentation result, the preset lexicon may include more lexicons, for example, the lexicons may be all used as the preset lexicon.
The preset text can be understood as a preset text, and in the embodiment of the present application, the preset text may be a text of information to be extracted. As an example, the preset text may be an electronic medical record of the information to be extracted. In other possible implementation manners of the embodiment of the present application, the preset text may also be a text of other information to be extracted, which is not limited in the embodiment of the present application.
The preset text may include at least one sentence therein. For each sentence in the preset text, the sentence can be segmented according to the preset word stock, wherein the sentence segmentation can be realized in various ways, including matching with the preset word stock, matching according to a regular rule and the like, and at least one segmentation result can be obtained after the sentence is segmented. The word segmentation result may include a word segmentation result in a preset word bank, or may include a result not in the preset word bank, and the word segmentation result in the preset word bank may be used as the first word segmentation result.
For ease of understanding, this is illustrated.
For example, the default text is an electronic medical record of a user, and the electronic medical record includes a sentence "knee joint swelling and pain is 6 months, and the aggravation is 3 weeks. "the sentence is segmented according to the preset lexicon, and the segmentation result can be obtained as follows:
"swelling pain of knee joint/symptom-6 months/time-,/O-aggravated/O-3 weeks/time-". And O'.
Where symptom indicates the symptom, time indicates time, and O indicates other, meaning others. It is to be understood that the segmentation result labeled with symptom and time is the segmentation result in the preset lexicon, the segmentation result labeled with O is the segmentation result not in the preset lexicon, the segmentation results "knee joint swelling pain", "june" and "3 weeks" in the preset lexicon can be determined as the first segmentation result, and the segmentation results "that are not in the preset lexicon", "accentuate" and ". "etc. are not intended as a result of the first word.
S102: and extracting a plurality of undetermined words from the first segmentation result based on a preset word bank.
The word to be determined can be understood as a word to be determined, which is extracted from the first word segmentation result. The first segmentation result may be a longer word, and the pending word may be a shorter word extracted from the first segmentation result, so that the plurality of pending words does not include the first segmentation result.
Since the first segmentation result is obtained by segmenting the preset text based on the preset lexicon, for example, the first segmentation result may be obtained according to a longest matching principle, and the first segmentation result is a segmentation result located in the preset lexicon, a plurality of to-be-segmented words included in the first segmentation result may be extracted from the first segmentation result based on the preset lexicon.
In a possible implementation manner, the first segmentation result may be traced back, that is, the first segmentation result is scanned in a preset word bank, all possible word forming conditions in the first segmentation result are recorded, the first segmentation result is deleted from the record, and the remaining words may be used as pending words included in the first segmentation result.
Taking the example of "knee joint pain" as an example, the "knee joint pain" is scanned in a preset word stock, and all possible word cases include:
knee joint, knee joint swelling and pain, joint swelling and pain, etc.
Deleting the first segmentation result 'knee joint swelling and pain' from the records, and determining the remaining words 'knee joint, joint swelling and pain' as a plurality of undetermined words included in the first segmentation result 'knee joint swelling and pain'.
S103: and determining the undetermined words without containing relations from the undetermined words as the information extraction result of the first segmentation result.
The purpose of the embodiments of the present application is to extract shorter words containing useful information from longer words, for example, to extract a plurality of shorter words such as diseases and parts from a longer word such as a complete operation name. It is to be understood that the extracted plurality of shorter words may be understood as words that are independent of each other and have no inclusion relationship.
Therefore, the undetermined words included in the first segmentation result can be judged, and the undetermined words without inclusion relation are used as the information extraction result of the first segmentation result. The information extraction result may be regarded as a shorter segmentation result including useful information extracted from a longer first segmentation result. By extracting the shorter information extraction result from the first segmentation result, the extracted information can be increased, and the data structuring effect can be enhanced.
Because the plurality of undetermined words are extracted from the first segmentation result based on the preset word bank, and the plurality of undetermined words corresponding to the same first segmentation result may have an inclusion relationship, at least one of the undetermined words having the inclusion relationship can be removed to obtain undetermined words having no inclusion relationship, and then the undetermined words having no inclusion relationship are used as an information extraction result.
For ease of understanding, this is illustrated.
For example, the plurality of tentative words included in the first segmentation result are "knee joint, swelling and pain in joint", wherein "knee joint", "joint" and "knee joint" have an inclusive relationship, and "joint", "swelling and pain in joint" and "swelling and pain in joint" have an inclusive relationship, and the tentative words included in the "knee joint", "joint" and "swelling and pain in joint" are obtained by removing "knee joint", "joint", and "swelling and pain in joint", and the "knee joint" and "swelling and pain in joint" can be extracted as information of the first segmentation result "swelling and pain in joint".
The above specific implementation manner of the information extraction method provided in the embodiment of the present application is to perform word segmentation on a preset text according to a preset word bank to obtain a first word segmentation result, and then extract a plurality of undetermined words from the first word segmentation result based on the preset word bank, where the plurality of undetermined words do not include the first word segmentation result, and determine, from the plurality of undetermined words, an undetermined word without an inclusion relation as an information extraction result of the first word segmentation result.
By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result which does not have the inclusion relation and is used for the first word segmentation result can be further extracted from the longer first word segmentation result, for example, words which represent information such as parts, diseases and the like are extracted from the complete words which represent the names of operations, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result, and the data searching and the positioning are facilitated.
In order to extract more information from the preset text and enhance the data structuring effect, the key point is that undetermined words without containing relations are determined from a plurality of undetermined words included in the first segmentation result and are used as the information extraction result of the first segmentation result. The undetermined words which are determined from the undetermined words and have no inclusion relationship can be used as the information extraction result in various implementation modes, for example, the undetermined words can be matched one by one to determine whether the inclusion relationship is performed between the undetermined words.
In order to improve the efficiency, the undetermined words which do not contain the relation in the undetermined words are quickly determined, the undetermined words can be sorted firstly, and compared with a mode of directly matching one by one, the method can reduce the calculation amount and improve the processing efficiency.
A specific implementation manner of determining an undetermined word without inclusion relation from a plurality of undetermined words as an information extraction result according to the embodiment of the present application is described below with reference to the accompanying drawings.
Fig. 2 is a flowchart illustrating a method for determining an undetermined word without inclusion relationship from a plurality of undetermined words as an information extraction result, please refer to fig. 2, where the method includes:
s201: and sequencing the multiple undetermined words according to the position of the first character of each undetermined word in the first word segmentation result.
The position of the first word to be determined in the first result can be understood as the position of the first word to be determined in the first result. The multiple undetermined words are ranked according to the position of the first word of each undetermined word in the first word segmentation result, the first undetermined word with the first word ranked in the first word segmentation result can be ranked in front, the second undetermined word with the first word ranked in the first word segmentation result can be ranked behind the first undetermined word with the first word ranked in the first word segmentation result, and the like is performed until all the undetermined words in the first word segmentation result are ranked, and the ranking can be regarded as being completed.
In the following, with reference to a specific example, a plurality of words to be determined are ordered according to the position of the first character of the word to be determined in the first segmentation result.
The first word segmentation result, namely the knee joint swelling and pain, comprises 5 undetermined words, namely 'knee joint, joint swelling and pain and swelling pain'. The first word may be located in the first position of the first result, that is, the undetermined words "knee joint", "knee joint" with the first word "knee" are ranked in front, and next, the first word is located in the second position of the first result, that is, the undetermined words "joint", "joint swelling pain" with the first word "knee" are ranked behind the undetermined words with the first word "knee", and since no first word is located in the third position of the first result, that is, the undetermined word with the first word "section", the first word may be located in the fourth position of the first result, that is, the undetermined word "swelling pain" with the first word "swelling" is ranked behind the undetermined word with the first word "knee", thus, the ranking of each undetermined word corresponding to the first result is completed, and the ranking results are as follows:
knee joint, joint swelling and pain, and swelling and pain.
In the case of an ordering by first letter only, the order between a plurality of pending words of the same first letter may be arbitrary. For example, the "knee joint" may be ranked before the "knee joint".
As an extension of the embodiment of the present application, the multiple undetermined words may also be sorted according to the position of the tail word of each undetermined word in the first segmentation result. For example, for a plurality of undetermined words "knee joint, joint swelling and pain" of the first segmentation result "knee joint swelling and pain", there are three types of end words "joint", "node", and "pain", respectively, according to the positions of the three types of end words in the first segmentation result, the undetermined word with the end word "joint" may be arranged in front, the undetermined word with the end word "node" may be arranged in the middle, the undetermined word with the end word "pain" may be arranged in the end, and the undetermined words may be sorted according to the positions of the end words in the first segmentation result, and the sorting result may be:
knee joint, joint swelling and pain, and swelling and pain.
In the case of sorting by the suffix only, the order between a plurality of pending words having the same suffix may be arbitrary. For example, the "knee joint" may also be arranged behind the "joint". And sorting according to the position of the tail word in the first word segmentation result, wherein the sorting result can be as follows:
knee joint, knee joint, swelling and pain, joint swelling and pain.
In other possible implementation manners of the embodiment of the present application, the multiple pending words may also be ranked according to the position of the first character and the last character of each pending word in the first segmentation result. The undetermined words can be sorted according to the positions of the first letters in the first word-dividing result, so that a temporary sorting result of ' knee joint, joint swelling and pain ', and the undetermined words with the same first letters can be sorted according to the end letters, so that the final sorting result of ' knee joint, joint swelling and pain ', and pain ' is obtained.
The above are some examples of sorting the words to be determined, and in other possible implementation manners of the embodiment of the present application, sorting may be performed in other manners, or sorting may not be performed, which is not limited in the embodiment of the present application.
S202: and if the currently undetermined word has an adjacent next undetermined word, judging whether the currently undetermined word has a containing relation with the adjacent next undetermined word.
The currently pending word is one of a plurality of pending words. In a possible manner, the first pending word in the sorted pending words may be used as the current pending word. The pending word having an inclusive relationship may be understood to have an inclusive or included relationship. Specifically, in this step, if there is an adjacent next undetermined word in the currently undetermined word, it may be determined whether the currently undetermined word includes the adjacent next undetermined word, or whether the currently undetermined word is included by the adjacent next undetermined word.
And judging whether the currently undetermined word contains an adjacent next undetermined word, matching the adjacent next undetermined word with the currently undetermined word one by one, and if each character of the next undetermined word can be continuously matched in the currently undetermined word, successfully matching, wherein the currently undetermined word contains the adjacent next undetermined word. Whether the currently undetermined word is contained by the adjacent next undetermined word or not is judged, and the currently undetermined word is similar to the judgment of whether the currently undetermined word contains the adjacent next undetermined word or not, so that the currently undetermined word and the adjacent next undetermined word can be matched word by word, and the specific matching process is not repeated.
For ease of understanding, this is illustrated.
If the currently undetermined word is "knee joint", and the next adjacent undetermined word is "knee joint", it may be determined whether the currently undetermined word "knee joint" includes the next adjacent undetermined word "knee joint", specifically, a first character of "knee joint" in the next undetermined word may be first matched in the currently undetermined word, when at least one result is matched, for each matching result, a character after the matching result in the currently undetermined word may be compared with a second character "off" in the next undetermined word, in this example, the second character of the next undetermined word may also be matched in the currently undetermined word, a third character "section" in the next undetermined word is continuously compared, and the next undetermined word cannot be matched in the currently undetermined word, that is, the matching is unsuccessful, so that the currently undetermined word "knee joint" does not include the next undetermined word "knee joint".
Then, whether the currently undetermined word "knee joint" is contained by the adjacent next undetermined word "knee joint" is judged, specifically, the first character "knee" of the currently undetermined word can be matched in the adjacent next undetermined word "knee joint", after the first character "knee" of the currently undetermined word is matched in the next undetermined word, then the second character "knee" of the currently undetermined word is matched, the matching is performed on the basis of the first matching, namely, whether the character behind the "knee" character matched in the next undetermined word is "knee" is judged, and the judgment result is yes, namely, the matching is successful, namely, the currently undetermined word is contained in the adjacent next undetermined word.
S203: and if not, taking the currently undetermined word and/or the adjacent next undetermined word as an information extraction result.
If the judgment result is negative, that is, the currently undetermined word and the adjacent next undetermined word do not have an inclusion relationship, the currently undetermined word can be used as an information extraction result. Further, the next adjacent undetermined word may be used as the current undetermined word, and the step of determining that the current undetermined word has a inclusion relationship with the next adjacent undetermined word is performed, so as to determine whether the current undetermined word is used as an information extraction result according to the inclusion relationship.
In some possible implementation manners of the embodiment of the present application, when the currently pending word and the adjacent next pending word do not have a inclusion relationship, the next pending word may be used as an information extraction result, or the currently pending word and/or the adjacent next pending word may be used as an information extraction result.
In some cases, the determination result may be that the currently pending word and the adjacent next pending word have an inclusion relationship, and the pending word in the information extraction result does not have an inclusion relationship, so that the currently pending word and the contained pending word in the adjacent next pending word are not used as the information extraction result, and the contained pending word may be covered by another pending word in the two pending words. After covering the contained undetermined word, the adjacent next undetermined word can be made to be the current undetermined word, and the step of judging whether the current undetermined word contains the adjacent next undetermined word is executed.
In a possible implementation manner, if the currently undetermined word contains the next adjacent undetermined word, covering the currently undetermined word with the next adjacent undetermined word, enabling the next adjacent undetermined word to be the currently undetermined word, and executing a step of judging whether the currently undetermined word contains the next adjacent undetermined word; and if the currently undetermined word is contained by the adjacent next undetermined word, the adjacent next undetermined word is made to be the currently undetermined word, and the step of judging whether the currently undetermined word contains the adjacent next undetermined word is executed.
It should be noted that, when the currently pending word is the penultimate pending word in the sequence of pending words and the next adjacent pending word is the last pending word, the last pending word may be used as the information extraction result after the above steps are performed. It can be understood that when the adjacent next undetermined word is the last undetermined word and the currently undetermined word contains the next undetermined word, the currently undetermined word, that is, the last undetermined word is covered by the penultimate word, and the covered last undetermined word is used as an information extraction result. When the next undetermined word, that is, the last undetermined word does not have a inclusion relationship with the currently undetermined word, the last undetermined word can be used as an information extraction result.
For ease of understanding, the present step is described below by way of example.
Aiming at a pending word sequence of 'knee joint, joint swelling and pain', the 'knee joint' is taken as a currently pending word, the 'knee joint' is taken as an adjacent next pending word, and the 'knee joint' comprises the 'knee joint', so that the 'knee joint' can cover the 'knee joint', and the pending word sequence 'knee joint, joint swelling and pain'. Then, the second knee joint is set as a currently undetermined word, the joint is set as an adjacent next undetermined word, the currently undetermined word knee joint comprises the next undetermined word joint, the knee joint can be set to cover the joint, and an undetermined word sequence of the knee joint, the joint swelling and pain can be obtained. And then, a third knee joint is taken as a currently undetermined word, and the joint swelling and pain is taken as an adjacent next undetermined word. Next, let "swelling and pain in joint" be the currently pending word, "swelling and pain" be the next pending word adjacent to the currently pending word, "swelling and pain in joint" include the next pending word "swelling and pain" adjacent to the currently pending word, "swelling and pain in joint" can be made to cover "swelling and pain" and the sequence of the pending word is "knee joint, swelling and pain in joint". Because the 'joint swelling and pain' is the last undetermined word, the last undetermined word can be used as an information extraction result. Namely, the 'knee joint' and the 'swelling and pain of joint' are determined as the information extraction result.
The above specific implementation manner for determining the undetermined words without inclusion relationship from the multiple undetermined words as the information extraction result provided by the embodiment of the present application is to sort the multiple undetermined words according to the first character and/or the last character, compare the currently undetermined word with the adjacent next undetermined word, and if the currently undetermined word and/or the adjacent next undetermined word do not have inclusion relationship, use the currently undetermined word and/or the adjacent next undetermined word as the information extraction result. Due to the fact that the undetermined words are sequenced, the comparison times can be reduced, the speed of determining the undetermined words without inclusion relations is increased, and the efficiency of obtaining the information extraction result is improved.
In the above example, the information extraction result includes two pending words "knee joint" and "joint pain", and the 2 pending words include the cross word "joint", and the cross word in the pending words can be processed to obtain a simplified information extraction result.
Based on this, the embodiment of the application provides an information extraction method, and when an information extraction result includes at least two undetermined words, the method can be used for further processing the undetermined words in the information extraction result, so that the information extraction result is simplified.
Fig. 3 is a flowchart illustrating an information extraction method according to an embodiment of the present application, please refer to fig. 2, where if an information extraction result includes a first pending word and a second pending word, the method further includes:
s301: and judging whether the first to-be-determined word and the second to-be-determined word comprise cross words or not based on a preset word bank.
The cross word is part of the first pending word and part of the second pending word. When the first undetermined word comprises the cross word, the first undetermined word can also comprise a first undetermined subword adjacent to the cross word. It should be noted that the first to-be-determined word may include at least one first to-be-determined sub-word. For the convenience of understanding, the embodiments of the present application are described by including a first to-be-determined sub-word as an example.
The first pending word and the second pending word can be regarded as character strings or Chinese character strings, and whether the first pending word and the second pending word comprise cross words can be determined in a character matching mode by combining a preset word bank. In a possible implementation manner, character matching may be performed on the first to-be-determined word and the second to-be-determined word first, an overlapping portion of the first to-be-determined word and the second to-be-determined word is obtained, the overlapping portion is traversed in a preset word bank, a possible word formation situation of the overlapping portion is obtained, and the cross word is determined according to the possible word formation situation of the overlapping portion. In other possible implementation manners of the embodiment of the present application, the first to-be-determined word and the second to-be-determined word may also be traversed in the preset word bank respectively to obtain all possible word formation conditions of the first to-be-determined word and all possible word formation conditions of the second to-be-determined word, the first to-be-determined word and the second to-be-determined word in all word formation conditions are removed, then all possible word formation conditions of the first to-be-determined word after the first to-be-determined word is removed are matched with all possible word formation conditions of the second to-be-determined word after the second to-be-determined word is removed, and a word successfully matched may be regarded as a cross word between the first to-be-determined word and the second to-be-determined word.
The process of determining cross-words is described below with reference to specific examples.
In this example, the information extraction result includes a first pending word "knee joint" and a second pending word "joint pain", and three possible word formation conditions of "knee joint", and "joint" can be obtained by traversing the first pending word in the preset word bank, and three possible word formation conditions of "joint", "joint pain", and "pain" can be obtained by traversing the second pending word in the preset word bank. And respectively removing the first pending word and the second pending word, wherein possible word forming conditions of the first pending word are 'knee joint', 'joint', and possible word forming conditions of the second pending word are 'joint', 'gall'. Matching the possible word forming condition of the first pending word with the possible word forming condition of the second pending word, and determining that the cross word is the joint.
S302: and if yes, judging whether the attribute of the cross word is the same as the attribute of the first undetermined subword.
Attributes may be understood as types of words. The attribute of the cross word and the attribute of the first to-be-determined word can be obtained according to a labeling result based on a preset word bank. For example, if a symptom dictionary library in a preset word library is used for labeling, the attribute of the word is a symptom, and if an organ part dictionary library in the preset word library is used for labeling, the attribute of the word is an organ part.
If the first pending word and the second pending word include a cross word, whether the attribute of the cross word is the same as the attribute of the first pending subword can be judged. Based on the preset text, such as the electronic medical record, the self writing characteristics, when several words with the same property, such as the first to-be-determined word and the cross word, which are all organ parts, are juxtaposed, they can be combined into a word with a coarser granularity, which characterizes the organ parts. Therefore, in the case of having the cross word, it may be determined whether the attributes of the first to-be-determined subword and the cross word are the same, so as to process the first to-be-determined word and the second to-be-determined word.
It should be noted that, when the first to-be-determined word and the second to-be-determined word include the cross word, if the first to-be-determined word includes a plurality of first to-be-determined subwords in addition to the cross word, it may be determined whether the attribute of the first to-be-determined subword adjacent to the cross word is the same as the attribute of the cross word. In the specific determination process, reference may be made to a case where the first to-be-determined word includes a first to-be-determined word, which is not described herein again.
S303: and if the word is the same as the first undetermined word, removing the crossed word from the second undetermined word to obtain the modified second undetermined word.
In this step, if the attribute of the first undetermined sub-word is the same as the attribute of the cross word, the first undetermined word in the first undetermined word may be juxtaposed with the cross word, and in order to avoid repetition, the cross word may be removed from the second undetermined word, so as to obtain the modified second undetermined word.
For example, the first pending word "knee joint" and the second pending word "swelling and pain in joint" have the cross word "joint", and the first pending word also has the first pending sub-word "knee", wherein both "knee" and "joint" belong to a body part, i.e. have the same attribute, so that the "knee joints" can be merged, and in order to avoid meaning repetition, the cross word "joint" can be removed from the second pending word "swelling and pain in joint", to obtain the modified second pending word "swelling and pain".
S304: and if not, removing the crossed words from the first pending word to obtain the modified first pending word.
In this step, if the attribute of the first undetermined sub-word is different from the attribute of the cross word, it may be considered that the first undetermined word in the first undetermined word is not proper to be aligned with the cross word, and the cross word may be removed from the first undetermined word to obtain the modified first undetermined word. In order not to lose the information comprised by the cross word, the cross word in the second pending word may be retained.
In order to facilitate information search and information positioning, the mapping relation among the first word segmentation result, the first pending word and the second pending word can be stored. It should be noted that, when the first pending word and/or the second pending word is modified, the mapping relationship between the first segmentation result and the modified first pending word and/or the modified second pending word is stored. In a possible implementation manner, if the attribute of the cross word is the same as that of the first undetermined subword, storing a mapping relation among the first word segmentation result, the first undetermined word and the modified second undetermined word; and if the attribute of the cross word is different from the attribute of the first undetermined subword, storing the mapping relation among the first word segmentation result, the modified first undetermined word and the second undetermined word.
Besides storing the first segmentation result, other segmentation results which are not in a preset word stock can also be stored. For the convenience of distinguishing, different formats can be adopted to store the first word segmentation result and other word segmentation results which are not in the preset word stock respectively. In a possible implementation manner, the first word segmentation result and the corresponding information extraction result thereof may be stored in a format of { "signal": "", "word" "," value ": f }, and other word segmentation results not in the preset lexicon are stored in a format of [", "O" ]. The word may represent a first word segmentation result, the signal may represent an attribute of the first word segmentation result, and the value may represent an information extraction result corresponding to the first word segmentation result. When a first long word segmentation result needs to be extracted, the first long word segmentation result can be extracted in the word, when a short information extraction result needs to be extracted, the first long word segmentation result can be extracted in the value, and the word segmentation result can be accurately positioned and extracted through the structured storage mode.
For the convenience of understanding, the storage process of the first segmentation result and other segmentation results not in the preset lexicon is described below with reference to specific examples.
For example, aiming at a sentence "knee joint swelling and pain is 6 months, and the weight is increased for 3 weeks in the electronic medical record. And performing word segmentation to obtain first word segmentation results of ' knee joint swelling and pain ', ' 6 months ' and ' 3 weeks ', and obtain other word segmentation results ' which are not in a preset word bank ', ' aggravation ' and '. ".
According to the storage format provided by the implementation manner, the word segmentation result can be stored as follows:
[ { "signal": symptom ", word": knee joint swelling and pain "," value ": [" knee joint "," organ "], [" swelling and pain "," symptom "] ] }, {" signal ": time", "word": 6 months "," value ": [ ] }, [", "," "O" ], [ "aggravation", "O" ], { "signal": time "," 3 weeks "," value ": }, [". "," O "]
In this example, when the user intends to extract "knee joint" but does not want to extract "knee joint gall", only "knee joint" may be retrieved in "value", thereby enabling extraction of only "knee joint". When the user intends to extract the knee joint gall, the knee joint gall can be retrieved from the word to extract the knee joint gall.
In the information extraction method provided in the embodiment of the present application, when the first word segmentation result includes the first to-be-determined word and the second to-be-determined word, it may be determined whether the first to-be-determined word and the second to-be-determined word have the cross word based on the preset word bank, when the first to-be-determined word has the cross word, it is determined whether an attribute of the first to-be-determined word adjacent to the cross word in the first to-be-determined word is the same as an attribute of the cross word, if so, the cross word in the second to-be-determined word is removed, and if not, the cross word in the first to-be-determined word is removed. The first undetermined word and the second undetermined word with the cross words are removed according to the attribute of the cross words and the attribute of the first undetermined subword, the cross words of one of the first undetermined word and the second undetermined subword can be simplified, and a simplified information extraction result is obtained.
Based on the specific implementation manner of the information extraction method provided by the foregoing embodiment, an embodiment of the present application further provides an information extraction device, and the information extraction device provided by the embodiment of the present application is introduced from the perspective of a functional module with reference to the accompanying drawings.
Fig. 4 is a schematic structural diagram of an information extraction apparatus according to an embodiment of the present application, please refer to fig. 4, where the apparatus includes:
a word segmentation unit 401, configured to perform word segmentation on a preset text according to a preset lexicon to obtain a first word segmentation result;
an extracting unit 402, configured to extract multiple undetermined words from the first segmentation result based on the preset lexicon, where the multiple undetermined words do not include the first segmentation result;
a determining unit 403, configured to determine, from the multiple pending words, a pending word that does not include a relationship, as an information extraction result of the first segmentation result.
Optionally, the determining unit 403 includes:
the sequencing subunit sequences the multiple undetermined words according to the positions of the first character and/or the tail character of each undetermined word in the first segmentation result;
a determining subunit, configured to determine, if there is an adjacent next undetermined word in the current undetermined word, whether the current undetermined word has a inclusion relationship with the adjacent next undetermined word, where the current undetermined word is one of the multiple undetermined words;
and the determining subunit is configured to, if the determination result is negative, use the currently pending word and/or the adjacent next pending word as the information extraction result.
Optionally, the determining unit 403 further includes an executing subunit, configured to:
when the currently undetermined word and the adjacent next undetermined word have an inclusion relationship, if the currently undetermined word contains the adjacent next undetermined word, covering the currently undetermined word with the adjacent next undetermined word, enabling the adjacent next undetermined word to be the currently undetermined word, and executing the step of judging whether the currently undetermined word contains the adjacent next undetermined word;
if the currently undetermined word is contained by the adjacent next undetermined word, the adjacent next undetermined word is made to be the currently undetermined word, and the step of judging whether the currently undetermined word contains the adjacent next undetermined word is executed.
Optionally, the apparatus further comprises:
a cross word judging unit, configured to judge, based on the preset word library, whether the first to-be-determined word and the second to-be-determined word include a cross word if the information extraction result includes the first to-be-determined word and the second to-be-determined word, where the cross word is a part of the first to-be-determined word and is a part of the second to-be-determined word;
the attribute judging unit is used for judging whether the attribute of the cross word is the same as the attribute of the first undetermined subword or not if the attribute of the cross word is included;
the modifying unit is used for removing the cross word from the second undetermined word if the cross word is the same as the second undetermined word to obtain a modified second undetermined word; and if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined sub-word is a word adjacent to the cross word in the first to-be-determined word.
Optionally, the apparatus further includes a storage unit, specifically configured to:
if the attribute of the cross word is the same as that of the first undetermined subword, storing the mapping relation among the first word segmentation result, the first undetermined word and the modified second undetermined word;
and if the attribute of the cross word is different from the attribute of the first undetermined subword, storing the mapping relation among the first word segmentation result, the modified first undetermined word and the second undetermined word.
The above is a specific implementation manner of the information extraction device provided in this embodiment of the present application, where words are segmented according to a preset word bank to obtain a first segmentation result, and then a plurality of undetermined words included in the first segmentation result are extracted based on the preset word bank, where the plurality of undetermined words do not include the first segmentation result, and an undetermined word without an inclusion relation is determined from the plurality of undetermined words as an information extraction result of the first segmentation result.
By adopting the two-time word segmentation, the longer first word segmentation result can be extracted, and the shorter information extraction result of the first word segmentation result without inclusion relation can be further extracted from the longer first word segmentation result, for example, words representing information such as parts, diseases and the like are extracted from the complete words representing the operation names, so that on one hand, the extracted information amount is increased, and on the other hand, the data structuring effect is enhanced through the structural hierarchy arrangement of the first word segmentation result and the information extraction result of the first word segmentation result, and the data searching and positioning are facilitated.
When introducing elements of various embodiments of the present application, the articles "a," "an," "the," and "said" are intended to mean that there are one or more of the elements. The terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements.
It should be noted that, as one of ordinary skill in the art would understand, all or part of the processes of the above method embodiments may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when executed, the computer program may include the processes of the above method embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the units and modules described as separate components may or may not be physically separate. In addition, some or all of the units and modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims (8)

1. An information extraction method, the method comprising:
segmenting words of a preset text according to a preset word bank to obtain a first segmentation result;
extracting a plurality of undetermined words from the first segmentation result based on the preset word bank, wherein the undetermined words do not comprise the first segmentation result;
determining undetermined words without containing relations from the undetermined words as information extraction results of the first segmentation results;
wherein, if the information extraction result comprises a first pending word and a second pending word, the method further comprises:
judging whether the first undetermined word and the second undetermined word comprise cross words or not based on the preset word bank, wherein the cross words are a part of the first undetermined word and a part of the second undetermined word;
if yes, judging whether the attribute of the cross word is the same as the attribute of the first undetermined sub-word, if so, removing the cross word from the second undetermined word to obtain a modified second undetermined word; if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined subword is a word adjacent to the cross word in the first to-be-determined word;
the method for segmenting the preset text according to the preset word bank to obtain a first segmentation result comprises the following steps: and segmenting words of a preset text according to a preset word bank to obtain word segmentation results representing symptoms, time and other words, wherein the word segmentation results representing other words are word segmentation results which are not in the preset word bank.
2. The method of claim 1, wherein the determining the pending word without the inclusion relation from the plurality of pending words as the information extraction result of the first segmentation result comprises:
sequencing the multiple undetermined words according to the position of the first character and/or the tail character of each undetermined word in the first segmentation result;
if the current undetermined word has an adjacent next undetermined word, judging whether the current undetermined word has a containing relation with the adjacent next undetermined word, if not, taking the current undetermined word and/or the adjacent next undetermined word as the information extraction result, wherein the current undetermined word is one of the plurality of undetermined words.
3. The method of claim 2, wherein if the currently pending word has a containment relationship with the next adjacent pending word, the method further comprises:
if the currently undetermined word contains the adjacent next undetermined word, covering the currently undetermined word with the adjacent next undetermined word, enabling the adjacent next undetermined word to be the currently undetermined word, and executing the step of judging whether the currently undetermined word contains the adjacent next undetermined word;
if the currently undetermined word is contained by the adjacent next undetermined word, the adjacent next undetermined word is made to be the currently undetermined word, and the step of judging whether the currently undetermined word contains the adjacent next undetermined word is executed.
4. The method of claim 1, wherein if the attributes of the cross word and the first pending subword are the same, the method further comprises:
storing the mapping relation among the first word segmentation result, the first pending word and the modified second pending word;
if the attribute of the cross word is different from the attribute of the first pending subword, the method further comprises the following steps:
and storing the mapping relation among the first word segmentation result, the modified first word to be determined and the second word to be determined.
5. An information extraction apparatus, characterized in that the apparatus comprises:
the word segmentation unit is used for segmenting words of a preset text according to a preset word bank to obtain a first word segmentation result;
the extracting unit is used for extracting a plurality of undetermined words from the first segmentation result based on the preset word bank, wherein the undetermined words do not comprise the first segmentation result;
a determining unit, configured to determine, from the multiple undetermined words, an undetermined word that does not include a relationship as an information extraction result of the first segmentation result;
wherein the apparatus further comprises:
a cross word judging unit, configured to judge, based on the preset word library, whether the first to-be-determined word and the second to-be-determined word include a cross word if the information extraction result includes the first to-be-determined word and the second to-be-determined word, where the cross word is a part of the first to-be-determined word and is a part of the second to-be-determined word;
the attribute judging unit is used for judging whether the attribute of the cross word is the same as the attribute of the first undetermined subword or not if the attribute of the cross word is included;
the modifying unit is used for removing the cross word from the second undetermined word if the cross word is the same as the second undetermined word to obtain a modified second undetermined word; if not, removing the cross word from the first to-be-determined word to obtain a modified first to-be-determined word, wherein the first to-be-determined subword is a word adjacent to the cross word in the first to-be-determined word;
the method for segmenting the preset text according to the preset word bank to obtain a first segmentation result comprises the following steps: and segmenting words of a preset text according to a preset word bank to obtain word segmentation results representing symptoms, time and other words, wherein the word segmentation results representing other words are word segmentation results which are not in the preset word bank.
6. The apparatus of claim 5, wherein the determining unit comprises:
the sequencing subunit sequences the multiple undetermined words according to the positions of the first character and/or the tail character of each undetermined word in the first segmentation result;
a determining subunit, configured to determine, if there is an adjacent next undetermined word in the current undetermined word, whether the current undetermined word has a inclusion relationship with the adjacent next undetermined word, where the current undetermined word is one of the multiple undetermined words;
and the determining subunit is configured to, if the determination result is negative, use the currently pending word and/or the adjacent next pending word as the information extraction result.
7. The apparatus of claim 6, wherein the determining unit further comprises an executing subunit configured to:
when the currently undetermined word and the adjacent next undetermined word have an inclusion relationship, if the currently undetermined word contains the adjacent next undetermined word, covering the currently undetermined word with the adjacent next undetermined word, enabling the adjacent next undetermined word to be the currently undetermined word, and executing the step of judging whether the currently undetermined word contains the adjacent next undetermined word;
if the currently undetermined word is contained by the adjacent next undetermined word, the adjacent next undetermined word is made to be the currently undetermined word, and the step of judging whether the currently undetermined word contains the adjacent next undetermined word is executed.
8. The apparatus according to claim 5, further comprising a storage unit, in particular for:
if the attribute of the cross word is the same as that of the first undetermined subword, storing the mapping relation among the first word segmentation result, the first undetermined word and the modified second undetermined word;
and if the attribute of the cross word is different from the attribute of the first undetermined subword, storing the mapping relation among the first word segmentation result, the modified first undetermined word and the second undetermined word.
CN201711476786.4A 2017-12-29 2017-12-29 Information extraction method and device Active CN108052508B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711476786.4A CN108052508B (en) 2017-12-29 2017-12-29 Information extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711476786.4A CN108052508B (en) 2017-12-29 2017-12-29 Information extraction method and device

Publications (2)

Publication Number Publication Date
CN108052508A CN108052508A (en) 2018-05-18
CN108052508B true CN108052508B (en) 2021-11-09

Family

ID=62129433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711476786.4A Active CN108052508B (en) 2017-12-29 2017-12-29 Information extraction method and device

Country Status (1)

Country Link
CN (1) CN108052508B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499058A (en) * 2009-03-05 2009-08-05 北京理工大学 Chinese word segmenting method based on type theory
CN102929870A (en) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 Method for establishing word segmentation model, word segmentation method and devices using methods
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN103324626A (en) * 2012-03-21 2013-09-25 北京百度网讯科技有限公司 Method for setting multi-granularity dictionary and segmenting words and device thereof
CN106547740A (en) * 2016-11-24 2017-03-29 四川无声信息技术有限公司 Text message processing method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103874033B (en) * 2012-12-12 2017-11-24 上海粱江通信系统股份有限公司 A kind of method that irregular refuse messages are identified based on Chinese word segmentation
US9811517B2 (en) * 2013-01-29 2017-11-07 Tencent Technology (Shenzhen) Company Limited Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text
CN105138514B (en) * 2015-08-24 2018-11-09 昆明理工大学 It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN106815195A (en) * 2015-11-27 2017-06-09 方正国际软件(北京)有限公司 A kind of segmenting method and device, search method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499058A (en) * 2009-03-05 2009-08-05 北京理工大学 Chinese word segmenting method based on type theory
CN102929870A (en) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 Method for establishing word segmentation model, word segmentation method and devices using methods
CN103324626A (en) * 2012-03-21 2013-09-25 北京百度网讯科技有限公司 Method for setting multi-granularity dictionary and segmenting words and device thereof
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN106547740A (en) * 2016-11-24 2017-03-29 四川无声信息技术有限公司 Text message processing method and device

Also Published As

Publication number Publication date
CN108052508A (en) 2018-05-18

Similar Documents

Publication Publication Date Title
CN105069124B (en) A kind of International Classification of Diseases coding method of automation and system
Mitra et al. An automatic approach to identify word sense changes in text media across timescales
US10169325B2 (en) Segmenting and interpreting a document, and relocating document fragments to corresponding sections
CN109344250B (en) Rapid structuring method of single disease diagnosis information based on medical insurance data
CN109192255B (en) Medical record structuring method
CN108182207B (en) Intelligent coding method and system for Chinese surgical operation based on word segmentation network
CN108460014A (en) Recognition methods, device, computer equipment and the storage medium of business entity
CN111984851B (en) Medical data searching method, device, electronic device and storage medium
JP2015011426A (en) Non-factoid-question answering system and computer program
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
JP4737435B2 (en) LABELING SYSTEM, LABELING SERVICE SYSTEM, LABELING METHOD, AND LABELING PROGRAM
CN110020005B (en) Method for matching main complaints in medical records with symptoms in current medical history
JP2020187738A (en) Information processing apparatus for eliminating ambiguity in author name, method, and storage medium
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN110134940B (en) Method and device for training text recognition model and text continuity
CN110534170A (en) Data processing method, device, electronic equipment and computer readable storage medium
CN113793663A (en) Medical data processing method and system
CN112101030A (en) Method, device and equipment for establishing term mapping model and realizing standard word mapping
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
Rousseau Graph-of-words: mining and retrieving text with networks of features
CN108052508B (en) Information extraction method and device
Banisakher et al. Improving the identification of the discourse function of news article paragraphs
CN113553410B (en) Long document processing method, processing device, electronic equipment and storage medium
Benício et al. Applying Text Mining and Natural Language Processing to Electronic Medical Records for extracting and transforming texts into structured data
CN107729518A (en) The text searching method and device of a kind of relevant database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190729

Address after: Room 2307, 3 storeys, No. 7 Pioneer Road, Shangdi Information Industry Base, Haidian District, Beijing 100085

Applicant after: Beijing Jiahesen Health Technology Co., Ltd.

Address before: 100085 Haidian District city on the base of the information industry base, Pioneer Road, building No. 7, section I, layer three, layer

Applicant before: Goodwill Information Technology Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 100085 room 2208, 2nd floor, building 1, No.7 Kaifa Road, Shangdi Information Industry base, Haidian District, Beijing

Patentee after: Beijing Jiahesen Health Technology Co.,Ltd.

Address before: 100085 room 2307, third floor, building 1, No.7 Kaifa Road, Shangdi Information Industry base, Haidian District, Beijing

Patentee before: Beijing Jiahesen Health Technology Co.,Ltd.

CP02 Change in the address of a patent holder