CN113496116A - Method, apparatus, and storage medium for recognizing text - Google Patents

Method, apparatus, and storage medium for recognizing text Download PDF

Info

Publication number
CN113496116A
CN113496116A CN202010256902.7A CN202010256902A CN113496116A CN 113496116 A CN113496116 A CN 113496116A CN 202010256902 A CN202010256902 A CN 202010256902A CN 113496116 A CN113496116 A CN 113496116A
Authority
CN
China
Prior art keywords
text
string
word
strings
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010256902.7A
Other languages
Chinese (zh)
Other versions
CN113496116B (en
Inventor
郑仲光
孙俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN202010256902.7A priority Critical patent/CN113496116B/en
Publication of CN113496116A publication Critical patent/CN113496116A/en
Application granted granted Critical
Publication of CN113496116B publication Critical patent/CN113496116B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed are a method and apparatus for recognizing text and a storage medium. The method comprises the following steps: decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library; taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and iteratively expanding and combining adjacent or partially overlapped word strings in the set of matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.

Description

Method, apparatus, and storage medium for recognizing text
Technical Field
The present disclosure relates to Natural Language Processing (NLP), and in particular to Named Entity Recognition (NER).
Background
In the field of natural language processing, named entity recognition is a fundamental task, whose purpose is to recognize from text the components of words or phrases of a particular class, such as the recognition of names of people, addresses, organizations, proper nouns, etc. The results of the NER are widely used for other downstream tasks such as information retrieval, automatic translation, etc.
In general, models required by the NER task are supervised models, i.e. corpus is labeled for training. In an actual scenario, however, not all categories of entities may obtain annotation data.
Disclosure of Invention
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
According to an aspect of the present invention, there is provided a method for recognizing text, including: decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library; taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and iteratively expanding and combining adjacent or partially overlapped word strings in the set of matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.
According to another aspect of the present invention, there is provided an apparatus for recognizing text, including: parsing means configured to parse each text in an original text corpus into a set of strings to merge with a corresponding string of each text in the original text corpus into a new text corpus; identifying means configured to identify, starting from each word in the text to be identified, a word string in the new text corpus starting from the word and having a longest match with the text to be identified; and the expanding and merging device is configured to iteratively expand and merge adjacent or partially overlapped character strings in the set of the matched character strings according to the position information of the identified matched character strings in the text to be identified so as to obtain a final identification result.
According to other aspects of the invention, corresponding computer program code, computer readable storage medium and computer program product are also provided.
By the text recognition method and the text recognition equipment, the text can be recognized under the condition that a training set is not labeled, so that the labor and time cost is reduced, and the complexity is reduced.
These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.
Drawings
To further clarify the above and other advantages and features of the present disclosure, a more particular description of embodiments of the present disclosure will be rendered by reference to the appended drawings. Which are incorporated in and form a part of this specification, along with the detailed description that follows. Elements having the same function and structure are denoted by the same reference numerals. It is appreciated that these drawings depict only typical examples of the disclosure and are therefore not to be considered limiting of its scope. In the drawings:
FIG. 1 shows an example of a list of disease names;
FIG. 2 is a flow diagram of a method 200 for recognizing text, according to one embodiment;
FIG. 3 illustrates a flow diagram for iteratively expanding and merging each matched string to the right, according to one embodiment;
FIG. 4 illustrates a flowchart for iteratively expanding and merging each matched string to the left, in accordance with a preferred embodiment;
FIG. 5 is a block diagram of an apparatus 500 for recognizing text, according to one embodiment; and
FIG. 6 is a block diagram of an exemplary architecture of a general purpose personal computer in which methods and/or apparatus according to embodiments of the invention may be implemented.
Detailed Description
Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structures and/or processing steps closely related to the scheme according to the present disclosure are shown in the drawings, and other details not so relevant to the present disclosure are omitted.
The models required by the NER task need to be labeled with corpora for training. Without the labeled data, the supervised model cannot be trained. The corpus is labor and time intensive to construct, costly and difficult to obtain. But the domain dictionary is relatively easy to obtain.
The invention provides the recognition method based on the domain dictionary, so that the target entity in the corpus can be recognized to the maximum extent by utilizing the domain dictionary under the condition of not marking the corpus, thereby overcoming or lightening the defects in the prior art.
It should be noted that in the following, although described with the example of identifying disease names in the medical literature, the skilled person will understand that the solution of the invention can be applied to entity identification in the literature in any other field.
It should also be noted that in the following, although the description is given by way of example for the recognition of english text, the person skilled in the art will understand that the solution of the invention can be applied to any other category of languages.
For example, assume that the task is to identify the name of a disease in the medical literature. Although the existing medical dictionary already contains many entries, the coverage rate of the entries is still low. That is, if the entries in the dictionary are directly used for matching in the medical literature, much entity information is lost. For example, assume that the goal is to recognize the disease name "Advanced differential pathological lymphoma" in the text "Advanced differential pathological lymphoma, a potential curable disease", but there is no corresponding entry in the existing medical dictionary. In this case, if the text is recognized only by an exact match with the entry in the dictionary, the disease name cannot be recognized.
However, each of the words "advanced", "diffuse", "histiocytic" and "lymphoma" in the text is present in the following entry in the dictionary, respectively: "advanced sleep phase syndrome", "differential large B-cell lymphoma" and "lipophilic disease". This situation is relatively common. For example, fig. 1 shows an example of a disease name list obtained based on an existing medical dictionary. It can be observed that although the term coverage of the medical dictionary is low, the coverage of the word (or word) is relatively high. The scheme of the invention is based on the characteristic to identify the entity.
The method 200 for recognizing text according to an embodiment will be described in detail below in conjunction with fig. 2.
The method 200 starts at step 201. In step 201, each text in the original text library is decomposed into a string set to be merged with the corresponding string of each text in the original text library into a new text library. Specifically, in the present embodiment, for example, the original text library is a medical dictionary, each text is a term in the medical dictionary, and a word is a word in the term.
According to a preferred embodiment, the dictionary can be disassembled as follows.
For the entry t ═ w0,w1,…,wn]Wherein w isiRepresenting words in terms, which can be broken down into a set of strings: [ s ] of0,s1,…,sm]Wherein s ish=[wi,wi+1,…,wj]Wherein j-i<n。
For shIf with w0Specifying the beginning as a prefix string; if with wnWhen the word ends, the word is defined as a suffix string, otherwise, it is defined as an intermediate string. For example, for the term "object candidate social optical neuron", the following string set can be obtained:
s0=obsolete
s1=obsolete meningococcal
s2=obsolete meningococcal optic
s3=meningococcal
s4=meningococcal optic
s5=meningococcal optic neuritis
s6=optic neuritis
s7=neuritis
wherein s is0,s1,s2Is a prefix string, s5,s6,s7Suffix string, and others intermediate string.
By doing the same process for each entry in the dictionary and merging the resulting set of strings with the corresponding strings of the entries in the original dictionary, a new dictionary D ═ t can be obtained0,t1,…,tn]Wherein t isi=[ent,role](ii) a And wherein ent ═ w0,w1,…,wn]Corresponding word string representing an entry, role [ n ]0,n1,n2,n3]A word-forming feature array for representing the word string, wherein n0Represents tiWhether the vocabulary entry is the vocabulary entry in the original dictionary or not, 1 is yes, and 0 is no; n is1Represents tiAs the number of times the prefix string appears; n is2Represents tiNumber of occurrences as suffix strings; and n is3Represents tiAs the number of occurrences of the intermediate string.
Thus, from the above example, one can obtain:
t0=[ent=[“obsolete meningococcal optic neuritis”],role=[1,0,0,0]]
t1=[ent=[“obsolete”],role=[0,1,0,0]]
t2=[ent=[“optic neuritis”],role=[0,0,1,0]]
t3=[ent=[“meningococcal optic”],role=[0,0,0,1]]
……。
by disassembling each entry in the dictionary in this way, the entry granularity can be reduced, thereby improving the coverage degree during recognition.
It should be understood that the manner of parsing the dictionary described above is merely an example, and the present invention is not limited thereto. For example, the parsing may be performed in units of two, three, or more words, as desired.
According to a preferred embodiment, the method 200 may further comprise a step 201' of applying weights to the set of configuration characteristics of the word strings of each text in the new text corpus.
Specifically, in step 201', for each entry t in the new dictionaryiIt is desirable to know if it is a commonly used entry in the field. That is, when the entry appears in the text to be recognized, it can be determined that the position where the entry appears contains the domain vocabulary with a high probability. However, not all the entries in the dictionary have domain characteristics, but many entries will also appear in the general domain, such as some commonly used words "disconnect", "drug" or "constant". These words are frequently found in both disease name/non-disease name texts and therefore do not have a clear distinction.
To improve the matching accuracy, the word formation feature array role of each entry in the new dictionary may be given [ n ═0,n1,n2,n3]Given a weight, i.e. a factor f representing the importance of the fielddThe calculation method is as follows:
Figure BDA0002435273030000051
wherein c1 represents the entry tiNumber of occurrences in the new dictionary; c2 denotes tiA number of occurrences in non-domain text, wherein the non-domain text may for example be in a news domain corpus; n1 represents the sum of the number of occurrences in the dictionary of all entries contained in the new dictionary; n2 represents the sum of the number of occurrences of all terms contained in the new dictionary in the non-domain corpus; and norm is a smoothing factor, which may be taken to be, for example, norm 1.
For each entry t in the new dictionary according to equation (1)iCalculates the field importance factor fdThereafter, the word formation feature array of each entry in the new dictionary may be replaced with: roll [ n ]0*fb(ti),n1*fb(ti),n2*fb(ti),n3*fb(ti)]。
Next, in step 202, starting from each word in the text to be recognized, a word string starting from the word and having the longest match with the text to be recognized in the new text library is recognized.
Specifically, in the present embodiment, after merging the new dictionaries, s ═ w is given to the input sentence given0,w1,…,wn]The identification of (a) can be performed by:
-with wiFind the longest string for the beginning, satisfy [ wi,…,wj](j>I) matches an entry in the new dictionary,
-if there is a match, recording the location [ i, j, t [ ]i]Wherein t isiFor the entry in the dictionary that is matched,
for each element w in the input sentence s0To wnSequentially executing the two steps and recording all the matched entries in the dictionaryThe position of (a).
To this end, a set of matching positions [ [ i ] can be obtained0,j0,ti],[i1,j1,tj],……]。
The basic segmentation result is obtained by the above full segmentation step 202. The basic segmentation results may be expanded and merged, step 203, to obtain a new large-grained segmentation result, so that entries not included in the original dictionary are recognized.
In step 203, according to the position information of the identified matching word strings in the text to be identified, the adjacent or partially overlapped word strings in the set of matching word strings are iteratively expanded and merged to obtain the final identification result.
In this embodiment, step 203 may include iteratively expanding and merging each matched string to the right. FIG. 3 illustrates a flow chart for iteratively expanding and merging to the right for each matching string. As shown in FIG. 3, iteratively expanding and merging to the right for each matching string includes:
step 2033: determining whether a particular string in the set of matched string strings is adjacent to or partially overlapping other string strings in the set of matched string strings on the right side using position information of the first word and the last word of the particular string in the text to be recognized;
step 2034: merging the adjacent or partially overlapping strings on the right into a new string if it is determined that the particular string corresponds to text in the original text corpus and that the particular string appears as a suffix string more than a first threshold in the new text corpus or that the particular string appears as an intermediate string more than a second threshold in the new text corpus; and
step 2035: repeating the above steps until no new string is generated.
Specifically, the above steps 2033, 2034 and 2035 can be realized, for example, by the following loop process:
for the segmentation result [ i, j, ti]Is expanded to the right side and is expanded to the right side,
if [ i ', j', t ] is presentj]And satisfies the following conditions:
i < i '< ═ j < j' (step 2033), and
2.n0>0 and n0In role (t)j) In, or n2>First threshold TH1 and n2In role (t)j) In, or n3>Second threshold TH2 and n3In role (t)j) In (1),
a new segmentation result [ i, j ', ti' ]isgenerated,
ti’=[ent[wi,…,wj’],role=[min(roleti,roletj)]](step 2034),
the above steps (steps 2033 and 2034) are repeatedly performed until no new cut result is generated (step 2035).
According to a preferred embodiment, step 203 may further comprise iteratively expanding and merging each matched string to the left before iteratively expanding and merging each matched string to the right. FIG. 4 illustrates a flow chart for iteratively expanding and merging to the left for each matching string. As shown in FIG. 4, iteratively expanding and merging each matched string to the left includes:
step 2031: determining whether a particular string is adjacent or partially overlapping on the left side with other strings in the set of matching strings using the location information, an
Step 2032: if it is determined that adjacent or partially overlapping strings are on the left and that the particular string corresponds to text in the original corpus of text, or that the number of times the particular string appears as a prefix string in the new corpus of text is greater than a third threshold, or that the number of times the particular string appears as an intermediate string in the new corpus of text is greater than a second threshold, then the adjacent or partially overlapping strings on the left are merged into a new string.
Specifically, steps 2031 to 2035 can be implemented by the following loop process:
for the segmentation result [ i, j, ti]
Expanding to the left:
if present [ i ', j',tj]And satisfies the following conditions:
1, i '< i ═ j' < j (step 2031), and
2.n0>0 and n0In role (t)j) In, or n1>Third threshold TH3 and n1In role (t)j) In, or n3>Second threshold TH2 and n3In role (t)j) In (1),
a new segmentation result [ i', j, t ] is generatedi’],
ti’=[ent[wi’,…,wj],role=[min(roleti,roletj)]](step 2032) of,
expanding to the right:
if [ i ', j', t ] is presentj]And satisfies the following conditions:
i < i '< ═ j < j' (step 2033), and
2.n0>0 and n0In role (t)j) In, or n2>First threshold TH1 and n2In role (t)j) In, or n3>Second threshold TH2 and n3In role (t)j) In (1),
a new segmentation result [ i, j ', ti' ]isgenerated,
ti’=[ent[wi,…,wj’],role=[min(roleti,roletj)]](step 2034),
the above steps (steps 2031 to 2034) are repeatedly performed until no new cut result is generated (step 2035).
Note that the first threshold TH1, the second threshold TH2, and the third threshold TH3 may be set as needed, and may be set to the same value or different values.
It should also be noted that for the sake of brevity no weights are applied to the array of tokenized features in steps 2031 to 2035 above, i.e. the above mentioned domain importance factor fd. It should be appreciated that the above steps 2031-2035 may be performed with or without applying weights to the token feature array.
For example, for the entry "Advanced differential histocytic simple, a potential secure digital disease", the result after the full segmentation step 202 is:
[0,0,“Advanced”]
[1,1,“diffuse”]
[2,2,“histiocytic”]
[3,3,“lymphoma”]
the above results are then expanded and merged 203 (step 2031-2035):
the first cycle is to obtain
[0,1,“Advanced diffuse”]
[2,3,“histiocytic lymphoma”]
Second recycling to obtain
[0,3,“Advanced diffuse histiocytic lymphoma”]
No new segmentation result is generated later, and a final recognition result 'Advanced differentiated historical semantic lymphoma' can be obtained.
Through the method 200 for recognizing text described above in conjunction with fig. 2 to 4, it is made possible to maximally recognize target entities in a corpus based on a domain dictionary without labeling the corpus.
The methods discussed above may be implemented entirely by computer-executable programs, or may be implemented partially or entirely using hardware and/or firmware. When implemented in hardware and/or firmware, or when a computer-executable program is loaded into a hardware device capable of running the program, the apparatus for recognizing text described hereinafter is implemented. In the following, a summary of these devices is given without repeating some details that have been discussed above, but it should be noted that, although these devices may perform the methods described in the foregoing, the methods do not necessarily employ or be performed by those components of the described devices.
Fig. 5 shows an apparatus 500 for recognizing text according to an embodiment comprising a disassembling means 501, a recognizing means 502 and an expanding and merging means 503. The disassembling device 501 is configured to disassemble each text in the original text library into a string set, so as to merge the string set with the corresponding string of each text in the original text library into a new text library. The identifying means 502 is configured to identify, starting from each word in the text to be identified, the word string in the new text library starting from the word and having the longest match with the text to be identified. The expanding and merging device 503 is configured to iteratively expand and merge adjacent or partially overlapping strings in the set of matched strings according to the position information of the identified matched strings in the text to be identified, so as to obtain a final identification result.
According to a preferred embodiment, the apparatus 500 for recognizing text further comprises weighting means 501' for applying a weight to the array of formation characteristics of the word strings of each text in the new text corpus.
The apparatus 500 for recognizing text shown in fig. 5 corresponds to the method 200 shown in fig. 2 to 4. Accordingly, the relevant details of each device in the apparatus for recognizing text 500 have been given in detail in the description of the method for recognizing text 200 of fig. 2 to 4, and will not be described herein again.
Each constituent module and unit in the above-described apparatus may be configured by software, firmware, hardware, or a combination thereof. The specific means or manner in which the configuration can be used is well known to those skilled in the art and will not be described further herein. In the case of implementation by software or firmware, a program constituting the software is installed from a storage medium or a network to a computer (for example, a general-purpose computer 600 shown in fig. 6) having a dedicated hardware configuration, and the computer can execute various functions and the like when various programs are installed.
FIG. 6 is a block diagram of an exemplary architecture of a general purpose personal computer in which methods and/or apparatus according to embodiments of the invention may be implemented. As shown in fig. 6, a Central Processing Unit (CPU)601 performs various processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 to a Random Access Memory (RAM) 603. In the RAM 603, data necessary when the CPU 601 executes various processes and the like is also stored as necessary. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output interface 605 is also connected to bus 604.
The following components are connected to the input/output interface 605: an input section 606 (including a keyboard, a mouse, and the like), an output section 607 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like), a storage section 608 (including a hard disk and the like), a communication section 609 (including a network interface card such as a LAN card, a modem, and the like). The communication section 609 performs communication processing via a network such as the internet. The driver 610 may also be connected to the input/output interface 605 as desired. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that the computer program read out therefrom is installed in the storage section 608 as necessary.
In the case where the series of processes described above is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 611.
It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 611 shown in fig. 6 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 611 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 602, a hard disk included in the storage section 608, or the like, in which programs are stored and which are distributed to users together with the apparatus including them.
The invention also provides a corresponding computer program code and a computer program product with a machine readable instruction code stored. The instruction codes are read by a machine and can execute the method according to the embodiment of the invention when being executed.
Accordingly, storage media configured to carry the above-described program product having machine-readable instruction code stored thereon are also included in the present disclosure. Including, but not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.
Through the above description, the embodiments of the present disclosure provide the following technical solutions, but are not limited thereto.
Supplementary note 1. a method of text recognition, comprising:
decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library;
taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and
and iteratively expanding and combining adjacent or partially overlapped word strings in the set of the matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.
Note 2. the method of note 1, wherein the set of strings includes a prefix string that refers to a string that includes a first word in the text, an intermediate string that refers to a string that includes a last word in the text, and a suffix string that refers to a string that does not include the first and last words in the text.
Supplementary notes 3. the method according to supplementary notes 2, wherein each text in the new text corpus comprises a word string corresponding to the text and a word formation feature array corresponding to the word string, the word formation feature array being constructed based on: whether the text is a text in the original text corpus, and the number of times the word string of the text appears as a prefix word string, as a suffix word string, and as an intermediate word string in the new text corpus.
Appendix 4. the method of appendix 3, wherein iteratively expanding and merging adjacent or partially overlapping strings within the set of matched strings includes iteratively expanding and merging right for each matched string, iteratively expanding and merging right for each matched string including:
determining whether a particular word string in the set of matched word strings is adjacent to or partially overlapped on the right side with other word strings in the set of matched word strings by using position information of a first word and a last word of the particular word string in the text to be recognized; and
if it is determined that the adjacent or partially overlapping strings on the right hand side correspond to text in the original text corpus and that the particular string corresponds to text in the original text corpus, or that the number of times the particular string appears as a suffix string in the new text corpus is greater than a first threshold, or that the number of times the particular string appears as an intermediate string in the new text corpus is greater than a second threshold, merging the adjacent or partially overlapping strings on the right hand side into a new string,
repeating the above steps until no new string is generated.
Reference 5. the method according to reference 4, further comprising expanding and merging said particular string to the left, said expanding and merging said particular string to the left comprising:
determining whether the particular string is adjacent to or partially overlapping on the left side with other strings in the set of matching strings using the location information, an
If it is determined that adjacent or partially overlapping strings are on the left and that the particular string corresponds to text in the original text corpus, or that the number of times the particular string appears as a prefix string in the new text corpus is greater than a third threshold, or that the number of times the particular string appears as an intermediate string in the new text corpus is greater than a second threshold, then the adjacent or partially overlapping strings on the left are merged into a new string.
Supplementary note 6. the method according to supplementary note 5, wherein the first threshold, the second threshold and the third threshold are the same.
Supplementary note 7. the method according to supplementary note 5, wherein the first threshold value, the second threshold value and the third threshold value are different.
Annex 8. the method according to any of the annexes 3 to 7, further comprising applying a weight to the set of tokenization characteristics of the word strings for each text in the new text corpus.
Annex 9. the method according to annex 8, wherein the weight depends on the number of times the corresponding word string appears in the new corpus of text, the number of times the corresponding word string appears in other text, the sum of the number of times all text contained in the new corpus of text appears in the new corpus of text and the sum of the number of times all text in the new corpus of text appears in the other text.
Supplementary notes 10. the method according to supplementary notes 9, wherein the original text corpus is a domain-specific dictionary, the text in the original text corpus is an entry in the domain-specific dictionary, the word string comprises one or more words or terms constituting the entry, and the other text is non-domain-specific text.
Note 11. an apparatus for recognizing text, comprising:
parsing means configured to parse each text in an original text corpus into a set of strings to merge with a corresponding string of each text in the original text corpus into a new text corpus;
identifying means configured to identify, starting from each word in the text to be identified, a word string in the new text corpus starting from the word and having a longest match with the text to be identified; and
and the expanding and merging device is configured to iteratively expand and merge adjacent or partially overlapped character strings in the set of matched character strings according to the position information of the identified matched character strings in the text to be identified so as to obtain a final identification result.
Reference 12. the apparatus according to reference 11, wherein the set of strings includes a prefix string that refers to a string that includes a first word in the text, an intermediate string that refers to a string that includes a last word in the text, and a suffix string that refers to a string that does not include the first and last words in the text.
The apparatus according to annex 12, wherein each text in the new library of texts comprises a word string corresponding to the text and a word formation feature array corresponding to the word string, the word formation feature array being constructed based on: whether the text is a text in the original text corpus, and the number of times the word string of the text appears as a prefix word string, as a suffix word string, and as an intermediate word string in the new text corpus.
Supplementary note 14. the apparatus according to supplementary note 13, wherein the expanding and merging means comprises a right-side expanding and merging means, the right-side expanding and merging means comprising:
a right-side determining device configured to determine whether a specific word string in the set of matched word strings is adjacent to or partially overlapped with other word strings in the set of matched word strings on the right side by using the position information of the first word and the last word of the specific word string in the text to be recognized; and
right merging means configured to merge the adjacent or partially overlapping strings on the right into a new string if it is determined that the particular string corresponds to text in the original text corpus and it is determined that the particular string appears as a suffix string more than a first threshold number of times in the new text corpus or it is determined that the particular string appears as an intermediate string more than a second threshold number of times in the new text corpus.
Supplementary note 15. the apparatus according to supplementary note 14, the expanding and merging means further comprises left expanding and merging means, the left expanding and merging means comprising:
a left-side determining device configured to determine whether the specific string is adjacent to or partially overlapped with other strings in the set of matched strings on the left side using the position information; and
left-side merging means configured to merge the adjacent or partially overlapping strings on the left side into a new string if it is determined that the particular string corresponds to text in the original text corpus, or that the number of times the particular string appears as a prefix string in the new text corpus is greater than a third threshold, or that the number of times the particular string appears as an intermediate string in the new text corpus is greater than a second threshold.
Supplementary notes 16. the apparatus according to supplementary notes 15, wherein the first, second and third threshold values are the same or different.
Reference 17. the apparatus according to any of the references 13 to 16, further comprising weighting means configured to apply a weight to the array of formation characteristics of the word strings of each text in the new library of texts.
Appendix 18. apparatus according to appendix 17, wherein said weight depends on the number of times the corresponding word string appears in said new corpus of text, the number of times the corresponding word string appears in other text, the sum of the number of times all text contained in said new corpus of text appears in said new corpus of text and the sum of the number of times all text in said new corpus of text appears in said other text.
Supplementary notes 19. apparatus according to supplementary notes 18, wherein the original text corpus is a domain-of-expertise dictionary, the text in the original text corpus is an entry in the domain-of-expertise dictionary, the word string comprises one or more words or terms that make up the entry, and the other text is non-domain-of-expertise text.
Note 20. a computer-readable storage medium storing a program executable by a processor to perform the operations of:
decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library;
taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and
and iteratively expanding and combining adjacent or partially overlapped word strings in the set of the matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.
Finally, it should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it should be understood that the above described embodiments are only configured to illustrate the present invention and do not constitute a limitation of the present invention. It will be apparent to those skilled in the art that various modifications and variations can be made in the above-described embodiments without departing from the spirit and scope of the invention. Accordingly, the scope of the invention is to be defined only by the claims appended hereto, and by their equivalents.

Claims (10)

1. A method for recognizing text, comprising:
decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library;
taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and
and iteratively expanding and combining adjacent or partially overlapped word strings in the set of the matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.
2. The method of claim 1, wherein the set of strings includes a prefix string that refers to a string that includes a first word in the text, an intermediate string that refers to a string that includes a last word in the text, and a suffix string that refers to a string that does not contain the first and last words in the text.
3. The method of claim 2, wherein each text in the new text corpus comprises a word string corresponding to the text and a word formation feature array corresponding to the word string, the word formation feature array being constructed based on: whether the text is a text in the original text corpus, and the number of times the word string of the text appears as a prefix word string, as a suffix word string, and as an intermediate word string in the new text corpus.
4. The method of claim 3, wherein iteratively expanding and merging adjacent or partially overlapping strings within the set of matching strings comprises iteratively expanding and merging each matching string to the right, iteratively expanding and merging each matching string to the right comprising:
determining whether a particular word string in the set of matched word strings is adjacent to or partially overlapped on the right side with other word strings in the set of matched word strings by using position information of a first word and a last word of the particular word string in the text to be recognized; and
if it is determined that the adjacent or partially overlapping strings on the right hand side correspond to text in the original text corpus and that the particular string corresponds to text in the original text corpus, or that the number of times the particular string appears as a suffix string in the new text corpus is greater than a first threshold, or that the number of times the particular string appears as an intermediate string in the new text corpus is greater than a second threshold, merging the adjacent or partially overlapping strings on the right hand side into a new string,
repeating the above steps until no new string is generated.
5. The method of claim 4, further comprising expanding and merging the particular string to the left, the expanding and merging the particular string to the left comprising:
determining whether the particular string is adjacent to or partially overlapping on the left side with other strings in the set of matching strings using the location information, an
If it is determined that adjacent or partially overlapping strings are on the left and that the particular string corresponds to text in the original text corpus, or that the number of times the particular string appears as a prefix string in the new text corpus is greater than a third threshold, or that the number of times the particular string appears as an intermediate string in the new text corpus is greater than a second threshold, then the adjacent or partially overlapping strings on the left are merged into a new string.
6. The method of claim 5, wherein the first threshold, the second threshold, and the third threshold are the same or different.
7. The method of any of claims 3 to 6, further comprising applying a weight to the set of tokenized features of the word strings for each text in the new library, the weight depending on the number of times the corresponding word string appears in the new library, the number of times the corresponding word string appears in other text, the sum of the number of times all text contained in the new library appears in the new library, and the sum of the number of times all text in the new library appears in the other text.
8. The method of claim 7, wherein the original text library is a professional domain dictionary, the text in the original text library is an entry in the professional domain dictionary, the word string comprises one or more words or terms that make up the entry, and the other text is non-professional domain text.
9. An apparatus for recognizing text, comprising:
parsing means configured to parse each text in an original text corpus into a set of strings to merge with a corresponding string of each text in the original text corpus into a new text corpus;
identifying means configured to identify, starting from each word in the text to be identified, a word string in the new text corpus starting from the word and having a longest match with the text to be identified; and
and the expanding and merging device is configured to iteratively expand and merge adjacent or partially overlapped character strings in the set of matched character strings according to the position information of the identified matched character strings in the text to be identified so as to obtain a final identification result.
10. A computer-readable storage medium storing a program executable by a processor to:
decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library;
taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and
and iteratively expanding and combining adjacent or partially overlapped word strings in the set of the matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.
CN202010256902.7A 2020-04-01 2020-04-01 Method, apparatus and storage medium for recognizing text Active CN113496116B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010256902.7A CN113496116B (en) 2020-04-01 2020-04-01 Method, apparatus and storage medium for recognizing text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010256902.7A CN113496116B (en) 2020-04-01 2020-04-01 Method, apparatus and storage medium for recognizing text

Publications (2)

Publication Number Publication Date
CN113496116A true CN113496116A (en) 2021-10-12
CN113496116B CN113496116B (en) 2024-07-05

Family

ID=77994703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010256902.7A Active CN113496116B (en) 2020-04-01 2020-04-01 Method, apparatus and storage medium for recognizing text

Country Status (1)

Country Link
CN (1) CN113496116B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196900A (en) * 2007-12-27 2008-06-11 中国移动通信集团湖北有限公司 Information searching method based on metadata
US20090171953A1 (en) * 2007-12-26 2009-07-02 Cameron Craig Morris Techniques for recognizing multiple patterns within a string
CN110162782A (en) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 Entity extraction method, apparatus, equipment and storage medium based on Medical Dictionary

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171953A1 (en) * 2007-12-26 2009-07-02 Cameron Craig Morris Techniques for recognizing multiple patterns within a string
CN101196900A (en) * 2007-12-27 2008-06-11 中国移动通信集团湖北有限公司 Information searching method based on metadata
CN110162782A (en) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 Entity extraction method, apparatus, equipment and storage medium based on Medical Dictionary

Also Published As

Publication number Publication date
CN113496116B (en) 2024-07-05

Similar Documents

Publication Publication Date Title
US11030407B2 (en) Computer system, method and program for performing multilingual named entity recognition model transfer
CN110427618B (en) Countermeasure sample generation method, medium, device and computing equipment
US8484238B2 (en) Automatically generating regular expressions for relaxed matching of text patterns
JP5599662B2 (en) System and method for converting kanji into native language pronunciation sequence using statistical methods
KR101136007B1 (en) System and method for anaylyzing document sentiment
Mohtaj et al. Parsivar: A language processing toolkit for Persian
US20200372088A1 (en) Recommending web api&#39;s and associated endpoints
Yazdani et al. Automated misspelling detection and correction in Persian clinical text
US9984071B2 (en) Language ambiguity detection of text
US10546065B2 (en) Information extraction apparatus and method
JP6077727B1 (en) Computer system, method, and program for transferring multilingual named entity recognition model
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
Vanetik et al. An unsupervised constrained optimization approach to compressive summarization
Laur et al. Estnltk 1.6: Remastered estonian nlp pipeline
US20050273316A1 (en) Apparatus and method for translating Japanese into Chinese and computer program product
Pérez et al. Inferred joint multigram models for medical term normalization according to ICD
CN111859858A (en) Method and device for extracting relationship from text
Loftsson et al. Tagging a morphologically complex language using an averaged perceptron tagger: The case of Icelandic
List et al. Toward a sustainable handling of interlinear-glossed text in language documentation
KR101663038B1 (en) Entity boundary detection apparatus in text by usage-learning on the entity&#39;s surface string candidates and mtehod thereof
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
JP4361299B2 (en) Evaluation expression extraction apparatus, program, and storage medium
CN113496116A (en) Method, apparatus, and storage medium for recognizing text
JP3441400B2 (en) Language conversion rule creation device and program recording medium
Brierley et al. Tools for Arabic Natural Language Processing: a case study in qalqalah prosody

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant