CN113496116A

CN113496116A - Method, apparatus, and storage medium for recognizing text

Info

Publication number: CN113496116A
Application number: CN202010256902.7A
Authority: CN
Inventors: 郑仲光; 孙俊
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2021-10-12
Anticipated expiration: 2040-04-01
Also published as: CN113496116B

Abstract

Disclosed are a method and apparatus for recognizing text and a storage medium. The method comprises the following steps: decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library; taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and iteratively expanding and combining adjacent or partially overlapped word strings in the set of matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.

Description

Method, apparatus, and storage medium for recognizing text

Technical Field

The present disclosure relates to Natural Language Processing (NLP), and in particular to Named Entity Recognition (NER).

Background

In the field of natural language processing, named entity recognition is a fundamental task, whose purpose is to recognize from text the components of words or phrases of a particular class, such as the recognition of names of people, addresses, organizations, proper nouns, etc. The results of the NER are widely used for other downstream tasks such as information retrieval, automatic translation, etc.

In general, models required by the NER task are supervised models, i.e. corpus is labeled for training. In an actual scenario, however, not all categories of entities may obtain annotation data.

Disclosure of Invention

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

According to an aspect of the present invention, there is provided a method for recognizing text, including: decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library; taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and iteratively expanding and combining adjacent or partially overlapped word strings in the set of matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.

According to another aspect of the present invention, there is provided an apparatus for recognizing text, including: parsing means configured to parse each text in an original text corpus into a set of strings to merge with a corresponding string of each text in the original text corpus into a new text corpus; identifying means configured to identify, starting from each word in the text to be identified, a word string in the new text corpus starting from the word and having a longest match with the text to be identified; and the expanding and merging device is configured to iteratively expand and merge adjacent or partially overlapped character strings in the set of the matched character strings according to the position information of the identified matched character strings in the text to be identified so as to obtain a final identification result.

According to other aspects of the invention, corresponding computer program code, computer readable storage medium and computer program product are also provided.

By the text recognition method and the text recognition equipment, the text can be recognized under the condition that a training set is not labeled, so that the labor and time cost is reduced, and the complexity is reduced.

These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.

Drawings

To further clarify the above and other advantages and features of the present disclosure, a more particular description of embodiments of the present disclosure will be rendered by reference to the appended drawings. Which are incorporated in and form a part of this specification, along with the detailed description that follows. Elements having the same function and structure are denoted by the same reference numerals. It is appreciated that these drawings depict only typical examples of the disclosure and are therefore not to be considered limiting of its scope. In the drawings:

FIG. 1 shows an example of a list of disease names;

FIG. 2 is a flow diagram of a method 200 for recognizing text, according to one embodiment;

FIG. 3 illustrates a flow diagram for iteratively expanding and merging each matched string to the right, according to one embodiment;

FIG. 4 illustrates a flowchart for iteratively expanding and merging each matched string to the left, in accordance with a preferred embodiment;

FIG. 5 is a block diagram of an apparatus 500 for recognizing text, according to one embodiment; and

FIG. 6 is a block diagram of an exemplary architecture of a general purpose personal computer in which methods and/or apparatus according to embodiments of the invention may be implemented.

Detailed Description

Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.

Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structures and/or processing steps closely related to the scheme according to the present disclosure are shown in the drawings, and other details not so relevant to the present disclosure are omitted.

The models required by the NER task need to be labeled with corpora for training. Without the labeled data, the supervised model cannot be trained. The corpus is labor and time intensive to construct, costly and difficult to obtain. But the domain dictionary is relatively easy to obtain.

The invention provides the recognition method based on the domain dictionary, so that the target entity in the corpus can be recognized to the maximum extent by utilizing the domain dictionary under the condition of not marking the corpus, thereby overcoming or lightening the defects in the prior art.

It should be noted that in the following, although described with the example of identifying disease names in the medical literature, the skilled person will understand that the solution of the invention can be applied to entity identification in the literature in any other field.

It should also be noted that in the following, although the description is given by way of example for the recognition of english text, the person skilled in the art will understand that the solution of the invention can be applied to any other category of languages.

For example, assume that the task is to identify the name of a disease in the medical literature. Although the existing medical dictionary already contains many entries, the coverage rate of the entries is still low. That is, if the entries in the dictionary are directly used for matching in the medical literature, much entity information is lost. For example, assume that the goal is to recognize the disease name "Advanced differential pathological lymphoma" in the text "Advanced differential pathological lymphoma, a potential curable disease", but there is no corresponding entry in the existing medical dictionary. In this case, if the text is recognized only by an exact match with the entry in the dictionary, the disease name cannot be recognized.

However, each of the words "advanced", "diffuse", "histiocytic" and "lymphoma" in the text is present in the following entry in the dictionary, respectively: "advanced sleep phase syndrome", "differential large B-cell lymphoma" and "lipophilic disease". This situation is relatively common. For example, fig. 1 shows an example of a disease name list obtained based on an existing medical dictionary. It can be observed that although the term coverage of the medical dictionary is low, the coverage of the word (or word) is relatively high. The scheme of the invention is based on the characteristic to identify the entity.

The method 200 for recognizing text according to an embodiment will be described in detail below in conjunction with fig. 2.

The method 200 starts at step 201. In step 201, each text in the original text library is decomposed into a string set to be merged with the corresponding string of each text in the original text library into a new text library. Specifically, in the present embodiment, for example, the original text library is a medical dictionary, each text is a term in the medical dictionary, and a word is a word in the term.

According to a preferred embodiment, the dictionary can be disassembled as follows.

For the entry t ═ w₀,w₁,…,w_n]Wherein w is_iRepresenting words in terms, which can be broken down into a set of strings: [ s ] of₀,s₁,…,s_m]Wherein s is_h＝[w_i,w_i+1,…,w_j]Wherein j-i<n。

For s_hIf with w₀Specifying the beginning as a prefix string; if with w_nWhen the word ends, the word is defined as a suffix string, otherwise, it is defined as an intermediate string. For example, for the term "object candidate social optical neuron", the following string set can be obtained:

s₀＝obsolete

s₁＝obsolete meningococcal

s₂＝obsolete meningococcal optic

s₃＝meningococcal

s₄＝meningococcal optic

s₅＝meningococcal optic neuritis

s₆＝optic neuritis

s₇＝neuritis

wherein s is₀，s₁，s₂Is a prefix string, s₅，s₆，s₇Suffix string, and others intermediate string.

By doing the same process for each entry in the dictionary and merging the resulting set of strings with the corresponding strings of the entries in the original dictionary, a new dictionary D ═ t can be obtained₀,t₁,…,t_n]Wherein t is_i＝[ent,role](ii) a And wherein ent ═ w₀,w₁,…,w_n]Corresponding word string representing an entry, role [ n ]₀,n₁,n₂,n₃]A word-forming feature array for representing the word string, wherein n₀Represents t_iWhether the vocabulary entry is the vocabulary entry in the original dictionary or not, 1 is yes, and 0 is no; n is₁Represents t_iAs the number of times the prefix string appears; n is₂Represents t_iNumber of occurrences as suffix strings; and n is₃Represents t_iAs the number of occurrences of the intermediate string.

Thus, from the above example, one can obtain:

t₀＝[ent＝[“obsolete meningococcal optic neuritis”],role＝[1,0,0,0]]

t₁＝[ent＝[“obsolete”],role＝[0,1,0,0]]

t₂＝[ent＝[“optic neuritis”],role＝[0,0,1,0]]

t₃＝[ent＝[“meningococcal optic”],role＝[0,0,0,1]]

……。

by disassembling each entry in the dictionary in this way, the entry granularity can be reduced, thereby improving the coverage degree during recognition.

It should be understood that the manner of parsing the dictionary described above is merely an example, and the present invention is not limited thereto. For example, the parsing may be performed in units of two, three, or more words, as desired.

According to a preferred embodiment, the method 200 may further comprise a step 201' of applying weights to the set of configuration characteristics of the word strings of each text in the new text corpus.

Specifically, in step 201', for each entry t in the new dictionary_iIt is desirable to know if it is a commonly used entry in the field. That is, when the entry appears in the text to be recognized, it can be determined that the position where the entry appears contains the domain vocabulary with a high probability. However, not all the entries in the dictionary have domain characteristics, but many entries will also appear in the general domain, such as some commonly used words "disconnect", "drug" or "constant". These words are frequently found in both disease name/non-disease name texts and therefore do not have a clear distinction.

To improve the matching accuracy, the word formation feature array role of each entry in the new dictionary may be given [ n ═₀,n₁,n₂,n₃]Given a weight, i.e. a factor f representing the importance of the field_dThe calculation method is as follows:

wherein c1 represents the entry t_iNumber of occurrences in the new dictionary; c2 denotes t_iA number of occurrences in non-domain text, wherein the non-domain text may for example be in a news domain corpus; n1 represents the sum of the number of occurrences in the dictionary of all entries contained in the new dictionary; n2 represents the sum of the number of occurrences of all terms contained in the new dictionary in the non-domain corpus; and norm is a smoothing factor, which may be taken to be, for example, norm 1.

For each entry t in the new dictionary according to equation (1)_iCalculates the field importance factor f_dThereafter, the word formation feature array of each entry in the new dictionary may be replaced with: roll [ n ]₀*f_b(t_i),n₁*f_b(t_i),n₂*f_b(t_i),n₃*f_b(t_i)]。

Next, in step 202, starting from each word in the text to be recognized, a word string starting from the word and having the longest match with the text to be recognized in the new text library is recognized.

Specifically, in the present embodiment, after merging the new dictionaries, s ═ w is given to the input sentence given₀,w₁,…,w_n]The identification of (a) can be performed by:

-with w_iFind the longest string for the beginning, satisfy [ w_i,…,w_j](j>I) matches an entry in the new dictionary,

-if there is a match, recording the location [ i, j, t [ ]_i]Wherein t is_iFor the entry in the dictionary that is matched,

for each element w in the input sentence s₀To w_nSequentially executing the two steps and recording all the matched entries in the dictionaryThe position of (a).

To this end, a set of matching positions [ [ i ] can be obtained₀,j₀,t_i],[i₁,j₁,t_j],……]。

The basic segmentation result is obtained by the above full segmentation step 202. The basic segmentation results may be expanded and merged, step 203, to obtain a new large-grained segmentation result, so that entries not included in the original dictionary are recognized.

In step 203, according to the position information of the identified matching word strings in the text to be identified, the adjacent or partially overlapped word strings in the set of matching word strings are iteratively expanded and merged to obtain the final identification result.

In this embodiment, step 203 may include iteratively expanding and merging each matched string to the right. FIG. 3 illustrates a flow chart for iteratively expanding and merging to the right for each matching string. As shown in FIG. 3, iteratively expanding and merging to the right for each matching string includes:

step 2033: determining whether a particular string in the set of matched string strings is adjacent to or partially overlapping other string strings in the set of matched string strings on the right side using position information of the first word and the last word of the particular string in the text to be recognized;

step 2034: merging the adjacent or partially overlapping strings on the right into a new string if it is determined that the particular string corresponds to text in the original text corpus and that the particular string appears as a suffix string more than a first threshold in the new text corpus or that the particular string appears as an intermediate string more than a second threshold in the new text corpus; and

step 2035: repeating the above steps until no new string is generated.

Specifically, the

above steps

2033, 2034 and 2035 can be realized, for example, by the following loop process:

for the segmentation result [ i, j, t_i]Is expanded to the right side and is expanded to the right side,

if [ i ', j', t ] is present_j]And satisfies the following conditions:

i < i '< ═ j < j' (step 2033), and

2.n₀>0 and n₀In role (t)_j) In, or n₂>First threshold TH1 and n₂In role (t)_j) In, or n₃>Second threshold TH2 and n₃In role (t)_j) In (1),

a new segmentation result [ i, j ', ti' ]isgenerated,

ti’＝[ent[wi,…,wj’],role＝[min(roleti,role_tj)]](step 2034),

the above steps (steps 2033 and 2034) are repeatedly performed until no new cut result is generated (step 2035).

According to a preferred embodiment, step 203 may further comprise iteratively expanding and merging each matched string to the left before iteratively expanding and merging each matched string to the right. FIG. 4 illustrates a flow chart for iteratively expanding and merging to the left for each matching string. As shown in FIG. 4, iteratively expanding and merging each matched string to the left includes:

step 2031: determining whether a particular string is adjacent or partially overlapping on the left side with other strings in the set of matching strings using the location information, an

Step 2032: if it is determined that adjacent or partially overlapping strings are on the left and that the particular string corresponds to text in the original corpus of text, or that the number of times the particular string appears as a prefix string in the new corpus of text is greater than a third threshold, or that the number of times the particular string appears as an intermediate string in the new corpus of text is greater than a second threshold, then the adjacent or partially overlapping strings on the left are merged into a new string.

Specifically, steps 2031 to 2035 can be implemented by the following loop process:

for the segmentation result [ i, j, t_i]

Expanding to the left:

if present [ i ', j',t_j]And satisfies the following conditions:

1, i '< i ═ j' < j (step 2031), and

2.n₀>0 and n₀In role (t)_j) In, or n₁>Third threshold TH3 and n₁In role (t)_j) In, or n₃>Second threshold TH2 and n₃In role (t)_j) In (1),

a new segmentation result [ i', j, t ] is generated_i’]，

t_i’＝[ent[w_i’,…,w_j],role＝[min(role_ti,role_tj)]](step 2032) of,

expanding to the right:

if [ i ', j', t ] is present_j]And satisfies the following conditions:

i < i '< ═ j < j' (step 2033), and

a new segmentation result [ i, j ', ti' ]isgenerated,

ti’＝[ent[wi,…,wj’],role＝[min(roleti,role_tj)]](step 2034),

the above steps (steps 2031 to 2034) are repeatedly performed until no new cut result is generated (step 2035).

Note that the first threshold TH1, the second threshold TH2, and the third threshold TH3 may be set as needed, and may be set to the same value or different values.

It should also be noted that for the sake of brevity no weights are applied to the array of tokenized features in steps 2031 to 2035 above, i.e. the above mentioned domain importance factor f_d. It should be appreciated that the above steps 2031-2035 may be performed with or without applying weights to the token feature array.

For example, for the entry "Advanced differential histocytic simple, a potential secure digital disease", the result after the full segmentation step 202 is:

[0,0,“Advanced”]

[1,1,“diffuse”]

[2,2,“histiocytic”]

[3,3,“lymphoma”]

the above results are then expanded and merged 203 (step 2031-2035):

the first cycle is to obtain

[0,1,“Advanced diffuse”]

[2,3,“histiocytic lymphoma”]

Second recycling to obtain

[0,3,“Advanced diffuse histiocytic lymphoma”]

No new segmentation result is generated later, and a final recognition result 'Advanced differentiated historical semantic lymphoma' can be obtained.

Through the method 200 for recognizing text described above in conjunction with fig. 2 to 4, it is made possible to maximally recognize target entities in a corpus based on a domain dictionary without labeling the corpus.

The methods discussed above may be implemented entirely by computer-executable programs, or may be implemented partially or entirely using hardware and/or firmware. When implemented in hardware and/or firmware, or when a computer-executable program is loaded into a hardware device capable of running the program, the apparatus for recognizing text described hereinafter is implemented. In the following, a summary of these devices is given without repeating some details that have been discussed above, but it should be noted that, although these devices may perform the methods described in the foregoing, the methods do not necessarily employ or be performed by those components of the described devices.

Fig. 5 shows an apparatus 500 for recognizing text according to an embodiment comprising a disassembling means 501, a recognizing means 502 and an expanding and merging means 503. The disassembling device 501 is configured to disassemble each text in the original text library into a string set, so as to merge the string set with the corresponding string of each text in the original text library into a new text library. The identifying means 502 is configured to identify, starting from each word in the text to be identified, the word string in the new text library starting from the word and having the longest match with the text to be identified. The expanding and merging device 503 is configured to iteratively expand and merge adjacent or partially overlapping strings in the set of matched strings according to the position information of the identified matched strings in the text to be identified, so as to obtain a final identification result.

According to a preferred embodiment, the apparatus 500 for recognizing text further comprises weighting means 501' for applying a weight to the array of formation characteristics of the word strings of each text in the new text corpus.

The apparatus 500 for recognizing text shown in fig. 5 corresponds to the method 200 shown in fig. 2 to 4. Accordingly, the relevant details of each device in the apparatus for recognizing text 500 have been given in detail in the description of the method for recognizing text 200 of fig. 2 to 4, and will not be described herein again.

Each constituent module and unit in the above-described apparatus may be configured by software, firmware, hardware, or a combination thereof. The specific means or manner in which the configuration can be used is well known to those skilled in the art and will not be described further herein. In the case of implementation by software or firmware, a program constituting the software is installed from a storage medium or a network to a computer (for example, a general-purpose computer 600 shown in fig. 6) having a dedicated hardware configuration, and the computer can execute various functions and the like when various programs are installed.

FIG. 6 is a block diagram of an exemplary architecture of a general purpose personal computer in which methods and/or apparatus according to embodiments of the invention may be implemented. As shown in fig. 6, a Central Processing Unit (CPU)601 performs various processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 to a Random Access Memory (RAM) 603. In the RAM 603, data necessary when the CPU 601 executes various processes and the like is also stored as necessary. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output interface 605 is also connected to bus 604.

The following components are connected to the input/output interface 605: an input section 606 (including a keyboard, a mouse, and the like), an output section 607 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like), a storage section 608 (including a hard disk and the like), a communication section 609 (including a network interface card such as a LAN card, a modem, and the like). The communication section 609 performs communication processing via a network such as the internet. The driver 610 may also be connected to the input/output interface 605 as desired. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that the computer program read out therefrom is installed in the storage section 608 as necessary.

In the case where the series of processes described above is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 611.

It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 611 shown in fig. 6 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 611 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 602, a hard disk included in the storage section 608, or the like, in which programs are stored and which are distributed to users together with the apparatus including them.

The invention also provides a corresponding computer program code and a computer program product with a machine readable instruction code stored. The instruction codes are read by a machine and can execute the method according to the embodiment of the invention when being executed.

Accordingly, storage media configured to carry the above-described program product having machine-readable instruction code stored thereon are also included in the present disclosure. Including, but not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.

Through the above description, the embodiments of the present disclosure provide the following technical solutions, but are not limited thereto.

Supplementary note 1. a method of text recognition, comprising:

decomposing each text in an original text library into a string set so as to be combined with a corresponding string of each text in the original text library into a new text library;

taking each character in the text to be recognized as the start, and recognizing the character string which takes the character as the start and has the longest matching with the text to be recognized in the new text library; and

and iteratively expanding and combining adjacent or partially overlapped word strings in the set of the matched word strings according to the position information of the identified matched word strings in the text to be identified so as to obtain a final identification result.

Note 2. the method of note 1, wherein the set of strings includes a prefix string that refers to a string that includes a first word in the text, an intermediate string that refers to a string that includes a last word in the text, and a suffix string that refers to a string that does not include the first and last words in the text.

Supplementary notes 3. the method according to supplementary notes 2, wherein each text in the new text corpus comprises a word string corresponding to the text and a word formation feature array corresponding to the word string, the word formation feature array being constructed based on: whether the text is a text in the original text corpus, and the number of times the word string of the text appears as a prefix word string, as a suffix word string, and as an intermediate word string in the new text corpus.

Appendix 4. the method of appendix 3, wherein iteratively expanding and merging adjacent or partially overlapping strings within the set of matched strings includes iteratively expanding and merging right for each matched string, iteratively expanding and merging right for each matched string including:

determining whether a particular word string in the set of matched word strings is adjacent to or partially overlapped on the right side with other word strings in the set of matched word strings by using position information of a first word and a last word of the particular word string in the text to be recognized; and

if it is determined that the adjacent or partially overlapping strings on the right hand side correspond to text in the original text corpus and that the particular string corresponds to text in the original text corpus, or that the number of times the particular string appears as a suffix string in the new text corpus is greater than a first threshold, or that the number of times the particular string appears as an intermediate string in the new text corpus is greater than a second threshold, merging the adjacent or partially overlapping strings on the right hand side into a new string,

repeating the above steps until no new string is generated.

Reference 5. the method according to reference 4, further comprising expanding and merging said particular string to the left, said expanding and merging said particular string to the left comprising:

determining whether the particular string is adjacent to or partially overlapping on the left side with other strings in the set of matching strings using the location information, an

If it is determined that adjacent or partially overlapping strings are on the left and that the particular string corresponds to text in the original text corpus, or that the number of times the particular string appears as a prefix string in the new text corpus is greater than a third threshold, or that the number of times the particular string appears as an intermediate string in the new text corpus is greater than a second threshold, then the adjacent or partially overlapping strings on the left are merged into a new string.

Supplementary note 6. the method according to supplementary note 5, wherein the first threshold, the second threshold and the third threshold are the same.

Supplementary note 7. the method according to supplementary note 5, wherein the first threshold value, the second threshold value and the third threshold value are different.

Annex 8. the method according to any of the annexes 3 to 7, further comprising applying a weight to the set of tokenization characteristics of the word strings for each text in the new text corpus.

Annex 9. the method according to annex 8, wherein the weight depends on the number of times the corresponding word string appears in the new corpus of text, the number of times the corresponding word string appears in other text, the sum of the number of times all text contained in the new corpus of text appears in the new corpus of text and the sum of the number of times all text in the new corpus of text appears in the other text.

Supplementary notes 10. the method according to supplementary notes 9, wherein the original text corpus is a domain-specific dictionary, the text in the original text corpus is an entry in the domain-specific dictionary, the word string comprises one or more words or terms constituting the entry, and the other text is non-domain-specific text.

Note 11. an apparatus for recognizing text, comprising:

parsing means configured to parse each text in an original text corpus into a set of strings to merge with a corresponding string of each text in the original text corpus into a new text corpus;

identifying means configured to identify, starting from each word in the text to be identified, a word string in the new text corpus starting from the word and having a longest match with the text to be identified; and

and the expanding and merging device is configured to iteratively expand and merge adjacent or partially overlapped character strings in the set of matched character strings according to the position information of the identified matched character strings in the text to be identified so as to obtain a final identification result.

Reference 12. the apparatus according to reference 11, wherein the set of strings includes a prefix string that refers to a string that includes a first word in the text, an intermediate string that refers to a string that includes a last word in the text, and a suffix string that refers to a string that does not include the first and last words in the text.

The apparatus according to annex 12, wherein each text in the new library of texts comprises a word string corresponding to the text and a word formation feature array corresponding to the word string, the word formation feature array being constructed based on: whether the text is a text in the original text corpus, and the number of times the word string of the text appears as a prefix word string, as a suffix word string, and as an intermediate word string in the new text corpus.

Supplementary note 14. the apparatus according to supplementary note 13, wherein the expanding and merging means comprises a right-side expanding and merging means, the right-side expanding and merging means comprising:

a right-side determining device configured to determine whether a specific word string in the set of matched word strings is adjacent to or partially overlapped with other word strings in the set of matched word strings on the right side by using the position information of the first word and the last word of the specific word string in the text to be recognized; and

right merging means configured to merge the adjacent or partially overlapping strings on the right into a new string if it is determined that the particular string corresponds to text in the original text corpus and it is determined that the particular string appears as a suffix string more than a first threshold number of times in the new text corpus or it is determined that the particular string appears as an intermediate string more than a second threshold number of times in the new text corpus.

Supplementary note 15. the apparatus according to supplementary note 14, the expanding and merging means further comprises left expanding and merging means, the left expanding and merging means comprising:

a left-side determining device configured to determine whether the specific string is adjacent to or partially overlapped with other strings in the set of matched strings on the left side using the position information; and

left-side merging means configured to merge the adjacent or partially overlapping strings on the left side into a new string if it is determined that the particular string corresponds to text in the original text corpus, or that the number of times the particular string appears as a prefix string in the new text corpus is greater than a third threshold, or that the number of times the particular string appears as an intermediate string in the new text corpus is greater than a second threshold.

Supplementary notes 16. the apparatus according to supplementary notes 15, wherein the first, second and third threshold values are the same or different.

Reference 17. the apparatus according to any of the references 13 to 16, further comprising weighting means configured to apply a weight to the array of formation characteristics of the word strings of each text in the new library of texts.

Appendix 18. apparatus according to appendix 17, wherein said weight depends on the number of times the corresponding word string appears in said new corpus of text, the number of times the corresponding word string appears in other text, the sum of the number of times all text contained in said new corpus of text appears in said new corpus of text and the sum of the number of times all text in said new corpus of text appears in said other text.

Supplementary notes 19. apparatus according to supplementary notes 18, wherein the original text corpus is a domain-of-expertise dictionary, the text in the original text corpus is an entry in the domain-of-expertise dictionary, the word string comprises one or more words or terms that make up the entry, and the other text is non-domain-of-expertise text.

Note 20. a computer-readable storage medium storing a program executable by a processor to perform the operations of:

Finally, it should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it should be understood that the above described embodiments are only configured to illustrate the present invention and do not constitute a limitation of the present invention. It will be apparent to those skilled in the art that various modifications and variations can be made in the above-described embodiments without departing from the spirit and scope of the invention. Accordingly, the scope of the invention is to be defined only by the claims appended hereto, and by their equivalents.

Claims

1. A method for recognizing text, comprising:

2. The method of claim 1, wherein the set of strings includes a prefix string that refers to a string that includes a first word in the text, an intermediate string that refers to a string that includes a last word in the text, and a suffix string that refers to a string that does not contain the first and last words in the text.

3. The method of claim 2, wherein each text in the new text corpus comprises a word string corresponding to the text and a word formation feature array corresponding to the word string, the word formation feature array being constructed based on: whether the text is a text in the original text corpus, and the number of times the word string of the text appears as a prefix word string, as a suffix word string, and as an intermediate word string in the new text corpus.

4. The method of claim 3, wherein iteratively expanding and merging adjacent or partially overlapping strings within the set of matching strings comprises iteratively expanding and merging each matching string to the right, iteratively expanding and merging each matching string to the right comprising:

repeating the above steps until no new string is generated.

5. The method of claim 4, further comprising expanding and merging the particular string to the left, the expanding and merging the particular string to the left comprising:

6. The method of claim 5, wherein the first threshold, the second threshold, and the third threshold are the same or different.

7. The method of any of claims 3 to 6, further comprising applying a weight to the set of tokenized features of the word strings for each text in the new library, the weight depending on the number of times the corresponding word string appears in the new library, the number of times the corresponding word string appears in other text, the sum of the number of times all text contained in the new library appears in the new library, and the sum of the number of times all text in the new library appears in the other text.

8. The method of claim 7, wherein the original text library is a professional domain dictionary, the text in the original text library is an entry in the professional domain dictionary, the word string comprises one or more words or terms that make up the entry, and the other text is non-professional domain text.

9. An apparatus for recognizing text, comprising:

10. A computer-readable storage medium storing a program executable by a processor to: