CN111125306A - Method, device and equipment for determining central word and storage medium - Google Patents

Method, device and equipment for determining central word and storage medium Download PDF

Info

Publication number
CN111125306A
CN111125306A CN201911259955.8A CN201911259955A CN111125306A CN 111125306 A CN111125306 A CN 111125306A CN 201911259955 A CN201911259955 A CN 201911259955A CN 111125306 A CN111125306 A CN 111125306A
Authority
CN
China
Prior art keywords
word
target
value
mutual information
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911259955.8A
Other languages
Chinese (zh)
Inventor
陈建华
崔朝辉
赵立军
张霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201911259955.8A priority Critical patent/CN111125306A/en
Publication of CN111125306A publication Critical patent/CN111125306A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for determining a headword, which comprises the following steps: the method comprises the steps of obtaining a word segmentation result corresponding to a target text, splicing a first object and a second object to obtain a target word when the value of mutual information between the continuous first object and the continuous second object in the word segmentation result is larger than a first preset threshold value, wherein the first object and the second object are characters and/or words obtained by segmenting the target text, and further determining a central word of the target text according to the target word. Therefore, two continuous objects in the word segmentation result are spliced into one word, namely the target word, so that the possibility that the target word is the central word is higher than the possibility that the single first object or the single second object is the central word, and the accuracy of the central word determined based on the target word is relatively high.

Description

Method, device and equipment for determining central word and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a headword.
Background
A core word is generally a word that reflects the subject or main content of a text such as a paragraph, an article, etc. In practical applications, a central word of a text may be extracted, and the extracted central word is mapped with the text, so that a corresponding text may be determined based on the central word. For example, in the field of information retrieval, if a keyword input by a user matches the headword, a text corresponding to the headword can be presented to the user, so as to meet the requirement of the user for obtaining information.
However, the word segmentation accuracy for a text often determines to a large extent whether a suitable central word can be extracted for the text. Particularly, for vocabularies in partial application fields, such as professional terms in the medical field, it is difficult to accurately recognize the vocabularies based on the current word segmenter, so that the accuracy of the central words extracted for the text is low.
Disclosure of Invention
In order to solve the above problem, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for determining a headword, so as to improve accuracy of determining the headword.
In a first aspect, an embodiment of the present application provides a method for determining a headword, where the method includes:
acquiring a word segmentation result corresponding to the target text;
when the value of mutual information between continuous first objects and continuous second objects in the word segmentation result is larger than a first preset threshold value, splicing the first objects and the second objects to obtain target words, wherein the first objects and the second objects are characters and/or words obtained by segmenting the target texts;
and determining the central word of the target text according to the target word.
In a possible implementation manner, when a value of mutual information between the first object and the second object that are consecutive in the word segmentation result is greater than a first preset threshold, stitching the first object and the second object to obtain a target word, the stitching method includes:
when the number of the continuous single characters in the word segmentation result is larger than a preset number, calculating a value of mutual information between the first object and the second object, wherein the first object and the second object are two characters in the continuous single characters;
and when the value of the mutual information between the first object and the second object is larger than the first preset threshold value, splicing the first object and the second object to obtain the target word.
In a possible implementation, the determining a central word of the target text according to the target word includes:
calculating a value of mutual information between the target word and a third object, wherein the third object is a character or a word which is continuous with the first object in the word segmentation result, or is a character or a word which is continuous with the second object in the word segmentation result;
when the value of mutual information between the target word and the third object is larger than a second preset threshold value, splicing the target word and the third object to obtain a new target word;
and determining the central word of the target text according to the new target word.
In one possible embodiment, the method further comprises:
when the value of mutual information between the first object and the second object is smaller than the first preset threshold value, determining the first object as an independent word or word, and determining a central word of the target text based on the independent word.
In one possible embodiment, the method further comprises:
calculating a value of mutual information between a second object and a fourth object, wherein the fourth object is a character or a word which is continuous with the second object in the word segmentation result;
when the value of mutual information between the second object and the fourth object is larger than the first preset threshold value, splicing the second object and the fourth object to obtain a new target word;
and determining the central word of the target text according to the new target word.
In a possible implementation, the determining a central word of the target text according to the target word includes:
calculating a weight value corresponding to the target word based on a word frequency inverse text word frequency TFIDF algorithm;
determining a central word from a plurality of candidate words of the target text according to the weight value corresponding to the target word.
In a second aspect, an embodiment of the present application further provides an apparatus for determining a headword, where the apparatus includes:
the acquisition module is used for acquiring word segmentation results corresponding to the target text;
the first splicing module is used for splicing the first object and the second object to obtain a target word when the value of mutual information between the continuous first object and the second object in the word segmentation result is greater than a first preset threshold value, wherein the first object and the second object are characters and/or words obtained by segmenting the target text;
and the first determining module is used for determining the central word of the target text according to the target word.
In one possible embodiment, the first splicing module includes:
the first calculation unit is used for calculating the value of mutual information between the first object and the second object when the number of the continuous single characters in the word segmentation result is greater than the preset number, wherein the first object and the second object are two characters in the continuous single characters;
and the first splicing unit is used for splicing the first object and the second object to obtain the target word when the value of the mutual information between the first object and the second object is greater than the first preset threshold value.
In one possible implementation, the first determining module includes:
a second calculating unit, configured to calculate a value of mutual information between the target word and a third object, where the third object is a word or a word that is continuous with the first object in the word segmentation result, or a word that is continuous with the second object in the word segmentation result;
the second splicing unit is used for splicing the target word and the third object to obtain a new target word when the value of mutual information between the target word and the third object is larger than a second preset threshold value;
and the first determining unit is used for determining the central word of the target text according to the new target word.
In a possible embodiment, the apparatus further comprises:
the second determining module is used for determining the first object as an independent word or word when the value of mutual information between the first object and the second object is smaller than the first preset threshold value;
and the third determining module is used for determining the central word of the target text based on the independent word.
In a possible embodiment, the apparatus further comprises:
the calculation module is used for calculating a value of mutual information between a second object and a fourth object, wherein the fourth object is a character or a word which is continuous with the second object in the word segmentation result;
the second splicing module is used for splicing the second object and the fourth object to obtain a new target word when the value of the mutual information between the second object and the fourth object is greater than the first preset threshold value;
and the fourth determining module is used for determining the central word of the target text according to the new target word.
In one possible implementation, the determining module includes:
the third calculating unit is used for calculating a weight value corresponding to the target word based on a word frequency inverse text word frequency TFIDF algorithm;
and the second determining unit is used for determining a central word from a plurality of candidate words of the target text according to the weight value corresponding to the target word.
In a third aspect, an embodiment of the present application further provides an apparatus, where the apparatus includes a memory and a processor;
the memory is used for storing a computer program;
the processor is configured to execute the method for determining a central word according to any of the embodiments of the first aspect
In a fourth aspect, the present application further provides a computer-readable storage medium including instructions, which when executed on a computer, cause the computer to perform the method for determining a keyword according to any one of the embodiments of the first aspect.
In the implementation manner of the embodiment of the present application, the segmentation result corresponding to the target text is obtained, and since the accuracy of the vocabulary segmented by the segmentation result may not be high, a mutual information value between two consecutive objects in the segmentation result may also be considered, when a value of mutual information between a first object and a second object that are consecutive in the segmentation result is greater than a first preset threshold value, the first object and the second object are spliced to obtain the target word, and the first object and the second object are characters and/or words obtained by segmenting the target text, so that the central word of the target text can be further determined according to the target word. Therefore, in the process of extracting the word-centered word of the text, for the result obtained by word segmentation, if the value of the mutual information between two continuous objects is greater than the preset threshold value, it indicates that the possibility that the two objects are a word is high, so that the two objects can be spliced into a word, namely the target word, and the possibility that the target word is the word-centered is higher than the possibility that the single first object or the second object is the word-centered, so that the accuracy of the word-centered determined based on the target word is relatively high.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a schematic diagram of an exemplary application scenario in an embodiment of the present application;
FIG. 2 is a schematic diagram of another exemplary application scenario in an embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for determining a headword in an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an apparatus for determining a keyword in an embodiment of the present application;
fig. 5 is a schematic hardware structure diagram of an apparatus in an embodiment of the present application.
Detailed Description
In practical applications, when a headword of a text such as a sentence, a paragraph, an article, etc. is extracted, an existing participler is usually used to perform a participling process on the text, so as to determine the headword capable of embodying a text topic or main content from a plurality of participles. However, when extracting text of some specific technical fields (such as medical field text, computer field text, biochemical field text, etc.), the accuracy of the extracted core word is not high.
The inventor has found that when the central word of the texts in the specific technical fields is extracted, the text in the specific technical fields contains more professional terms, and the word segmentation capability of the existing word segmenter is low in accuracy of segmenting the professional terms. For example, when "paracetamol" (an antipyretic and analgesic drug without anti-inflammatory effect) is segmented based on the current word segmentation device, the obtained segmentation result is likely to be four characters of "paracetamol", "heat", "rest" and "pain". Meanwhile, the professional terms have high possibility of embodying text topics or main contents, and when the professional terms are segmented inaccurately, the accuracy of finally extracting the central words of the texts is also low.
Based on the method, the method for determining the central word is provided so as to improve the accuracy of extracting the central word of the text. Specifically, a word segmentation result corresponding to the target text is obtained, and since the accuracy of words segmented by the word segmentation result may not be high, a mutual information value between two consecutive objects in the word segmentation result may also be considered, when the mutual information value between a first object and a second object that are consecutive in the word segmentation result is greater than a first preset threshold value, the first object and the second object are spliced to obtain a target word, and the first object and the second object are characters and/or words obtained by performing word segmentation on the target text, so that a central word of the target text may be further determined according to the target word. Therefore, in the process of extracting the word-centered word of the text, for the result obtained by word segmentation, if the value of the mutual information between two continuous objects is greater than the preset threshold value, it indicates that the possibility that the two objects are a word is high, so that the two objects can be spliced into a word, namely the target word, and the possibility that the target word is the word-centered is higher than the possibility that the single first object or the second object is the word-centered, so that the accuracy of the word-centered determined based on the target word is relatively high.
As an example, the embodiment of the present application may be applied to an exemplary application scenario as shown in fig. 1. In this scenario, the user 101 can extract a central word of a certain text (hereinafter referred to as a target text) by using the terminal 102. In a specific implementation, the user 101 may input the target text into the terminal 102; after the terminal 102 obtains the target text, a pre-configured word segmentation device can be used for performing word segmentation processing on the target text to obtain a corresponding word segmentation result; then, the terminal 102 may calculate a value of mutual information between two consecutive objects in the word segmentation result, and when the value of the mutual information is greater than a preset threshold, the two objects are spliced into one object to obtain a target word, where both the two objects may be characters or words obtained by performing word segmentation processing on a target text; in this way, the terminal 102 may adjust the segmentation result according to the object stitching process, determine a central word of the target text based on each segmentation in the adjusted segmentation result, and present the determined central word to the user 101.
It is to be understood that the above scenario is only one example of a scenario provided in the embodiment of the present application, and the embodiment of the present application is not limited to this scenario. For example, in other possible application scenarios, the number of texts that the user needs to extract the central word is large, and meanwhile, the computing resource and the data processing capability of the terminal are limited, so that after the terminal 102 acquires the target text input by the user, the target text can be sent to the server 103 to request processing, and the central word of the target text processed by the server 103 is presented to the user 101, as shown in fig. 2. In summary, the embodiments of the present application may be applied in any applicable scenario and are not limited to the scenario examples described above.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, various non-limiting embodiments accompanying the present application examples are described below with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 3, fig. 3 is a flow chart illustrating a method for determining a headword in an embodiment of the present application. This method may be performed by the terminal 102 shown in fig. 1, the server 103 shown in fig. 2, or the terminal 102 and the server 103 cooperatively. The method specifically comprises the following steps:
s301: and acquiring a word segmentation result corresponding to the target text.
In an exemplary embodiment, when extracting a central word of a certain text (hereinafter referred to as a target text for convenience of description), a word segmentation process may be performed on the target text, specifically, the target text may be input into a pre-configured word segmenter, and the word segmenter performs word segmentation on the target text, so that a central word capable of embodying a subject or main content of the target text may be determined from a plurality of words and/or words (i.e., word segmentation results) output by the word segmenter.
It is understood that, because the accuracy of the present word segmenter for segmenting words of text is usually not high, especially when the word segmentation processing is performed on some texts with more vocabularies such as professional terms, a complete word is generally segmented into a plurality of characters and/or words. For example, when segmenting the word "paracetamol" in a text, the word segmenter will generally segment the words "paracetamol", "hot", "resting" and "pain", but the correct segmentation result should be that "paracetamol" is segmented as a complete word. Therefore, if the central word of the target text is directly determined from the plurality of segmented words included in the segmentation result, the extracted central word may not be the actual central word of the target text. It is understood that for a text describing a drug such as "paracetamol", the central word of the text should be "paracetamol", but since the word segmenter cannot accurately segment the word "paracetamol" from the text, the central word extracted finally for the text will not be "paracetamol".
Based on this, in this embodiment, for the word segmentation result output by the word segmentation device, the word segmentation result may also be adjusted by combining with the mutual information, specifically, part of the objects in the word segmentation result may be spliced. Still taking the text of "paracetamol" as an example in the above description, the four characters "paracetamol", "hot", "resting" and "pain" may be spliced into a word "paracetamol" based on mutual information, so that "paracetamol" is taken as a central word of the text from the adjusted word segmentation result, thereby improving the accuracy of the extracted central word of the text. Therefore, after the word segmentation result is obtained in step S301, the following step S302 may be continued in the present embodiment.
S302: and when the value of mutual information between the continuous first object and the continuous second object in the obtained word segmentation result is larger than a first preset threshold value, splicing the first object and the second object to obtain a target word, wherein the first object and the second object are characters and/or words obtained by performing word segmentation on a target text.
After the word segmentation result is obtained, the value of mutual information between two continuous objects (including the first object and the second object) in the word segmentation result can be calculated. The mutual information value may represent a degree of association between the first object and the second object, and in general, a larger mutual information value between the two objects indicates a higher degree of association between the two objects, whereas a smaller mutual information value between the two objects indicates a lower degree of association between the two objects. In this embodiment, the degree of association between two objects may be demarcated by using a first preset threshold. When the value of the mutual information between the first object and the second object is greater than the first preset threshold, it indicates that the degree of association between the first object and the second object is high enough, and the first object and the second object should not be divided into independent words or words, so that the first object and the second object can be spliced to obtain a target word including the first object and the second object. For example, for the drug name "double gram" (also referred to as "double-hydrogen-gram-urine-thi"), it may be segmented into two words (objects) "double" and "gram" during the initial word segmentation, and when the mutual information value of the two words is greater than a first preset threshold, the two words and the gram "may be spliced into" double gram "according to the line and text sequence of the text, so that the two words and the gram may be completely segmented, and the target word" double gram "obtained by splicing may participate in the determination of the central word of the target text. Of course, if the value of the mutual information between the first object and the second object is not greater than the first preset threshold value, which indicates that the degree of association between the two objects is low, it may be considered that the first object should be divided into independent objects, and determined as independent words or words, and accordingly, the first object participates in the determination of the central word of the target text with the independent words or words.
Wherein, the first object and the second object can be both divided single characters, such as the above-mentioned "double", "gram" and the like; or they can be all words of segmentations, such as "benzathine" and "penicillin" (benzathine is a drug suitable for patients who need to take penicillin for a long time for prevention), etc.; also, the term "dry" or "yeast" may be a combination of words such as "dry" or "yeast" (dry yeast is a dry cell).
As an example, the value of mutual information between the first object and the second object may be calculated using the following formula (1):
Figure BDA0002311352020000081
the method comprises the following steps of A and B respectively representing a first object and a second object, I (A, B) represents a value of mutual information between A and B, P (A, B) represents the probability of A and B appearing at the same time, P (A) represents the probability of A appearing in a preset material library, and P (B) represents the probability of B appearing in a preset material library. In practical applications, when the number of texts of the central word extracted by the user is large, the texts can be used as a corpus. Assuming that the number of contained participles in the predetermined corpus is N, the above formula (1) can be expressed by the following formula (2):
Figure BDA0002311352020000091
wherein n (A) represents the frequency of A appearing in the corpus, and n (B) represents the frequency of B appearing in the corpus.
It should be noted that the first object and the second object may be any two consecutive objects in the word segmentation result, and based on a value of mutual information between the two consecutive objects, the two consecutive objects in the word segmentation result may be segmented into one object or the first object may be determined to be an independent word or word.
S303: and determining the central word of the target text according to the target word.
In this embodiment, when the value of the mutual information between the first object and the second object is higher, the first object and the second object may be spliced to obtain a target word, and the central word of the target text may be determined based on the target word. In one embodiment, the target word may be a central word of the target text. For example, when the target text is a text describing a drug such as "benzathine" the target word "benzathine" obtained by combining "benzathine" and "penicillin" may be used as the central word in the text.
Of course, in other possible embodiments, the target word obtained by stitching may still not be a complete word, for example, assuming that the first object is "hot" and the second object is "hot", the target word obtained after stitching the first object and the second object is "hot", but it is still a part of the complete word "paracetamol". Therefore, after the target word is obtained, the mutual information value between the target word and the third object can be continuously calculated, and whether the target word and the third object need to be continuously spliced or not is determined based on the mutual information value. The third object is a word or a word in the segmentation result that is continuous with the first object (i.e. the sequence of the word or the word in the text is before the first object), or the third object is a word or a word in the segmentation result that is continuous with the second object (i.e. the sequence of the word or the word in the text is after the second object). It can be understood that when the value of the mutual information between the target word and the third object is greater than the second preset threshold, the target word and the third object may be spliced, and specifically, the target word and the third object may be spliced according to the sequence of the target word and the third object in the text to obtain a new target word, so that the central word of the text may be determined based on the new target word; and when the value of mutual information between the target word and the third object is not greater than a second preset threshold value, determining that the target word is suitable as an independent word. The first preset threshold may be the same as the second preset threshold, or greater than the second preset threshold.
Further, after a new target word is obtained, the value of mutual information between the new target word and the next continuous object may also be continuously calculated, so as to determine whether to continue to splice the new target word and the next continuous object based on the magnitude relationship between the value of mutual information and the corresponding threshold value until the value of mutual information between the spliced word and the next continuous object is not greater than the corresponding preset threshold value. Therefore, the objects in the segmentation result are continuously spliced based on the mutual information value between the two objects, and the mutual information value between any two continuous objects in the segmentation result is not more than the first preset threshold value.
Accordingly, after determining that the value of the mutual information between the current object (which may be a word obtained by concatenation or an object obtained by initial word segmentation) and the next continuous object is not greater than the corresponding preset threshold, the mutual information calculation with the subsequent object may be continued from the next continuous object. Taking the first object and the second object as an example, when the value of the mutual information between the first object and the second object is not greater than the first preset threshold, the object may be determined as an independent object, then, a value of the mutual information between the second object and a next continuous object thereof (for convenience of description, the next continuous object of the second object is referred to as a fourth object) may be calculated, if the value of the mutual information between the second object and the fourth object is greater than the first preset threshold, the second object and the fourth object are spliced into one object, and if the value of the mutual information between the second object and the fourth object is not greater than the first preset value, the second object may be determined as an independent object, and the value of the mutual information between the second object and the next continuous object may be continuously calculated from the fourth object.
For ease of understanding, the above process is described with the target text "paracetamol is a drug". The word segmentation results obtained after the word segmentation is performed by using the word segmentation results are 'plop', 'heat', 'breath', 'pain', 'yes', 'one kind' and 'medicine'. The mutual information value between two continuous objects in the word segmentation result is calculated in an iterative mode, and the mutual information value between the 'plop' and the 'heat' is larger than a first preset threshold value, so that the 'plop' and the 'heat' can be spliced into the 'plop heat' (namely the target word); then, the value of mutual information between the heat-trapping object and the next continuous object information is continuously calculated, and the heat-trapping object and the information can be continuously spliced to obtain the heat-trapping object (namely the new target word) because the value of the mutual information between the heat-trapping object and the next continuous object information is larger than a corresponding second preset threshold value; then, the mutual information value between the 'paracetamol' and the next continuous object 'pain' is continuously calculated, and as the mutual information value is still larger than the corresponding preset threshold value, the mutual information value is continuously spliced to obtain the 'paracetamol'; and continuously calculating the mutual information value between the paracetamol and the next object ' yes ', and confirming that the paracetamol is a complete word because the mutual information value is smaller than a corresponding preset threshold value, so that four objects ' paracetamol ', heat ', ' rest ' and ' pain ' in the original word segmentation result are spliced into the paracetamol ' object '. Then, the value of mutual information between the object "yes" and the next consecutive object "one" can be continuously calculated, and since the value of the mutual information is not greater than the first preset threshold value, "yes" can be determined as an independent object; and then, the mutual information value between the object 'one' and the last object 'medicine' is calculated, and as the mutual information value is still not larger than the preset first preset threshold value, the object 'one' and the last object 'medicine' can be continuously determined as independent objects respectively.
It should be noted that, in this embodiment, a window with a fixed character size may be preset, and mutual information calculation may be performed on objects in the window with the fixed size to determine whether to splice the objects in the window. The size of the characters in the window can be determined according to the number of Chinese characters contained in the word in practical application. For example, in practical applications, the number of chinese characters included in terms of nouns in the fields of medical technology or computer technology may not include 9 characters, and the window size may be set to 9 characters. When the mutual information value between the object a (which may be the first object or an object obtained by splicing) and the object B in the window is smaller than the corresponding preset threshold, the window may move backward, and the first object included in the moved window is the object B, and the mutual information value between the first object and the next consecutive object is continuously calculated from the object B. And repeating the process until the last object in the word segmentation result completes the calculation of mutual information.
Of course, in other possible embodiments, the calculation of mutual information values may be performed on a plurality of continuous single words in the word segmentation result, so as to determine whether the single words are components of a complete word. It is understood that in terms of words such as "paracetamol", "alteplase", etc., it is likely that the word segmenter will divide "paracetamol" into a plurality of consecutive words "paracetamol", "heat", "rest", "pain", and "alteplase" into a plurality of consecutive words "a", "ti", "p", "enzyme". Based on this, in this embodiment, when the object in the segmentation result is adjusted, the mutual information calculation may be performed on the segmentation results of the consecutive single words, so as to determine whether the segmentation result of the segmenter for the consecutive single words is accurate.
In a specific implementation, when the number of consecutive single words in the word segmentation result is determined to be greater than a preset number (e.g., 3, 4, etc.), a mutual information value between a first word (i.e., a first object) and a second word (i.e., a second object) in the consecutive single words may be calculated first, and when the mutual information value is greater than a first preset threshold, it is determined to splice the first word and the second word. Then, the mutual information between the next single word of the word obtained by splicing can be calculated again, and whether the next single word is continuously spliced with the word is further determined until the last single word in the continuous single words completes the calculation of the mutual information, or the mutual information value between the spliced word in the continuous single words and the next single word is smaller than the corresponding preset threshold value.
In one example, for a spliced object S (e.g., "paracetamol") composed of multiple objects (obtained by word segmentation), the mutual information of the whole character string may be:
Figure BDA0002311352020000121
wherein, I (S) represents the mutual information of the whole character string of the splicing object S, m is a positive integer larger than 1, represents the number of objects contained in the splicing object S, SiCharacterizing the ith object, n (S), included in the overall stringi) Characterization SiThe number in the corpus, i is a positive integer and the value range is [2, m],Si-1Characterisation of the (i-1) th object, n (S)i-1) Characterization Si-1Number in corpus.
After the objects in the segmentation result are spliced based on the above process, a weight value corresponding to each object in the segmentation result after the object splicing processing can be calculated by using a term frequency-inverse text term frequency (TF-IDF) algorithm. The higher the weight value of the object is, the more suitable the object is as the central word of the text, and conversely, the lower the weight value of the object is, the less suitable the object is as the central word of the text. Therefore, in one example, the object with the largest weight value may be used as the central word of the text. It can be understood that after the continuous objects are spliced according to the value of mutual information between the continuous objects, the accuracy of the central word extracted based on the objects obtained by splicing is generally higher than that of the central word extracted directly based on the word segmentation.
In this embodiment, a word segmentation result corresponding to a target text may be obtained first, and since the accuracy of words segmented by the word segmentation result may not be high, a mutual information value between two consecutive objects in the word segmentation result may also be considered, when a value of mutual information between a first object and a second object that are consecutive in the word segmentation result is greater than a first preset threshold value, the first object and the second object are spliced to obtain a target word, and the first object and the second object are words and/or words obtained by performing word segmentation on the target text, so that a central word of the target text may be further determined according to the target word. Therefore, in the process of extracting the word-centered word of the text, for the result obtained by word segmentation, if the value of the mutual information between two continuous objects is greater than the preset threshold value, it indicates that the possibility that the two objects are a word is high, so that the two objects can be spliced into a word, namely the target word, and the possibility that the target word is the word-centered is higher than the possibility that the single first object or the second object is the word-centered, so that the accuracy of the word-centered determined based on the target word is relatively high.
In addition, the embodiment of the application also provides a device for determining the central word. Referring to fig. 4, fig. 4 is a schematic structural diagram illustrating an apparatus for determining a headword in an embodiment of the present application, where the apparatus 400 may include:
an obtaining module 401, configured to obtain a word segmentation result corresponding to a target text;
a first splicing module 402, configured to splice the first object and the second object to obtain a target word when a value of mutual information between consecutive first objects and second objects in the word segmentation result is greater than a first preset threshold, where the first object and the second object are characters and/or words obtained by performing word segmentation on the target text;
a first determining module 403, configured to determine a headword of the target text according to the target word.
In one possible implementation, the first splicing module 402 includes:
the first calculation unit is used for calculating the value of mutual information between the first object and the second object when the number of the continuous single characters in the word segmentation result is greater than the preset number, wherein the first object and the second object are two characters in the continuous single characters;
and the first splicing unit is used for splicing the first object and the second object to obtain the target word when the value of the mutual information between the first object and the second object is greater than the first preset threshold value.
In a possible implementation, the first determining module 403 includes:
a second calculating unit, configured to calculate a value of mutual information between the target word and a third object, where the third object is a word or a word that is continuous with the first object in the word segmentation result, or a word that is continuous with the second object in the word segmentation result;
the second splicing unit is used for splicing the target word and the third object to obtain a new target word when the value of mutual information between the target word and the third object is larger than a second preset threshold value;
and the first determining unit is used for determining the central word of the target text according to the new target word.
In a possible implementation, the apparatus 400 further includes:
the second determining module is used for determining the first object as an independent word or word when the value of mutual information between the first object and the second object is smaller than the first preset threshold value;
and the third determining module is used for determining the central word of the target text based on the independent word.
In a possible implementation, the apparatus 400 further includes:
the calculation module is used for calculating a value of mutual information between a second object and a fourth object, wherein the fourth object is a character or a word which is continuous with the second object in the word segmentation result;
the second splicing module is used for splicing the second object and the fourth object to obtain a new target word when the value of the mutual information between the second object and the fourth object is greater than the first preset threshold value;
and the fourth determining module is used for determining the central word of the target text according to the new target word.
In a possible implementation, the determining module 403 includes:
the third calculating unit is used for calculating a weight value corresponding to the target word based on a word frequency inverse text word frequency TFIDF algorithm;
and the second determining unit is used for determining a central word from a plurality of candidate words of the target text according to the weight value corresponding to the target word.
It should be noted that, for the contents of information interaction, execution process, and the like between the modules and units of the apparatus, since the same concept is based on the method embodiment in the embodiment of the present application, the technical effect brought by the contents is the same as that of the method embodiment in the embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment in the embodiment of the present application, and are not described herein again.
In addition, the embodiment of the application also provides equipment. Referring to fig. 5, fig. 5 shows a hardware structure diagram of an apparatus in an embodiment of the present application, and the apparatus 500 may include a processor 501 and a memory 502.
Wherein the memory 502 is used for storing computer programs;
the processor 501 is configured to execute the following steps according to the computer program:
acquiring a word segmentation result corresponding to the target text;
when the value of mutual information between continuous first objects and continuous second objects in the word segmentation result is larger than a first preset threshold value, splicing the first objects and the second objects to obtain target words, wherein the first objects and the second objects are characters and/or words obtained by segmenting the target texts;
and determining the central word of the target text according to the target word.
In a possible implementation, the processor 501 is specifically configured to execute the following steps according to the computer program:
when the number of the continuous single characters in the word segmentation result is larger than a preset number, calculating a value of mutual information between the first object and the second object, wherein the first object and the second object are two characters in the continuous single characters;
and when the value of the mutual information between the first object and the second object is larger than the first preset threshold value, splicing the first object and the second object to obtain the target word.
In a possible implementation, the processor 501 is specifically configured to execute the following steps according to the computer program:
calculating a value of mutual information between the target word and a third object, wherein the third object is a character or a word which is continuous with the first object in the word segmentation result, or is a character or a word which is continuous with the second object in the word segmentation result;
when the value of mutual information between the target word and the third object is larger than a second preset threshold value, splicing the target word and the third object to obtain a new target word;
and determining the central word of the target text according to the new target word.
In a possible implementation, the processor 501 is further configured to perform the following steps according to the computer program:
when the value of mutual information between the first object and the second object is smaller than the first preset threshold value, determining the first object as an independent word or word, and determining a central word of the target text based on the independent word.
In a possible implementation, the processor 501 is further configured to perform the following steps according to the computer program:
calculating a value of mutual information between a second object and a fourth object, wherein the fourth object is a character or a word which is continuous with the second object in the word segmentation result;
when the value of mutual information between the second object and the fourth object is larger than the first preset threshold value, splicing the second object and the fourth object to obtain a new target word;
and determining the central word of the target text according to the new target word.
In a possible implementation, the processor 501 is specifically configured to execute the following steps according to the computer program:
calculating a weight value corresponding to the target word based on a word frequency inverse text word frequency TFIDF algorithm;
determining a central word from a plurality of candidate words of the target text according to the weight value corresponding to the target word.
The embodiment of the application also provides a computer readable storage medium. The methods described in the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media may include computer storage media and communication media, and may include any medium that can communicate a computer program from one place to another. A storage medium may be any target medium that can be accessed by a computer.
As an alternative design, a computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that is targeted for carriage or stores desired program code in the form of instructions or data structures and that is accessible by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Such a concatenation is also intended to be included within the scope of computer readable media.
It should be noted that "of, corresponding to" and "corresponding" may be sometimes used in combination in the present application, and it should be noted that the intended meaning is consistent when the difference is not emphasized.
It should be noted that in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
In the present application, "at least one" means one or more. "plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any concatenation of these items, including any concatenation of single item(s) or plural item(s). For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.
As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

Claims (10)

1. A method for determining a headword, the method comprising:
acquiring a word segmentation result corresponding to the target text;
when the value of mutual information between continuous first objects and continuous second objects in the word segmentation result is larger than a first preset threshold value, splicing the first objects and the second objects to obtain target words, wherein the first objects and the second objects are characters and/or words obtained by segmenting the target texts;
and determining the central word of the target text according to the target word.
2. The method of claim 1, wherein when a value of mutual information between the first object and the second object which are consecutive in the word segmentation result is greater than a first preset threshold, the splicing the first object and the second object to obtain a target word comprises:
when the number of the continuous single characters in the word segmentation result is larger than a preset number, calculating a value of mutual information between the first object and the second object, wherein the first object and the second object are two characters in the continuous single characters;
and when the value of the mutual information between the first object and the second object is larger than the first preset threshold value, splicing the first object and the second object to obtain the target word.
3. The method according to claim 1 or 2, wherein the determining the central word of the target text according to the target word comprises:
calculating a value of mutual information between the target word and a third object, wherein the third object is a character or a word which is continuous with the first object in the word segmentation result, or is a character or a word which is continuous with the second object in the word segmentation result;
when the value of mutual information between the target word and the third object is larger than a second preset threshold value, splicing the target word and the third object to obtain a new target word;
and determining the central word of the target text according to the new target word.
4. The method of claim 2, further comprising:
when the value of mutual information between the first object and the second object is smaller than the first preset threshold value, determining the first object as an independent word or word, and determining a central word of the target text based on the independent word.
5. The method of claim 4, further comprising:
calculating a value of mutual information between a second object and a fourth object, wherein the fourth object is a character or a word which is continuous with the second object in the word segmentation result;
when the value of mutual information between the second object and the fourth object is larger than the first preset threshold value, splicing the second object and the fourth object to obtain a new target word;
and determining the central word of the target text according to the new target word.
6. The method of claim 1, wherein the determining a headword of the target text according to the target word comprises:
calculating a weight value corresponding to the target word based on a word frequency inverse text word frequency TFIDF algorithm;
determining a central word from a plurality of candidate words of the target text according to the weight value corresponding to the target word.
7. An apparatus for determining a headword, the apparatus comprising:
the acquisition module is used for acquiring word segmentation results corresponding to the target text;
the first splicing module is used for splicing the first object and the second object to obtain a target word when the value of mutual information between the continuous first object and the second object in the word segmentation result is greater than a first preset threshold value, wherein the first object and the second object are characters and/or words obtained by segmenting the target text;
and the first determining module is used for determining the central word of the target text according to the target word.
8. The apparatus of claim 7, wherein the first stitching module comprises:
a calculating unit, configured to calculate a value of mutual information between the first object and the second object when the number of consecutive single characters in the word segmentation result is greater than a preset number, where the first object and the second object are two characters in the consecutive single characters;
and the splicing unit is used for splicing the first object and the second object to obtain the target word when the value of the mutual information between the first object and the second object is greater than the first preset threshold value.
9. An apparatus, comprising a memory and a processor;
the memory is used for storing a computer program;
the processor is configured to perform the method of determining a centering word of any of claims 1-6 according to the computer program.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of accessing an application program interface of any of claims 1 to 5.
CN201911259955.8A 2019-12-10 2019-12-10 Method, device and equipment for determining central word and storage medium Pending CN111125306A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911259955.8A CN111125306A (en) 2019-12-10 2019-12-10 Method, device and equipment for determining central word and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911259955.8A CN111125306A (en) 2019-12-10 2019-12-10 Method, device and equipment for determining central word and storage medium

Publications (1)

Publication Number Publication Date
CN111125306A true CN111125306A (en) 2020-05-08

Family

ID=70498168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911259955.8A Pending CN111125306A (en) 2019-12-10 2019-12-10 Method, device and equipment for determining central word and storage medium

Country Status (1)

Country Link
CN (1) CN111125306A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258196A1 (en) * 2008-12-30 2011-10-20 Skjalg Lepsoy Method and system of content recommendation
CN108021558A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Keyword recognition method and device, electronic equipment and storage medium
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258196A1 (en) * 2008-12-30 2011-10-20 Skjalg Lepsoy Method and system of content recommendation
CN108021558A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Keyword recognition method and device, electronic equipment and storage medium
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
修驰 等: "基于无监督学习的专业领域分词歧义消解方法", 计算机应用, vol. 33, no. 3, 1 March 2013 (2013-03-01), pages 780 - 783 *
杨阳;魏晓;秦成磊;: "基于Web知识的中文分词结果优化", no. 12, pages 55 - 58 *

Similar Documents

Publication Publication Date Title
CN107193792B (en) Method and device for generating article based on artificial intelligence
US11343569B2 (en) System and method for context aware detection of objectionable speech in video
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
US10108602B2 (en) Dynamic portmanteau word semantic identification
CN111126060B (en) Method, device, equipment and storage medium for extracting subject term
US9715497B1 (en) Event detection based on entity analysis
CN107766325B (en) Text splicing method and device
US20230076387A1 (en) Systems and methods for providing a comment-centered news reader
CN111161739A (en) Speech recognition method and related product
US20150339616A1 (en) System for real-time suggestion of a subject matter expert in an authoring environment
CN110377750B (en) Comment generation method, comment generation device, comment generation model training device and storage medium
CN110019948B (en) Method and apparatus for outputting information
WO2019173085A1 (en) Intelligent knowledge-learning and question-answering
US20150339310A1 (en) System for recommending related-content analysis in an authoring environment
JP2023002690A (en) Semantics recognition method, apparatus, electronic device, and storage medium
WO2016191912A1 (en) Comment-centered news reader
US11557381B2 (en) Clinical trial editing using machine learning
WO2020052060A1 (en) Method and apparatus for generating correction statement
JP6885506B2 (en) Response processing program, response processing method, response processing device and response processing system
US20190121833A1 (en) Rendering content items of a social networking system
CN104021202A (en) Device and method for processing entries of knowledge sharing platform
CN111125332B (en) Method, device, equipment and storage medium for calculating TF-IDF value of word
CN111046169B (en) Method, device, equipment and storage medium for extracting subject term
US9946762B2 (en) Building a domain knowledge and term identity using crowd sourcing
CN111125306A (en) Method, device and equipment for determining central word and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination