CN113177410B - Text word segmentation method and device, storage medium and electronic equipment - Google Patents

Text word segmentation method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113177410B
CN113177410B CN202110494883.6A CN202110494883A CN113177410B CN 113177410 B CN113177410 B CN 113177410B CN 202110494883 A CN202110494883 A CN 202110494883A CN 113177410 B CN113177410 B CN 113177410B
Authority
CN
China
Prior art keywords
phrase
speech
confirmed
chain
phrases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110494883.6A
Other languages
Chinese (zh)
Other versions
CN113177410A (en
Inventor
徐欢春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Multipoint Shenzhen Digital Technology Co ltd
Original Assignee
Multipoint Shenzhen Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Multipoint Shenzhen Digital Technology Co ltd filed Critical Multipoint Shenzhen Digital Technology Co ltd
Priority to CN202110494883.6A priority Critical patent/CN113177410B/en
Publication of CN113177410A publication Critical patent/CN113177410A/en
Application granted granted Critical
Publication of CN113177410B publication Critical patent/CN113177410B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a text word segmentation method, a device, a storage medium and electronic equipment, wherein firstly, an input text is segmented according to an identification symbol in the input text to obtain a pre-word set, wherein the pre-word set comprises at least one group of phrases to be confirmed, and the phrases to be confirmed are part of the input text; and screening out the target phrase from the phrase to be confirmed according to the standard part-of-speech chain set and the part-of-speech chain of the phrase to be confirmed, wherein the standard part-of-speech chain set comprises at least one standard part-of-speech chain, the part-of-speech chain of the target phrase is the standard part-of-speech chain, and the target phrase is a continuous part of the phrase to be confirmed. Based on the original word segmentation basis, the complete phrase is accurately extracted through reprocessing on a standard part-of-speech chain, and text phrase information is more accurately extracted. And taking the target phrase as a final word segmentation result to obtain the meaning originally intended to be expressed in the text, thereby guaranteeing the usability and accuracy of the word segmentation result for the user.

Description

Text word segmentation method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text word segmentation method, a device, a storage medium, and an electronic apparatus.
Background
With the development of the internet, more and more people start to use, and the number of internet users is rapidly increasing. Users are accustomed to communicating and sharing through texts on the internet, and when the number of internet users is huge, the number of texts on the internet is also huge. How to extract effective information from a huge number of texts becomes a problem to be solved at present.
Disclosure of Invention
The present application aims to provide a text word segmentation method, a text word segmentation device, a storage medium and an electronic device, so as to at least partially improve the above problems.
In order to achieve the above purpose, the technical solution adopted in the embodiment of the present application is as follows:
in a first aspect, an embodiment of the present application provides a text word segmentation method, including;
dividing an input text according to an identification symbol in the input text to obtain a pre-word set, wherein the pre-word set comprises at least one group of phrases to be confirmed, and the phrases to be confirmed are part of the input text;
and screening target phrases from the phrases to be confirmed according to a standard part-of-speech chain set and part-of-speech chains of the phrases to be confirmed, wherein the standard part-of-speech chain set comprises at least one standard part-of-speech chain, the part-of-speech chains of the target phrases are the standard part-of-speech chains, and the target phrases are continuous parts of the phrases to be confirmed.
In a second aspect, an embodiment of the present application provides a text word segmentation apparatus, including;
the first word segmentation unit is used for segmenting the input text according to the identification symbol in the input text to obtain a pre-word set, wherein the pre-word set comprises at least one group of phrases to be confirmed, and the phrases to be confirmed are part of the input text;
and the second word segmentation unit is used for screening out target phrases from the phrases to be confirmed according to a standard part-of-speech chain set and part-of-speech chains of the phrases to be confirmed, wherein the standard part-of-speech chain set comprises at least one standard part-of-speech chain, the part-of-speech chains of the target phrases are the standard part-of-speech chains, and the target phrases are continuous parts of the phrases to be confirmed.
In a third aspect, embodiments of the present application provide a storage medium having stored thereon a computer program which, when executed by a processor, implements the method described above.
In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory for storing one or more programs; the above-described method is implemented when the one or more programs are executed by the processor.
Compared with the prior art, the text word segmentation method, the device, the storage medium and the electronic equipment provided by the embodiment of the application firstly segment the input text according to the identification symbol in the input text to obtain a pre-word set, wherein the pre-word set comprises at least one group of phrases to be confirmed, and the phrases to be confirmed are part of the input text; and screening out the target phrase from the phrase to be confirmed according to the standard part-of-speech chain set and the part-of-speech chain of the phrase to be confirmed, wherein the standard part-of-speech chain set comprises at least one standard part-of-speech chain, the part-of-speech chain of the target phrase is the standard part-of-speech chain, and the target phrase is a continuous part of the phrase to be confirmed. Based on the original word segmentation basis, the complete phrase is accurately extracted through reprocessing on a standard part-of-speech chain, and text phrase information is more accurately extracted. And taking the target phrase as a final word segmentation result to obtain the meaning originally intended to be expressed in the text, thereby guaranteeing the usability and accuracy of the word segmentation result for the user.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting in scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a text word segmentation method provided in an embodiment of the present application;
FIG. 3 is a sub-step schematic of S30 provided in an embodiment of the present application;
FIG. 4 is a schematic representation of a further sub-step of S30 provided by an embodiment of the present application;
FIG. 5 is a schematic flow chart of a text word segmentation method according to an embodiment of the present disclosure;
fig. 6 is an effect schematic diagram of a text word segmentation method provided in an embodiment of the present application;
fig. 7 is a schematic diagram of a unit of a text word segmentation device according to an embodiment of the present application.
In the figure: 10-a processor; 11-memory; 12-bus; 13-a communication interface; 201-a first word segmentation unit; 202-a second word segmentation unit.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the description of the present application, it should be noted that, the terms "upper," "lower," "inner," "outer," and the like indicate an orientation or a positional relationship based on the orientation or the positional relationship shown in the drawings, or an orientation or a positional relationship conventionally put in use of the product of the application, merely for convenience of description and simplification of the description, and do not indicate or imply that the apparatus or element to be referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present application.
In the description of the present application, it should also be noted that, unless explicitly specified and limited otherwise, the terms "disposed," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art in a specific context.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
With the development of artificial intelligence technology, natural Language Processing (NLP) has also made significant progress, and the performance and accuracy of NLP gradually reach the standards that can be used commercially. The word segmentation kit used in the embodiment of the application is based on HanLP, but is applicable to all word segmentation algorithms. HanLP (Han Language Processing) is an NLP toolkit composed of a series of models and algorithms, with the goal of popularizing the application of natural language processing in a production environment.
The embodiment of the application distributes a large amount of questionnaires to customers based on a specific requirement in a production environment, for example, a business party directs, wherein the questionnaires comprise open questions, and after receiving information fed back by users, the open questions of the customers need to be quantitatively counted as data. Basic word segmentation algorithms currently exist, such as: HMM, perceptron, CRF are all used to resolve basic word segmentation. Take the following customer feedback information as an example: the following results are obtained after word segmentation and part of speech tagging based on a perceptron model: [ capacity/n, small/a, per u, good/a,/w, population/n, small/a,/w, eating up/v, buying again/v,/w, total/d, eating up/v, new/a, per u,/w, not/d, like/v, large/a, capacity/n, per u,/w, vial/n, per u, good/a ]. If the statistical word frequency is directly used for a business party based on the original word segmentation, a great amount of distortion information exists, such as 'like' is an independent word segmentation, but the originally intended expression meaning 'dislike' can be seen through the information originally submitted by a customer, so that how to accurately extract a more complete phrase is needed to be reprocessed based on the original word segmentation. According to the text word segmentation method, text phrase information is extracted more accurately based on original basic word segmentation. The following is the result output based on the text word segmentation method provided in the embodiment of the present application: small, good, capacity, small population, complete eating, buying again, general eating new, dislike big, good vials.
The embodiment of the application provides electronic equipment which can be server equipment or other intelligent terminal equipment. Referring to fig. 1, a schematic structure of an electronic device is shown. The electronic device comprises a processor 10, a memory 11, a bus 12. The processor 10 and the memory 11 are connected by a bus 12, the processor 10 being adapted to execute executable modules, such as computer programs, stored in the memory 11.
The processor 10 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the text segmentation method may be accomplished by integrated logic circuitry of hardware in the processor 10 or instructions in the form of software. The processor 10 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The memory 11 may comprise a high-speed random access memory (RAM: random Access Memory) and may also comprise a non-volatile memory (non-volatile memory), such as at least one disk memory.
Bus 12 may be a ISA (Industry Standard Architecture) bus, PCI (Peripheral Component Interconnect) bus, EISA (Extended Industry Standard Architecture) bus, or the like. Only one double-headed arrow is shown in fig. 1, but not only one bus 12 or one type of bus 12.
The memory 11 is used for storing programs, such as programs corresponding to the text segmentation apparatus. The text segmentation means comprises at least one software function module which may be stored in the memory 11 in the form of software or firmware (firmware) or cured in an Operating System (OS) of the electronic device. The processor 10, upon receiving the execution instruction, executes the program to implement the text word segmentation method.
Possibly, the electronic device provided in the embodiment of the present application further includes a communication interface 13. The communication interface 13 is connected to the processor 10 via a bus. The electronic device may obtain text information transmitted by other terminals through the communication interface 13.
It should be understood that the structure shown in fig. 1 is a schematic structural diagram of only a portion of an electronic device, which may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
The text word segmentation method provided in the embodiment of the present application may be applied to, but not limited to, the electronic device shown in fig. 1, and refer to fig. 2 for a specific flow:
s10, dividing the input text according to the identification symbol in the input text to obtain a pre-segmentation word set.
Wherein the pre-word set includes at least one group of phrases to be confirmed, the phrases to be confirmed being part of the input text.
Alternatively, the customer feedback information described above is described as "small-volume good, small-population, after-eating, and after-eating, new, dislike large-volume good vials". In this example, the identifier in the input text is a comma, and when the input text is different, the identifier may also be a symbol that acts as a specific sentence break such as a period, a colon, a semicolon, an exclamation mark, a question mark, and the like. The phrases to be confirmed corresponding to the customer feedback information are 'good with small capacity', 'small population', 'buying after eating, totally eating new', 'dislike large capacity', 'good with small bottle'.
S30, screening target phrases from the phrases to be confirmed according to the standard part-of-speech chain set and the part-of-speech chains of the phrases to be confirmed.
The standard part-of-speech chain set comprises at least one standard part-of-speech chain, the part-of-speech chain of the target phrase is the standard part-of-speech chain, and the target phrase is a continuous part of the phrases to be confirmed.
Alternatively, the standard part-of-speech chain is a part-of-speech chain set by the staff, for example, the standard part-of-speech chain may be (n, v) or (d, v). The part-of-speech chains of the phrase to be confirmed are as follows, the part-of-speech chains of "small capacity" are (n, a, u, a), "dislike large capacity" are (d, v, a, n, u), the others are analogized.
When part-of-speech chains of a continuous part of the phrases to be confirmed are identical to the standard part-of-speech chains, the part of speech chains can be used as target phrases, for example, a dislike part-of-speech chain of dislike large capacity is (d, v) identical to a standard part-of-speech chain, and dislike can be used as target phrases.
Based on the original word segmentation, the statistical word frequency is provided for a business party to use, and a large amount of distortion information, such as like, is an independent word segmentation, but the originally intended expression means dislike as can be seen through the information originally submitted by a customer. According to the text word segmentation method provided by the embodiment of the application, more complete phrases are accurately extracted through reprocessing on a standard part-of-speech chain based on an original word segmentation basis, and text phrase information is more accurately extracted. And taking the target phrase as a final word segmentation result to obtain the meaning originally intended to be expressed in the text, thereby guaranteeing the usability and accuracy of the word segmentation result for the user.
In summary, the embodiment of the application provides a text word segmentation method, firstly, dividing an input text according to an identifier in the input text to obtain a pre-word set, wherein the pre-word set comprises at least one group of phrases to be confirmed, and the phrases to be confirmed are part of the input text; and screening out the target phrase from the phrase to be confirmed according to the standard part-of-speech chain set and the part-of-speech chain of the phrase to be confirmed, wherein the standard part-of-speech chain set comprises at least one standard part-of-speech chain, the part-of-speech chain of the target phrase is the standard part-of-speech chain, and the target phrase is a continuous part of the phrase to be confirmed. Based on the original word segmentation basis, the complete phrase is accurately extracted through reprocessing on a standard part-of-speech chain, and text phrase information is more accurately extracted. And taking the target phrase as a final word segmentation result to obtain the meaning originally intended to be expressed in the text, thereby guaranteeing the usability and accuracy of the word segmentation result for the user.
On the basis of fig. 2, for the content in S30, a possible implementation manner is further provided in the embodiment of the present application, please refer to fig. 3, S30 includes:
s301, judging whether the part-of-speech chain length of the phrase to be confirmed is larger than the first length. If not, executing S302; if yes, S306 is executed.
The first length is the length of the longest standard part of speech in the standard part of speech chain set. For example, the length of the standard part-of-speech chain (n, v) is 2 and the length of the standard part-of-speech chain (n, a, u, a) is 4. When the standard part-of-speech chain set only includes the two standard part-of-speech chains, the first length is 4.
S302, judging whether the part-of-speech chain of the phrase to be confirmed is identical to any one group of standard part-of-speech chains. If yes, then execute S303; if not, S304 is performed.
Optionally, when the part-of-speech chain of the phrase to be confirmed is the same as any one set of standard part-of-speech chains, then the phrase to be confirmed is indicated to be the target phrase, and S303 is executed; otherwise, it is necessary to continue to determine whether a certain part of the phrase to be confirmed is the target phrase, and S304 is executed.
S303, determining the phrase to be confirmed as a target phrase.
S304, removing words corresponding to the first part of speech or the last part of speech in the part-of-speech chain of the phrase to be confirmed to obtain a new phrase to be confirmed.
Taking the phrase to be confirmed as "dislike large capacity" as an example, the part-of-speech chain of "dislike large capacity" is (d, v, a, n, u), which is different from any standard part-of-speech chain, at the moment, the vocabulary corresponding to the end u of the part-of-speech chain (d, v, a, n, u) can be removed, and the new phrase to be confirmed "dislike large capacity" corresponds to the part-of-speech chain (d, v, a, n).
S305, judging whether part-of-speech chains of the phrases to be confirmed are empty. If not, executing S302; if yes, ending.
Optionally, the new phrase to be confirmed "dislike large capacity", the corresponding part-of-speech chain is (d, v, a, n), the part-of-speech chain of the phrase to be confirmed is not empty, and it is necessary to further determine whether the new phrase to be confirmed is the target phrase, and then S302 is repeatedly executed; and after repeating for a plurality of times, if the target phrase is not obtained yet and the part-of-speech chain of the phrase to be confirmed is empty, ending.
In one possible implementation, S304 performs two actions separately to arrive at a different new phrase to be validated.
The first action, removing words corresponding to the last part of speech in the part-of-speech chain of the phrase to be confirmed to obtain a new phrase to be confirmed;
and a second action, removing the vocabulary corresponding to the first part of speech in the part-of-speech chain of the phrase to be confirmed, so as to obtain a new phrase to be confirmed.
With continued reference to fig. 3, in the embodiment of the present application, if the part-of-speech chain length of the phrase to be confirmed is greater than the first length, S306 is performed.
S306, extracting a first sub-phrase from the phrase to be confirmed.
The part-of-speech chain length of the first sub-phrase is a first length, and the first vocabulary in the first sub-phrase is the first vocabulary in the phrase to be confirmed.
Continuing with the example of the phrase to be validated as "dislike large volume," the part-of-speech chain of "dislike large volume" is (d, v, a, n, u), assuming the first length is 2, then the first sub-phrase is "dislike".
S307, judging whether the part-of-speech chain of the first sub-phrase is identical to any one group of standard part-of-speech chains. If yes, then execution S308; if not, S312 is performed.
When the part-of-speech chain of the first sub-phrase is the same as any one group of standard part-of-speech chains, the first sub-phrase is the target phrase, and S308 is executed at this time; otherwise, whether the combination of the part of the first sub-phrase and the rest of the phrases to be confirmed is the target phrase needs to be continuously judged, and S312 is executed.
S308, taking the first sub-phrase as a group of target phrases.
S309, judging whether the length between the part of speech corresponding to the first divided vocabulary and the tail end of the part of speech chain of the phrase to be confirmed is larger than the first length. If yes, executing S310; if not, S311 is performed.
The first divided words are words adjacent to the tail end of the first sub-phrase in the phrase to be confirmed.
Optionally, when the first sub-phrase is used as a set of target phrases, it needs to be determined whether the remaining part of the phrases to be confirmed contains other target phrases, and further, it needs to be determined whether the part-of-speech chain length corresponding to the remaining part of the phrases to be confirmed is greater than the first length. If not, the rest of the phrases to be confirmed can be directly used as new phrases to be confirmed, and then whether the new phrases to be confirmed are the same as any group of standard part-of-speech chains is judged, namely S311 is executed. Otherwise, a new first sub-phrase needs to be obtained from the rest of the phrases to be validated, i.e., S310 is performed.
S310, using the first divided vocabulary as a first vocabulary of a new first sub phrase.
Continuing taking the phrase to be confirmed as "dislike large capacity" as an example, taking the part-of-speech chain of "dislike large capacity" as (d, v, a, n, u), and assuming that the first length is 2, when the first sub-phrase is "dislike" and "dislike" is the target phrase, the part-of-speech chain of "large capacity" is (a, n, u), the length is still greater than 2, and at the moment, the first segmentation vocabulary is "large", and the new first sub-phrase is "large capacity".
And repeatedly judging whether the part-of-speech chain of the first sub-phrase is the same as any one set of standard part-of-speech chains or not until the length between the part of speech corresponding to the new first segmentation vocabulary and the tail end of the part-of-speech chain of the phrase to be confirmed is smaller than or equal to the first length, or the part-of-speech chain of the first sub-phrase is different from any one set of standard part-of-speech chains, namely repeatedly executing S307.
S311, taking the first divided words to the tail end of the phrase to be confirmed as a new phrase to be confirmed.
And repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty, namely repeatedly executing S302.
With continued reference to fig. 3, in the embodiment of the present application, if the part-of-speech chain of the first sub-phrase is different from any one of the standard part-of-speech chains, S312 is performed.
S312, judging whether the length between the part of speech corresponding to the second divided vocabulary and the tail end of the part of speech chain of the phrase to be confirmed is larger than the first length. If yes, then execute S314; if not, S313 is performed.
Wherein the second divided word is a second digit word in the first sub-phrase.
Continuing taking the phrase to be confirmed as "dislike large capacity" as an example, taking the part-of-speech chain of "dislike large capacity" as (d, v, a, n, u), assuming that the first length is 2, when the first sub-phrase is "dislike", "dislike" is not the target phrase, the second split vocabulary is "like", "like" to the end of the phrase to be confirmed "as the part-of-speech chain length is 4, and still greater than the first length, so that it is required to determine whether other target phrases exist from the second split vocabulary to the end of the phrase to be confirmed, and at this time, executing S314 to construct a new first sub-phrase. Otherwise, when the part-of-speech chain length between the second divided word and the end of the phrase to be confirmed is smaller than the first length, it may be determined whether the second divided word and the end of the phrase to be confirmed are target phrases, i.e., S313 is performed.
S313, the second divided words are taken to the tail end of the phrase to be confirmed as a new phrase to be confirmed.
And repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty, namely repeatedly executing S302.
And S314, taking the second divided vocabulary as a first vocabulary of the new first sub-phrase.
A new first sub-phrase is constructed in the same manner as S310 described above.
And repeatedly judging whether the part-of-speech chain of the first sub-phrase is the same as any group of standard part-of-speech chains or not until the length between the part of speech corresponding to the new second segmentation vocabulary and the tail end of the part-of-speech chain of the phrase to be confirmed is smaller than or equal to the first length, or the part-of-speech chain of the first sub-phrase is the same as any group of standard part-of-speech chains, namely repeatedly executing S307.
S315, judging whether the value of the first count is equal to the value of the first length. If yes, then execute S316; if not, skipping.
The first counting, characterizing and judging result is continuously accumulated times that the part-of-speech chain of the first sub-phrase is different from any group of standard part-of-speech chains, the last vocabulary of the second sub-phrase is the first vocabulary of the current first sub-phrase, and the length of the second sub-phrase is the first length.
Taking the phrase to be confirmed as XYZM which does not like large capacity as an example, taking part-of-speech chains which do not like large capacity as (d, v, a, n, u, X, y, z, m), and assuming that the first length is 4, when the judgment result is that the accumulated times of part-of-speech chains of the first sub-phrase which are different from any group of standard part-of-speech chains are 4 continuously, the first sub-phrase is 'dislike large capacity', 'like large capacity', 'large capacity X' and 'capacity XY' from the first time to the fourth time respectively, the current first sub-phrase is 'capacity XY', and the second sub-phrase is 'dislike large capacity'.
When the value of the first count is equal to the value of the first length, it is required to determine whether the second sub-phrase includes other target phrases, such as "dislike", and at this time, S316 is required to be executed.
S316, obtaining a second sub phrase and clearing the first count.
And S317, taking the second sub-phrase as a new phrase to be confirmed.
And repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty, namely repeatedly executing S302.
Referring to fig. 4, in the embodiment of the present application, if the part-of-speech chain length of the phrase to be confirmed is greater than the first length, regarding how to obtain the target phrase in the phrase to be confirmed, one possible implementation is also used in the embodiment of the present application, as shown in fig. 4.
In the first step, i=0 is initialized and s=0.
And secondly, judging whether i+MAX is more than or equal to L or not. If not, executing the third step, if yes, executing the tenth step.
Where MAX represents the first length and L represents the total length of the part-of-speech chain of the phrase to be confirmed.
And thirdly, taking the ith bit to the (i+MAX-1) in the phrase to be confirmed as a first sub-phrase.
The first sub-phrase in the third step has the same meaning as the first sub-phrase previously described.
Fourth, judging whether the first sub-phrase is a target phrase. If yes, the first sub-phrase is determined to be the target phrase, the fifth step is executed, and otherwise, the seventh step is executed.
And fifthly, determining the first sub-phrase as a target phrase.
Sixth, i=i+max, s=0.
The second step is repeatedly performed.
Seventh, i++, s++.
Eighth, it is determined whether s=max is established. If yes, executing the ninth step, and if not, repeatedly executing the second step.
And ninth, s=0, obtaining a second sub-phrase, and taking the second sub-phrase as a new phrase to be confirmed.
The second sub-phrase in the ninth step has the same meaning as the second sub-phrase previously described.
And tenth, taking the first divided words to the tail end of the phrase to be confirmed as a new phrase to be confirmed.
The first divided words in the tenth step have the same meaning as the first divided words described above.
And eleventh step, judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains.
The eleventh step is equivalent to S302 in fig. 3, after which S303-S305 may be continued.
On the basis of fig. 2, regarding how to output the word segmentation result, a possible implementation manner is further provided in the embodiments of the present application, please refer to fig. 5, and the text word segmentation method further includes:
S20, extracting phrases to be confirmed in the pre-word set in sequence.
S40, adding the target phrase into the word segmentation result set.
S50, outputting a word segmentation result set under the condition that the pre-word set is empty.
In one possible implementation, after S40, the text word segmentation method further includes matching the target phrase in the word segmentation result set with the stop word in the stop word bank, and if the matching is successful, deleting the target phrase from the word segmentation result set.
Although the embodiment of the application provides the most likely combined word part-of-speech chain, in real life, the same word part-of-speech chain uses different words, and possibly, meaningless situations exist, compensation elimination is needed through a stop word mechanism, and different fields can maintain the same stop word stock and stop word rule. Optionally, a simple stop word rule is defined first, and the word segmentation result of the single word part is directly used as the stop word and is not output. Optionally, a stop word marking function is provided in the system, and the user can mark unreasonable word segmentation on the system page and count into a stop word stock after the verification is passed.
On the basis of fig. 2, for the content in S10, a possible implementation manner is further provided in the embodiments of the present application, please refer to the following.
Taking an input text as customer feedback information as an example, the customer feedback information is 'good with small capacity, small population, buying after eating, always eating new, dislike good with large capacity and small bottle', and part-of-speech labeling is carried out on the input text, so that the following results are obtained: the input text is divided according to the part of speech corresponding to the identification symbol, so as to obtain a pre-segmentation word set.
A word segmentation effect schematic diagram of the text word segmentation method provided by the embodiment of the application is shown in fig. 6.
According to the text word segmentation method provided by the embodiment of the application, through a predefined standard part-of-speech chain, no word segmentation mode is adopted through the predefined part-of-speech chain in the industry at present. At the same time the predefined part-of-speech chain itself is also an important asset, without a good predefined part-of-speech chain the word segmentation effect will be greatly discounted, possibly requiring more complex compensation mechanisms. The mechanism of the predefined standard part-of-speech chain and the empirically summarized set of predefined standard part-of-speech chains should be protected; the matching algorithm adopts a principle similar to a sliding block, and the part-of-speech chain after the segmentation of the input text is taken out through the sliding block to continuously match. The text word segmentation method has 2 advantages: firstly, innovatively adopting different word segmentation algorithms aiming at long and short texts, and secondly, reversely compensating word segmentation in the word segmentation matching process. Unlike the current popular word segmentation algorithm based on statistics, the method does not need a large amount of data training, and long and short word segmentation is effective. The algorithm complexity is O (N), the consumed resources are less, and the method is particularly suitable for analyzing short feedback of the user. Fine tuning algorithms that use the same principles for part-of-speech matching are also protected.
Referring to fig. 7, fig. 7 is a schematic diagram illustrating a text word segmentation apparatus according to an embodiment of the present application, and optionally, the text word segmentation apparatus is applied to the electronic device described above.
The text word segmentation apparatus includes a first word segmentation unit 201 and a second word segmentation unit 202.
The first word segmentation unit 201 is configured to segment an input text according to an identifier in the input text, so as to obtain a pre-word set, where the pre-word set includes at least one group of phrases to be confirmed, and the phrases to be confirmed are part of the input text. Alternatively, the first word segmentation unit 201 may perform S10 described above.
The second word segmentation unit 202 is configured to screen a target phrase from the phrases to be confirmed according to a standard part-of-speech chain set and part-of-speech chains of the phrases to be confirmed, where the standard part-of-speech chain set includes at least one standard part-of-speech chain, the part-of-speech chain of the target phrase is a standard part-of-speech chain, and the target phrase is a continuous part of the phrases to be confirmed. Alternatively, the second word segmentation unit 202 may perform S30 described above.
The second word segmentation unit is further used for judging whether the length of the part-of-speech chain of the phrase to be confirmed is larger than the first length, wherein the first length is the length of the longest standard part-of-speech chain; if the part-of-speech chain of the phrase to be confirmed is smaller than or equal to the standard part-of-speech chain, judging whether the part-of-speech chain of the phrase to be confirmed is identical to any group of standard part-of-speech chains; if the phrases are the same, determining the phrases to be confirmed as target phrases; if the words are different, removing words corresponding to the first part of speech or the last part of speech in the part-of-speech chain of the phrase to be confirmed so as to obtain a new phrase to be confirmed; and repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty. Alternatively, the second word segmentation unit 202 may perform S301 to S317 described above.
It should be noted that, the text word segmentation device provided in this embodiment may execute the method flow shown in the method flow embodiment to achieve the corresponding technical effect. For a brief description, reference is made to the corresponding parts of the above embodiments, where this embodiment is not mentioned.
The present application also provides a storage medium storing computer instructions, a program that when read and executed perform the text word segmentation method of the above embodiments. The storage medium may include memory, flash memory, registers, combinations thereof, or the like.
The following provides an electronic device, which may be a server, as shown in fig. 1, and may implement the text word segmentation method described above; specifically, the electronic device includes: a processor 10, a memory 11, a bus 12. The processor 10 may be a CPU. The memory 11 is used to store one or more programs that, when executed by the processor 10, perform the text segmentation method of the above-described embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (6)

1. A method of text segmentation, the method comprising;
dividing an input text according to an identification symbol in the input text to obtain a pre-word set, wherein the pre-word set comprises at least one group of phrases to be confirmed, and the phrases to be confirmed are part of the input text;
sequentially extracting phrases to be confirmed in the pre-word set;
screening target phrases from the phrases to be confirmed according to a standard part-of-speech chain set and part-of-speech chains of the phrases to be confirmed, wherein the standard part-of-speech chain set comprises at least one standard part-of-speech chain, the part-of-speech chains of the target phrases are the standard part-of-speech chains, and the target phrases are continuous parts of the phrases to be confirmed;
Adding the target phrase into a word segmentation result set;
outputting the word segmentation result set under the condition that the pre-word set is empty;
the step of screening the target phrase from the phrases to be confirmed according to the standard part-of-speech chain set and the part-of-speech chain of the phrases to be confirmed comprises the following steps:
judging whether the length of the part-of-speech chain of the phrase to be confirmed is larger than a first length, wherein the first length is the length of the longest standard part-of-speech chain;
if the part-of-speech chain of the phrase to be confirmed is smaller than or equal to the standard part-of-speech chain, judging whether the part-of-speech chain of the phrase to be confirmed is identical to any group of standard part-of-speech chains;
if the phrases to be confirmed are the same, determining that the phrases to be confirmed are the target phrases;
if the words are different, removing words corresponding to the first part of speech or the last part of speech in the part-of-speech chain of the phrase to be confirmed so as to obtain a new phrase to be confirmed;
repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty;
the step of screening the target phrase from the phrases to be confirmed according to the standard part-of-speech chain set and the part-of-speech chain of the phrases to be confirmed, further comprises:
if the part-of-speech chain length of the phrase to be confirmed is greater than the first length, extracting a first sub-phrase from the phrase to be confirmed, wherein the part-of-speech chain length of the first sub-phrase is the first length, and a first vocabulary in the first sub-phrase is a first vocabulary in the phrase to be confirmed;
Judging whether the part-of-speech chain of the first sub-phrase is the same as any group of standard part-of-speech chains;
if the first sub-phrases are the same, the first sub-phrases are used as a group of target phrases;
judging whether the length between the part of speech corresponding to a first divided word and the tail end of the part of speech chain of the phrase to be confirmed is larger than the first length, wherein the first divided word is a word adjacent to the tail end of the first sub-phrase in the phrase to be confirmed;
if the first sub phrase is larger than the second sub phrase, the first segmentation vocabulary is used as a first vocabulary of the new first sub phrase;
repeatedly judging whether the part-of-speech chain of the first sub-phrase is the same as any group of standard part-of-speech chains or not until the length between the part of speech corresponding to the new first segmentation vocabulary and the tail end of the part-of-speech chain of the phrase to be confirmed is smaller than or equal to the first length;
if the first segmentation word is smaller than or equal to the first segmentation word, the first segmentation word is taken as a new phrase to be confirmed at the tail end of the phrase to be confirmed;
and repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty.
2. The text word segmentation method according to claim 1, wherein the step of screening a target phrase from the phrases to be confirmed according to a standard part-of-speech chain set and the part-of-speech chain of the phrases to be confirmed further comprises:
If the part-of-speech chain of the first sub-phrase is different from any group of standard part-of-speech chains, judging whether the length between the part-of-speech corresponding to a second segmentation word and the tail end of the part-of-speech chain of the phrase to be confirmed is larger than the first length, wherein the second segmentation word is a second bit word in the first sub-phrase;
if the word is larger than the first word, the second divided word is used as a first word of a new first sub-phrase;
repeatedly judging whether the part-of-speech chain of the first sub-phrase is the same as any group of standard part-of-speech chains or not until the length between the part-of-speech corresponding to the new second segmentation vocabulary and the tail end of the part-of-speech chain of the phrase to be confirmed is smaller than or equal to the first length;
if the value of the first count is equal to the value of the first length, a second sub-phrase is obtained, and the first count is cleared, wherein the first count represents the accumulated times that the part-of-speech chain of the first sub-phrase is different from any group of standard part-of-speech chains continuously, the last vocabulary of the second sub-phrase is the first vocabulary of the current first sub-phrase, and the length of the second sub-phrase is the first length;
taking the second sub-phrase as a new phrase to be confirmed;
And repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty.
3. The text word segmentation method according to claim 2, wherein the step of screening a target phrase from the phrases to be confirmed according to a standard part-of-speech chain set and the part-of-speech chain of the phrases to be confirmed further comprises:
if the length between the part of speech corresponding to the second divided word and the tail end of the part of speech chain of the phrase to be confirmed is smaller than or equal to the first length, taking the second divided word to the tail end of the phrase to be confirmed as a new phrase to be confirmed;
and repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty.
4. A text word segmentation apparatus, the apparatus comprising;
the first word segmentation unit is used for segmenting the input text according to the identification symbol in the input text to obtain a pre-word set, wherein the pre-word set comprises at least one group of phrases to be confirmed, and the phrases to be confirmed are part of the input text;
The text word segmentation device is also used for sequentially extracting phrases to be confirmed in the pre-word set;
the second word segmentation unit is used for screening target phrases from the phrases to be confirmed according to a standard part-of-speech chain set and part-of-speech chains of the phrases to be confirmed, wherein the standard part-of-speech chain set comprises at least one standard part-of-speech chain, the part-of-speech chains of the target phrases are the standard part-of-speech chains, and the target phrases are continuous parts of the phrases to be confirmed;
the text word segmentation device is also used for adding the target phrase into a word segmentation result set; outputting the word segmentation result set under the condition that the pre-word set is empty;
the second word segmentation unit is further used for judging whether the length of the part-of-speech chain of the phrase to be confirmed is larger than a first length, wherein the first length is the length of the longest standard part-of-speech chain; if the part-of-speech chain of the phrase to be confirmed is smaller than or equal to the standard part-of-speech chain, judging whether the part-of-speech chain of the phrase to be confirmed is identical to any group of standard part-of-speech chains; if the phrases to be confirmed are the same, determining that the phrases to be confirmed are the target phrases; if the words are different, removing words corresponding to the first part of speech or the last part of speech in the part-of-speech chain of the phrase to be confirmed so as to obtain a new phrase to be confirmed; repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty;
The second word segmentation unit is further configured to extract a first sub-phrase from the phrase to be confirmed if the part-of-speech chain length of the phrase to be confirmed is greater than the first length, where the part-of-speech chain length of the first sub-phrase is the first length, and a first word in the first sub-phrase is a first word in the phrase to be confirmed; judging whether the part-of-speech chain of the first sub-phrase is the same as any group of standard part-of-speech chains; if the first sub-phrases are the same, the first sub-phrases are used as a group of target phrases; judging whether the length between the part of speech corresponding to a first divided word and the tail end of the part of speech chain of the phrase to be confirmed is larger than the first length, wherein the first divided word is a word adjacent to the tail end of the first sub-phrase in the phrase to be confirmed; if the first sub phrase is larger than the second sub phrase, the first segmentation vocabulary is used as a first vocabulary of the new first sub phrase; repeatedly judging whether the part-of-speech chain of the first sub-phrase is the same as any group of standard part-of-speech chains or not until the length between the part of speech corresponding to the new first segmentation vocabulary and the tail end of the part-of-speech chain of the phrase to be confirmed is smaller than or equal to the first length; if the first segmentation word is smaller than or equal to the first segmentation word, the first segmentation word is taken as a new phrase to be confirmed at the tail end of the phrase to be confirmed; and repeatedly judging whether the part-of-speech chain of the phrase to be confirmed is the same as any group of standard part-of-speech chains or not until the part-of-speech chain of the phrase to be confirmed is empty.
5. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 1-3.
6. An electronic device, comprising: a processor and a memory for storing one or more programs; the method of any of claims 1-3 being implemented when the one or more programs are executed by the processor.
CN202110494883.6A 2021-05-07 2021-05-07 Text word segmentation method and device, storage medium and electronic equipment Active CN113177410B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110494883.6A CN113177410B (en) 2021-05-07 2021-05-07 Text word segmentation method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110494883.6A CN113177410B (en) 2021-05-07 2021-05-07 Text word segmentation method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113177410A CN113177410A (en) 2021-07-27
CN113177410B true CN113177410B (en) 2023-04-25

Family

ID=76928246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110494883.6A Active CN113177410B (en) 2021-05-07 2021-05-07 Text word segmentation method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113177410B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274361A (en) * 2020-01-21 2020-06-12 北京明略软件系统有限公司 Industry new word discovery method and device, storage medium and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975475A (en) * 2016-03-31 2016-09-28 华南理工大学 Chinese phrase string-based fine-grained thematic information extraction method
US10417268B2 (en) * 2017-09-22 2019-09-17 Druva Technologies Pte. Ltd. Keyphrase extraction system and method
CN110895655A (en) * 2018-09-11 2020-03-20 北京京东尚科信息技术有限公司 Method and device for extracting text core phrase
CN110532567A (en) * 2019-09-04 2019-12-03 北京百度网讯科技有限公司 Extracting method, device, electronic equipment and the storage medium of phrase

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274361A (en) * 2020-01-21 2020-06-12 北京明略软件系统有限公司 Industry new word discovery method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN113177410A (en) 2021-07-27

Similar Documents

Publication Publication Date Title
CN110852087B (en) Chinese error correction method and device, storage medium and electronic device
CN111695352A (en) Grading method and device based on semantic analysis, terminal equipment and storage medium
CN110580335A (en) user intention determination method and device
CN112084381A (en) Event extraction method, system, storage medium and equipment
CN110909549B (en) Method, device and storage medium for punctuating ancient Chinese
CN113255331B (en) Text error correction method, device and storage medium
CN111813944A (en) Live comment analysis method and device, electronic equipment and storage medium
CN110134934A (en) Text emotion analysis method and device
CN110069769A (en) Using label generating method, device and storage equipment
CN115438650B (en) Contract text error correction method, system, equipment and medium fusing multi-source characteristics
CN115840808B (en) Technological project consultation method, device, server and computer readable storage medium
CN111339775A (en) Named entity identification method, device, terminal equipment and storage medium
CN115310443A (en) Model training method, information classification method, device, equipment and storage medium
CN110826298A (en) Statement coding method used in intelligent auxiliary password-fixing system
CN113239668B (en) Keyword intelligent extraction method and device, computer equipment and storage medium
CN113221553A (en) Text processing method, device and equipment and readable storage medium
CN114281996A (en) Long text classification method, device, equipment and storage medium
CN113377910A (en) Emotion evaluation method and device, electronic equipment and storage medium
CN111128122B (en) Method and system for optimizing rhythm prediction model
CN113177410B (en) Text word segmentation method and device, storage medium and electronic equipment
CN110347934B (en) Text data filtering method, device and medium
CN111581347A (en) Sentence similarity matching method and device
CN109446518B (en) Decoding method and decoder for language model
CN110874408B (en) Model training method, text recognition device and computing equipment
CN110232328A (en) A kind of reference report analytic method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant