CN111027316A - Text processing method and device, electronic equipment and computer readable storage medium - Google Patents

Text processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN111027316A
CN111027316A CN201911129869.5A CN201911129869A CN111027316A CN 111027316 A CN111027316 A CN 111027316A CN 201911129869 A CN201911129869 A CN 201911129869A CN 111027316 A CN111027316 A CN 111027316A
Authority
CN
China
Prior art keywords
target
text
phrase
candidate
phrases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911129869.5A
Other languages
Chinese (zh)
Inventor
王卓然
岳猛
周圣凯
秦海龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Yunzhihui Technology Co Ltd
Original Assignee
Dalian Yunzhihui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Yunzhihui Technology Co Ltd filed Critical Dalian Yunzhihui Technology Co Ltd
Priority to CN201911129869.5A priority Critical patent/CN111027316A/en
Publication of CN111027316A publication Critical patent/CN111027316A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of information processing, and discloses a text processing method, a text processing device, electronic equipment and a computer readable storage medium, wherein the text processing method comprises the following steps: acquiring a target text, and identifying a target general phrase in the target text; converting the target text into a target vector set; the target vector set comprises at least one target vector corresponding to the target general phrase; the target common phrase is translated as a word of the whole. The text processing method provided by the application can keep the context relation among the strongly-associated words, so that the converted target vector set can keep the semantics of the target text, and the accuracy of the semantic recognition of the text can be effectively improved.

Description

Text processing method and device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a text processing method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
In the natural language processing task, when a user needs to communicate with a computer to ask a question, target text input by the user is represented as data which can be understood and easily processed by the computer. When a user inputs a target text, the computer calculates semantic similarity between at least one pre-stored candidate text and the target text.
In the prior art, a text is generally directly divided into words, then converted into vectors, and then the similarity between the vectors of the target text and the vectors of the candidate text is calculated. However, the process of word segmentation and conversion can lose the context relationship between strongly-associated words in the text, so that the accuracy of text matching is low.
Disclosure of Invention
The purpose of the present application is to solve at least one of the above technical drawbacks, and to provide the following solutions:
in a first aspect, a text processing method is provided, including:
acquiring a target text, and identifying a target general phrase in the target text;
converting the target text into a target vector set; the target vector set comprises at least one target vector corresponding to the target general phrase; the target common phrase is converted by a word as a whole;
acquiring at least one candidate text, and acquiring a candidate vector set of the candidate text aiming at each candidate text;
and acquiring the similarity between the target vector set and the candidate vector set.
In an alternative embodiment of the first aspect, the step of identifying the target common phrase in the target text comprises:
matching at least one pre-stored general phrase with a target text;
and setting the same phrase in the target text as the general phrase as the target general phrase.
In an optional embodiment of the first aspect, the step of identifying the target common phrase in the target text further comprises:
and obtaining a plurality of language material texts, and screening at least one general phrase from the plurality of language material texts.
In an alternative embodiment of the first aspect, the step of filtering at least one common phrase from the plurality of corpus texts comprises:
splitting the plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the plurality of corpus texts;
and setting the candidate phrases with the total frequency of appearance in each corpus text larger than a preset threshold as the common phrases.
In an alternative embodiment of the first aspect, the step of filtering at least one common phrase from the plurality of corpus texts comprises:
splitting the plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the plurality of corpus texts;
setting candidate phrases with the total frequency larger than a preset threshold value in a plurality of corpus texts as high-frequency phrases;
counting the frequency of occurrence of high-frequency phrases in the corpus texts of each category respectively, wherein any corpus text belongs to at least one category;
and if the ratio of the frequency appearing in the corpus text of one category to the total frequency is not more than the preset ratio, setting the high-frequency phrase as a general phrase.
In an optional embodiment of the first aspect, the step of converting the target text into a set of target vectors comprises:
performing word segmentation on other parts except the target general phrase in the target text to obtain at least one target word;
and taking the target general phrase as a word of the whole, and performing word embedding vectorization on the target general phrase and each target word to obtain a target vector set.
In a second aspect, a text processing apparatus is provided, including:
the recognition module is used for receiving a target text input by a user and recognizing a target general phrase in the target text;
the conversion module is used for converting the target text into a target vector set; the target vector set comprises at least one target vector corresponding to the target general phrase; the target common phrase is converted by a word as a whole;
the first acquisition module is used for acquiring at least one candidate text and acquiring a candidate vector set of the candidate text aiming at each candidate text;
and the second acquisition module is used for acquiring the similarity between the target vector set and the candidate vector set.
In an optional embodiment of the second aspect, the recognition module, when recognizing the target common phrase in the target text, is specifically configured to:
matching at least one pre-stored general phrase with a target text;
and setting the same phrase in the target text as the general phrase as the target general phrase.
In an optional embodiment of the second aspect, the text processing apparatus further comprises:
and the screening module is used for obtaining a plurality of language material texts and screening at least one general phrase from the plurality of language material texts.
In an optional embodiment of the second aspect, when the filtering module filters at least one common phrase from the plurality of corpus texts, the filtering module is specifically configured to:
splitting the plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the plurality of corpus texts;
and setting the candidate phrases with the total frequency of appearance in each corpus text larger than a preset threshold as the common phrases.
In an optional embodiment of the second aspect, when the filtering module filters at least one common phrase from the plurality of corpus texts, the filtering module is specifically configured to:
splitting the plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the plurality of corpus texts;
setting candidate phrases with the total frequency larger than a preset threshold value in a plurality of corpus texts as high-frequency phrases;
counting the frequency of occurrence of high-frequency phrases in the corpus texts of each category respectively, wherein any corpus text belongs to at least one category;
and if the ratio of the frequency appearing in the corpus text of one category to the total frequency is not more than the preset ratio, setting the high-frequency phrase as a general phrase.
In an optional embodiment of the second aspect, when the conversion module converts the target text into the target vector set, the conversion module is specifically configured to:
performing word segmentation on other parts except the target general phrase in the target text to obtain at least one target word;
and taking the target general phrase as a word of the whole, and performing word embedding vectorization on the target general phrase and each target word to obtain a target vector set.
In a third aspect, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the text processing method shown in the first aspect of the present application is implemented.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the text processing method shown in the first aspect of the present application.
The beneficial effect that technical scheme that this application provided brought is:
when the semantics of the target text needs to be identified, when the target text is subjected to vector conversion, the target general phrase in the target text is identified firstly, the target general phrase is taken as a word of the whole, the target text is converted into a target vector set, and the context relation between strongly-associated words can be kept in the conversion process, so that the semantics of the target text can be kept in the converted target vector set, and the accuracy of the semantic identification of the text can be effectively improved.
Furthermore, when the target general text is obtained, the plurality of corpus texts are firstly split to obtain a plurality of candidate phrases, the candidate phrases with the total times larger than the preset times appearing in each corpus text are set as the general phrases, then the target general phrases in the target text are identified according to the general phrases, and the target general phrases with strong association in the target text can be effectively identified.
Furthermore, when the target general text is obtained, the plurality of corpus texts can be firstly split to obtain a plurality of candidate phrases, the candidate phrases with the total frequency of occurrence in each corpus text being greater than the preset frequency are set as high-frequency phrases, then the high-frequency phrases with higher frequency of occurrence in the corpus text of a certain class in the high-frequency phrases are excluded to obtain the general phrases, and the target general phrases which have strong association and are not limited to the certain class in the target text can be effectively identified.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a text processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of similarity calculation provided by an embodiment of the present application;
fig. 3 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device for text processing according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
The term "frequency" as used herein is not limited to representing the number and frequency, but may include a series of meanings such as probability, frequency, etc., which represent the number of occurrences in statistics.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The text processing method, the text processing device, the electronic device and the computer-readable storage medium provided by the application aim to solve the technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
A possible implementation manner is provided in the embodiment of the present application, and as shown in fig. 1, a text processing method is provided, which may include the following steps:
step S101, acquiring a target text, and identifying a target general phrase in the target text.
In this step, the user may directly input the target text or input the voice, and the terminal or the server for text processing converts the received voice into the target text input by the user, and the obtaining manner of the target text is not limited herein.
The target general phrase refers to a phrase formed by connecting at least two words with strong association relation in the target text, for example, "how" and "how to process" are commonly used together, and the formed "how to process" can be used as the general phrase.
In a specific implementation process, a plurality of common phrases may be prestored in a terminal or a server for text processing, and the plurality of common phrases are matched with a target text one by one, so as to identify a target common phrase in the target text, where the identified target common phrase may be one or a plurality of phrases.
Step S102, converting a target text into a target vector set; the target vector set comprises at least one target vector corresponding to the target general phrase; the target common phrase is translated as a word of the whole.
Specifically, a plurality of words in the target text can be mapped into the high-dimensional vector by adopting a word embedding mode to obtain a corresponding target vector set; the trained neural network may also be adopted to split the target text into a plurality of words, and the words are input into the trained neural network to obtain a target vector set, and the specific manner for converting the target text into the target vector set is not limited herein.
It should be noted that, no matter how the target text is converted into the target vector set, in the conversion process, the target common phrase is converted as a word of the whole.
For example, the target text is "how to handle a mobile phone is bad", and is divided into "mobile phone", "bad", and "how to handle" to be converted into vectors, respectively, wherein "how to handle" is a target general phrase, and is converted as a word of the whole.
Step S103, at least one candidate text is obtained, and a candidate vector set of the candidate text is obtained for each candidate text.
In an implementation process, a terminal or a server for text matching may be pre-stored with at least one candidate text, identify a candidate common phrase in each candidate text, and then convert each candidate text into a candidate vector set, respectively, when each candidate text is converted into a vector set, each candidate common phrase is converted as an entire word, and one or more candidate common phrases in each candidate text may be used.
And step S104, acquiring the similarity between the target vector set and the candidate vector set.
Specifically, the process of calculating the similarity includes performing word alignment on the words or target common phrases in the target text and the words or candidate common phrases in the candidate text to obtain a many-to-many or one-to-many relationship between the words or target common phrases of the target text and the words or candidate common phrases in the candidate text.
Taking fig. 2 as an example, if the target text is "how a mobile phone is bad," and the candidate text is "how a mobile phone is bad," splitting the target text into "a mobile phone," bad "and" how to process, "splitting the candidate text into" a mobile phone, "" bad "and" how to process, "then calculating the similarity between the converted target vector set and the candidate vector set," performing word alignment between the mobile phone "and the" mobile phone, "performing word alignment between bad" and "bad," performing word alignment between "how to process" and "how to process" performing word alignment, the similarity calculation method shown in fig. 2 is to perform forward calculation from the candidate vector set to the target vector set to obtain the similarity score of each word as 1, 1 and 0.9, calculating the average value of the similarity value corresponding to each word to obtain the forward matching score of 0.93, and then performing backward calculation from the target vector set to the candidate vector set, and obtaining similarity scores of 1, 1 and 0.9 of each word, and calculating the average value of similarity values corresponding to each word to obtain a reverse matching score of 0.93.
In other embodiments, the similarity calculation may also be performed in other manners, such as cosine similarity; the method for calculating the similarity score may also be to set different weighting coefficients for each word to perform calculation, and the specific method for calculating the similarity is not limited herein.
According to the text processing method provided by the embodiment, when the semantics of the target text needs to be identified, when the target text is subjected to vector conversion, the target general phrase in the target text is identified first, the target general phrase is taken as a word of the whole, the target text is converted into the target vector set, and the context relation between strongly-associated words can be kept in the conversion process, so that the semantics of the target text can be kept in the converted target vector set, and the accuracy of the semantic identification of the text can be effectively improved.
A possible implementation manner is provided in the embodiment of the present application, and the identifying a target general phrase in a target text in step S101 may include:
(1) matching at least one pre-stored general phrase with a target text;
(2) and setting the same phrase in the target text as the general phrase as the target general phrase.
In a specific implementation process, at least one common phrase may be prestored in a terminal or a server for text processing, when a target text input by a user is received, the prestored at least one common phrase is matched with the target text one by one, and when the target text includes phrases identical to the common phrases, the identical phrases are set as the target common phrases in the target text.
As shown in fig. 3, before the step S101 of recognizing the target common phrase in the target text, a possible implementation manner is provided in the embodiment of the present application, which may further include:
step S100, obtaining a plurality of corpus texts, and screening at least one general phrase from the corpus texts.
Specifically, texts input by a plurality of different sample users can be collected as corpus texts, and then the collected corpus texts are classified into a plurality of different categories, wherein any corpus text belongs to at least one category; categories refer to those associated with different professions or industries, such as corpus text in the medical category, corpus text in the educational category, and so forth.
The manner in which the corpus text is collected from the corpus databases of different categories, such as the corpus text of the medical category from the medically-related database, the corpus text of the education category from the education-related database, and the phrase is filtered from the corpus text, may be varied and will be further described with reference to the following embodiments.
The embodiment of the present application provides a possible implementation manner, the step S100 of obtaining at least one common phrase from the plurality of corpus texts by screening may include:
(1) and splitting the plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the plurality of corpus texts.
Specifically, the corpus text may be split, for example, two or three arbitrarily adjacent words in the corpus text are combined to obtain candidate phrases, and phrases with preset character lengths may also be randomly intercepted from the corpus text as the candidate phrases, and the specific splitting manner is not limited herein.
(2) And setting the candidate phrases with the total frequency of appearance in each corpus text larger than a preset threshold as the common phrases.
In a specific implementation process, counting the total frequency of each candidate phrase appearing in all the corpus texts, and if the total frequency of the appearance is greater than a preset threshold value, indicating that the frequency of the candidate phrase appears is high, that is, the candidate phrase comprises at least two words which have strong association, so that the candidate phrase can be set as a general phrase.
In the implementation manner, when the target general text is obtained, the plurality of corpus texts are firstly split to obtain a plurality of candidate phrases, the candidate phrases with the total frequency greater than a preset threshold value appearing in each corpus text are set as the general phrases, and then the target general phrases in the target text are identified according to the general phrases, so that the target general phrases with strong association in the target text can be effectively identified.
The embodiment of the present application provides a possible implementation manner, the step S100 of obtaining at least one common phrase from the plurality of corpus texts by screening may include:
(1) and splitting the plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the plurality of corpus texts.
Specifically, the corpus text may be split, for example, two or three arbitrarily adjacent words in the corpus text are combined to obtain candidate phrases, and phrases with preset character lengths may also be randomly intercepted from the corpus text as the candidate phrases, and the specific splitting manner is not limited herein.
(2) And setting candidate phrases with the total frequency of appearance in the plurality of corpus texts larger than a preset threshold as high-frequency phrases.
In a specific implementation process, counting the total frequency of each candidate phrase appearing in all the corpus texts, and if the total frequency of the appearance is greater than a preset threshold, indicating that the frequency of the candidate phrase appears is high, that is, at least two words included in the candidate phrase may have a certain association relationship, and the candidate phrase may be set as a high-frequency phrase.
(3) And counting the frequency of the occurrence of the high-frequency phrases in the corpus texts of each category respectively, wherein any corpus text belongs to at least one category.
Specifically, the plurality of corpus texts include corpus texts of different categories, and any one corpus text belongs to at least one category; categories refer to those associated with different professions or industries, such as corpus text in the medical category, corpus text in the educational category, and so forth.
In the specific implementation process, the frequency of the high-frequency phrases appearing in the corpus text of each category is counted, so that whether the high-frequency phrases appear in a certain category only frequently can be judged.
(4) And if the ratio of the frequency appearing in the corpus text of one category to the total frequency is not more than the preset ratio, setting the high-frequency phrase as a general phrase.
Specifically, if the ratio of the frequency of the high-frequency phrase appearing in the corpus text of one category to the total frequency is greater than a preset ratio, which indicates that the high-frequency phrase is relatively high in correlation with the category, the high-frequency phrase cannot be set as a general phrase; if the ratio of the frequency of the high-frequency phrase appearing in the corpus text of one category to the total frequency is not greater than the preset ratio, which indicates that the high-frequency phrase is not related to the category in a large amount, and the frequency of the high-frequency phrase appearing in the corpus text of different categories is large, the high-frequency phrase can be set as a general phrase.
In the above embodiment, when the target general text is obtained, the plurality of corpus texts are firstly split to obtain a plurality of candidate phrases, the candidate phrases having the total frequency greater than the preset threshold value appearing in each corpus text are set as high-frequency phrases, and then the high-frequency phrases having higher frequency appearing only in a certain category of corpus texts in the high-frequency phrases are excluded to obtain the general phrases, so that the target general phrases having strong association and not limited to a certain category in the target text can be effectively identified.
A possible implementation manner is provided in the embodiment of the present application, and the converting the target text into the target vector set in step S102 may include:
(1) and performing word segmentation on other parts except the target general phrase in the target text to obtain at least one target word.
Specifically, after the target general phrase in the target text is identified, the other parts except the target general phrase are segmented to obtain at least one target word, that is, the target text is divided into at least one target word and at least one target general phrase.
For example, the target text is "how to handle a mobile phone is bad", after the target general phrase "how to handle" in the target text is recognized, the other part "mobile phone is bad" is participled to obtain two target words "mobile phone" and "bad", that is, how to handle "mobile phone is bad" is split into "mobile phone", "bad" and "how to handle".
(2) And taking the target general phrase as a word of the whole, and performing word embedding vectorization on the target general phrase and each target word to obtain a target vector set.
In the embodiment, at least one target word and at least one target general phrase in a target text are mapped into a high-dimensional vector in a word embedding mode to obtain a corresponding target vector set; in other embodiments, a trained neural network may also be used, at least one target word and at least one target general phrase obtained by splitting a target text are input into the trained neural network to obtain a target vector set, and a manner of converting the target text into the target vector set is not limited herein, but no matter how the conversion is performed, a sequence of each number in the target vector set and a sequence of each target word and each target general phrase in the target text are set in a one-to-one correspondence, and in a conversion process, the target general phrase is converted into one word as a whole.
A possible implementation manner is provided in this embodiment of the present application, and the step of obtaining the candidate vector set of the candidate text in step S103 may include:
(1) and querying candidate universal phrases in the candidate texts.
Specifically, at least one preset general phrase and a candidate text which are pre-stored are matched, a candidate general phrase in the candidate text is identified, when the candidate text includes a phrase which is the same as the preset general phrase, the phrase which is the same as the preset general phrase is set as the candidate general phrase, and one candidate text may include one or more candidate general phrases or may not include the candidate general phrase.
(2) Converting the candidate text into a candidate vector set; if candidate general phrases in the candidate text are inquired, the candidate vector set comprises candidate vectors corresponding to at least one candidate general phrase; the candidate common phrases are converted by a word as a whole; if the candidate general phrases in the candidate text are not inquired, the candidate text is directly subjected to word segmentation to obtain a plurality of candidate words, and word embedding vectorization is performed on the candidate words to obtain a candidate vector set.
In a specific implementation process, if a candidate general phrase in the candidate text is queried, performing word segmentation on other parts except the candidate general phrase to obtain at least one candidate word, namely, dividing the candidate text into at least one candidate word and at least one candidate general phrase.
It can be understood that in other application scenarios, multiple candidate texts may be obtained, the similarity between the target text and each candidate text is obtained, and the candidate text with the highest similarity is selected as the candidate text closest to the target text in semantic meaning, and the specific application scenario is not limited herein.
According to the text processing method, when the semantics of the target text needs to be recognized, the target general phrase in the target text is recognized firstly, the target general phrase is taken as a word of the whole, the target text is converted into the target vector set, and the context relation among strongly-associated words can be kept in the conversion process, so that the converted target vector set can keep the semantics of the target text, and the accuracy of the semantic recognition of the text can be effectively improved.
Furthermore, when the target general text is obtained, the plurality of corpus texts are firstly split to obtain a plurality of candidate phrases, the candidate phrases with the total frequency greater than a preset threshold value appearing in each corpus text are set as the general phrases, then the target general phrases in the target text are identified according to the general phrases, and the target general phrases with strong association in the target text can be effectively identified.
Furthermore, when the target general text is obtained, the plurality of corpus texts can be split to obtain a plurality of candidate phrases, the candidate phrases with the total frequency greater than a preset threshold value appearing in each corpus text are set as high-frequency phrases, then the high-frequency phrases with higher frequency appearing only in a certain category of corpus text in the high-frequency phrases are excluded to obtain the general phrases, and the target general phrases which have strong association and are not limited to a certain category in the target text can be effectively identified.
A possible implementation manner is provided in the embodiment of the present application, and as shown in fig. 4, a text processing apparatus 40 is provided, where the text processing apparatus 40 may include: an identification module 401 and a translation module 402, wherein,
the recognition module 401 is configured to obtain a target text and recognize a target general phrase in the target text;
a conversion module 402, configured to convert the target text into a target vector set; the target vector set comprises at least one target vector corresponding to the target general phrase; the target common phrase is converted by a word as a whole;
a first obtaining module 403, configured to obtain at least one candidate text, and obtain, for each candidate text, a candidate vector set of the candidate text;
a second obtaining module 404, configured to obtain a similarity between the target vector set and the candidate vector set.
When the text processing device carries out vector conversion on the target text, the target general phrase in the target text is firstly identified, the target general phrase is taken as a word of the whole, the target text is converted into the target vector set, and the context relation among strongly-associated words can be kept in the conversion process, so that the converted target vector set can keep the semantics of the target text, and the accuracy of text semantic identification can be effectively improved.
In the embodiment of the present application, a possible implementation manner is provided, and when identifying a target general phrase in a target text, the identifying module 401 is specifically configured to:
matching at least one pre-stored general phrase with a target text;
and setting the same phrase in the target text as the general phrase as the target general phrase.
In the embodiment of the present application, a possible implementation manner is provided, and as shown in fig. 5, the text processing apparatus further includes:
the filtering module 400 is configured to obtain a plurality of corpus texts, and filter at least one general phrase from the corpus texts.
In an embodiment of the present application, a possible implementation manner is provided, and when the filtering module 400 filters at least one common phrase from a plurality of corpus texts, the module is specifically configured to:
splitting the plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the plurality of corpus texts;
and setting the candidate phrases with the total frequency of appearance in each corpus text larger than a preset threshold as the common phrases.
In an embodiment of the present application, a possible implementation manner is provided, and when the filtering module 400 filters at least one common phrase from a plurality of corpus texts, the module is specifically configured to:
splitting the plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the plurality of corpus texts;
setting candidate phrases with the total frequency larger than a preset threshold value in a plurality of corpus texts as high-frequency phrases;
counting the frequency of occurrence of high-frequency phrases in the corpus texts of each category respectively, wherein any corpus text belongs to at least one category;
and if the ratio of the frequency appearing in the corpus text of one category to the total frequency is not more than the preset ratio, setting the high-frequency phrase as a general phrase.
A possible implementation manner is provided in the embodiment of the present application, and when the conversion module 402 converts the target text into the target vector set, the conversion module is specifically configured to:
performing word segmentation on other parts except the target general phrase in the target text to obtain at least one target word;
and taking the target general phrase as a word of the whole, and performing word embedding vectorization on the target general phrase and each target word to obtain a target vector set.
The text processing device for pictures according to the embodiments of the present disclosure may execute the text processing method for pictures provided by the embodiments of the present disclosure, and the implementation principle is similar, the actions performed by each module in the text processing device for pictures according to the embodiments of the present disclosure correspond to the steps in the text processing method for pictures according to the embodiments of the present disclosure, and for the detailed functional description of each module in the text processing device for pictures, reference may be specifically made to the description in the text processing method for corresponding pictures shown in the foregoing, and details are not repeated here.
Based on the same principle as the method shown in the embodiments of the present disclosure, embodiments of the present disclosure also provide an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing computer operating instructions; and the processor is used for executing the text processing method shown in the embodiment by calling the computer operation instruction. Compared with the prior art, the text processing method can keep the context relation between strongly-associated words in the conversion process, so that the converted target vector set can keep the semantics of the target text, and the accuracy of text semantic recognition can be effectively improved.
In an alternative embodiment, an electronic device is provided, as shown in fig. 6, the electronic device 4000 shown in fig. 6 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application specific integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (extended industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically erasable programmable Read Only Memory), a CD-ROM (Compact Read Only Memory) or other optical disk storage, optical disk storage (including Compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the text processing method can keep the context relation among the strongly-associated words, so that the converted target vector set can keep the semantics of the target text, and the accuracy of text semantic recognition is effectively improved.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a module does not in some cases constitute a limitation on the module itself, for example, a recognition module may also be described as a "module that recognizes a target common phrase".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (10)

1. A method of text processing, comprising:
acquiring a target text, and identifying a target general phrase in the target text;
converting the target text into a target vector set; wherein the target vector set comprises at least one target vector corresponding to the target general phrase; the target common phrase is converted by a word as a whole;
acquiring at least one candidate text, and acquiring a candidate vector set of the candidate text aiming at each candidate text;
and acquiring the similarity between the target vector set and the candidate vector set.
2. The text processing method of claim 1, wherein the step of identifying the target universal phrase in the target text comprises:
matching at least one pre-stored common phrase with the target text;
and setting the phrase which is the same as the general phrase in the target text as the target general phrase.
3. The text processing method of claim 2, wherein the step of identifying the target universal phrase in the target text is preceded by the step of:
and obtaining a plurality of language material texts, and screening at least one general phrase from the plurality of language material texts.
4. The method of claim 3, wherein said step of filtering at least one of said common phrases from a plurality of corpus texts comprises:
splitting a plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total times of the candidate phrases appearing in the corpus texts;
and setting the candidate phrases with the total occurrence frequency in each language text larger than a preset threshold value as the common phrases.
5. The method according to claim 3, wherein said step of filtering at least one of said common phrases from a plurality of corpus texts comprises:
splitting a plurality of corpus texts to obtain a plurality of candidate phrases, and counting the total frequency of the candidate phrases appearing in the corpus texts;
setting candidate phrases with the total frequency larger than a preset threshold value in the plurality of corpus texts as high-frequency phrases;
counting the frequency of the high-frequency phrases appearing in the corpus texts of each category respectively, wherein any corpus text belongs to at least one category;
and if the ratio of the frequency appearing in the corpus text of one category to the total frequency is not more than a preset ratio, setting the high-frequency phrase as the general phrase.
6. The method of claim 1, wherein the step of converting the target text into a set of target vectors comprises:
segmenting the other parts except the target general phrase in the target text to obtain at least one target word;
and taking the target general phrase as a word of the whole, and performing word embedding vectorization on the target general phrase and each target word to obtain the target vector set.
7. A text processing apparatus, comprising:
the identification module is used for acquiring a target text and identifying a target general phrase in the target text;
the conversion module is used for converting the target text into a target vector set; wherein the target vector set comprises at least one target vector corresponding to the target general phrase; the target common phrase is converted by a word as a whole;
the first acquisition module is used for acquiring at least one candidate text and acquiring a candidate vector set of the candidate text aiming at each candidate text;
and the second acquisition module is used for acquiring the similarity between the target vector set and the candidate vector set.
8. The text processing apparatus of claim 7, wherein the recognition module, when recognizing the target universal phrase in the target text, is specifically configured to:
matching at least one pre-stored general phrase with a target text;
and setting the same phrase in the target text as the general phrase as the target general phrase.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the text processing method of any of claims 1-6 when executing the program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, implements the text processing method of any one of claims 1 to 6.
CN201911129869.5A 2019-11-18 2019-11-18 Text processing method and device, electronic equipment and computer readable storage medium Pending CN111027316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911129869.5A CN111027316A (en) 2019-11-18 2019-11-18 Text processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911129869.5A CN111027316A (en) 2019-11-18 2019-11-18 Text processing method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111027316A true CN111027316A (en) 2020-04-17

Family

ID=70200440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911129869.5A Pending CN111027316A (en) 2019-11-18 2019-11-18 Text processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111027316A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708863A (en) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 Method and device for text matching based on doc2vec and electronic equipment
CN111767714A (en) * 2020-06-28 2020-10-13 平安科技(深圳)有限公司 Text smoothness determination method, device, equipment and medium
CN113836937A (en) * 2021-09-23 2021-12-24 平安普惠企业管理有限公司 Text processing method, device, equipment and storage medium based on comparison model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649256A (en) * 2016-09-21 2017-05-10 清华大学 Method for extracting high-quality phrases in electronic medical records
CN107577663A (en) * 2017-08-24 2018-01-12 北京奇艺世纪科技有限公司 A kind of key-phrase extraction method and apparatus
CN108304377A (en) * 2017-12-28 2018-07-20 东软集团股份有限公司 A kind of extracting method and relevant apparatus of long-tail word
CN109165291A (en) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 A kind of text matching technique and electronic equipment
CN110008309A (en) * 2019-03-21 2019-07-12 腾讯科技(深圳)有限公司 A kind of short phrase picking method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649256A (en) * 2016-09-21 2017-05-10 清华大学 Method for extracting high-quality phrases in electronic medical records
CN107577663A (en) * 2017-08-24 2018-01-12 北京奇艺世纪科技有限公司 A kind of key-phrase extraction method and apparatus
CN108304377A (en) * 2017-12-28 2018-07-20 东软集团股份有限公司 A kind of extracting method and relevant apparatus of long-tail word
CN109165291A (en) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 A kind of text matching technique and electronic equipment
CN110008309A (en) * 2019-03-21 2019-07-12 腾讯科技(深圳)有限公司 A kind of short phrase picking method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708863A (en) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 Method and device for text matching based on doc2vec and electronic equipment
CN111708863B (en) * 2020-06-02 2024-03-15 上海硬通网络科技有限公司 Text matching method and device based on doc2vec and electronic equipment
CN111767714A (en) * 2020-06-28 2020-10-13 平安科技(深圳)有限公司 Text smoothness determination method, device, equipment and medium
CN111767714B (en) * 2020-06-28 2022-02-11 平安科技(深圳)有限公司 Text smoothness determination method, device, equipment and medium
CN113836937A (en) * 2021-09-23 2021-12-24 平安普惠企业管理有限公司 Text processing method, device, equipment and storage medium based on comparison model
CN113836937B (en) * 2021-09-23 2023-11-10 上海瑞释信息科技有限公司 Text processing method, device, equipment and storage medium based on comparison model

Similar Documents

Publication Publication Date Title
CN106997342B (en) Intention identification method and device based on multi-round interaction
CN112164391A (en) Statement processing method and device, electronic equipment and storage medium
CN112507704B (en) Multi-intention recognition method, device, equipment and storage medium
CN111027316A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN111814770A (en) Content keyword extraction method of news video, terminal device and medium
CN113033438B (en) Data feature learning method for modal imperfect alignment
CN110795541A (en) Text query method and device, electronic equipment and computer readable storage medium
CN110968686A (en) Intention recognition method, device, equipment and computer readable medium
CN113934848B (en) Data classification method and device and electronic equipment
CN114218945A (en) Entity identification method, device, server and storage medium
CN116882372A (en) Text generation method, device, electronic equipment and storage medium
CN117011581A (en) Image recognition method, medium, device and computing equipment
CN110489740B (en) Semantic analysis method and related product
CN112908339B (en) Conference link positioning method and device, positioning equipment and readable storage medium
CN113177479B (en) Image classification method, device, electronic equipment and storage medium
CN113434630B (en) Customer service evaluation method, customer service evaluation device, terminal equipment and medium
CN112541357B (en) Entity identification method and device and intelligent equipment
CN115630643A (en) Language model training method and device, electronic equipment and storage medium
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN110162614B (en) Question information extraction method and device, electronic equipment and storage medium
CN114117062A (en) Text vector representation method and device and electronic equipment
CN110175241B (en) Question and answer library construction method and device, electronic equipment and computer readable medium
CN113836297A (en) Training method and device for text emotion analysis model
CN109033070B (en) Data processing method, server and computer readable medium
CN111460214A (en) Classification model training method, audio classification method, device, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination