CN113486662A

CN113486662A - Text processing method, system and medium

Info

Publication number: CN113486662A
Application number: CN202110811108.9A
Authority: CN
Inventors: 王伟; 梁玮; 兰斌旋; 彭婧; 龙鲜菊
Original assignee: SAIC GM Wuling Automobile Co Ltd
Current assignee: SAIC GM Wuling Automobile Co Ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-10-08

Abstract

The invention discloses a text processing method, a system and a medium, wherein the method comprises the following steps: obtaining a comment text of an automobile user; performing word segmentation and word stop removal processing on the automobile user comment text; extracting text keywords from the text after word segmentation and word deactivation processing to obtain a keyword extraction result; constructing a corresponding similarity vector space, and carrying out vectorization processing on the automobile user comment text to obtain a high-dimensional vector with ultrahigh dimensionality; based on the high-dimensional vector, carrying out unbalanced cosine similarity analysis on the automobile user comment text to obtain the speech segment similarity; if the word segment similarity is larger than a preset threshold value, taking the automobile user comment text as a text to be deleted; otherwise, the automobile user comment text is reserved. The method can improve the text de-duplication robustness, avoid the defect that repeated contents are easily mistakenly eliminated in short texts and ultra-short texts such as automobile comments, and solve the problem that long sentences and short sentences with repeated meanings cannot be distinguished when the traditional cosine vector meets.

Description

Text processing method, system and medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a text processing method, system, and medium.

Background

At present, most methods for extracting keywords from automobile after-sale problems and forum user comments are an MD5 method for absolutely mapping content and a method for performing cosine similarity analysis on high-dimensional space vectors by using local sensitive hashing, wherein the traditional method uses MD5 to detect the same text content, the efficiency is high, but the recognition of repeated keywords can be caused by slight change of characters; the local sensitive hash method is effective in duplicate checking of long text sentence segments by searching local near words and determining weights by hamming distance regardless of the word order of different sentences, but is very easy to remove by mistake in short texts and ultra-short texts such as automobile reviews, and the duplicate removal robustness is low.

In addition, the traditional cosine similarity method is difficult to be used for processing short-term text analysis, and when a sparse high-dimensional vector space such as automobile comments is processed, sentences with large differences in length describe a certain specific attribute of the same object to generate different cosine vectors, so that the problem of omission of repeated contents is caused.

Disclosure of Invention

The invention mainly aims to provide a text processing method, a text processing system and a text processing medium, which can improve the text de-emphasis robustness, avoid the defect that repeated contents are easily removed by mistake in short texts and ultra-short texts such as automobile comments, and solve the problem that long sentences and short sentences with repeated meanings cannot be distinguished when the traditional cosine vector meets.

In order to achieve the above object, the present invention provides a text processing method, including the following steps:

obtaining a comment text of an automobile user;

performing word segmentation and word stop removal processing on the automobile user comment text;

extracting text keywords from the text after word segmentation and word deactivation processing to obtain a keyword extraction result;

constructing a corresponding similarity vector space, and performing vectorization processing on the automobile user comment text based on the keyword extraction result to obtain a super-high-dimensional vector;

based on the high-dimensional vector, carrying out unbalanced cosine similarity analysis on the automobile user comment text to obtain a speech segment similarity;

if the word segment similarity is larger than a preset threshold value, taking the automobile user comment text as a text to be deleted;

and if the similarity of the language segments is less than or equal to a preset threshold value, keeping the automobile user comment text.

The step of extracting the text keywords from the text after the word segmentation and stop word processing comprises the following steps:

and performing table look-up replacement on the divided words and the synonyms/near-synonyms in the automobile comment word library, and dividing the word frequency by the inverse text frequency index by using a TF-IDF method to obtain a keyword extraction result.

The method comprises the following steps of constructing a corresponding similarity vector space, carrying out vectorization processing on the automobile user comment text based on the keyword extraction result, and obtaining a super-high-dimensional vector, wherein the steps comprise:

constructing a corresponding similarity vector space, carrying out digital conversion on the extracted keywords, converting each piece of user data containing a plurality of words into a multi-dimensional array through digital conversion, and obtaining a high-dimensional vector with ultrahigh dimensionality, so that the whole user comment text forms a co-occurrence matrix of the high-dimensional vector.

The step of carrying out non-equilibrium cosine similarity analysis on the automobile user comment text based on the high-dimensional vector to obtain the speech segment similarity comprises the following steps:

and introducing a weighting matrix, weighting the sensitive words concerned by engineers by adopting a mode of combining a forward user feedback input matrix with reverse engineer input as the weighting matrix, and calculating the unbalanced cosine similarity between the comment texts to obtain the speech segment similarity.

Wherein the performing of the unbalanced cosine similarity analysis on the automobile user comment text comprises:

based on a cosine similarity mathematical model, subtracting the average value of all vectors of the data set in the dimensionality from the analytic value of each dimensionality of the cosine vector, and replacing each dimensionality value of the original mathematical model vector with the obtained result to construct new vectors which have different dimensionality variable quantities and break the balance of the original vector for comparing the similarity.

In addition, the present invention further provides a text processing system, including: the system comprises a memory, a processor, and a computer program stored on the memory, which computer program, when executed by the processor, implements the steps of the text processing method as described above.

Furthermore, the present invention also proposes a computer storage medium having a computer program stored thereon, which, when being executed by a processor, implements the steps of the text processing method as described above.

The technical effects of the scheme of the invention are as follows:

in the invention, for text datamation, Chinese word segmentation and stop word removal (high frequency but no influence on semantics) are adopted, and a corresponding similarity vector space is constructed, so that the robustness of the ultrashort statement vector comparison algorithm is improved through the combined data cleaning work.

The method adopts digital analysis to carry out digital conversion on the cleaned ultrashort text data, divides the text into words, converts the words into numbers, converts each user voice containing a plurality of words into a multidimensional array, namely a multidimensional vector, can be matched with different scenes of special problems of automobiles through digital conversion, uses different text keyword extraction algorithms, and has better anti-interference capability;

according to the method, a huge amount of automobile user evaluation papers are vectorized at the same time to obtain a high-dimensional vector with ultrahigh dimensionality, the dimensionality explosion risk of the high-dimensional sparse matrix is reduced by using a dimensionality reduction method, the high-dimensional sparse matrix generated by massive evaluation contents is good in applicability and expansibility, and the calculation efficiency is improved by 31% in a large amount of data.

In order to solve the problem that the high-dimensional vector vocabulary number is huge and storage overflow is easily caused, a scene library matched with a special automobile project is used in an actual project, the text keywords of the similar meaning words are used for replacing the keyword numbers of corresponding automobile comments, the vectors corresponding to comment sentences are recombined and changed, and the sentence vector definition is greatly improved.

In order to solve the problem that the traditional cosine similarity method is difficult to be used for processing a certain specific attribute of a sentence with a large difference in length to describe the same thing, different cosine vectors are generated, and repeated content is omitted, the similarity comparison is carried out on the vectors with different weights by using an unbalanced cosine similarity method, the interference of long sentences and short sentences with repeated meanings but large sentence lengths (the hamming distance between keywords) is eliminated, and the same items are combined to reduce the fault keywords of the project software by about 27%.

Drawings

FIG. 1 is a schematic flow chart of a text processing method according to the present invention;

fig. 2 is a schematic flowchart of an example text processing method according to the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

At present, most of automobile user comment duplicate removal methods of finished automobile enterprises are a method of content absolute mapping and a method of converting a vector space VSM model and then performing similarity analysis on a high-dimensional space vector, so that the duplicate removal robustness is low, the automobile user comments cover long comments and short comments, the semantic structure is complex, duplicate removal results are unstable, the method of locally sensitive hashing method is effective in duplicate detection of a long text paragraph by searching local near words and determining weights according to hamming distances regardless of different sentence pattern word orders, but the duplicate removal is easy to miss in short texts and ultra-short texts such as automobile comments.

The invention adopts digital analysis to carry out digital conversion on the cleaned ultra-short text data, divides the text into words, converts the words into numbers, and converts each user voice containing a plurality of words into a multi-dimensional array, namely a multi-dimensional vector, through digital conversion.

According to the method, a huge amount of automobile user evaluation papers are vectorized at the same time to obtain a high-dimensional vector with ultrahigh dimensionality, the dimensionality explosion risk of the high-dimensional sparse matrix is reduced by using a dimensionality reduction method, the applicability of the high-dimensional sparse matrix generated by massive evaluation contents is good, and the calculation efficiency is improved by 31% in a large amount of data.

Specifically, as shown in fig. 1, the present invention provides a text processing method, which is characterized in that the method includes the following steps:

s1, obtaining automobile user comment texts;

s2, performing word segmentation and word stop removal processing on the automobile user comment text;

s3, extracting text keywords of the text after word segmentation and word stop removal processing to obtain a keyword extraction result;

and performing table look-up replacement on the divided words and synonyms/near-synonyms in the automobile comment word bank, and dividing word frequency by the inverse text frequency index by using a TF-IDF method to obtain a keyword extraction result.

S4, constructing a corresponding similarity vector space, and performing vectorization processing on the automobile user comment text based on the keyword extraction result to obtain a super-high-dimensional vector;

S5, based on the high-dimensional vector, carrying out unbalanced cosine similarity analysis on the automobile user comment text to obtain a corpus similarity;

S6, if the similarity of the language segments is larger than a preset threshold value, taking the automobile user comment text as a text to be deleted; and if the similarity of the language segments is less than or equal to a preset threshold value, keeping the automobile user comment text.

The following details the embodiments of the present invention:

the present invention contemplates: the character modification of the traditional duplication removing method causes the recognition of repeated keywords; the automobile comments such as short texts and ultra-short texts are very easy to be mistakenly eliminated; the total data vocabulary is large in the large data environment, the number of the vocabulary is large, the sensitivity of the converted high-dimensional vector is poor, and the availability and the effectiveness of the dimension reduction algorithm are greatly reduced; when comment texts with large differences in length are processed, cosine vectors with large differences are always generated when sparse high-dimensional vector spaces are converted, and the defect that repeated contents are not excluded occurs.

In order to solve the technical problems, the invention adopts the technical scheme that:

on the basis of the existing computer text analysis natural language processing method, aiming at the deduplication particularity of short texts of automobile user comment, a whole set of calculation method of keyword extraction, text vectorization and unbalanced cosine similarity threshold screening is adopted. Focusing keywords to extract and replace synonyms and synonyms in the subject object to form a corresponding similarity vector space; the automobile special question word bank is used, the text keywords of the similar meaning words are used for replacing the keyword numbers of the corresponding automobile comments, the vectors corresponding to the comment sentences are recombined and changed, and the anti-interference capacity is improved; the dimensionality reduction method is combined with the high-dimensional sparse matrix, so that the applicability is wide; and the high-order space vector is used for replacing the Hamming distance, the similarity of the keywords is calculated, and the similar keywords are combined, so that the anti-interference capability is strong.

The method has the advantages that different scenes of special automobile problems can be matched in an innovative mode, different text keyword extraction algorithms are used, and compared with a hamming distance calculation weight, the anti-interference capability is good;

the method is characterized in that a traditional cosine similarity mathematical model is innovatively changed, the analysis value of each dimension of a cosine vector is subtracted from the average value of all vectors of a data set in the dimension, and the obtained result replaces the value of each dimension of the original mathematical model vector, so that a method for breaking balance of original vectors by different dimensional variable quantities and comparing similarity is constructed, and the problem that long sentences and short sentences with repeated meanings cannot be distinguished when the traditional cosine vector encounters is solved.

In particular, the work of automobile engineers needs to continuously collect user feedback information to improve products, but the huge amount of user feedback brings about the difficulty of information purification multiplied.

As shown in fig. 1 and fig. 2, the method performs word segmentation on the after-market user feedback information of the automobile containing a large number of short comments, removes the influence of punctuation marks by following a traditional computer natural language processing method, and divides the Chinese text by using CRF + +. As in the original sentence: compared with the Changan automobile transmission shaft with the great light of a golden ox stand, the resonance is large. ", word segmentation result: "contrast/changan// chariot/macro light/car/propeller shaft/resonance/large".

The word segmentation is to enable the comment short sentence to construct a tuple in a word vector according to the occurrence evaluation rate of the words in later mathematical analysis, and the high-repetition number of the words will affect the similarity judgment, and as the method for judging similarity of high-frequency words is disabled due to the high occurrence evaluation rate of the words in the previous example, the frequently-occurring automobile comment stop word bank is added to the second part of words to realize the function of stop words such as removing the words after word segmentation.

And thirdly, extracting text keywords, namely replacing the words obtained in the second step with synonym/near-synonym point lookup tables in the company automobile comment word stock, and dividing word frequency by an inverse text frequency index by using a TF-IDF method, wherein the method comprises the following steps: the "resonance" in "macro light/car/propeller shaft/resonance/large" is "vibration" in the review thesaurus: the synonym processing result of the shaking resonance shaking vibration is 'macro light/automobile/transmission shaft/vibration/big', and the key word extraction result of TF-IDF is 'macro light, transmission shaft, vibration, big'. The method provides that the accuracy of similarity analysis after the extraction of the key word features of the automobile text can be enhanced.

TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining:

where TF (c, d) represents the frequency of occurrence of the word c in the comment d.

The fourth step of text vectorization is to provide vector elements for cosine analysis, each comment is expressed as a dimension counting vector in the process, each keyword in the comment is attached with the occurrence frequency of the keyword as a tuple of a dimension vector, and therefore the whole comment to be analyzed is formedThe length of the vectors in the matrix must be consistent, the short end zero padding forms an n x n matrix, and the size of the vector table determines the dimension. Such as: "doc 1, doc2, doc3, doc4 and doc 5" represent 5 independent user comments, "drive shaft, vibration, large torque" represents each keyword extracted from the comment to be analyzed, the matrix tuple in the following table corresponds to the co-occurrence frequency of each keyword in each independent comment, for example, in the following table, if the case doc1 does not refer to "drive shaft" and doc2 refers to once, D is obtained, and then D is the case₁₁＝0,D₁₂＝1。

The matrix D is:

the fifth step is to solve the information deviation problem of the user's inherent concept in the automobile technical review. The cosine similarity algorithm has the advantage of high efficiency in mathematical analysis of high-dimensional vectors, but the biggest difference between user comments and the after-sale problems of a 4S shop is that users are not maintenance technicians trained professionally, the recognition degree of each user on automobile products is different, class A users cannot use professional terms, class B users are used for providing guesses according to experience, and class C users use a final feedback part as a problem source during feedback (for example, the problem of 'steering wheel shaking' is solved, the steering wheel does not shake and the problem is distributed to a steering wheel airbag engineer to cause certain trouble). In order to eliminate the clustering method deviation and improve the numerical influence of the automobile comment information difference on the similarity probability between the multi-dimensional vectors generated in the last step, the balance of the cosine similarity method is adjusted, and sensitive words concerned by engineers are weighted. The specific operation method is to introduce a weighting matrix E during detailed analysis, and adopt a mode of combining a forward user feedback input matrix and a reverse engineer input as the weighting matrix. Namely as in the above table: both user reviews doc1 and doc2 relate to the similarity of the "sensitive word" doc1 and doc2 reviews is not affected, while doc5 does not relate to the sensitive word "vibrate" to increase the cosine distance between doc1/doc2 and doc 5.

Now, the insertion of the matrix weight is realized through a formula:

wherein Cos (doc1, doc2) is the cosine similarity coefficient between two comments doc1 and doc2, doc1_i，doc2_iFrequency of mentioning each keyword, M, in two comments_iAnd the matrix is formed by the weights of the sensitive words with the sequence numbers i in the word stock.

As in the above example: the set of the sensitive words doc2 and doc5 for a torque of weight 1.6 is as follows:

the similarity values of two comments including doc2 and doc1/doc5 obtained by a traditional cosine similarity method are the same as 0.707, and in an actual situation, two user feedback opinions including doc1 for neutral evaluation of vibration and torque and doc5 with large raising torque cannot be classified into one category.

The new method is used to obtain the unbalanced cosine similarity between doc1 and doc2 as follows:

the unbalanced cosine similarity of doc2 and doc5 is obtained as follows:

by the complete implementation of the method, better results can be obtained in user comment classification in actual use.

It should be noted that the above example is a description of the present method, and the application of the method is not limited to the above embodiment.

Compared with the prior art, the technical effect of the scheme of the invention is as follows:

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of text processing, the method comprising the steps of:

obtaining a comment text of an automobile user;

2. The method of claim 1, wherein the step of extracting the text keywords from the text after word segmentation and word deactivation processing comprises:

3. The text processing method of claim 1, wherein the step of constructing a corresponding similarity vector space, vectorizing the comment text of the automobile user based on the keyword extraction result, and obtaining a super-high-dimensional vector comprises:

4. The text processing method according to claim 1, wherein the step of performing an unbalanced cosine similarity analysis on the automobile user comment text based on the high-dimensional vector to obtain the corpus similarity comprises:

5. The text processing method of claim 6, wherein the performing of the unbalanced cosine similarity analysis on the automobile user comment text comprises:

6. A text processing system, comprising: the system comprises a memory, a processor, and a computer program stored on the memory, which computer program, when executed by the processor, carries out the steps of the text processing method according to any one of claims 1-5.

7. A computer storage medium, characterized in that the computer storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the text processing method according to any one of claims 1-5.