CN113486662A - Text processing method, system and medium - Google Patents

Text processing method, system and medium Download PDF

Info

Publication number
CN113486662A
CN113486662A CN202110811108.9A CN202110811108A CN113486662A CN 113486662 A CN113486662 A CN 113486662A CN 202110811108 A CN202110811108 A CN 202110811108A CN 113486662 A CN113486662 A CN 113486662A
Authority
CN
China
Prior art keywords
text
similarity
word
vector
automobile
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110811108.9A
Other languages
Chinese (zh)
Inventor
王伟
梁玮
兰斌旋
彭婧
龙鲜菊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SAIC GM Wuling Automobile Co Ltd
Original Assignee
SAIC GM Wuling Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SAIC GM Wuling Automobile Co Ltd filed Critical SAIC GM Wuling Automobile Co Ltd
Priority to CN202110811108.9A priority Critical patent/CN113486662A/en
Publication of CN113486662A publication Critical patent/CN113486662A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text processing method, a system and a medium, wherein the method comprises the following steps: obtaining a comment text of an automobile user; performing word segmentation and word stop removal processing on the automobile user comment text; extracting text keywords from the text after word segmentation and word deactivation processing to obtain a keyword extraction result; constructing a corresponding similarity vector space, and carrying out vectorization processing on the automobile user comment text to obtain a high-dimensional vector with ultrahigh dimensionality; based on the high-dimensional vector, carrying out unbalanced cosine similarity analysis on the automobile user comment text to obtain the speech segment similarity; if the word segment similarity is larger than a preset threshold value, taking the automobile user comment text as a text to be deleted; otherwise, the automobile user comment text is reserved. The method can improve the text de-duplication robustness, avoid the defect that repeated contents are easily mistakenly eliminated in short texts and ultra-short texts such as automobile comments, and solve the problem that long sentences and short sentences with repeated meanings cannot be distinguished when the traditional cosine vector meets.

Description

Text processing method, system and medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a text processing method, system, and medium.
Background
At present, most methods for extracting keywords from automobile after-sale problems and forum user comments are an MD5 method for absolutely mapping content and a method for performing cosine similarity analysis on high-dimensional space vectors by using local sensitive hashing, wherein the traditional method uses MD5 to detect the same text content, the efficiency is high, but the recognition of repeated keywords can be caused by slight change of characters; the local sensitive hash method is effective in duplicate checking of long text sentence segments by searching local near words and determining weights by hamming distance regardless of the word order of different sentences, but is very easy to remove by mistake in short texts and ultra-short texts such as automobile reviews, and the duplicate removal robustness is low.
In addition, the traditional cosine similarity method is difficult to be used for processing short-term text analysis, and when a sparse high-dimensional vector space such as automobile comments is processed, sentences with large differences in length describe a certain specific attribute of the same object to generate different cosine vectors, so that the problem of omission of repeated contents is caused.
Disclosure of Invention
The invention mainly aims to provide a text processing method, a text processing system and a text processing medium, which can improve the text de-emphasis robustness, avoid the defect that repeated contents are easily removed by mistake in short texts and ultra-short texts such as automobile comments, and solve the problem that long sentences and short sentences with repeated meanings cannot be distinguished when the traditional cosine vector meets.
In order to achieve the above object, the present invention provides a text processing method, including the following steps:
obtaining a comment text of an automobile user;
performing word segmentation and word stop removal processing on the automobile user comment text;
extracting text keywords from the text after word segmentation and word deactivation processing to obtain a keyword extraction result;
constructing a corresponding similarity vector space, and performing vectorization processing on the automobile user comment text based on the keyword extraction result to obtain a super-high-dimensional vector;
based on the high-dimensional vector, carrying out unbalanced cosine similarity analysis on the automobile user comment text to obtain a speech segment similarity;
if the word segment similarity is larger than a preset threshold value, taking the automobile user comment text as a text to be deleted;
and if the similarity of the language segments is less than or equal to a preset threshold value, keeping the automobile user comment text.
The step of extracting the text keywords from the text after the word segmentation and stop word processing comprises the following steps:
and performing table look-up replacement on the divided words and the synonyms/near-synonyms in the automobile comment word library, and dividing the word frequency by the inverse text frequency index by using a TF-IDF method to obtain a keyword extraction result.
The method comprises the following steps of constructing a corresponding similarity vector space, carrying out vectorization processing on the automobile user comment text based on the keyword extraction result, and obtaining a super-high-dimensional vector, wherein the steps comprise:
constructing a corresponding similarity vector space, carrying out digital conversion on the extracted keywords, converting each piece of user data containing a plurality of words into a multi-dimensional array through digital conversion, and obtaining a high-dimensional vector with ultrahigh dimensionality, so that the whole user comment text forms a co-occurrence matrix of the high-dimensional vector.
The step of carrying out non-equilibrium cosine similarity analysis on the automobile user comment text based on the high-dimensional vector to obtain the speech segment similarity comprises the following steps:
and introducing a weighting matrix, weighting the sensitive words concerned by engineers by adopting a mode of combining a forward user feedback input matrix with reverse engineer input as the weighting matrix, and calculating the unbalanced cosine similarity between the comment texts to obtain the speech segment similarity.
Wherein the performing of the unbalanced cosine similarity analysis on the automobile user comment text comprises:
based on a cosine similarity mathematical model, subtracting the average value of all vectors of the data set in the dimensionality from the analytic value of each dimensionality of the cosine vector, and replacing each dimensionality value of the original mathematical model vector with the obtained result to construct new vectors which have different dimensionality variable quantities and break the balance of the original vector for comparing the similarity.
In addition, the present invention further provides a text processing system, including: the system comprises a memory, a processor, and a computer program stored on the memory, which computer program, when executed by the processor, implements the steps of the text processing method as described above.
Furthermore, the present invention also proposes a computer storage medium having a computer program stored thereon, which, when being executed by a processor, implements the steps of the text processing method as described above.
The technical effects of the scheme of the invention are as follows:
in the invention, for text datamation, Chinese word segmentation and stop word removal (high frequency but no influence on semantics) are adopted, and a corresponding similarity vector space is constructed, so that the robustness of the ultrashort statement vector comparison algorithm is improved through the combined data cleaning work.
The method adopts digital analysis to carry out digital conversion on the cleaned ultrashort text data, divides the text into words, converts the words into numbers, converts each user voice containing a plurality of words into a multidimensional array, namely a multidimensional vector, can be matched with different scenes of special problems of automobiles through digital conversion, uses different text keyword extraction algorithms, and has better anti-interference capability;
according to the method, a huge amount of automobile user evaluation papers are vectorized at the same time to obtain a high-dimensional vector with ultrahigh dimensionality, the dimensionality explosion risk of the high-dimensional sparse matrix is reduced by using a dimensionality reduction method, the high-dimensional sparse matrix generated by massive evaluation contents is good in applicability and expansibility, and the calculation efficiency is improved by 31% in a large amount of data.
In order to solve the problem that the high-dimensional vector vocabulary number is huge and storage overflow is easily caused, a scene library matched with a special automobile project is used in an actual project, the text keywords of the similar meaning words are used for replacing the keyword numbers of corresponding automobile comments, the vectors corresponding to comment sentences are recombined and changed, and the sentence vector definition is greatly improved.
In order to solve the problem that the traditional cosine similarity method is difficult to be used for processing a certain specific attribute of a sentence with a large difference in length to describe the same thing, different cosine vectors are generated, and repeated content is omitted, the similarity comparison is carried out on the vectors with different weights by using an unbalanced cosine similarity method, the interference of long sentences and short sentences with repeated meanings but large sentence lengths (the hamming distance between keywords) is eliminated, and the same items are combined to reduce the fault keywords of the project software by about 27%.
Drawings
FIG. 1 is a schematic flow chart of a text processing method according to the present invention;
fig. 2 is a schematic flowchart of an example text processing method according to the present invention.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
At present, most of automobile user comment duplicate removal methods of finished automobile enterprises are a method of content absolute mapping and a method of converting a vector space VSM model and then performing similarity analysis on a high-dimensional space vector, so that the duplicate removal robustness is low, the automobile user comments cover long comments and short comments, the semantic structure is complex, duplicate removal results are unstable, the method of locally sensitive hashing method is effective in duplicate detection of a long text paragraph by searching local near words and determining weights according to hamming distances regardless of different sentence pattern word orders, but the duplicate removal is easy to miss in short texts and ultra-short texts such as automobile comments.
In the invention, for text datamation, Chinese word segmentation and stop word removal (high frequency but no influence on semantics) are adopted, and a corresponding similarity vector space is constructed, so that the robustness of the ultrashort statement vector comparison algorithm is improved through the combined data cleaning work.
The invention adopts digital analysis to carry out digital conversion on the cleaned ultra-short text data, divides the text into words, converts the words into numbers, and converts each user voice containing a plurality of words into a multi-dimensional array, namely a multi-dimensional vector, through digital conversion.
According to the method, a huge amount of automobile user evaluation papers are vectorized at the same time to obtain a high-dimensional vector with ultrahigh dimensionality, the dimensionality explosion risk of the high-dimensional sparse matrix is reduced by using a dimensionality reduction method, the applicability of the high-dimensional sparse matrix generated by massive evaluation contents is good, and the calculation efficiency is improved by 31% in a large amount of data.
In order to solve the problem that the high-dimensional vector vocabulary number is huge and storage overflow is easily caused, a scene library matched with a special automobile project is used in an actual project, the text keywords of the similar meaning words are used for replacing the keyword numbers of corresponding automobile comments, the vectors corresponding to comment sentences are recombined and changed, and the sentence vector definition is greatly improved.
In order to solve the problem that the traditional cosine similarity method is difficult to be used for processing a certain specific attribute of a sentence with a large difference in length to describe the same thing, different cosine vectors are generated, and repeated content is omitted, the similarity comparison is carried out on the vectors with different weights by using an unbalanced cosine similarity method, the interference of long sentences and short sentences with repeated meanings but large sentence lengths (the hamming distance between keywords) is eliminated, and the same items are combined to reduce the fault keywords of the project software by about 27%.
Specifically, as shown in fig. 1, the present invention provides a text processing method, which is characterized in that the method includes the following steps:
s1, obtaining automobile user comment texts;
s2, performing word segmentation and word stop removal processing on the automobile user comment text;
s3, extracting text keywords of the text after word segmentation and word stop removal processing to obtain a keyword extraction result;
and performing table look-up replacement on the divided words and synonyms/near-synonyms in the automobile comment word bank, and dividing word frequency by the inverse text frequency index by using a TF-IDF method to obtain a keyword extraction result.
S4, constructing a corresponding similarity vector space, and performing vectorization processing on the automobile user comment text based on the keyword extraction result to obtain a super-high-dimensional vector;
constructing a corresponding similarity vector space, carrying out digital conversion on the extracted keywords, converting each piece of user data containing a plurality of words into a multi-dimensional array through digital conversion, and obtaining a high-dimensional vector with ultrahigh dimensionality, so that the whole user comment text forms a co-occurrence matrix of the high-dimensional vector.
S5, based on the high-dimensional vector, carrying out unbalanced cosine similarity analysis on the automobile user comment text to obtain a corpus similarity;
and introducing a weighting matrix, weighting the sensitive words concerned by engineers by adopting a mode of combining a forward user feedback input matrix with reverse engineer input as the weighting matrix, and calculating the unbalanced cosine similarity between the comment texts to obtain the speech segment similarity.
Wherein the performing of the unbalanced cosine similarity analysis on the automobile user comment text comprises:
based on a cosine similarity mathematical model, subtracting the average value of all vectors of the data set in the dimensionality from the analytic value of each dimensionality of the cosine vector, and replacing each dimensionality value of the original mathematical model vector with the obtained result to construct new vectors which have different dimensionality variable quantities and break the balance of the original vector for comparing the similarity.
S6, if the similarity of the language segments is larger than a preset threshold value, taking the automobile user comment text as a text to be deleted; and if the similarity of the language segments is less than or equal to a preset threshold value, keeping the automobile user comment text.
The following details the embodiments of the present invention:
the present invention contemplates: the character modification of the traditional duplication removing method causes the recognition of repeated keywords; the automobile comments such as short texts and ultra-short texts are very easy to be mistakenly eliminated; the total data vocabulary is large in the large data environment, the number of the vocabulary is large, the sensitivity of the converted high-dimensional vector is poor, and the availability and the effectiveness of the dimension reduction algorithm are greatly reduced; when comment texts with large differences in length are processed, cosine vectors with large differences are always generated when sparse high-dimensional vector spaces are converted, and the defect that repeated contents are not excluded occurs.
In order to solve the technical problems, the invention adopts the technical scheme that:
on the basis of the existing computer text analysis natural language processing method, aiming at the deduplication particularity of short texts of automobile user comment, a whole set of calculation method of keyword extraction, text vectorization and unbalanced cosine similarity threshold screening is adopted. Focusing keywords to extract and replace synonyms and synonyms in the subject object to form a corresponding similarity vector space; the automobile special question word bank is used, the text keywords of the similar meaning words are used for replacing the keyword numbers of the corresponding automobile comments, the vectors corresponding to the comment sentences are recombined and changed, and the anti-interference capacity is improved; the dimensionality reduction method is combined with the high-dimensional sparse matrix, so that the applicability is wide; and the high-order space vector is used for replacing the Hamming distance, the similarity of the keywords is calculated, and the similar keywords are combined, so that the anti-interference capability is strong.
The method has the advantages that different scenes of special automobile problems can be matched in an innovative mode, different text keyword extraction algorithms are used, and compared with a hamming distance calculation weight, the anti-interference capability is good;
the method is characterized in that a traditional cosine similarity mathematical model is innovatively changed, the analysis value of each dimension of a cosine vector is subtracted from the average value of all vectors of a data set in the dimension, and the obtained result replaces the value of each dimension of the original mathematical model vector, so that a method for breaking balance of original vectors by different dimensional variable quantities and comparing similarity is constructed, and the problem that long sentences and short sentences with repeated meanings cannot be distinguished when the traditional cosine vector encounters is solved.
In particular, the work of automobile engineers needs to continuously collect user feedback information to improve products, but the huge amount of user feedback brings about the difficulty of information purification multiplied.
As shown in fig. 1 and fig. 2, the method performs word segmentation on the after-market user feedback information of the automobile containing a large number of short comments, removes the influence of punctuation marks by following a traditional computer natural language processing method, and divides the Chinese text by using CRF + +. As in the original sentence: compared with the Changan automobile transmission shaft with the great light of a golden ox stand, the resonance is large. ", word segmentation result: "contrast/changan// chariot/macro light/car/propeller shaft/resonance/large".
The word segmentation is to enable the comment short sentence to construct a tuple in a word vector according to the occurrence evaluation rate of the words in later mathematical analysis, and the high-repetition number of the words will affect the similarity judgment, and as the method for judging similarity of high-frequency words is disabled due to the high occurrence evaluation rate of the words in the previous example, the frequently-occurring automobile comment stop word bank is added to the second part of words to realize the function of stop words such as removing the words after word segmentation.
And thirdly, extracting text keywords, namely replacing the words obtained in the second step with synonym/near-synonym point lookup tables in the company automobile comment word stock, and dividing word frequency by an inverse text frequency index by using a TF-IDF method, wherein the method comprises the following steps: the "resonance" in "macro light/car/propeller shaft/resonance/large" is "vibration" in the review thesaurus: the synonym processing result of the shaking resonance shaking vibration is 'macro light/automobile/transmission shaft/vibration/big', and the key word extraction result of TF-IDF is 'macro light, transmission shaft, vibration, big'. The method provides that the accuracy of similarity analysis after the extraction of the key word features of the automobile text can be enhanced.
TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining:
Figure BDA0003168253960000071
where TF (c, d) represents the frequency of occurrence of the word c in the comment d.
The fourth step of text vectorization is to provide vector elements for cosine analysis, each comment is expressed as a dimension counting vector in the process, each keyword in the comment is attached with the occurrence frequency of the keyword as a tuple of a dimension vector, and therefore the whole comment to be analyzed is formedThe length of the vectors in the matrix must be consistent, the short end zero padding forms an n x n matrix, and the size of the vector table determines the dimension. Such as: "doc 1, doc2, doc3, doc4 and doc 5" represent 5 independent user comments, "drive shaft, vibration, large torque" represents each keyword extracted from the comment to be analyzed, the matrix tuple in the following table corresponds to the co-occurrence frequency of each keyword in each independent comment, for example, in the following table, if the case doc1 does not refer to "drive shaft" and doc2 refers to once, D is obtained, and then D is the case11=0,D12=1。
Figure BDA0003168253960000072
The matrix D is:
Figure BDA0003168253960000073
the fifth step is to solve the information deviation problem of the user's inherent concept in the automobile technical review. The cosine similarity algorithm has the advantage of high efficiency in mathematical analysis of high-dimensional vectors, but the biggest difference between user comments and the after-sale problems of a 4S shop is that users are not maintenance technicians trained professionally, the recognition degree of each user on automobile products is different, class A users cannot use professional terms, class B users are used for providing guesses according to experience, and class C users use a final feedback part as a problem source during feedback (for example, the problem of 'steering wheel shaking' is solved, the steering wheel does not shake and the problem is distributed to a steering wheel airbag engineer to cause certain trouble). In order to eliminate the clustering method deviation and improve the numerical influence of the automobile comment information difference on the similarity probability between the multi-dimensional vectors generated in the last step, the balance of the cosine similarity method is adjusted, and sensitive words concerned by engineers are weighted. The specific operation method is to introduce a weighting matrix E during detailed analysis, and adopt a mode of combining a forward user feedback input matrix and a reverse engineer input as the weighting matrix. Namely as in the above table: both user reviews doc1 and doc2 relate to the similarity of the "sensitive word" doc1 and doc2 reviews is not affected, while doc5 does not relate to the sensitive word "vibrate" to increase the cosine distance between doc1/doc2 and doc 5.
Now, the insertion of the matrix weight is realized through a formula:
Figure BDA0003168253960000081
wherein Cos (doc1, doc2) is the cosine similarity coefficient between two comments doc1 and doc2, doc1i,doc2iFrequency of mentioning each keyword, M, in two commentsiAnd the matrix is formed by the weights of the sensitive words with the sequence numbers i in the word stock.
As in the above example: the set of the sensitive words doc2 and doc5 for a torque of weight 1.6 is as follows:
Figure BDA0003168253960000082
the similarity values of two comments including doc2 and doc1/doc5 obtained by a traditional cosine similarity method are the same as 0.707, and in an actual situation, two user feedback opinions including doc1 for neutral evaluation of vibration and torque and doc5 with large raising torque cannot be classified into one category.
The new method is used to obtain the unbalanced cosine similarity between doc1 and doc2 as follows:
Figure BDA0003168253960000083
the unbalanced cosine similarity of doc2 and doc5 is obtained as follows:
Figure BDA0003168253960000084
by the complete implementation of the method, better results can be obtained in user comment classification in actual use.
It should be noted that the above example is a description of the present method, and the application of the method is not limited to the above embodiment.
In addition, the present invention further provides a text processing system, including: the system comprises a memory, a processor, and a computer program stored on the memory, which computer program, when executed by the processor, implements the steps of the text processing method as described above.
Furthermore, the present invention also proposes a computer storage medium having a computer program stored thereon, which, when being executed by a processor, implements the steps of the text processing method as described above.
Compared with the prior art, the technical effect of the scheme of the invention is as follows:
in the invention, for text datamation, Chinese word segmentation and stop word removal (high frequency but no influence on semantics) are adopted, and a corresponding similarity vector space is constructed, so that the robustness of the ultrashort statement vector comparison algorithm is improved through the combined data cleaning work.
The method adopts digital analysis to carry out digital conversion on the cleaned ultrashort text data, divides the text into words, converts the words into numbers, converts each user voice containing a plurality of words into a multidimensional array, namely a multidimensional vector, can be matched with different scenes of special problems of automobiles through digital conversion, uses different text keyword extraction algorithms, and has better anti-interference capability;
according to the method, a huge amount of automobile user evaluation papers are vectorized at the same time to obtain a high-dimensional vector with ultrahigh dimensionality, the dimensionality explosion risk of the high-dimensional sparse matrix is reduced by using a dimensionality reduction method, the high-dimensional sparse matrix generated by massive evaluation contents is good in applicability and expansibility, and the calculation efficiency is improved by 31% in a large amount of data.
In order to solve the problem that the high-dimensional vector vocabulary number is huge and storage overflow is easily caused, a scene library matched with a special automobile project is used in an actual project, the text keywords of the similar meaning words are used for replacing the keyword numbers of corresponding automobile comments, the vectors corresponding to comment sentences are recombined and changed, and the sentence vector definition is greatly improved.
In order to solve the problem that the traditional cosine similarity method is difficult to be used for processing a certain specific attribute of a sentence with a large difference in length to describe the same thing, different cosine vectors are generated, and repeated content is omitted, the similarity comparison is carried out on the vectors with different weights by using an unbalanced cosine similarity method, the interference of long sentences and short sentences with repeated meanings but large sentence lengths (the hamming distance between keywords) is eliminated, and the same items are combined to reduce the fault keywords of the project software by about 27%.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (7)

1. A method of text processing, the method comprising the steps of:
obtaining a comment text of an automobile user;
performing word segmentation and word stop removal processing on the automobile user comment text;
extracting text keywords from the text after word segmentation and word deactivation processing to obtain a keyword extraction result;
constructing a corresponding similarity vector space, and performing vectorization processing on the automobile user comment text based on the keyword extraction result to obtain a super-high-dimensional vector;
based on the high-dimensional vector, carrying out unbalanced cosine similarity analysis on the automobile user comment text to obtain a speech segment similarity;
if the word segment similarity is larger than a preset threshold value, taking the automobile user comment text as a text to be deleted;
and if the similarity of the language segments is less than or equal to a preset threshold value, keeping the automobile user comment text.
2. The method of claim 1, wherein the step of extracting the text keywords from the text after word segmentation and word deactivation processing comprises:
and performing table look-up replacement on the divided words and the synonyms/near-synonyms in the automobile comment word library, and dividing the word frequency by the inverse text frequency index by using a TF-IDF method to obtain a keyword extraction result.
3. The text processing method of claim 1, wherein the step of constructing a corresponding similarity vector space, vectorizing the comment text of the automobile user based on the keyword extraction result, and obtaining a super-high-dimensional vector comprises:
constructing a corresponding similarity vector space, carrying out digital conversion on the extracted keywords, converting each piece of user data containing a plurality of words into a multi-dimensional array through digital conversion, and obtaining a high-dimensional vector with ultrahigh dimensionality, so that the whole user comment text forms a co-occurrence matrix of the high-dimensional vector.
4. The text processing method according to claim 1, wherein the step of performing an unbalanced cosine similarity analysis on the automobile user comment text based on the high-dimensional vector to obtain the corpus similarity comprises:
and introducing a weighting matrix, weighting the sensitive words concerned by engineers by adopting a mode of combining a forward user feedback input matrix with reverse engineer input as the weighting matrix, and calculating the unbalanced cosine similarity between the comment texts to obtain the speech segment similarity.
5. The text processing method of claim 6, wherein the performing of the unbalanced cosine similarity analysis on the automobile user comment text comprises:
based on a cosine similarity mathematical model, subtracting the average value of all vectors of the data set in the dimensionality from the analytic value of each dimensionality of the cosine vector, and replacing each dimensionality value of the original mathematical model vector with the obtained result to construct new vectors which have different dimensionality variable quantities and break the balance of the original vector for comparing the similarity.
6. A text processing system, comprising: the system comprises a memory, a processor, and a computer program stored on the memory, which computer program, when executed by the processor, carries out the steps of the text processing method according to any one of claims 1-5.
7. A computer storage medium, characterized in that the computer storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the text processing method according to any one of claims 1-5.
CN202110811108.9A 2021-07-19 2021-07-19 Text processing method, system and medium Pending CN113486662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110811108.9A CN113486662A (en) 2021-07-19 2021-07-19 Text processing method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110811108.9A CN113486662A (en) 2021-07-19 2021-07-19 Text processing method, system and medium

Publications (1)

Publication Number Publication Date
CN113486662A true CN113486662A (en) 2021-10-08

Family

ID=77942164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110811108.9A Pending CN113486662A (en) 2021-07-19 2021-07-19 Text processing method, system and medium

Country Status (1)

Country Link
CN (1) CN113486662A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297401A (en) * 2021-12-14 2022-04-08 中航机载系统共性技术有限公司 System knowledge extraction method based on clustering algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408809A (en) * 2018-09-25 2019-03-01 天津大学 A kind of sentiment analysis method for automobile product comment based on term vector
CN110134925A (en) * 2019-05-15 2019-08-16 北京信息科技大学 A kind of Chinese patent text similarity calculating method
CN110705286A (en) * 2019-09-24 2020-01-17 青木数字技术股份有限公司 Comment information-based data processing method and device
CN111401045A (en) * 2020-03-16 2020-07-10 腾讯科技(深圳)有限公司 Text generation method and device, storage medium and electronic equipment
CN112257410A (en) * 2020-10-15 2021-01-22 江苏卓易信息科技股份有限公司 Similarity calculation method for unbalanced text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408809A (en) * 2018-09-25 2019-03-01 天津大学 A kind of sentiment analysis method for automobile product comment based on term vector
CN110134925A (en) * 2019-05-15 2019-08-16 北京信息科技大学 A kind of Chinese patent text similarity calculating method
CN110705286A (en) * 2019-09-24 2020-01-17 青木数字技术股份有限公司 Comment information-based data processing method and device
CN111401045A (en) * 2020-03-16 2020-07-10 腾讯科技(深圳)有限公司 Text generation method and device, storage medium and electronic equipment
CN112257410A (en) * 2020-10-15 2021-01-22 江苏卓易信息科技股份有限公司 Similarity calculation method for unbalanced text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297401A (en) * 2021-12-14 2022-04-08 中航机载系统共性技术有限公司 System knowledge extraction method based on clustering algorithm

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN105045781B (en) Query term similarity calculation method and device and query term search method and device
CN107423371B (en) Method for classifying positive and negative emotions of text
JP2001034623A (en) Information retrievel method and information reteraval device
CN108875065B (en) Indonesia news webpage recommendation method based on content
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
CN108363694B (en) Keyword extraction method and device
Ye et al. Unknown Chinese word extraction based on variety of overlapping strings
CN111241824B (en) Method for identifying Chinese metaphor information
Lan Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF‐IDF Method
WO2014002774A1 (en) Synonym extraction system, method, and recording medium
CN113011194A (en) Text similarity calculation method fusing keyword features and multi-granularity semantic features
Rathod Extractive text summarization of Marathi news articles
JP6340351B2 (en) Information search device, dictionary creation device, method, and program
Hardeniya et al. An approach to sentiment analysis using lexicons with comparative analysis of different techniques
CN113486662A (en) Text processing method, system and medium
Jiang et al. Word network topic model based on Word2Vector
Farhan et al. Sentiment-specific word embedding for Indonesian sentiment analysis
CN111339778A (en) Text processing method, device, storage medium and processor
Santosh et al. Obtaining feature-and sentiment-based linked instance RDF data from unstructured reviews using ontology-based machine learning
CN113987133A (en) Method for realizing extraction type text summarization by fusing TFIDF and LDA
Saad et al. Dewy index based Arabic document classification with synonyms merge feature reduction
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
Wang et al. Natural language semantic corpus construction based on cloud service platform
Kumamoto et al. Improving a method for quantifying readers’ impressions of news articles with a regression equation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211008