WO2024037483A1 - 文本处理方法、装置、设备及介质 - Google Patents

文本处理方法、装置、设备及介质 Download PDF

Info

Publication number
WO2024037483A1
WO2024037483A1 PCT/CN2023/112841 CN2023112841W WO2024037483A1 WO 2024037483 A1 WO2024037483 A1 WO 2024037483A1 CN 2023112841 W CN2023112841 W CN 2023112841W WO 2024037483 A1 WO2024037483 A1 WO 2024037483A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency
target
sentence
sentence element
target high
Prior art date
Application number
PCT/CN2023/112841
Other languages
English (en)
French (fr)
Inventor
王兆麟
丁冠源
回姝
王兆麒
郭富琦
黄嘉桐
郑彤
张文娟
Original Assignee
中国第一汽车股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国第一汽车股份有限公司 filed Critical 中国第一汽车股份有限公司
Publication of WO2024037483A1 publication Critical patent/WO2024037483A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the field of computers, for example, to a text processing method, device, equipment and medium.
  • Online reviews express users' experiences, evaluations or opinions in the form of text. It is often time-sensitive, quantitative, unstructured, and complex in content. With the rapid increase in the amount of online review information, text mining of online reviews can not only help consumers make rational decisions, but also guide designers in design, production, and version updates.
  • This application provides a text processing method, device, equipment and medium, which can achieve more accurate and effective processing of review text, and can extract and visualize more accurate and effective information.
  • a text processing method including:
  • the review text is a text that comments on the relevant performance of the car of the preset ranking model
  • each sentence element includes at least one of the following: nouns and phrases;
  • the results of the review text processing are visually displayed based on the quantitative value of importance, the topic type to which the target high-frequency sentence element belongs, and the word frequency of the target high-frequency sentence element.
  • a text processing device including:
  • a splitting module configured to split the comment text into at least two target sentences;
  • the comment text is a text commenting on the relevant performance of the car of the preset ranking model;
  • the determining module is configured to determine the frequency of occurrence of at least one sentence element in each of the at least two target sentences in all review texts, and the number of sentence elements included in the at least two target sentences.
  • the similarity between the target high-frequency sentence elements is determined; where each sentence element includes at least one of the following: nouns and phrases;
  • the calculation module is configured to calculate the quantitative importance value associated with the target high-frequency sentence element based on the evaluation level of the review text where the target high-frequency sentence element is located;
  • the visualization module is configured to visually display the results of the comment text processing based on the quantitative value of importance, the topic type to which the target high-frequency sentence element belongs, and the word frequency of the target high-frequency sentence element.
  • an electronic device including:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application. Text processing methods.
  • a computer-readable storage medium stores computer instructions, and the computer instructions are used to implement any of the embodiments of the present application when executed by a processor. text processing methods.
  • Figure 1 is a flow chart of a text processing method provided by Embodiment 1 of the present application.
  • Figure 2 is a flow chart of a text processing method provided in Embodiment 2 of the present application.
  • Figure 3 is a flow chart of a text processing method provided in Embodiment 3 of the present application.
  • Figure 4 is a structural diagram of a text processing device provided in Embodiment 4 of the present application.
  • FIG. 5 is a schematic structural diagram of an electronic device provided in Embodiment 5 of the present application.
  • Figure 1 is a flow chart of a text processing method provided in Embodiment 1 of the present application. This embodiment is suitable for analyzing and processing review texts about automobile-related performance.
  • the method can be executed by a text processing device, and the device can It is implemented in the form of software and/or hardware, and can be integrated into an electronic device capable of realizing text processing. As shown in Figure 1, the method includes:
  • the comment text is text that comments on the relevant performance of the car of the preset ranking model.
  • the target sentence refers to the sentence in the review text that satisfies the filtering conditions.
  • Each review text can include at least two target sentences.
  • a preset number such as 5,000
  • the trained model outputs at least two target words after splitting. sentence, that is, split the comment text into at least two target sentences; you can also use the word_tokenize function in the Natural Language Processing Toolkit (NLTK) to segment the comment text, that is, split the comment text into at least two target sentences target statement.
  • NLTK Natural Language Processing Toolkit
  • S102 Determine the target based on the frequency of occurrence of at least one sentence element in each of the at least two target sentences in all review texts, and the similarity between multiple sentence elements included in the at least two target sentences. High-frequency sentence elements.
  • Sentence elements are nouns and/or phrases.
  • a phrase is a common phrase consisting of at least two words.
  • a sentence element can be a noun or a phrase.
  • the sentence elements may be "bluetooth”, “wiper”, “car window”, “vehicle rearview mirror”, etc.
  • Term frequency refers to the number or frequency of statement elements appearing in all review texts. Similarity refers to the semantic similarity between the semantic features represented by multiple sentence elements.
  • Target high-frequency sentence elements refer to sentence elements whose elements meet the preset target filtering conditions.
  • the similarity of the sentence elements that meet the preset filtering conditions is screened again to determine the target high-frequency sentence elements; the word frequency of at least one sentence element in each target sentence that appears in all review texts, and multiple
  • the similarity between sentence elements is input to the pre-trained model and the target high-frequency sentence elements are output.
  • the evaluation level is an evaluation level that characterizes the degree of satisfaction of the target high-frequency sentence element. For example, when the user evaluates the target high-frequency sentence element and outputs the comment text, he or she can also check the evaluation level of the target high-frequency sentence element. Select, so that the user's evaluation level of the comment text where the target high-frequency sentence element is located can be obtained.
  • the evaluation level can be Strongly Disagree, Disagree, Neither, Strongly Agree or Agree.
  • the quantitative value of importance refers to the quantitative value that can represent the importance of the target high-frequency sentence element information.
  • the evaluation levels of multiple review texts where the target high-frequency sentence element is located can be input into a pre-trained model, and the quantitative importance value associated with the target high-frequency sentence element can be output. ; You can also target high-frequency sentence elements for each target according to preset rules. Statistically analyze the evaluation levels of multiple review texts where the target high-frequency sentence element is located, determine the quantitative importance value associated with the target high-frequency sentence element, and calculate each target high-frequency sentence element based on the evaluation level of the review text where the target high-frequency sentence element is located.
  • the quantitative importance value associated with each target's high-frequency sentence elements includes: determining each target based on the evaluation levels of multiple review texts where each target's high-frequency sentence elements are located, and the correlation between the evaluation levels and evaluation scores.
  • the evaluation score refers to the score corresponding to the evaluation level.
  • the correlation between the evaluation grade and the evaluation score is a one-to-one correspondence.
  • the evaluation score can be 1; when the evaluation grade is Disagree , the evaluation score can be 2; when the evaluation level is "Neither", the evaluation score can be 3; when the evaluation level is "Strongly Agree", the evaluation score can be 4; when the evaluation is When the level is Agree, the evaluation score can be 5.
  • the target high-frequency sentence element A it appears twice in the review text B, in the target sentence C and the target sentence D in the review text B, and once in the review text E, in the target sentence In sentence F, the evaluation level of comment text B is not approved, that is, the evaluation score is 2, and the evaluation level of comment text E is very approved, that is, the evaluation score is 4, then the target high-frequency sentence element A is in the multiple to which it belongs.
  • the average value of the scores in the target sentences i.e., target sentences C, D, and F
  • is (2+2+4)/3 that is, the quantitative importance value of target high-frequency sentence element A is 8/3.
  • the topic type refers to the type of topic direction represented by the semantics of target high-frequency sentence elements.
  • the theme type can be a feeling experience theme, a car hardware theme, a product display theme or a car color theme.
  • the sentence elements of the experience theme can be "stability", “comfort”, “controllability”, etc.
  • the sentence elements of the automobile hardware theme may be, for example, "door”, “window”, “microphone”, “display”, etc.
  • the sentence elements of the product display theme may be, for example, "pressure”, "traction”, “strut”, etc.
  • the sentence elements of the car color theme may be, for example, “white (white)", “black (black)", etc.
  • the word frequency of the target high-frequency sentence element refers to the frequency of the target high-frequency sentence element in all review articles. The total frequency of occurrence in this book.
  • For the quantitative importance value, topic type, and word frequency first filter the target high-frequency sentence elements, and use the quantitative importance value, topic type, and word frequency of the target high-frequency sentence elements that meet the preset filtering conditions as the result of text processing , perform visual display, that is, visually display the results of review text processing based on the quantitative value of importance, the topic type to which each target high-frequency sentence element belongs, and the word frequency of each target high-frequency sentence element.
  • the review text is split into at least two target sentences, and the word frequency of at least one sentence element in each target sentence in the at least two target sentences appearing in all review texts, and the at least The similarity between multiple sentence elements included in the two target sentences determines the target high-frequency sentence elements, and calculates the importance of the association of the target high-frequency sentence elements based on the evaluation level of the comment text where the target high-frequency sentence elements are located. Based on the quantitative value of importance, the topic type to which the target high-frequency sentence element belongs, and the word frequency of the target high-frequency sentence element, the results of the review text processing are visually displayed.
  • high-frequency sentence elements can be extracted from the review text, and based on the relevant information of the high-frequency sentence elements, the review text processing results can be visually displayed, achieving more accurate and effective processing of the review text, and can extract More accurate and effective information can be visualized to facilitate subsequent improvement of product quality.
  • the preprocessing operation can include: converting the brackets in the review text and the words, special symbols and presets they contain. Sensitive words are deleted.
  • the preprocessing of the comment text may also include: correcting all English letters in the comment text to lowercase.
  • Figure 2 is a flow chart of a text processing method provided in Embodiment 2 of the present application. As shown in Figure 2, the method includes:
  • S202 Determine at least one sentence element in each target sentence, and count the frequency of occurrence of the at least one sentence element in all review texts.
  • the nouns and/or phrases in each target sentence can be extracted according to preset matching rules, that is, the sentence elements in each target sentence can be determined.
  • the preset word frequency threshold refers to a preset threshold that measures the word frequency of sentence elements.
  • the preset word frequency threshold may be 50.
  • Candidate high-frequency sentence elements refer to sentence elements that satisfy the relationship with the preset word frequency.
  • the word frequency can be compared with a preset word frequency threshold. If the word frequency of the sentence element is greater than the preset word frequency threshold, the sentence element is determined to be an alternative sentence element.
  • Similarity refers to the similarity of the semantic features represented by multiple alternative high-frequency sentence elements.
  • the target high-frequency sentence elements refer to the high-frequency sentence elements that meet the preset similarity filtering conditions among the candidate high-frequency sentence elements.
  • the similarity between all candidate high-frequency sentence elements and multiple other candidate high-frequency sentence elements can be directly input into the pre-trained model to output the target high-frequency sentence elements; it can also be based on the preset similarity Filter conditions, analyze the similarity between candidate high-frequency sentence elements, determine the target high-frequency sentence elements from the candidate high-frequency sentence elements, and based on the similarity between the candidate high-frequency sentence elements, from Determining the target high-frequency sentence elements among the candidate high-frequency sentence elements includes: determining the similarity between at least two candidate high-frequency sentence elements, and when the similarity is higher than the preset similarity threshold, combine at least two The candidate high-frequency sentence elements are divided into one group; based on the word frequency of multiple candidate high-frequency sentence elements in each group of candidate high-frequency sentence elements in all review texts, the candidate high-frequency sentence elements are determined from each group of candidate high-frequency sentence elements.
  • Target high-frequency sentence elements are based on the word frequency of multiple candidate high-frequency sentence elements in each group of candidate high-frequency sentence elements in
  • the preset similarity threshold refers to a preset threshold that measures the degree of semantic similarity between candidate high-frequency sentence elements.
  • the number of target high-frequency sentence elements is determined to be at least one from each set of candidate high-frequency sentence elements.
  • the similarity between the candidate high-frequency sentence element "pattern (pattern)” and the candidate high-frequency sentence element “style (style)” is higher than the preset similarity threshold, and the candidate high-frequency sentence element “cloth”
  • the similarity between “(cloth)” and the candidate high-frequency sentence element “materials (materials)” is higher than the preset similarity threshold.
  • the similarity between candidate high-frequency sentence element A and candidate high-frequency sentence element B is 0.4
  • the similarity between candidate high-frequency sentence element C and candidate high-frequency sentence element D is 0.7
  • the similarity threshold is 0.5
  • the sentence elements in each target sentence are determined, and the word frequency of each sentence element in all review texts is counted. According to the relationship between the word frequency of each sentence element and the preset word frequency threshold, the word frequency of each target sentence is calculated.
  • Candidate high-frequency statement elements are determined from the statement elements in the statement, and based on the similarity between the candidate high-frequency statement elements, the target high-frequency statement elements are determined from the candidate high-frequency statement elements, and finally multiple target high-frequency statement elements are determined. Quantify the importance associated with sentence elements and display them visually. In this way, an implementable method of determining target high-frequency sentence elements from the target sentence is provided, and more accurate and effective high-frequency sentence elements can be determined.
  • determine the topic type to which the multiple target high-frequency sentence elements belong including: using a clustering algorithm, based on the semantic similarity between the target sentences where the multiple target high-frequency sentence elements are located, classify the target high-frequency sentence elements containing the target high-frequency sentence elements.
  • the target sentences are clustered; based on the clustering results, the topic types to which the multiple target high-frequency sentence elements belong are determined.
  • the clustering algorithm can be the commonly used k-means clustering algorithm (k-means clustering algorithm, k-means).
  • Theme types include: experience theme, car hardware theme, product display theme and car color theme.
  • Figure 3 is a flow chart of a text processing method provided in Embodiment 3 of the present application. As shown in Figure 3, the method includes:
  • the target topic type refers to the topic type that needs to be visually displayed specified by relevant personnel.
  • the target theme type can be an experience theme, a car hardware theme, a product display theme, or a car color theme.
  • you can obtain the topic types to which multiple target high-frequency statement elements belong to determine the topic type. is the target high-frequency sentence element of the target topic type, that is, the target high-frequency sentence element corresponding to the target topic type is screened out from the target high-frequency sentence elements.
  • the quantitative importance value of at least one target high-frequency statement element corresponding to the target topic type can be determined, and sorted in descending order according to the magnitude of the quantitative importance value, and the at least one target high-frequency statement element corresponding to the target topic type can be determined.
  • the sorting number of a target high-frequency sentence element can be determined. And count the word frequency of at least one target high-frequency sentence element corresponding to the target topic type in all review texts, and use the word frequency of each target high-frequency sentence element as its associated label information.
  • the sorting result refers to the sorting sequence number result obtained after sorting at least one target high-frequency sentence element corresponding to the target topic type.
  • the target high-frequency sentence elements corresponding to the target topic type can be visually displayed in the order of the serial numbers, and at least one target high-frequency sentence element can be displayed
  • the label information is also visually displayed in the form of labels, that is, the text processing results corresponding to the target topic type to be visually displayed are visually displayed.
  • the target topic type to be visually displayed is obtained, and the target high-frequency sentence elements corresponding to the target topic type are screened out from the target high-frequency sentence elements, and the importance of the target high-frequency sentence elements corresponding to the target topic type is quantified. value, sort the target high-frequency sentence elements, and use the word frequency of each target high-frequency sentence element as its associated label information. Finally, based on the sorting results and label information, the text processing results corresponding to the target topic type to be visually displayed are processed. Visual display. In this way, an implementable method for visually displaying text processing results is provided, which can better display the text processing results and facilitate relevant personnel to obtain information.
  • the comment text after the replacement operation into at least two target sentences also includes: performing semantic analysis on at least two target sentences to determine whether the semantics of each target sentence is smooth; When there is a semantically inconsistent target sentence in the two target sentences, analyze the contextual relationship of the comment text where the semantically inconsistent target sentence is located, and start from the semantically inconsistent target sentence. Words with misspellings are identified; when it is determined that the words with misspellings belong to the preset common wrong-form words, the misspelled words are corrected based on the correct forms of the common wrong-form words.
  • the target sentence may contain misspelled words.
  • the context of the comment text where the target sentence is located can be analyzed to infer that there are spelling errors. Wrong word, and compare the misspelled word with the pre-stored common wrong form word library, determine whether the misspelled word exists in the pre-stored common wrong form word library, and determine whether the pre-stored common wrong form word library exists
  • the misspelled word can be corrected to the correct form of the corresponding common wrong-form word, that is, the misspelled word can be corrected according to the correct form of the common wrong-form word.
  • the misspelled word is corrected according to a preset spelling corrector (such as the FAROO spelling corrector).
  • the preset spelling corrector such as FAROO spelling corrector
  • Figure 4 is a structural diagram of a text processing device provided in Embodiment 4 of the present application; a text processing device provided in the embodiment of the present application can execute the text processing method provided in any embodiment of the present application, and has functions corresponding to the execution method. Modules and effects.
  • the device includes:
  • the splitting module 401 is configured to split the comment text into at least two target sentences; the comment text is a text commenting on the relevant performance of the car of the preset ranking model;
  • the determination module 402 is configured to determine the frequency of occurrence of at least one sentence element in each of the at least two target sentences in all review texts, and the plurality of sentence elements included in the at least two target sentences. The similarity between them determines the target high-frequency sentence elements; where each sentence element includes at least one of the following: nouns and phrases;
  • the calculation module 403 is configured to calculate the quantitative importance value associated with the target high-frequency sentence element based on the evaluation level of the comment text where the target high-frequency sentence element is located;
  • the visualization module 404 is configured to visually display the results of the review text processing based on the quantitative importance value, the topic type to which the target high-frequency sentence element belongs, and the word frequency of the target high-frequency sentence element.
  • the review text is split into at least two target sentences, and the word frequency of at least one sentence element in each target sentence in the at least two target sentences appearing in all review texts, and the at least The similarity between multiple sentence elements included in the two target sentences determines the target high-frequency sentence elements, and calculates the importance of the association of the target high-frequency sentence elements based on the evaluation level of the comment text where the target high-frequency sentence elements are located.
  • the result of the review text processing is visually displayed based on the importance quantified value, the topic type to which the target high-frequency sentence element belongs, and the word frequency of the target high-frequency sentence element.
  • high-frequency sentence elements can be extracted from the review text, and based on the relevant information of the high-frequency sentence elements, the review text processing results can be visually displayed, achieving more accurate and effective processing of the review text, and can extract Produce more accurate and effective information and visualize it to facilitate subsequent improvement of product quality.
  • Determining module 402 may include:
  • the first determination unit is configured to determine at least one sentence element in each target sentence, and count the word frequency of the at least one sentence element appearing in all review texts;
  • the second determination unit is configured to determine a plurality of candidate high-frequency sentence elements from the plurality of sentence elements included in the at least two target sentences based on the relationship between the word frequency of each sentence element and the preset word frequency threshold;
  • the third determination unit is configured to determine the target high-frequency sentence element from the plurality of candidate high-frequency sentence elements based on the similarity between the plurality of candidate high-frequency sentence elements.
  • the third determination unit is set to:
  • the target high-frequency sentence element is determined from each group of candidate high-frequency sentence elements based on the frequency of occurrence of each candidate high-frequency sentence element in all review texts in each group of candidate high-frequency sentence elements.
  • the calculation module 403 is set to:
  • the evaluation grade of the review text where the target high-frequency sentence element is located and the relationship between the evaluation grade and Evaluate the correlation between the scores and determine the evaluation score of the target high-frequency sentence element in at least one target sentence to which it belongs;
  • For each target high-frequency sentence element calculate the average of the evaluation scores of the target high-frequency sentence element in at least one target sentence to which it belongs;
  • the average value is used as the quantitative importance value of the target high-frequency sentence element association.
  • the above devices also include:
  • the topic type determination module is configured to use a clustering algorithm to perform a clustering operation on the at least one target sentence containing the target high-frequency sentence element based on the semantic similarity between the at least one target sentence in which the target high-frequency sentence element is located. clustering processing;
  • the theme type to which the target high-frequency sentence element belongs is determined; the theme type includes at least one of the following: experience experience theme, car hardware theme, product display theme and car color theme.
  • the above devices also include:
  • a preprocessing module configured to perform semantic analysis on the at least two target sentences and determine whether the semantics of each target sentence is smooth
  • the misspelled word is corrected according to the correct form of the common wrong-form word.
  • Visualization module 404 is set to:
  • the target topic type to be visually displayed, and filter out at least one target high-frequency sentence element corresponding to the target topic type from the target high-frequency sentence elements;
  • the at least one target high-frequency sentence element is sorted according to the quantitative importance value of the at least one target high-frequency sentence element corresponding to the target topic type, and the word frequency of the at least one target high-frequency sentence element is used as its associated label information;
  • the text processing results corresponding to the target topic type to be visually displayed are visually displayed.
  • FIG. 5 is a schematic structural diagram of an electronic device provided in Embodiment 5 of the present application.
  • FIG. 5 shows a schematic structural diagram of an electronic device 10 that can be used to implement embodiments of the present application.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the present application as described and/or claimed herein.
  • the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (Read-Only Memory, ROM) 12, a random access memory (Random Access Memory, RAM) 13 and so on, wherein the memory stores a computer program that can be executed by at least one processor.
  • the processor 11 can load a computer program into the random access memory RAM 13 from the storage unit 18 according to the computer program stored in the read-only memory ROM 12 or program to perform a variety of appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 10 can also be stored.
  • the processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14.
  • An input/output (I/O) interface 15 is also connected to the bus 14 .
  • the I/O interface 15 Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunications networks.
  • Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various specialized artificial intelligence (Artificial Intelligence) Intelligence (AI) computing chips, a variety of processors that run machine learning model algorithms, digital signal processors (Digital Signal Processor, DSP), and any appropriate processors, controllers, microcontrollers, etc.
  • the processor 11 performs the methods and processes described above, such as text processing methods.
  • the text processing method may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18.
  • part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19.
  • the processor 11 may be configured to perform the text processing method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Parts (ASSP), System On Chip (SOC), Complex Programmable Logic Device (CPLD), computer hardware, firmware, software, and/or they realized in a combination.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Parts
  • SOC System On Chip
  • CPLD Complex Programmable Logic Device
  • computer hardware firmware, software, and/or they realized in a combination.
  • These embodiments may include implementation in one or more computer programs that may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be A special-purpose or general-purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device,
  • Computer programs for implementing the methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that the computer program, when executed by the processor, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented.
  • a computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a computer-readable storage medium may be a tangible medium that may contain or store a computer program for use by or in connection with an instruction execution system, apparatus, or device.
  • Computer-readable storage media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any of the foregoing any suitable combination.
  • the computer-readable storage medium may be a machine-readable signal medium.
  • machine-readable storage media would include an electrical connection based on one or more wires, a laptop disk, a hard drive, RAM, ROM, Erasable Programmable Read Only Memory (EPROM), or Flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on an electronic device having a display device (e.g., a cathode ray tube (CRT) or liquid crystal) for displaying information to the user.
  • a display device e.g., a cathode ray tube (CRT) or liquid crystal
  • a display Liquid Crystal Display, LCD monitor
  • a keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.
  • Computing systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, also known as a cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems that exist between traditional physical hosts and virtual private servers (VPS). It has the disadvantages of difficult management and weak business scalability.
  • VPN virtual private servers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

本申请公开了一种文本处理方法、装置、设备及介质。该方法包括:将评论文本拆分为至少两条目标语句;根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素;根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值;根据重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。

Description

文本处理方法、装置、设备及介质
本申请要求在2022年08月16日提交中国专利局、申请号为202210978990.0的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,例如涉及一种文本处理方法、装置、设备及介质。
背景技术
在线评论是以文本的形式表达用户的经验、评价或意见。其往往具有时效性、数量性、非结构化、内容复杂等特点。随着网络评论信息数量的急剧增加,对在线评论进行文本挖掘,不仅可以帮助消费者做出理性的决策,而且可以引导设计师进行设计、生产和版本更新。
因此,如何更好的对评论文本进行文本处理,准确提取出文本中的有效信息并进行可视化,使得相关人员可以有针对性的提高产品质量,是亟待解决的问题。
发明内容
本申请提供了一种文本处理方法、装置、设备及介质,可以实现对评论文本更准确有效的处理,可以提取出更精准的有效信息并进行可视化。
根据本申请的一方面,提供了一种文本处理方法,包括:
将评论文本拆分为至少两条目标语句;所述评论文本为对预设排名车型的汽车的相关性能进行评论的文本;
根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素;其中,每个语句要素包括以下至少之一:名词、短语;
根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值;
根据所述重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。
根据本申请的另一方面,提供了一种文本处理装置,包括:
拆分模块,设置为将评论文本拆分为至少两条目标语句;所述评论文本为对预设排名车型的汽车的相关性能进行评论的文本;
确定模块,设置为根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素;其中,每个语句要素包括以下至少之一:名词、短语;
计算模块,设置为根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值;
可视化模块,设置为根据所述重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。
根据本申请的另一方面,提供了一种电子设备,所述电子设备包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请任一实施例所述的文本处理方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现本申请任一实施例所述的文本处理方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。
图1是本申请实施例一提供的一种文本处理方法流程图;
图2是本申请实施例二提供的一种文本处理方法流程图;
图3是本申请实施例三提供的一种文本处理方法流程图;
图4是本申请实施例四提供的文本处理装置的结构图;
图5是本申请实施例五提供的电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行说明。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“备选”、“目标”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
实施例一
图1是本申请实施例一提供的一种文本处理方法流程图,本实施例适用于对关于汽车相关性能的评论文本进行分析处理的情况,该方法可以由文本处理装置来执行,该装置可以采用软件和/或硬件的方式实现,并可集成于具有实现文本处理功能的电子设备中。如图1所示,该方法包括:
S101、将评论文本拆分为至少两条目标语句。
评论文本为对预设排名车型的汽车的相关性能进行评论的文本。目标语句是指评论文本中满足筛选条件的语句。每个评论文本可以包括至少两条目标语句。
可选的,可以利用爬虫工具,在相关汽车评论网站中随机爬取用户评论,例如,可以获取预设数量(如5000条)的对排名前10的车型汽车进行评论的评论文本,即获取评论文本。
可选的,可以直接根据评论文本中的预设标点符号,如问号、句号和感叹号,将评论文本拆分为至少两条目标语句;还可以先将评论文本中除预设标点符号(如问号、句号和感叹号)之外的其他标点符号替换为句号,再根据预设标点符号(如问号、句号和感叹号),将评论文本拆分为至少两条目标语句;还可以将评论文本直接输入预先训练好的模型,输出拆分后的至少两条目标语 句,即将评论文本拆分为至少两条目标语句;还可以利用自然语言处理工具包(Natural Language Toolkit Package,NLTK)中的word_tokenize函数对评论文本进行分句,即将评论文本拆分为至少两条目标语句。
S102、根据至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素。
语句要素为名词和/或短语。短语是指至少两个词语组成的常用短语。一个语句要素可以是一个名词,也可以是一个短语。例如,语句要素可以是“bluetooth(蓝牙)”、“wiper(雨刮器)”、“车窗”以及“车辆后视镜”等。
词频是指语句要素在所有评论文本中出现的次数或频次。相似度是指多个语句要素表征的语义特征之间的语义相似度。目标高频语句要素是指要素满足预设目标筛选条件的语句要素。
可选的,可以先根据每个目标语句中的至少一个语句要素在所有评论文本中出现的词频,确定词频满足预设筛选条件的语句要素,并根据词频满足预设筛选条件的语句要素之间的相似度,对词频满足预设筛选条件的语句要素再次进行筛选,确定目标高频语句要素;还可以将每个目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及多个语句要素之间的相似度输入预先训练好的模型,输出目标高频语句要素。
S103、根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值。
评价等级是对目标高频语句要素的满意程度进行表征的评价等级,例如,用户在对目标高频语句要素进行评价,输出评论文本的同时,还可以对目标高频语句要素的评价等级进行勾选,从而可以获取用户对目标高频语句要素所在评论文本的评价等级。评价等级可以是非常不认可(Strongly Disagree)、不认可(Disagree)、一般(Neither)、非常认可(Strongly Agree)或认可(Agree)。重要性量化值是指可以表征目标高频语句要素信息重要性的量化数值。
可选的,可以针对每个目标高频语句要素,将该目标高频语句要素所在的多个评论文本的评价等级输入预先训练好的模型,输出该目标高频语句要素关联的重要性量化值;也可以针对每个目标高频语句要素,根据预设的规则,对 该目标高频语句要素所在的多个评论文本的评价等级进行统计分析,确定该目标高频语句要素关联的重要性量化值,根据每个目标高频语句要素所在评论文本的评价等级,计算每个目标高频语句要素关联的重要性量化值,包括:根据每个目标高频语句要素所在的多个评论文本的评价等级,以及评价等级与评价分值之间的关联关系,确定每个目标高频语句要素在其所属的至少两条目标语句中的评价分值;针对每个目标高频语句要素,计算该目标高频语句要素在其所属的至少两条目标语句中分值的平均值;将平均值作为每个目标高频语句要素关联的重要性量化值。
评价分值是指与评价等级相对应的分值。评价等级与评价分值之间的关联关系为一一对应,在评价等级为非常不认可(Strongly Disagree)的情况下,评价分值可以为1;在评价等级为不认可(Disagree)的情况下,评价分值可以为2;在评价等级为一般(Neither)的情况下,评价分值可以为3;在评价等级为非常认可(Strongly Agree)的情况下,评价分值可以为4;在评价等级为认可(Agree)的情况下,评价分值可以为5。
示例性的,对于目标高频语句要素A,其在评论文本B中出现两次,分别出现在评论文本B中的目标语句C和目标语句D中,在评论文本E中出现一次,出现在目标语句F中,评论文本B的评价等级不认可,即评价分值为2,评论文本E的评价等级为非常认可,即评价分值为4,则目标高频语句要素A在其所属的多个目标语句(即目标语句C、D和F)中分值的平均值为(2+2+4)/3,即目标高频语句要素A的重要性量化值为8/3。
S104、根据重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。
主题类型是指目标高频语句要素语义所表征的主题方向类型。主题类型可以为感受体验主题、汽车硬件主题、产品展示主题或汽车颜色主题。感受体验主题的语句要素可以是“stability(稳定性)”、“comfort(舒适)”、“controllability(操控性)”等。汽车硬件主题的语句要素例如可以是“door(门)”、“window(窗)”、“microphone(麦克风)”、“display(显示器)”等。产品展示主题的语句要素例如可以是“pressure(压力)”、“traction(牵引)”、“strut(支柱)”等。汽车颜色主题的语句要素例如可以是“white(白色)”、“black(黑色)”等。目标高频语句要素的词频是指目标高频语句要素在所有评论文 本中出现的总频次。
可选的,可以根据预设的展示方式,将每个目标高频语句要素的重要性量化值、主题类型,以及词频作为文本处理的结果,进行可视化展示;也可以根据目标高频语句要素的重要性量化值、主题类型,以及词频,先对目标高频语句要素进行筛选,将满足预设筛选条件的目标高频语句要素的重要性量化值、主题类型,以及词频,作为文本处理的结果,进行可视化展示,即根据重要性量化值、每个目标高频语句要素所属的主题类型,以及每个目标高频语句要素的词频,对评论文本处理的结果进行可视化展示。
通过对评论文本进行处理并可视化展示,可以便于相关人员对用户使用感受进行评估,有针对性的提高产品质量,从而提高用户的使用体验、感知产品质量。
本申请实施例,将评论文本拆分为至少两条目标语句,根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素,根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值,根据重要性量化值、目标高频语句要素所属的主题类型,以及目标高频语句要素的词频,对评论文本处理的结果进行可视化展示。通过这样的方式,可以从评论文本中提取出高频语句要素,并基于高频语句要素的相关信息,对评论文本处理结果进行可视化展示,实现了对评论文本更准确有效的处理,可以提取出更精准的有效信息并进行可视化,便于后续对产品质量的提高。
可选的,本申请对评论文本进行拆分处理之前,还可以先对获取的评论文本进行预处理,预处理操作可以包括:将评论文本中的括号及其包含的词语、特殊符号以及预设敏感词删除。
对括号中的句子进行删除操作,是因为括号里的语句往往是对评论信息的解释补充,并没有实际意义,另对特殊符号以及预设敏感词的删除操作,可以便于后续更加准确地识别各目标高频语句要素的主题,提高文本处理的效率。
可选的,在评论文本为英文的情况下,对评论文本的预处理还可以包括:将评论文本中的所有英文字母均修正为小写形式。
实施例二
图2是本申请实施例二提供的一种文本处理方法流程图,如图2所示,该方法包括:
S201、将评论文本拆分为至少两条目标语句。
S202、确定所述每个目标语句中的至少一个语句要素,并统计所述至少一个语句要素在所有评论文本中出现的词频。
可选的,可以根据预设的匹配规则,提取每个目标语句中的名词和/或短语,即确定每个目标语句中的语句要素。
可选的,确定所有评论文本的所有目标语句的语句要素之后,可以针对每个语句要素,统计其在所有评论文本的所有目标语句中出现的次数,作为该语句要素的词频,即统计至少一个语句要素在所有评论文本中出现的词频。
S203、根据每个语句要素的词频与预设词频阈值的大小关系,从所述至少两条目标语句所包括的多个语句要素中确定出多个备选高频语句要素。
预设词频阈值是指预先设定的衡量语句要素词频大小的阈值,例如,预设词频阈值可以是50。备选高频语句要素是指满足与预设词频大小关系的语句要素。
可选的,确定每个语句要素的词频之后,可以将该词频与预设词频阈值进行比较,在语句要素的词频大于预设词频阈值的情况下,确定该语句要素为备选语句要素。
S204、根据多个备选高频语句要素之间的相似度,从多个备选高频语句要素中确定目标高频语句要素。
相似度是指多个备选高频语句要素所表征的语义特征的相似度。目标高频语句要素是指备选高频语句要素中满足预设相似度筛选条件的高频语句要素。
可选的,可以直接将所有备选高频语句要素与其他多个备选高频语句要素之间的相似度输入预先训练好的模型,输出目标高频语句要素;也可以基于预设相似度筛选条件,对备选高频语句要素之间的相似度进行分析,从备选高频语句要素中确定目标高频语句要素,根据备选高频语句要素之间的相似度,从 备选高频语句要素中确定目标高频语句要素,包括:确定至少两个备选高频语句要素之间的相似度,在相似度高于预设相似度阈值的情况下,将至少两个备选高频语句要素分为一组;根据每组备选高频语句要素中多个备选高频语句要素在所有评论文本中出现的词频,从每组备选高频语句要素中确定出目标高频语句要素。
预设相似度阈值是指预先设置的衡量备选高频语句要素之间语义相似度程度的阈值。从每组备选高频语句要素中确定出目标高频语句要素的数量为至少一个。
示例性的,备选高频语句要素“pattern(模式)”和备选高频语句要素“style(样式)”的相似度是高于预设相似度阈值的,备选高频语句要素“cloth(布)”和备选高频语句要素“materials(材料)”的相似度是高于预设相似度阈值的。
示例性的,备选高频语句要素A与备选高频语句要素B的相似度为0.4,备选高频语句要素C与备选高频语句要素D的相似度为0.7,相似度阈值为0.5,则可以确定备选高频语句要素C与备选高频语句要素D满足预设相似度筛选条件,将备选高频语句要素C与备选高频语句要素D分为一组,在备选高频语句要素C的词频为50,备选高频语句要素D的词频为70的情况下,可以确定该组备选高频语句要素的目标高频语句要素为备选高频语句要素D。
S205、根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值。
S206、根据所述重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。
本申请实施例,确定每条目标语句中的语句要素,并统计每个语句要素在所有评论文本中出现的词频,根据每个语句要素的词频与预设词频阈值的大小关系,从每条目标语句中的语句要素中确定出备选高频语句要素,根据备选高频语句要素之间的相似度,从备选高频语句要素中确定目标高频语句要素,最后确定多个目标高频语句要素关联的重要性量化值并进行可视化展示。通过这样的方式,给出了一种从目标语句中确定目标高频语句要素的可实施方式,可以确定出更准确有效的高频语句要素。
可选的,确定多个目标高频语句要素所属的主题类型,包括:利用聚类算法,基于多个目标高频语句要素所在的目标语句之间的语义相似度,对包含目标高频语句要素的目标语句进行聚类处理;根据聚类结果,确定多个目标高频语句要素所属的主题类型。
聚类算法可以是常用的k均值聚类算法(k-means clustering algorithm,k-means)。主题类型包括:感受体验主题、汽车硬件主题、产品展示主题以及汽车颜色主题。
可选的,可以先确定多个目标高频语句要素所在目标语句的语义特征,然后确定多个目标语句之间的语义相似度,利用聚类算法,将目标语句中语义相似的聚为一类,最终将所有目标语句聚集为四类,分别为感受体验主题、汽车硬件主题、产品展示主题以及汽车颜色主题,根据聚类结果,可以分析确定目标高频语句要素所属的主题类型为该目标高频语句要素所属的目标语句的主题类型。
实施例三
图3是本申请实施例三提供的一种文本处理方法流程图,如图3所示,该方法包括:
S301、将评论文本拆分为至少两条目标语句。
S302、根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素。
S303、根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值。
S304、获取待可视化展示的目标主题类型,并从目标高频语句要素中筛选出目标主题类型对应的目标高频语句要素。
目标主题类型是指相关人员指定的需要进行可视化展示的主题类型。目标主题类型可以是感受体验主题、汽车硬件主题、产品展示主题或汽车颜色主题。
可选的,可以获取多个目标高频语句要素所属的主题类型,确定主题类型 为目标主题类型的目标高频语句要素,即从目标高频语句要素中筛选出目标主题类型对应的目标高频语句要素。
S305、根据目标主题类型对应的所述至少一个目标高频语句要素的重要性量化值,对所述至少一个目标高频语句要素进行排序,并将所述至少一个目标高频语句要素的词频作为其关联的标签信息。
可选的,可以确定目标主题类型对应的至少一个目标高频语句要素的重要性量化值,并按照重要性量化值的大小,按照从大到小的顺序进行排序,确定目标主题类型对应的至少一个目标高频语句要素的排序序号。并统计目标主题类型对应的至少一个目标高频语句要素在所有评论文本中出现的词频,将每个目标高频语句要素的词频作为其关联的标签信息。
S306、根据排序结果和标签信息,对待可视化展示的目标主题类型对应的文本处理结果进行可视化展示。
排序结果是指包括对目标主题类型对应的至少一个目标高频语句要素进行排序之后,得到的排序序号结果。
可选的,可以根据排序结果中多个目标主题类型对应目标高频语句要素的排序序号,按照序号顺序,可视化展示目标主题类型对应的目标高频语句要素,并将至少一个目标高频语句要素的标签信息以标签的形式也进行可视化展示,即实现对待可视化展示的目标主题类型对应的文本处理结果进行可视化展示。
本申请实施例,获取待可视化展示的目标主题类型,并从目标高频语句要素中筛选出目标主题类型对应的目标高频语句要素,根据目标主题类型对应的目标高频语句要素的重要性量化值,对目标高频语句要素进行排序,并将每个目标高频语句要素的词频作为其关联的标签信息,最后根据排序结果和标签信息,对待可视化展示的目标主题类型对应的文本处理结果进行可视化展示。通过这样的方式,给出了一种对文本处理结果进行可视化展示的可实施方式,可以更好的对文本处理的结果进行展示,便于相关人员的对信息的获取。
可选的,将执行替换操作后的评论文本拆分为至少两条目标语句之后,还包括:对至少两条目标语句进行语义分析,确定每条目标语句的语义是否通顺;在确定所述至少两条目标语句中存在语义不通顺的目标语句的情况下,分析语义不通顺的目标语句所在的评论文本的上下文关系,从语义不通顺的目标语句 中确定出存在拼写错误的词语;在确定存在拼写错误的词语属于预设的常见错误形式词语的情况下,根据常见错误形式词语的正确形式,对存在拼写错误的词语进行修正。
可选的,在确定目标语句的语义不通顺的情况下,可以认为该目标语句中可能包含拼写错误的词语,此时可以对目标语句所在的评论文本的上下文关系进行分析,推断出其中存在拼写错误的词语,并将该拼写错误的词语与预存的常见错误形式词语库进行比对,确定预存的常见错误形式词语库中是否存在该拼写错误的词语,在确定预存的常见错误形式词语库中存在该拼写错误的词语,可以将该存在拼写错误的词语修正为其对应的常见错误形式词语的正确形式,即根据常见错误形式词语的正确形式,对存在拼写错误的词语进行修正。在确定预存的常见错误形式词语库中不存在该拼写错误的词语,根据预设拼写纠正器(如FAROO拼写纠正器),对存在拼写错误的词语进行修正处理。
可选的,也可以直接对目标语句进行分析,确定其是否包含拼写错误词语;在确定目标语句中包含拼写错误词语的情况下,根据预设拼写纠正器(如FAROO拼写纠正器),对拼写错误词语进行修正处理。
实施例四
图4是本申请实施例四提供的文本处理装置的结构图;本申请实施例所提供的一种文本处理装置可执行本申请任一实施例所提供的文本处理方法,具备执行方法相应的功能模块和效果。
如图4所示,该装置包括:
拆分模块401,设置为将评论文本拆分为至少两条目标语句;所述评论文本为对预设排名车型的汽车的相关性能进行评论的文本;
确定模块402,设置为根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素;其中,每个语句要素包括以下至少之一:名词、短语;
计算模块403,设置为根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值;
可视化模块404,设置为根据所述重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。
本申请实施例,将评论文本拆分为至少两条目标语句,根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素,根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值,根据重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。通过这样的方式,可以从评论评论文本中提取出高频语句要素,并基于高频语句要素的相关信息,对评论文本处理结果进行可视化展示,实现了对评论文本更准确有效的处理,可以提取出更精准的有效信息并进行可视化,便于后续对产品质量的提高。
确定模块402可以包括:
第一确定单元,设置为确定所述每个目标语句中至少一个的语句要素,并统计所述至少一个语句要素在所有评论文本中出现的词频;
第二确定单元,设置为根据每个语句要素的词频与预设词频阈值的大小关系,从所述至少两条目标语句所包括的多个语句要素中确定出多个备选高频语句要素;
第三确定单元,设置为根据所述多个备选高频语句要素之间的相似度,从所述多个备选高频语句要素中确定所述目标高频语句要素。
第三确定单元是设置为:
确定至少两个备选高频语句要素之间的相似度,在所述相似度高于预设相似度阈值的情况下,将所述至少两个备选高频语句要素分为一组;
根据每组备选高频语句要素中每个备选高频语句要素在所有评论文本中出现的词频,从每组备选高频语句要素中确定出所述目标高频语句要素。
计算模块403设置为:
根据所述目标高频语句要素所在的评论文本的评价等级,以及评价等级与 评价分值之间的关联关系,确定所述目标高频语句要素在其所属的至少一条目标语句中的评价分值;
针对每个目标高频语句要素,计算该目标高频语句要素在其所属的至少一条目标语句中的评价分值的平均值;
将所述平均值作为所述目标高频语句要素关联的重要性量化值。
上述装置还包括:
主题类型确定模块,设置为利用聚类算法,基于所述目标高频语句要素所在的至少一条目标语句之间的语义相似度,对包含所述目标高频语句要素的所述至少一条目标语句进行聚类处理;
根据聚类结果,确定所述目标高频语句要素所属的主题类型;所述主题类型包括以下至少之一:感受体验主题、汽车硬件主题、产品展示主题以及汽车颜色主题。
上述装置还包括:
预处理模块,设置为对所述至少两条目标语句进行语义分析,确定每条目标语句的语义是否通顺;
在确定所述至少两条目标语句中存在语义不通顺的目标语句的情况下,分析所述语义不通顺的目标语句所在的评论文本的上下文关系,从所述语义不通顺的目标语句中确定出存在拼写错误的词语;
在确定所述存在拼写错误的词语属于预设的常见错误形式词语的情况下,根据所述常见错误形式词语的正确形式,对所述存在拼写错误的词语进行修正。
可视化模块404是设置为:
获取待可视化展示的目标主题类型,并从所述目标高频语句要素中筛选出所述目标主题类型对应的至少一个目标高频语句要素;
根据所述目标主题类型对应的所述至少一个目标高频语句要素的重要性量化值,对所述至少一个目标高频语句要素进行排序,并将所述至少一个目标高频语句要素的词频作为其关联的标签信息;
根据排序结果和所述标签信息,对待可视化展示的目标主题类型对应的文本处理结果进行可视化展示。
本申请技术方案中对评论文本及其相关数据信息的获取、存储、使用、处理等均符合国家法律法规的相关规定。
实施例五
图5是本申请实施例五提供的电子设备的结构示意图。图5示出了可以用来实施本申请实施例的电子设备10的结构示意图。电子设备旨在表示多种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示多种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。
如图5所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储器,如只读存储器(Read-Only Memory,ROM)12、随机访问存储器(Random Access Memory,RAM)13等,其中,存储器存储有可被至少一个处理器执行的计算机程序,处理器11可以根据存储在只读存储器ROM12中的计算机程序或者从存储单元18加载到随机访问存储器RAM13中的计算机程序,来执行多种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的各种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(Input/Output,I/O)接口15也连接至总线14。
电子设备10中的多个部件连接至I/O接口15,包括:输入单元16,例如键盘、鼠标等;输出单元17,例如多种类型的显示器、扬声器等;存储单元18,例如磁盘、光盘等;以及通信单元19,例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或多种电信网络与其他设备交换信息/数据。
处理器11可以是多种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例包括但不限于中央处理单元(Central Processing Unit,CPU)、图形处理单元(Graphics Processing Unit,GPU)、多种专用的人工智能(Artificial  Intelligence,AI)计算芯片、多种运行机器学习模型算法的处理器、数字信号处理器(Digital Signal Processor,DSP)、以及任何适当的处理器、控制器、微控制器等。处理器11执行上文所描述的方法和处理,例如文本处理方法。
在一些实施例中,文本处理方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的文本处理方法的一个或多个步骤。备选地,在其他实施例中,处理器11可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行文本处理方法。
本文中以上描述的系统和技术的多种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、芯片上系统的系统(System On Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本申请的方法的计算机程序可以采用一个或多个编程语言的任何组合来编写。这些计算机程序可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器,使得计算机程序当由处理器执行时使流程图和/或框图中所规定的功能/操作被实施。计算机程序可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本申请的上下文中,计算机可读存储介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的计算机程序。计算机可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容 的任何合适组合。备选地,计算机可读存储介质可以是机器可读信号介质。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、可擦除可编程只读存储器,(Erasable Programmable Read Only Memory,EPROM)或快闪存储器、光纤、便捷式紧凑盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在电子设备上实施此处描述的系统和技术,该电子设备具有:用于向用户显示信息的显示装置(例如,阴极射线管(Cathode Ray Tube,CRT)或者液晶显示器(Liquid Crystal Display,LCD)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(Local Area Network,LAN)、广域网(Wide Area Network,WAN)、区块链网络和互联网。
计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与虚拟专享服务器(Virtual Private Server,VPS)中,存在的管理难度大,业务扩展性弱的缺陷。
应该理解,可以使用上面所示的多种形式的流程,重新排序、增加或删除 步骤。例如,本申请中记载的多个步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本申请保护范围的限制。

Claims (10)

  1. 一种文本处理方法,包括:
    将评论文本拆分为至少两条目标语句;所述评论文本为对预设排名车型的汽车的相关性能进行评论的文本;
    根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素;其中,每个语句要素包括以下至少之一:名词、短语;
    根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值;
    根据所述重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。
  2. 根据权利要求1所述的方法,其中,根据所述至少两条目标语句中的每条目标语句中的至少一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素,包括:
    确定所述每个目标语句中的至少一个语句要素,并统计所述至少一个语句要素在所有评论文本中出现的词频;
    根据每个语句要素的词频与预设词频阈值的大小关系,从所述至少两条目标语句所包括的多个语句要素中确定出多个备选高频语句要素;
    根据所述多个备选高频语句要素之间的相似度,从所述多个备选高频语句要素中确定所述目标高频语句要素。
  3. 根据权利要求2所述的方法,其中,根据所述多个备选高频语句要素之 间的相似度,从所述多个备选高频语句要素中确定所述目标高频语句要素,包括:
    确定至少两个备选高频语句要素之间的相似度,在所述相似度高于预设相似度阈值的情况下,将所述至少两个备选高频语句要素分为一组;
    根据每组备选高频语句要素中每个备选高频语句要素在所有评论文本中出现的词频,从每组备选高频语句要素中确定出所述目标高频语句要素。
  4. 根据权利要求1所述的方法,其中,根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值,包括:
    根据所述目标高频语句要素所在的评论文本的评价等级,以及评价等级与评价分值之间的关联关系,确定所述目标高频语句要素在其所属的至少一条目标语句中的评价分值;
    针对每个目标高频语句要素,计算该目标高频语句要素在其所属的至少一条目标语句中的评价分值的平均值;
    将所述平均值作为所述目标高频语句要素关联的重要性量化值。
  5. 根据权利要求1所述的方法,还包括:确定所述目标高频语句要素所属的主题类型,包括:
    利用聚类算法,基于所述目标高频语句要素所在的至少一条目标语句之间的语义相似度,对包含所述目标高频语句要素的所述至少一条目标语句进行聚类处理;
    根据聚类结果,确定所述目标高频语句要素所属的主题类型;所述主题类型包括以下至少之一:感受体验主题、汽车硬件主题、产品展示主题以及汽车颜色主题。
  6. 根据权利要求1-5中任一项所述的方法,在所述将评论文本拆分为至少 两条目标语句之前,还包括:对所述评论文本执行替换操作;
    在所述将评论文本拆分为至少两条目标语句之后,还包括:
    对所述至少两条目标语句进行语义分析,确定每条目标语句的语义是否通顺;
    在确定所述至少两条目标语句中存在语义不通顺的目标语句的情况下,分析所述语义不通顺的目标语句所在的评论文本的上下文关系,从所述语义不通顺的目标语句中确定出存在拼写错误的词语;
    在确定所述存在拼写错误的词语属于预设的常见错误形式词语的情况下,根据所述常见错误形式词语的正确形式,对所述存在拼写错误的词语进行修正。
  7. 根据权利要求1所述的方法,其中,所述根据所述重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示,包括:
    获取待可视化展示的目标主题类型,并从所述目标高频语句要素中筛选出所述目标主题类型对应的至少一个目标高频语句要素;
    根据所述目标主题类型对应的所述至少一个目标高频语句要素的重要性量化值,对所述至少一个目标高频语句要素进行排序,并将所述至少一个目标高频语句要素的词频作为其关联的标签信息;
    根据排序结果和所述标签信息,对待可视化展示的所述目标主题类型对应的文本处理结果进行可视化展示。
  8. 一种文本处理装置,包括:
    拆分模块,设置为将评论文本拆分为至少两条目标语句;所述评论文本为对预设排名车型的汽车的相关性能进行评论的文本;
    确定模块,设置为根据所述至少两条目标语句中的每条目标语句中的至少 一个语句要素在所有评论文本中出现的词频,以及所述至少两条目标语句所包括的多个语句要素之间的相似度,确定目标高频语句要素;其中,每个语句要素包括以下至少之一:名词、短语;
    计算模块,设置为根据所述目标高频语句要素所在评论文本的评价等级,计算所述目标高频语句要素关联的重要性量化值;
    可视化模块,设置为根据所述重要性量化值、所述目标高频语句要素所属的主题类型,以及所述目标高频语句要素的词频,对所述评论文本处理的结果进行可视化展示。
  9. 一种电子设备,所述电子设备包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的文本处理方法。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使处理器执行时实现权利要求1-7中任一项所述的文本处理方法。
PCT/CN2023/112841 2022-08-16 2023-08-14 文本处理方法、装置、设备及介质 WO2024037483A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210978990.0A CN115309867A (zh) 2022-08-16 2022-08-16 一种文本处理方法、装置、设备及介质
CN202210978990.0 2022-08-16

Publications (1)

Publication Number Publication Date
WO2024037483A1 true WO2024037483A1 (zh) 2024-02-22

Family

ID=83861957

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/112841 WO2024037483A1 (zh) 2022-08-16 2023-08-14 文本处理方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN115309867A (zh)
WO (1) WO2024037483A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309867A (zh) * 2022-08-16 2022-11-08 中国第一汽车股份有限公司 一种文本处理方法、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408809A (zh) * 2018-09-25 2019-03-01 天津大学 一种基于词向量的针对汽车产品评论的情感分析方法
CN110175325A (zh) * 2019-04-26 2019-08-27 南京邮电大学 基于词向量和句法特征的评论分析方法及可视化交互界面
US20200020000A1 (en) * 2018-07-16 2020-01-16 Ebay Inc. Generating product descriptions from user reviews
CN115309867A (zh) * 2022-08-16 2022-11-08 中国第一汽车股份有限公司 一种文本处理方法、装置、设备及介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200020000A1 (en) * 2018-07-16 2020-01-16 Ebay Inc. Generating product descriptions from user reviews
CN109408809A (zh) * 2018-09-25 2019-03-01 天津大学 一种基于词向量的针对汽车产品评论的情感分析方法
CN110175325A (zh) * 2019-04-26 2019-08-27 南京邮电大学 基于词向量和句法特征的评论分析方法及可视化交互界面
CN115309867A (zh) * 2022-08-16 2022-11-08 中国第一汽车股份有限公司 一种文本处理方法、装置、设备及介质

Also Published As

Publication number Publication date
CN115309867A (zh) 2022-11-08

Similar Documents

Publication Publication Date Title
US11093854B2 (en) Emoji recommendation method and device thereof
US10713571B2 (en) Displaying quality of question being asked a question answering system
US10691770B2 (en) Real-time classification of evolving dictionaries
EP2664997B1 (en) System and method for resolving named entity coreference
WO2021093755A1 (zh) 问题的匹配方法及装置、问题的回复方法及装置
WO2024037483A1 (zh) 文本处理方法、装置、设备及介质
CN104850617A (zh) 短文本处理方法及装置
US11361165B2 (en) Methods and systems for topic detection in natural language communications
JP7221526B2 (ja) 分析方法、分析装置及び分析プログラム
US20220198358A1 (en) Method for generating user interest profile, electronic device and storage medium
CN115409039A (zh) 一种对标车型数据的分析方法、装置、电子设备及介质
US20200401767A1 (en) Summary evaluation device, method, program, and storage medium
US11989677B2 (en) Framework for early warning of domain-specific events
Mehta et al. Sentiment analysis on product reviews using Hadoop
CN115577109A (zh) 文本分类方法、装置、电子设备及存储介质
CN115048523A (zh) 文本分类方法、装置、设备以及存储介质
KR20220024251A (ko) 이벤트 라이브러리를 구축하는 방법 및 장치, 전자 기기, 및 컴퓨터 판독가능 매체
CN113792546A (zh) 语料库的构建方法、装置、设备以及存储介质
CN107590163A (zh) 文本特征选择的方法、装置和系统
CN112926297A (zh) 处理信息的方法、装置、设备和存储介质
US11907668B2 (en) Method for selecting annotated sample, apparatus, electronic device and storage medium
US20230274088A1 (en) Sentiment parsing method, electronic device, and storage medium
US11609957B2 (en) Document processing device, method of controlling document processing device, and non-transitory computer-readable recording medium containing control program
CN114186552B (zh) 文本分析方法、装置、设备及计算机存储介质
CN113656393B (zh) 数据处理方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23854384

Country of ref document: EP

Kind code of ref document: A1