CN114757187A - Intelligent device and effective semantic word extraction method - Google Patents

Intelligent device and effective semantic word extraction method Download PDF

Info

Publication number
CN114757187A
CN114757187A CN202210455759.3A CN202210455759A CN114757187A CN 114757187 A CN114757187 A CN 114757187A CN 202210455759 A CN202210455759 A CN 202210455759A CN 114757187 A CN114757187 A CN 114757187A
Authority
CN
China
Prior art keywords
semantic
text
words
word
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210455759.3A
Other languages
Chinese (zh)
Inventor
李俊彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Electronic Technology Wuhan Co ltd
Original Assignee
Hisense Electronic Technology Wuhan Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Electronic Technology Wuhan Co ltd filed Critical Hisense Electronic Technology Wuhan Co ltd
Priority to CN202210455759.3A priority Critical patent/CN114757187A/en
Publication of CN114757187A publication Critical patent/CN114757187A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides intelligent equipment and an effective semantic word extraction method, wherein the intelligent equipment comprises a storage module and a processing module, and the processing module is configured to acquire a text to be extracted; performing word segmentation on a text to be extracted to obtain a word set, wherein the word set comprises a plurality of semantic words; replacing semantic words in the word set with generic labels to generate a set of labeled text; inputting the text to be extracted and the labeled text set into a semantic extraction model; and obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words. According to the method and the device, key semantic words influencing semantic understanding can be extracted from the query text of the user, and the search engine is helped to understand the user intention better, so that the intelligent device can be helped to give out an accurate search result, and the user experience is improved.

Description

Intelligent device and effective semantic word extraction method
Technical Field
The application relates to the technical field of natural language processing, in particular to an intelligent device and an effective semantic word extraction method.
Background
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will relate to natural language, i.e. the language people use daily, so it is closely related to, but has important difference from, the research of linguistics. Natural language processing does not generally study natural language but rather develops computer systems, and particularly software systems therein, that can efficiently implement natural language communications. It is thus part of computer science. Natural language processing is mainly applied to the aspects of machine translation, public opinion monitoring, automatic summarization, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, Chinese OCR and the like.
In the intelligent equipment consumption field, the query statement of a consumer is biased to spoken language, and the query statement contains more personal habits and regional habits, for example, Shandong people like to query by using inverted sentences, and the like. While the following consumer intention understanding is generally analyzed directly based on the sentences of the user query, some manufacturers also use a classification algorithm that fuses the word properties of the businesses, but such a classification method often makes a classification error when there are many nonsense words or ambiguous words in the user query utterance. In the field of intelligent device searching, as the titles of all searching targets are diversified and networked, the title contents are full of disordered self-constructed words, exaggerated vocabularies and other nonsense words, and the accuracy and the efficiency of searching are seriously influenced. The popular method adopted by the intelligent device search field directly matches the user query with the search target title, and then gives some contents with the highest score. The main reasons are that on one hand, meaningful words in user queries or titles cannot be analyzed effectively, and on the other hand, training data with wide coverage is difficult to collect.
Disclosure of Invention
The application provides an intelligent device and an effective semantic word extraction method, and aims to solve the problem that the existing intelligent device searching method cannot effectively analyze meaningful words in user query or titles.
In one aspect, the present application provides a smart device, comprising:
a storage module configured to store a semantic extraction model;
a processing module configured to:
acquiring a text to be extracted;
performing word segmentation on a text to be extracted to obtain a word set, wherein the word set comprises a plurality of semantic words;
replacing semantic words in the word set with generic tokens to generate a set of annotated text;
inputting the text to be extracted and the labeled text set into a semantic extraction model, wherein the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set comprises training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probabilities, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels;
and obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words.
On the other hand, the application also provides an effective semantic word extraction method, which comprises the following steps:
acquiring a text to be extracted;
performing word segmentation on a text to be extracted to obtain a word set, wherein the word set comprises a plurality of semantic words;
replacing semantic words in the word set with generic tokens to generate a set of annotated text;
inputting the text to be extracted and the labeled text set into a semantic extraction model, wherein the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set comprises training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probabilities, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels;
and obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words.
According to the technical scheme, the intelligent device and the effective semantic word extraction method provided by the application comprise a storage module and a processing module, wherein the processing module is configured to acquire a text to be extracted; performing word segmentation on a text to be extracted to obtain a word set, wherein the word set comprises a plurality of semantic words; replacing semantic words in the word set with generic tokens to generate a set of annotated text; inputting the text to be extracted and the labeled text set into a semantic extraction model, wherein the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set comprises training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probabilities, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels; and obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words. According to the method and the device, key semantic words influencing semantic understanding can be extracted from the query text of the user, and the search engine is helped to understand the user intention better, so that the intelligent device can be helped to give out an accurate search result, and the user experience is improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments are briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic processing flow diagram of an intelligent device in an embodiment of the present application;
FIG. 2 is a schematic diagram of a model training process in an embodiment of the present application;
FIG. 3 is a schematic diagram of a semantic extraction model according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of obtaining a text to be extracted in the embodiment of the present application;
FIG. 5 is a schematic diagram of a word segmentation process in the embodiment of the present application;
FIG. 6 is a schematic diagram of a process for generating a tagged text set in an embodiment of the present application;
FIG. 7 is a schematic flow chart illustrating a process of obtaining a tag classification result in the embodiment of the present application;
fig. 8 is a schematic diagram of a threshold filtering process in the embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following examples do not represent all embodiments consistent with the present application. But merely as exemplifications of systems and methods consistent with certain aspects of the application, as recited in the claims.
In the embodiment of the application, the semantic extraction method can be applied to intelligent equipment with a data processing function and semantic extraction requirements. The smart devices include but are not limited to: the system comprises a computer, an intelligent terminal, an intelligent television, intelligent wearable equipment, intelligent display equipment, a server and the like. The intelligent device can be internally or externally connected with a storage module and provides a processing module to form a text classification system capable of executing the text classification method.
In the embodiment of the present application, the smart device may be a smart television device, and the smart television device has a built-in memory and a controller, where the memory may be used to store data such as text, a natural language processing model, and a control program. The controller may then recall data from memory and perform processing on the recalled data by running a control program.
The application scenario of the method and the device is the field of intelligent device searching, key semantic words influencing semantic understanding can be extracted from query texts of users, and a search engine is helped to understand user intentions better, so that the intelligent devices are helped to give accurate responses, and user experience is improved. On the other hand, due to the fact that the title names of the current search target data are various, the retrieval system can be helped to retrieve the content interested by the user more easily through extraction of the effective semantic words.
Fig. 1 is a schematic diagram illustrating a processing flow of an intelligent device in the embodiment of the present application, natural language processing may include two stages, namely, a model training stage and a semantic extraction stage. In the model training stage, the processing module can acquire training sample data from a network or other ways, and then inputs the training sample data into the initial semantic model for model training. The training model may output classification probabilities based on input sample data. And calculating to obtain a loss value according to the classification probability and the labeling probability, and adjusting model parameters of the initial semantic model according to the loss value to obtain the semantic extraction model. Therefore, parameters of the training model are adjusted and optimized repeatedly through a certain volume of training sample data, and the semantic extraction model with a good training effect is obtained. After the model training process is completed, the processing module stores the semantic extraction model obtained by training in the storage module for being called by subsequent applications.
In the embodiment of the application, the semantic extraction stage processing module can call the trained semantic extraction model in the storage module, and input text data to be subjected to semantic extraction into the called semantic extraction model. Then, through the internal operation of the semantic extraction model, each semantic word extracted from the current text data can be obtained, so as to perform related query.
In the embodiment of the present application, the training model or the semantic extraction model described in the above embodiment may be a model based on Natural Language Processing (NLP). For example, BERT models and other NLP models obtained by optimization or modification based on BERT models. It should be noted that, the model described in the above embodiment may be referred to as a training model in a model training stage, and the model in a semantic extraction stage is referred to as a semantic extraction model. Because the training model and the semantic extraction model are only different stages of one model, and both the training model and the semantic extraction model take text data as input in the semantic extraction process, the text data capable of being input into the training model can also be input into the semantic extraction model, namely in any of the embodiments described hereinafter, the training model and the semantic extraction model are not distinguished any more except for other descriptions.
In the embodiment of the application, the semantic extraction process can enable the intelligent device to determine the semantic similarity of the semantic words in the text data according to the labeling probability by setting different labeling probabilities, so as to determine the semantic words in the text data. Namely, the semantic extraction process can determine the machine language from the natural language text data to realize machine learning. It can be seen that the semantic extraction process can be applied in the fields related to natural language processing, such as intelligent voice control, intelligent question answering, image recognition processing, business statistical analysis, and the like.
In the embodiment of the present application, in order to implement the above semantic extraction process, the intelligent device may perform model training and semantic extraction by embedding an Artificial Intelligence (AI) algorithm in an operating system. Obviously, for intelligent devices with different purposes, due to different implementation functions, artificial intelligence algorithms built in operating systems of the intelligent devices are different, but the artificial intelligence algorithms are all used for realizing semantic extraction processes in essence. For example, in an AI algorithm built in the intelligent question-answering robot, the semantics of the text input by the voice of the user are extracted to make relevant answers. In an intelligent search system built in the intelligent television, semantic extraction is carried out on text input by a user through voice or typing so as to search related programs.
In the embodiment of the application, in addition to the AI algorithm being built in the operating system, the AI algorithm corresponding to the semantic extraction function may also be built in the application program. Namely, the intelligent device can also realize the semantic extraction function by installing an application program. The application program capable of implementing the semantic extraction function may be a system application or a third party application. For example, to implement the intelligent question and answer function, the computer may download and install an "intelligent question and answer robot" application program, and invoke the semantic extraction model by running the application program, and implement the semantic extraction function on the text data by acquiring the text data input by the user in real time and inputting the text data into the semantic extraction model.
In the embodiment of the application, the semantic extraction function is not limited to the operation of one intelligent device, but can also be realized through the cooperative operation among a plurality of devices, that is, the intelligent device can establish a communication connection relationship with the server, acquire the input text of the user in real time by the intelligent device in practical application, execute the model training, the semantic extraction process and the query process by the server, and display the query result by the intelligent device.
In the embodiment of the application, the intelligent device can acquire the text data input by the user in real time during operation and send the text data to the server. An AI algorithm and a semantic extraction model for realizing the semantic extraction function are built in the server, so that after receiving text data sent by the intelligent equipment, the server can input the text data into the semantic extraction model to obtain a semantic extraction result output by the model. The server feeds the semantic extraction result back to the intelligent equipment, so that the semantic extraction result and the related query information can be fed back to the user through the intelligent equipment.
In the embodiment of the present application, in order to implement more service requirements and reduce data processing amount, in practical applications, specific device data for implementing a semantic extraction function through cooperative operation of multiple devices may be flexibly set according to requirements of implemented functions. And the specific semantic extraction process can be flexibly set according to the hardware configuration and the data volume of the equipment, so that the repeated data processing process is reduced, and the operational capacity of the equipment is saved. For example, multiple smart devices may establish a communication connection with a server at the same time. The server is used for providing a semantic extraction model for the plurality of intelligent devices in a unified mode, and different intelligent devices can perform programs such as data input, model operation and result output automatically after acquiring the semantic extraction model, so that the semantic extraction function is achieved. Meanwhile, the intelligent device can report the text data processed by the intelligent device to the server so as to further carry out model training in the server and continuously improve the semantic extraction model. Therefore, the server can push the semantic extraction model to the intelligent devices at a preset time, and update the semantic extraction model in each intelligent device so as to keep the timeliness of the semantic extraction model.
In the embodiment of the application, when the semantic extraction function is realized through the cooperative operation of multiple devices, the operation load condition of each device can be monitored in real time, and the actual execution main body of the model training stage and/or the semantic extraction stage can be dynamically adjusted according to the real-time load condition. The application program for realizing semantic extraction can be network application, and the intelligent device and the server which are accessed to the same network can realize the semantic extraction function by operating the network application after the network application is installed. In the process of running the network application, the network application can monitor the operation loads of each intelligent device and the server in real time, wherein the operation loads include the CPU usage amount, the memory usage amount, the network delay and the like. When the data corresponding to any operation load is abnormal, the AI algorithm execution main body of the corresponding equipment can be adjusted in real time, so that the semantic extraction process can be smoothly operated.
In the embodiment of the application, model calculation in the semantic extraction process can be executed by the intelligent device in a normal state, and when the monitored memory usage amount of the intelligent device exceeds a threshold value, the process of executing the model calculation by the intelligent device can be suspended, the intelligent device is automatically controlled to send the acquired text data to the server, so that the model calculation is executed by the server, and a semantic extraction result is fed back to reduce the processing load of the intelligent device and improve the timeliness of semantic extraction.
As can be seen from the above embodiments, in the process of applying semantic extraction, the smart device or the server needs to input text data into the training model (or the semantic extraction model) in the model training stage and the semantic extraction stage. Since the text data to be extracted is natural language text data, it has different text forms depending on the source of the text data. For example, for text data generated from voice information input by a user, the content is biased toward spoken language according to the voice input process of the user, and the actual text length is low, generally only one sentence in length. And for the input text, the text length is generally higher.
Fig. 2 is a schematic diagram illustrating a model training process in an embodiment of the present application, in the embodiment of the present application, the processing module is configured to obtain a training sample set, where the training sample set is constructed by a query text and a resource title text input by a user; extracting training sentences from the training sample set; replacing keywords in the training sentences with general labels to generate the labeled sample sentences; setting a labeling probability for the labeling sample sentence; training the semantic extraction model using the training sentences and the annotated sample sentences.
In the embodiment of the application, the processing module can collect training data from the text input by the user and the title text of the medium resource library in the server, the mode of inputting the text by the user is not limited, and the user can input the text to be searched in a voice mode and can also input the text to be searched in a typing mode; the title text of the media resource in the server is the title text of the final search target of the user, and the title text and the final search target are combined to be used as training data, so that the comprehensiveness of model training can be improved; the processing module can extract sample sentences for training from the training sample set; replacing the keywords in the training sentences by using general marks to generate the labeled sample sentences, wherein the styles of the general marks are not limited, and can be single characters or character groups, and the effect of replacing semantic words is only required; setting a labeling probability for the labeling sample sentence, wherein the labeling probability is obtained by the mode labeling of a soft label; training the semantic extraction model using the training sentences and the annotated sample sentences.
In the embodiment of the application, the general mark replaces a certain keyword in the training sentence, other contents in the training sentence are kept unchanged, a labeled sample sentence is generated, a labeling probability is set for each labeled sample sentence, and the set of the training sentence, the labeled sample sentence and the labeling probability is the training sample of the semantic extraction model.
In the embodiment of the application, the labeling form of the soft label can express the similarity degree of the original text and the substituted labeling text very closely, the closer the labeling probability is to 1, the more similar the original text and the substituted labeling text is, and the closer the labeling probability is to 0, the more dissimilar the original text and the substituted labeling text is; the semantic extraction model is marked with the marking probability in the form of a soft label with a certain gradient, namely, the importance degree of different words is distinguished by the standard label with a certain gradient.
In an embodiment of the present application, the processing module is configured to invoke an initial semantic model in the step of training the semantic extraction model using the training sentences and the annotated sample sentences; inputting the training sentences and the annotation sample sentences into the initial semantic model; obtaining the classification probability output by the initial semantic model; calculating to obtain a loss value according to the classification probability and the labeling probability; and adjusting model parameters of the initial semantic model according to the loss value to obtain the semantic extraction model.
Fig. 3 is a schematic diagram of a semantic extraction model in the embodiment of the present application, the initial semantic model is a Query2Query interactive matching model, Query1 is the training sentence, Query2 is the labeled sample sentence, sim _ prob _ new is the labeling probability, [ prob0, prob1] is the classification probability, [ labels _0, labels _1] is a label classification result of the training sentence and the labeled sample sentence, and Loss is a Loss value, and the training sentence and the labeled sample sentence are input into the initial semantic model to calculate the classification probability [ prob0, prob1], where:
The method for calculating the Loss value Loss comprises the following steps:
Loss=labels_g-[prob0,prob1];
the method for calculating labels _ g comprises the following steps:
labels_g=[labels_0,labels_1]*[1-sim_prob_new,sim_prob_new];
the calculation method of the label classification result [ labels _0, labels _1] comprises the following steps:
Figure BDA0003618672480000051
and adjusting the model parameters of the initial semantic model according to the loss value obtained by calculation to obtain the semantic extraction model, and converting the original semantic word extraction task into a task for calculating the semantic similarity by using a semantic similarity calculation method, thereby effectively solving the problem that the same word has different importance degrees in different texts.
In the embodiment of the application, the semantic extraction model effectively incorporates label information with a certain gradient when calculating the loss value loss, that is, the dynamic perception of the text similarity is effectively incorporated in the whole training process.
As shown in fig. 4, which is a schematic flow diagram of obtaining a text to be extracted in the embodiment of the present application, the processing module is configured to receive a query instruction input by a user in a step of obtaining the text to be extracted, where the query instruction includes a query text; analyzing a query text from the query instruction; and deleting the meaningless words in the query text by using a preset word bank to obtain the text to be extracted.
In the embodiment of the present application, when a user uses an intelligent device to search, the user directly enters a voice query instruction through a voice input function of the intelligent device, a processing module of the intelligent device parses a query text from the voice query instruction, where the query text may include many meaningless semantic words, such as: "then search movie 1 of actor a", "this movie 2 of actor B", the processing module uses a preset word library to delete meaningless words "then" this "in the query text, and obtains the text to be extracted, i.e." search movie 1 of actor a "," movie 2 of actor B ".
In this application embodiment, a user inputs a text query instruction through a text input function of an intelligent device when searching using the intelligent device, a processing module of the intelligent device parses a query text from the text query instruction, where the query text may include many meaningless linguistic words and punctuation marks, such as: "movie 1 Ha for actor A," "movie 2 o for actor B! ", the processing module deletes the meaningless words" ha "" o "" | in the query text by using a preset word bank! "to obtain the text to be extracted, i.e.," movie 1 of actor a "and movie 2 of actor B".
In the embodiment of the application, the preset word bank is a self-built bank, other meaningless tone words, punctuation marks and special symbols may exist in a query instruction input by a user when the user uses intelligent equipment for searching, and all the meaningless tone words, punctuation marks and special symbols can be deleted through the preset word bank; the adding, modifying and deleting operations can be carried out on the preset word stock at any time, so that the preset word stock can better process the nonsense words in the query text.
In the embodiment of the application, some repeated participles with extremely low semantic similarity sequencing and extremely low semantic similarity score may be obtained in the model training process, and the repeated nonsense participles may be added into the preset lexicon to further improve the preset lexicon.
Fig. 5 is a schematic diagram of a word segmentation process in the embodiment of the present application, the processing module is configured to invoke a word segmentation tool in the step of segmenting words of the text to be extracted; and inputting the text to be extracted into the word segmentation tool so as to divide the text to be extracted into a plurality of semantic words and form the word set.
In the embodiment of the present application, the word segmentation tool is not limited, and the text to be extracted is, for example, "European crossroads! The method comprises the steps of inputting a mystery-exploring Esania tarnish old city area appreciation mid-century complete building into a word segmentation tool for word segmentation, obtaining word segmentation results of 'Europe', 'crossroads', 'mystery', 'Esania tarlin', 'old city area', 'appreciation', 'mid-century', 'complete' and 'building', and inputting 'a movie 1 of an actor A' for playing 'to obtain' the actor A ',' the movie 1 'of the actor A' for 'playing'.
In the embodiment of the present application, Chinese Word Segmentation refers to a process of segmenting a Chinese character sequence into individual words, and Word Segmentation refers to a process of recombining continuous Word sequences into Word sequences according to a certain criterion, where the Word Segmentation result conforms to objective cognition of Word groups, and the Word Segmentation tool can completely and thoroughly segment a text to be extracted into individual independent words regardless of the length of the text to be extracted.
For example, a jieba segmentation, which is a tool that is very popular in the field of NLP (natural language processing), may be selected as the segmentation tool. The main function of the Jieba is to divide Chinese words, which can perform simple word division, parallel word division and command line word division, and the function of the Jieba also supports keyword extraction, part of speech tagging, word position query and the like. jieba, although based on python, also supports other languages and platforms such as C + +, Go, R, Rust, node. js, PHP, iOS, Android, etc., so jieba can meet the needs of various developers. The github star number for the jieba project has reached 24 k; the jieba word segmentation tool is selected as a word segmentation tool because of strong universality and rich database; the jieba participle has three modes: the first is full mode: scanning all words which can be formed into words in the sentence; the second is the precision mode: trying to cut the sentence most accurately (jieba default mode); the third is a search mode: on the basis of an accurate mode, long words are segmented again, the recall rate is improved, and the method is suitable for word segmentation of a search engine. In the embodiment of the present application, word segmentation operation may be performed in any one of the three modes.
As shown in fig. 6, which is a schematic diagram illustrating a process of generating a tagged text set in an embodiment of the present application, in the embodiment of the present application, the processing module is configured to replace a semantic word in the word set with a general tag, so as to traverse the semantic word in the word set in the step of generating the tagged text set; replacing one semantic word in the word set by using a general mark in sequence to obtain a labeled text sentence in the process of replacing the semantic word each time, wherein the labeled text sentence comprises the general mark and the semantic word which is not replaced by the general mark in the word set, and setting a labeling probability for the labeled text sentence; and combining the labeled text sentences generated in the process of replacing the semantic words each time to form the labeled text set.
In the embodiment of the present application, the processing module combines the segmentation results, replaces each semantic word with the general label "N", and maintains other semantic words unchanged to obtain a tagged text statement, such as "play of movie 1 of actor a", "play of movie 1 of actor AN", "play of N of actor a", "play of movie 1 of actor a", and "1N" of actor a "can be obtained from the text" play of movie 1 of actor a "to be extracted, and sets a tagging probability for each tagged text statement, such as" play of movie 1 of N "," 0.9 "," play of actor AN movie 1, 1 "," play of actor a, 0.7 "," movie 1N of actor a ", and 0.8"; and combining all the marked text sentences to form the marked text set.
In the embodiment of the present application, replacing a semantic word with the general tag plays a role in tagging the semantic word, both the step of participating in model training and the step of semantic extraction are tagged text sentences obtained by replacing the semantic word with the general tag and leaving other semantic words unchanged, and the result output by the step of semantic extraction is the semantic similarity of the replaced semantic word, that is, the semantic similarity of the tagged semantic word is obtained.
In the embodiment of the application, the labeling probability is set for each labeling text sentence in the form of a soft label, the form of the soft label means that the similarity between the text to be extracted and the labeling text sentence is represented by one numerical value from 0 to 1, and the labeling format of the form of the soft label is based on the design of teachers and students.
In the embodiment of the application, a labeling probability is set for a labeling text generated by replacing a certain semantic word for each general label in the form of a soft label, that is, the original text and the replaced labeling text are compared, and the labeling probability indicates the similarity between the original text and the replaced labeling text, for example, "N movie 1 playback, 0.9" means that "N movie 1 playback" is 0.9 similar to "actor a movie 1 playback"; "actor AN movie 1 playback, 1" means that "actor AN movie 1 playback" is similar to "actor a movie 1 playback" to the extent of 1, and is substantially completely similar.
In the embodiment of the application, the labeling form of the soft label can express the similarity degree of the original text and the substituted labeling text very closely, the closer the labeling probability is to 1, the more similar the original text and the substituted labeling text is, and the closer the labeling probability is to 0, the more dissimilar the original text and the substituted labeling text is; the semantic extraction model is marked with the marking probability in the form of a soft label with a certain gradient, namely, the importance degree of different words is distinguished by the standard label with a certain gradient.
In an embodiment of the present application, the processing module is configured to input the text to be extracted and the labeled text set into a semantic extraction model, where the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set includes training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probability, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels.
In the embodiment of the application, the semantic extraction model is an improvement of a text interaction matching model, the semantic extraction model can convert an input text into a vector, an output value is also a vector, and a value in an output label classification result vector is selected to obtain the semantic similarity.
In the embodiment of the application, the semantic similarity of the output of the semantic words replaced by the general labels can be obtained by inputting the text to be extracted and the labeled text set into the trained semantic extraction model, wherein the labeled text set and the labeled sample set have the same composition and each include a plurality of labeled text sentences with labeled probabilities.
As shown in fig. 7, which is a schematic flow diagram of obtaining a label classification result in the embodiment of the present application, the text Query1 to be extracted and the annotated text sentence Query2 are input into the semantic extraction model to obtain a label classification result [ labels _0, labels _1], where a calculation method of the label classification result [ labels _0, labels _1] is:
Figure BDA0003618672480000081
in the embodiment of the application, the model parameters of the initial semantic model are adjusted according to the loss value obtained by calculation to obtain the semantic extraction model, and the original semantic word extraction task is converted into the task of calculating the semantic similarity by using a semantic similarity calculation method, so that the problem that the same word has different importance degrees in different texts is effectively solved.
In this embodiment of the present application, in the step of obtaining the semantic similarity output by the semantic extraction model, the processing module is configured to obtain the semantic similarity output by the semantic extraction model to the semantic words replaced by the general tags in the text to be extracted and each of the annotated text sentences; sequencing the semantic words replaced by the general tags in each annotation text sentence according to the sequence of the semantic similarity from large to small so as to obtain a semantic sequencing result; and screening effective semantic words in the word set according to the semantic sorting result.
Fig. 8 is a schematic diagram of a threshold filtering process in this embodiment of the application, where prob is the semantic similarity, a value of the semantic similarity prob is equal to a value of labels _1 in a tag classification result [ labels _0, labels _1], and a semantic similarity of the semantic extraction model to be extracted text and output of a labeled text sentence is obtained according to the tag classification result [ labels _0, labels _1], such as "movie 1 playback of actor a" to be extracted text "may be obtained, where" actor A, prob is 0.99924904 "", prob 0.3582316 "," movie 1, prob 0.99599165 "" "playback, prob 0.9925249"; obtaining a semantic sorting result after sorting according to the semantic similarity prob score: "actor A", "movie 1", "play", and "what" are played ".
In the embodiment of the present application, the importance degree of the semantic words in the text to be extracted can be determined according to the semantic ordering result, for example, in the text "movie 1 playing of actor a" to be extracted, the most important semantic word is actor a, the next important semantic word is movie 1, the next important semantic word is play, and the least important semantic word is "what"; the semantic sorting result can be used for identifying the most important semantic words in the text to be extracted, performing big data statistics on user preference and habit according to the semantic sorting result, and performing search association and personalized recommendation on the user according to the most important semantic words in the query text frequently input by the user.
In an embodiment of the present application, the processing module is configured to set a filtering threshold in the step of screening valid semantic words in the word set according to the semantic sorting result; comparing the semantic similarity of the semantic word replaced by the universal mark in each annotation text sentence with the filtering threshold value; and extracting effective semantic words, wherein the effective semantic words are semantic words replaced by general marks in the labeled text sentences of which the semantic similarity is greater than or equal to the filtering threshold value.
In the embodiment of the present application, a filtering threshold is set as needed, for example, the filtering threshold is set to 0.5, and the semantic words in the text "movie 1 playback of actor a" to be extracted are "actor a", "movie 1" and "playback".
In this embodiment of the present application, the filtering threshold is set according to a requirement of a query result range, if the filtering threshold is set to be larger, the number of valid semantic words in the text to be extracted is smaller, the range of the query result is smaller, and the query result is more accurate, whereas if the filtering threshold is set to be smaller, the number of semantic words in the text to be extracted is larger, the range of the query result is larger, and the query result is wider, that is, the filtering threshold is negatively related to the range of the query result.
In the embodiment of the application, if the filtering threshold is set to 0.1 in the query process, many contents with weak relevance appear in the search result; if the filtering threshold value is set to 0.9 in the query process, only the content with extremely strong correlation appears in the search result; if the query text is 'movie 1 playing of actor A', and the filtering threshold is 0.1, the search result will have the content of any semantic word in all titles 'actor A', 'movie 1', 'playing', which contains massive search content; if the query text is 'casting of movie 1 of actor a', and the filtering threshold is 0.995, only the contents of any semantic word in all titles including 'actor a' and 'movie 1' appear in the search result, and the search result is very accurate.
In the embodiment of the application, the filtering threshold value can be set according to a specific application scene, for example, in an application background during news searching, the filtering threshold value can be set to a higher value so as to ensure the accuracy of the searched news and meet the requirements of users; in the application background during short video search, the filtering threshold value can be set to be a moderate value so as to ensure that the searched short video content is enough to meet the requirements of users; in the application background of the entertainment message searching, the filtering threshold value can be set to be a lower value so as to ensure that the searched entertainment message has wide coverage and meets the requirements of users.
In an embodiment of the present application, the processing module is configured to, after the step of obtaining valid semantic words, query the storage module for associated media asset items using the valid semantic words; or sending a query instruction for querying the associated media asset item to a server by using the effective semantic word.
In this embodiment of the present application, after obtaining the valid semantic words in the word set, the processing module uses the valid semantic words to query the associated media asset items in the storage module, that is, searches the associated content in the training data through the valid semantic words, or sends a query instruction for querying the associated media asset items to the server through the valid semantic words, that is, searches the associated content in the server data through the valid semantic words. If the effective semantic words are "actor A", "movie 1" and "play", the storage module or the server is queried about the associated contents of the three words "actor A", "movie 1" and "play".
In the embodiment of the application, after the intelligent device obtains the effective semantic words through the semantic extraction model, the intelligent device can search in a database of the intelligent device according to the relevant effective semantic words to obtain associated content; the related effective semantic words can be sent to other servers for searching so as to meet the query requirements of the user, for example, the user can search information about the actor B stored in the user and the Internet through a smart phone carried by the user; the user can search videos related to the gourmet documentations in the Internet server media library through the intelligent television at home; the user can search the position of the pizza shop in the shopping mall in the information base of the user through the intelligent question and answer robot which is common in the shopping mall.
Based on the above intelligent device, some embodiments of the present application further provide a method for extracting effective semantic words, including:
acquiring a text to be extracted;
performing word segmentation on a text to be extracted to obtain a word set, wherein the word set comprises a plurality of semantic words;
replacing semantic words in the word set with generic labels to generate a set of labeled text;
inputting the text to be extracted and the labeled text set into a semantic extraction model, wherein the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set comprises training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probability, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels;
and obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words.
In the embodiment of the application, the effective semantic word extraction method obtains a set of a plurality of semantic words by segmenting a text to be extracted; replacing semantic words in the semantic word set with general marks to generate a marked text set; inputting the text to be extracted and the labeled text set into a semantic extraction model, wherein the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set comprises training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probability, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels; and obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words.
In the embodiment of the application, a semantic extraction model in the effective semantic word extraction method is marked with a labeling probability in a soft label form with a certain gradient, namely, the importance degree of different words is distinguished by a label with a certain gradient in a standard way; adjusting the model parameters of the initial semantic model according to the loss value obtained by calculation to obtain the semantic extraction model, and converting the original semantic word extraction task into a task for calculating the semantic similarity by using a semantic similarity calculation method, so that the problem that the same word has different importance degrees in different texts is effectively solved; the semantic extraction model effectively incorporates label information marked with a certain gradient when calculating the loss value, namely effectively incorporates dynamic perception of text similarity in the whole training process.
According to the technical scheme, the intelligent device and the effective semantic word extraction method provided by the application comprise a storage module and a processing module, wherein the processing module is configured to acquire the text to be extracted; performing word segmentation on a text to be extracted to obtain a word set, wherein the word set comprises a plurality of semantic words; replacing semantic words in the word set with generic labels to generate a set of labeled text; inputting the text to be extracted and the labeled text set into a semantic extraction model, wherein the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set comprises training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probabilities, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels; and obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words. According to the method and the device, key semantic words influencing semantic understanding can be extracted from the query text of the user, and the search engine is helped to understand the user intention better, so that the intelligent device can be helped to give out an accurate search result, and the user experience is improved.
The embodiments provided in the present application are only a few examples of the general concept of the present application, and do not limit the scope of the present application. Any other embodiments extended according to the scheme of the present application without inventive efforts will be within the scope of protection of the present application for a person skilled in the art.

Claims (10)

1. A smart device, comprising:
a storage module configured to store a semantic extraction model;
a processing module configured to:
acquiring a text to be extracted;
performing word segmentation on a text to be extracted to obtain a word set, wherein the word set comprises a plurality of semantic words;
replacing semantic words in the word set with generic tokens to generate a set of annotated text;
inputting the text to be extracted and the labeled text set into a semantic extraction model, wherein the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set comprises training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probabilities, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels;
And obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words.
2. The smart device of claim 1, wherein the processing module is configured to:
in the step of segmenting the words of the text to be extracted, a word segmentation tool is called;
and inputting the text to be extracted into the word segmentation tool so as to divide the text to be extracted into a plurality of semantic words and form the word set.
3. The smart device of claim 1, wherein the processing module is configured to:
in the step of replacing the semantic words in the word set by the general marks to generate a labeled text set, traversing the semantic words in the word set;
replacing one semantic word in the word set by using a general mark in sequence to obtain a labeled text sentence in the process of replacing the semantic word each time, wherein the labeled text sentence comprises the general mark and the semantic word which is not replaced by the general mark in the word set, and setting a labeling probability for the labeled text sentence;
and combining the labeled text sentences generated in the process of replacing the semantic words each time to form the labeled text set.
4. The smart device of claim 3, wherein the processing module is configured to:
in the step of obtaining the semantic similarity output by the semantic extraction model, obtaining the semantic similarity output by the semantic extraction model to the semantic words replaced by the general tags in the text to be extracted and each of the annotated text sentences;
sequencing the semantic words replaced by the general tags in each annotation text sentence according to the sequence of the semantic similarity from large to small so as to obtain a semantic sequencing result;
and screening effective semantic words in the word set according to the semantic sorting result.
5. The smart device of claim 4, wherein the processing module is configured to:
setting a filtering threshold value in the step of screening effective semantic words in the word set according to the semantic sorting result;
comparing the semantic similarity of the semantic words replaced by the general marks in each of the labeled text sentences with the filtering threshold value;
and extracting effective semantic words, wherein the effective semantic words are semantic words replaced by general marks in the labeled text sentences of which the semantic similarity is greater than or equal to the filtering threshold.
6. The smart device of claim 1, wherein the processing module is configured to:
acquiring a training sample set, wherein the training sample set is constructed by a query text and a media asset title text input by a user;
extracting training sentences from the training sample set;
replacing keywords in the training sentences with general labels to generate the labeled sample sentences;
setting a labeling probability for the labeling sample sentence;
training the semantic extraction model using the training sentences and the annotated sample sentences.
7. The smart device of claim 1, wherein the processing module is configured to:
in the step of training the semantic extraction model by using the training sentences and the labeled sample sentences, calling an initial semantic model;
inputting the training sentences and the labeling sample sentences into the initial semantic model;
obtaining the classification probability output by the initial semantic model;
calculating to obtain a loss value according to the classification probability and the labeling probability;
and adjusting the model parameters of the initial semantic model according to the loss value to obtain the semantic extraction model.
8. The smart device of claim 1, wherein the processing module is configured to:
In the step of acquiring the text to be extracted, receiving a query instruction input by a user, wherein the query instruction comprises a query text;
analyzing a query text from the query instruction;
and deleting the meaningless words in the query text by using a preset word bank to obtain the text to be extracted.
9. The smart device of claim 1, wherein the processing module is configured to:
after the step of obtaining the effective semantic words, using the effective semantic words to inquire the associated media asset items in the storage module; or,
and sending a query instruction for querying the associated media asset items to a server by using the effective semantic words.
10. A method for extracting valid semantic words is characterized by comprising the following steps:
acquiring a text to be extracted;
performing word segmentation on a text to be extracted to obtain a word set, wherein the word set comprises a plurality of semantic words;
replacing semantic words in the word set with generic tokens to generate a set of annotated text;
inputting the text to be extracted and the labeled text set into a semantic extraction model, wherein the semantic extraction model is generated by training a training sample set and a labeled sample set, and the training sample set comprises training sentences with semantic labels; the labeled sample set comprises labeled sample sentences with labeled probability, and the labeled sample sentences are sentences formed by replacing keywords in the training sentences with general labels;
And obtaining semantic similarity output by the semantic extraction model, and filtering semantic words in the word set according to a similarity threshold value to obtain effective semantic words.
CN202210455759.3A 2022-04-27 2022-04-27 Intelligent device and effective semantic word extraction method Pending CN114757187A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210455759.3A CN114757187A (en) 2022-04-27 2022-04-27 Intelligent device and effective semantic word extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210455759.3A CN114757187A (en) 2022-04-27 2022-04-27 Intelligent device and effective semantic word extraction method

Publications (1)

Publication Number Publication Date
CN114757187A true CN114757187A (en) 2022-07-15

Family

ID=82334140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210455759.3A Pending CN114757187A (en) 2022-04-27 2022-04-27 Intelligent device and effective semantic word extraction method

Country Status (1)

Country Link
CN (1) CN114757187A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362656A (en) * 2019-06-03 2019-10-22 广东幽澜机器人科技有限公司 A kind of semantic feature extracting method and device
US20200184151A1 (en) * 2018-11-30 2020-06-11 Thomson Reuters Special Services Llc Systems and methods for identifying an event in data
CN112667800A (en) * 2020-12-21 2021-04-16 深圳壹账通智能科技有限公司 Keyword generation method and device, electronic equipment and computer storage medium
CN113392305A (en) * 2020-11-25 2021-09-14 腾讯科技(深圳)有限公司 Keyword extraction method and device, electronic equipment and computer storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200184151A1 (en) * 2018-11-30 2020-06-11 Thomson Reuters Special Services Llc Systems and methods for identifying an event in data
CN110362656A (en) * 2019-06-03 2019-10-22 广东幽澜机器人科技有限公司 A kind of semantic feature extracting method and device
CN113392305A (en) * 2020-11-25 2021-09-14 腾讯科技(深圳)有限公司 Keyword extraction method and device, electronic equipment and computer storage medium
CN112667800A (en) * 2020-12-21 2021-04-16 深圳壹账通智能科技有限公司 Keyword generation method and device, electronic equipment and computer storage medium

Similar Documents

Publication Publication Date Title
CN109165302B (en) Multimedia file recommendation method and device
US10192544B2 (en) Method and system for constructing a language model
CN112201228A (en) Multimode semantic recognition service access method based on artificial intelligence
CN107526809B (en) Method and device for pushing music based on artificial intelligence
CN109101479A (en) A kind of clustering method and device for Chinese sentence
US20070118519A1 (en) Question answering system, data search method, and computer program
US11875585B2 (en) Semantic cluster formation in deep learning intelligent assistants
CN111353026A (en) Intelligent law attorney assistant customer service system
CN111400513A (en) Data processing method, data processing device, computer equipment and storage medium
CN111881283A (en) Business keyword library creating method, intelligent chat guiding method and device
CN117668181A (en) Information processing method, device, terminal equipment and storage medium
CN113704507A (en) Data processing method, computer device and readable storage medium
CN109992651B (en) Automatic identification and extraction method for problem target features
CN110795547A (en) Text recognition method and related product
JP2012003704A (en) Faq candidate extraction system and faq candidate extraction program
CN109800326B (en) Video processing method, device, equipment and storage medium
CN111354350A (en) Voice processing method and device, voice processing equipment and electronic equipment
CN114430832A (en) Data processing method and device, electronic equipment and storage medium
CN116361416A (en) Speech retrieval method, system and medium based on semantic analysis and high-dimensional modeling
CN114757187A (en) Intelligent device and effective semantic word extraction method
CN111949781B (en) Intelligent interaction method and device based on natural sentence syntactic analysis
CN114662002A (en) Object recommendation method, medium, device and computing equipment
CN115618873A (en) Data processing method and device, computer equipment and storage medium
CN112507105A (en) Multi-mode intelligent question-answering system and method based on WeChat public number
JP2004118856A (en) Information retrieval method and information retrieval system using agent

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination