CN113282711B - Internet of vehicles text matching method and device, electronic equipment and storage medium - Google Patents

Internet of vehicles text matching method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113282711B
CN113282711B CN202110622070.0A CN202110622070A CN113282711B CN 113282711 B CN113282711 B CN 113282711B CN 202110622070 A CN202110622070 A CN 202110622070A CN 113282711 B CN113282711 B CN 113282711B
Authority
CN
China
Prior art keywords
text
model
matching
matched
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110622070.0A
Other languages
Chinese (zh)
Other versions
CN113282711A (en
Inventor
邹博松
王卉捷
宋娟
郭盈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Software Evaluation Center
Original Assignee
China Software Evaluation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Software Evaluation Center filed Critical China Software Evaluation Center
Priority to CN202110622070.0A priority Critical patent/CN113282711B/en
Publication of CN113282711A publication Critical patent/CN113282711A/en
Application granted granted Critical
Publication of CN113282711B publication Critical patent/CN113282711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method, a device, electronic equipment and a storage medium for matching Internet of vehicles text, which are used for solving the problem that better text matching effect is difficult to obtain from the aspect of semantic representation. The method comprises the following steps: acquiring a text to be matched, and extracting abstract content of the text to be matched and dependency syntax core components of the text to be matched; word segmentation and vectorization are carried out on abstract content, dependency syntax core components and texts to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse sequence position embedded vectors; fusion processing is carried out on the sentence component vector, the token embedded vector, the position embedded vector and/or the reverse sequence position embedded vector, so that an input representation vector is obtained; and matching and sorting the plurality of search texts according to the input representation vector by using a text matching model to obtain a plurality of sorted search texts.

Description

Internet of vehicles text matching method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of Internet of vehicles and natural language processing, in particular to an Internet of vehicles text matching method, an Internet of vehicles text matching device, electronic equipment and a storage medium.
Background
The internet of vehicles (Internet of Vehicles, ioV) means that vehicle-mounted equipment on the vehicle effectively utilizes all vehicle dynamic information in an information network platform through a wireless communication technology, and different functional services are provided in the running process of the vehicle. IoV the internet of vehicles is also referred to as V2X (i.e. vehicle communication) for communication between the vehicle and other vehicles or devices that may affect the vehicle, which means that the devices on the vehicle can exchange information through different kinds of communication modes, where the different kinds of communication modes include: vehicle-to-infrastructure (V2I), vehicle-to-network (Vehicleto Network, V2N), vehicle-to-vehicle (V2V), vehicle-to-pedestrian (Vehicleto Pedestrian, V2P), or vehicle-to-device (V2D).
Text matching (Text Match), an important issue in natural language processing (Natural Language Processing, NLP), many tasks in NLP can be abstracted as Text matching issues; specific examples are: the process of performing web search in a search engine can be abstracted as a problem that the search engine matches a set of related web pages corresponding to a user search Text (Query Text); similarly, the automatic question-answering task may be abstracted as a satisfaction matching problem of candidate answers and questions, and the text deduplication task may be abstracted as a similarity matching problem of query text and text to be deduplicated.
In the current internet of vehicles industry, information searching and information exchange of automobiles and internet of vehicles V2X are mostly realized based on text matching, and the current internet of vehicles text matching method is mostly realized based on a machine learning method; based on machine learning methods such as: after the keywords are input by the user, in a scenario where the search engine performs relevance ranking on the web page files according to the keywords, the search engine may rank the search results using a Term Frequency-inverse document Frequency (TF-IDF) algorithm or a vector space model (Vector Space Model, VSM) algorithm. In a specific practical process, it is found that using a machine learning-based method to perform text matching only uses word frequency, inverse document frequency, document length and other factors, and this text matching method has many problems, for example: the combined structure questions of the languages in the automatic question-answering task (for example, the question of "high-speed rail from Beijing to Shanghai" is given as a candidate answer to "high-speed rail from Shanghai to Beijing") and the questions of ambiguity and synonymity of the languages, etc. That is, using a machine-based learning approach to rank search results using lexical level similarity factors makes it difficult to achieve better text matching from a semantic representation perspective.
Disclosure of Invention
The embodiment of the application aims to provide a method, a device, electronic equipment and a storage medium for matching internet of vehicles text, which are used for solving the problem that better text matching effect is difficult to obtain from the aspect of semantic representation.
The embodiment of the application provides a text matching method for the Internet of vehicles, which comprises the following steps: acquiring a text to be matched, and extracting abstract content of the text to be matched and dependency syntax core components of the text to be matched; word segmentation and vectorization are carried out on abstract content, dependency syntax core components and texts to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse sequence position embedded vectors; fusion processing is carried out on the sentence component vector, the token embedded vector, the position embedded vector and/or the reverse sequence position embedded vector, so that an input representation vector is obtained; and matching and sorting the plurality of search texts according to the input expression vector by using a text matching model to obtain a plurality of sorted search texts, wherein the text matching model is obtained through multi-task joint training. In the implementation process, the input representation vector is obtained through word segmentation, vectorization and fusion processing of the abstract content, the dependency syntax core component and the text to be matched, so that the input representation vector can better represent the text content in terms of semantic representation, sentence component vectors for distinguishing the abstract content and the dependency syntax core component are added into the input representation vector, the text matching model can more easily distinguish the core abstract content and the dependency syntax core component of the text, matching sorting is carried out according to the core abstract content and the dependency syntax core component which can better represent the semantic, and the text matching effect is effectively improved in terms of semantic representation.
Optionally, in the embodiment of the present application, extracting the abstract content of the text to be matched and the dependency syntax core component of the text to be matched includes: using a pre-trained generated pre-training model as a abstract extraction model to extract abstracts of texts to be matched, and obtaining abstract contents of the texts to be matched; and extracting a main-predicate relation component, a guest relation component, a structure-in-shape component and/or a core relation component in the text to be matched by using a dependency analysis tool, and determining the main-predicate relation component, the guest relation component, the structure-in-shape component and/or the core relation component as a dependency syntax core component of the text to be matched. In the implementation process, the abstract content and the dependency syntax core components of the text to be matched are extracted, so that the text matching model can more easily distinguish the core components of the text, and the matching ordering is carried out according to the abstract content and the core components which can better represent the semantics, thereby effectively improving the text matching effect in terms of semantic representation.
Optionally, in an embodiment of the present application, before performing abstract extraction on the text to be matched using the pre-trained generated pre-training model as the abstract extraction model, the method further includes: acquiring a text data set and a summary data set, wherein summary texts in the summary data set are obtained by abstracting sample texts in the text data set; training the generated pre-training network by using the text data set and the abstract data set to obtain a generated pre-training model. In the implementation process, a generating type pre-training model is independently trained by using the text data set and the abstract data set instead of directly adopting the existing generating type pre-training model on the Internet, so that the problem that the existing generating type pre-training model is not applicable is avoided, and the generalization capability of the generating type pre-training model which is independently trained is better and the accuracy is higher.
Optionally, in an embodiment of the present application, the text matching model includes: a feature extraction model and a depth network model; using a text matching model to match and rank the plurality of retrieved text according to the input representation vector, comprising: extracting a feature vector of the input representation vector by using a feature extraction model; and matching and sorting text vectors corresponding to the plurality of search texts according to the feature vectors by using the depth network model. In the implementation process, the feature extraction model is used for extracting the feature vector of the input representation vector, and the depth network model is used for carrying out matching sorting on the text vectors corresponding to the plurality of search texts according to the feature vector, so that the feature vector in terms of semantic representation is used for carrying out matching sorting, and the text matching effect in terms of semantic representation is effectively improved.
Optionally, in an embodiment of the present application, before using the text matching model to match and rank the plurality of search texts according to the input representation vector, the method further includes: acquiring a text data set, a abstract data set and a dependency data set; and performing multi-task combined training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain a text matching model. In the implementation process, the abstract data set and the dependency data set are used for carrying out multi-task combined training on the feature extraction model, so that the obtained text matching model can further improve the capturing capability of core abstract content and dependency syntax core components in the text, and the accuracy of text matching of the text matching model is improved.
Optionally, in an embodiment of the present application, the text data set includes: querying a content sample, a positive sample text, and a plurality of negative sample texts; training the deep network model using the text dataset, comprising: predicting a predicted matching text corresponding to the query content sample by using a depth network model in the text matching model; calculating a PairWise loss value between the predictive matching text and the positive sample text and between the predictive matching text and the negative sample text; calculating a ListWise loss value among the query content sample, the positive sample text and the plurality of negative sample texts; and training the depth network model in the text matching model according to the PairWise loss value and the ListWise loss value. In the implementation process, through combining the ideas of the PairWise and the LittWise and training the depth network model in the text matching model according to the PairWise loss value and the LittWise loss value, training only through one of the PointWise loss value, the PairWise loss value or the LittWise loss value is avoided, so that the model has the PairWise advantage, namely, the loss values of similar sentences and dissimilar sentences have larger spacing (margin), and has the LittWise advantage, namely, the model comprehensively considers a plurality of texts, and the generalization capability of the text matching model is improved.
Alternatively, in an embodiment of the present application, the feature extraction model employs a Roberta model or a BERT model. By extracting the feature vector of the input expression vector by using the Roberta model as the feature extraction model, the accuracy of extracting the feature vector can be further improved compared with other models.
In the text matching model in the Internet of vehicles text matching method, the input representing vector capable of representing the semantic representation is fed to the pre-training language model such as Roberta or BERT, and further, the text matching model is jointly trained by using the text matching task, the part-of-speech predicting task and the dependency relation task, so that the text matching model has better semantic representation capability, and then the text matching or text retrieval is performed by using the jointly trained text matching model, so that the text matching or text retrieval effect in terms of semantic representation can be effectively improved.
The embodiment of the application also provides a text matching device of the Internet of vehicles, which comprises the following steps: the text content extraction module is used for acquiring the text to be matched and extracting abstract content of the text to be matched and dependency syntax core components of the text to be matched; the vector matrix obtaining module is used for word segmentation and vectorization of abstract content, dependency syntax core components and texts to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse sequence position embedded vectors; the representation vector obtaining module is used for carrying out fusion processing on the sentence component vector, the token embedded vector, the position embedded vector and/or the reverse sequence position embedded vector to obtain an input representation vector; and the text matching and sorting module is used for matching and sorting the plurality of search texts according to the input expression vector by using a text matching model, so as to obtain a plurality of sorted search texts, wherein the text matching model is obtained through multi-task joint training.
Optionally, in an embodiment of the present application, the text content extraction module includes: the abstract content extraction module is used for extracting the abstract of the text to be matched by using a pre-trained generation type pre-training model as an abstract extraction model to obtain the abstract content of the text to be matched; the dependency relation analysis module is used for extracting main-predicate relation components, dynamic guest relation components, inter-guest relation components, in-shape structure components and/or core relation components in the text to be matched by using the dependency relation analysis tool, and determining the main-predicate relation components, dynamic guest relation components, inter-guest relation components, in-shape structure components and/or core relation components as dependency syntax core components of the text to be matched.
Optionally, in an embodiment of the present application, the text matching model includes: a feature extraction model and a depth network model; a text matching ranking module comprising: a feature vector extraction module for extracting feature vectors of the input representation vectors using a feature extraction model; and the vector matching and sorting module is used for matching and sorting the text vectors corresponding to the plurality of search texts according to the feature vectors by using the depth network model.
Optionally, in an embodiment of the present application, the internet of vehicles text matching device further includes: the training data acquisition module is used for acquiring a text data set, a abstract data set and a dependency data set; and the matching model obtaining module is used for performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain a text matching model.
Optionally, in an embodiment of the present application, the text data set includes: querying a content sample, a positive sample text, and a plurality of negative sample texts; a matching model acquisition module comprising: the matching text prediction module is used for predicting a predicted matching text corresponding to the query content sample by using a depth network model in the text matching model; the first loss calculation module is used for calculating PairWise loss values between the prediction matching text and the positive sample text as well as between the prediction matching text and the negative sample text; a second loss calculation module for calculating a listdase loss value between the query content sample, the positive sample text, and the plurality of negative sample texts; and the network model training module is used for training the depth network model in the text matching model according to the PairWise loss value and the ListWise loss value.
Alternatively, in an embodiment of the present application, the feature extraction model uses a Roberta model or a BERT model.
The embodiment of the application also provides electronic equipment, which comprises: a processor and a memory storing machine-readable instructions executable by the processor to perform the method as described above when executed by the processor.
Embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method as described above.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a text matching method of internet of vehicles, which is provided by an embodiment of the application;
FIG. 2 is a schematic structural diagram of a text matching model according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a training text matching model according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an internet of vehicles text matching device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.
Before introducing the text matching method of the internet of vehicles provided by the embodiment of the application, some concepts related in the embodiment of the application are introduced:
Natural language processing (Natural Language Processing, NLP), which refers to the study of natural language cognition related problems due to understanding (unrerstanning) natural language, requiring extensive knowledge about the outside world and the ability to manipulate such knowledge.
Dependency parsing (Dependency Parsing) is to reveal the syntactic structure of components in a language unit by analyzing their dependencies, and to claim that core verbs in sentences are central components that govern other components. A dependency is a binary asymmetric relationship between a center word and its dependencies, the center word of a sentence is typically a verb, and all other words either depend on the center word or are associated with it through a dependency path. The dependency grammar is to analyze sentences into a dependency syntax tree, describe the dependency relation among the words and also indicate the syntactic collocation relation among the words, wherein the collocation relation is related with semantics.
Vectorization (Vectorization) may refer to the use of vector representations on a plurality of regular character sequences as described above, i.e. converting the character sequences into vector form; of course, in a specific implementation process, the character sequence may be represented in a vectorization manner, or a plurality of regular character sequences may be segmented (token) first to obtain a plurality of words, if Vector representation is used for the plurality of words, a plurality of Word vectors (Word vectors) are obtained, and if Vector representation is used for sentences, a plurality of Sentence vectors (Sentence vectors) are obtained.
Joint training, also known as Joint Learning (Joint Learning), refers to the use of a multi-task Learning framework to jointly train multiple neural networks in a model, specifically for example: and training the neural network models such as the feature extraction model and the deep network model successively or simultaneously by using a multi-task learning framework.
KL divergence (Kullback-Leibler divergence, KLD), referred to as relative entropy (relative entropy) in information systems, randomness (randomness) in continuous time sequences, information gain (information gain), also referred to as information divergence (information divergence), in statistical model inference; KL divergence is a measure of the asymmetry of the differences of the two probability distributions P and Q; KL divergence is a measure of the number of additional average bits needed to encode samples that obey the distribution of P using the Q-based distribution; typically, P represents the true distribution of the data, and Q represents the theoretical distribution of the data, the estimated model distribution, or the approximate distribution of P.
It should be noted that, the method for matching text in internet of vehicles provided by the embodiment of the present application may be executed by an electronic device, where the electronic device refers to a device terminal or a server having a function of executing a computer program, where the device terminal is for example: smart phones, personal computers (personal computer, PC), tablet computers or mobile internet devices (mobile Internet device, MID), etc. The server is for example: an x86 server and a non-x 86 server, the non-x 86 server comprising: mainframe, minicomputer, and UNIX servers.
Before introducing the internet of vehicles text matching method provided by the embodiment of the application, an application scene suitable for the internet of vehicles text matching method is introduced, wherein the application scene comprises but is not limited to: fields such as internet of vehicles information retrieval, collaborative filtering, commodity recommendation, advertisement pushing, internet of vehicles question-answering systems (Question Answering System, QA System) and the like; the application scenario of the internet of vehicles information retrieval specifically includes: the user provides text content to the search engine, and then the search engine matches a plurality of related webpage sets according to the relevance of the text content, or searches all files in the local file system for files semantically matched with the target file content, and the like. The application scenario of the internet of vehicles question-answering system specifically includes: the user asks "how to go from Beijing to New York", and the Internet of vehicles question-answering system provides candidate answers which best meet the user's requirements.
Please refer to a flow chart of a text matching method of internet of vehicles provided by an embodiment of the present application shown in fig. 1; the main idea of the Internet of vehicles text matching method is that the input representation vector is obtained through word segmentation, vectorization and fusion processing of abstract content, dependency syntax core components and texts to be matched, so that the input representation vector can better represent text content in terms of semantic representation, abstract content and dependency syntax core components are extracted, the text matching model can more easily distinguish the core components of the texts, matching sorting is carried out according to the abstract content and the core components which can better represent the semantics, and the text matching effect is effectively improved in terms of semantic representation; the internet of vehicles text matching method can comprise the following steps:
Step S110: and obtaining the text to be matched, and extracting abstract content of the text to be matched and dependency syntax core components of the text to be matched.
The dependency syntax core component refers to a core component after dependency relation analysis is performed on a text to be matched, namely, the core component can represent semantic core component content in the text to be matched, specifically for example: the statement trunk core part has the relationships of main-and-predicate relationship, guest-to-guest relationship, direct-indirect object relationship and the like. The text to be matched refers to text content to be matched, for example: the content of the questions presented in the automatic question-answering system, etc.
The embodiment of obtaining the text to be matched, the abstract content, and the dependency syntax core element in step S110 may include:
step S111: and obtaining the text to be matched.
The text to be matched in the step S111 may be obtained in various ways, including but not limited to: the first acquisition mode is to receive texts to be matched sent by other terminal equipment and store the texts to be matched into a file system, a database or mobile storage equipment; the second obtaining method obtains a pre-stored text to be matched, specifically for example: obtaining a text to be matched from a file system, or obtaining the text to be matched from a database, or obtaining the text to be matched from a mobile storage device; and thirdly, acquiring texts to be matched on the Internet by using software such as a browser or the like, or acquiring the texts to be matched by using other application programs to access the Internet.
Step S112: and performing abstract extraction on the text to be matched by using a pre-trained generated pre-training model as an abstract extraction model to obtain abstract content of the text to be matched.
Generating a generating Pre-Training (GPT) model, which is also called GPT model or GPT-2 model for short, and is a large-scale language model based on a deformer (transformation) issued by OpenAI; the model adopts a training mode of Pre-training and Fine-tuning (Fine-tuning), and can be used for tasks such as classification, reasoning, question-answering, similarity and the like.
It can be appreciated that before the above-mentioned generative pre-training model GPT model or GPT-2 is used, the above-mentioned generative pre-training model needs to be trained separately, and by training a single generative pre-training model using the text data set and the abstract data set, instead of directly adopting the existing generative pre-training model on the internet, the problem that the existing generative pre-training model is not applicable is avoided. The above-mentioned individual training process of generating the pre-training model specifically includes: acquiring a text data set and a summary data set, wherein the summary text in the summary data set is obtained by abstracting sample text in the text data set, the acquired text data set and the summary data set can be manually-collected and abstracted manual annotation data sets, and under the condition that the number of the manual annotation data sets is large enough, a neural network model trained by the manual annotation data sets can be used for replacing manual extraction of the summaries, and labeling is performed to obtain a machine annotation data set; the generated pre-training network is trained by means of supervised learning (Supervised Learning) using the text data set and the summary data set to obtain a generated pre-training model.
The embodiment of step S112 described above is, for example: and using a pre-trained generation type pre-training model (namely the GPT model or the GPT-2 model) as a abstract extraction network model or an abstract generation network model to extract the abstract of the text to be matched, so as to obtain the abstract content of the text to be matched. Of course, in a specific implementation, other generated pre-training models may be used as the abstract extraction network model, including: matchSum model, bertSum model, etc.; the summary may also be used to generate a network model, including: masking sequence-to-sequence (Masked Sequence to Sequence, MASS) models, unified language models (Unified Language Model, UNILM), and the like.
Step S113: and performing dependency relationship analysis on the text to be matched by using a dependency analysis tool to obtain the dependency syntax core component of the text to be matched.
The embodiment of step S113 described above is, for example: performing dependency analysis on the text to be matched by using a dependency analysis tool, so as to extract a main-predicate (namely, suBject-Verb, which is denoted as SBV) component, a dynamic guest (namely, verb-OBject, which is denoted as VOB) component, a meta-guest (namely, index-OBject, which is denoted as IOB) component, a meta-structure (namely, ADVerbial, which is denoted as ADV) component and/or a core relation (namely, head, which is denoted as HED) component in the text to be matched, and determining the main-predicate component, the dynamic guest component, the meta-structure component and/or the core relation component as the dependency core component of the text to be matched; among the dependency analysis tools that may be used herein include, but are not limited to: LTP4 tools based on Pytorch, hanLP tools, and DDParser tools, etc.
After step S110, step S120 is performed: and performing word segmentation and vectorization on the abstract content, the dependency syntax core components and the text to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse sequence position embedded vectors.
Since the above sentence component vector, token embedding vector, position embedding vector, and reverse order position embedding vector are arranged in many combinations, the implementation of the above step S120 is also many, and in a specific practical process, a partial vector may be selected according to a specific situation to perform fusion processing, for example, only the sentence component vector, token embedding vector, and position embedding vector fusion processing may be selected. Of course, all vectors may be selected for fusion processing, and this embodiment includes:
step S121: and performing word segmentation and vectorization on the abstract content, the dependency syntax core component and the text to be matched to obtain a token embedded vector corresponding to the abstract content, a token embedded vector corresponding to the dependency syntax core component and a token embedded vector corresponding to the text to be matched.
Token embedding (Token embedding), like Word embedding, refers to mapping from the multidimensional space of each Word part (Word Piece), complete Word or other special character to a continuous vector space, while Token embedding vector (Token Embedding Vector) refers to a vector after mapping.
The embodiment of step S121 described above is, for example: word segmentation (token) is carried out on abstract content, dependency syntax core components and texts to be matched by using a grammar and rule-based word segmentation method, a mechanical word segmentation method (namely dictionary-based) or a statistical-based method to obtain a plurality of words; wherein the mechanical word segmentation method is as follows: a forward maximum matching method, a reverse maximum matching method and a least segmentation method based on a dictionary; statistical-based methods such as: a hidden Markov model (Hidden Markov Model, HMM) method, an N-gram (N-gram) method, a conditional random field method, and the like. Then, the characters in each of the plurality of words are vectorized using a pre-trained language model (Pretraining Language Models, PLMs) to obtain token embedded vectors corresponding to the summary content, token embedded vectors corresponding to the dependency syntax core components, and token embedded vectors corresponding to the text to be matched. The PLMs are also called pre-training models for short, which refer to neural network models obtained by semi-supervised machine learning of a neural network by using a large amount of text corpus as training data, wherein the pre-training models contain text structure relations in language models.
Step S122: and performing word segmentation and vectorization on the abstract content, the dependency syntax core component and the text to be matched to obtain sentence component vectors corresponding to the abstract content, sentence component vectors corresponding to the dependency syntax core component and sentence component vectors corresponding to the text to be matched.
Sentence component vector, which is a vector for representing that a Token (Token) belongs to one of summary content, dependency syntax core component, or text to be matched, specifically, for example: assuming that the token belongs to summary content with 1, the token belongs to dependency syntax core component with 2, and the token belongs to text to be matched with 3.
The step S122 specifically includes: after word segmentation (Token) is performed on the abstract content, the dependency syntax core component and the text to be matched respectively, token numbers (Token) of the abstract content, the dependency syntax core component and the text to be matched respectively can be counted, and sentence component vectorization representation is used according to the Token numbers of the abstract content, the dependency syntax core component and the text to be matched, so that sentence component vectors corresponding to the abstract content, the dependency syntax core component and the text to be matched can be obtained.
Step S123: and carrying out vectorization processing on the abstract content, the dependency syntax core component and each word position in the text to be matched to obtain a position embedding vector corresponding to the abstract content, a position embedding vector corresponding to the dependency syntax core component and a position embedding vector corresponding to the text to be matched.
Location embedding (position embedding) is similar to token embedding above, except that the location embedding vector vectorizes the location of the token rather than vectorizing the token itself. The position embedding vector is a vector obtained by vectorizing the position of the token.
The embodiment of step S123 described above is, for example: and performing vectorization processing on the abstract content, the dependency syntax core component and each word position in the text to be matched by using a GloVe model, a word2vec model, a Fasttext model and the like to obtain a position embedding vector corresponding to the abstract content, a position embedding vector corresponding to the dependency syntax core component and a position embedding vector corresponding to the text to be matched.
It will be appreciated that the above-described position embedding vectors have two forms: the first form is to put together the abstract content, the dependency syntax core component and the text to be matched to make statistics of positions, that is, each Word Embedding (Word Embedding) has only one unique position Embedding vector in the abstract content, the dependency syntax core component and the text to be matched to represent the position of the Word Embedding; specific examples are: assuming that the summary content is "[ CLS ] I like cat [ sep ]", the dependency syntax core component is "[ CLS ] like cat [ sep ]", and the text to be matched is "[ CLS ] I just like cat [ sep ]", the summary content and the text to be matched have "I" in their positions, but are different, the position index of "I" in the summary content is 1, and the position index of "I" in the text to be matched is 8, the position embedding vectors are obtained together statistically, and can effectively represent the positions of the same word in different sentence components (summary content, dependency syntax core component, or text to be matched). The second form is to count positions of the summary content, the dependency syntax core component, and the text to be matched, respectively, that is, there may be a case where the position index or the position embedding vector is the same in the summary content, the dependency syntax core component, and the text to be matched, specifically, for example: assuming that the digest content is "[ CLS ] I like cat [ sep ]", the dependent syntax core component is "[ CLS ] like cat [ sep ]", and the text to be matched is "[ CLS ] I just like cat [ sep ]", there is "I" in both the digest content and the text to be matched, and the relative index position of "I" in both the digest content and the text to be matched is 1, and the relative index position of "[ CLS ]" in both the digest content, the dependent syntax core component, and the text to be matched is 0.
Step S124: and carrying out vectorization processing on the abstract content, the dependency syntax core component and the reverse sequence position of each word in the text to be matched to obtain an embedded vector of the reverse sequence position.
Reverse order position embedding (Reverse-Position Embedding), similar to the position embedding above, except that the position embedding is vectorized for the positive order position of the token, and the Reverse order position embedding is vectorized for the Reverse order position of the token. The vector is embedded in the reverse sequence position, and the vector is obtained by vectorizing the reverse sequence position of the token. Similarly, the reverse order position embedding vector has two forms: the first form is to put together summary content, dependency syntax core components and text to be matched to make statistics of positions; the second form is to count the positions of abstract content, dependency syntax core components and text to be matched respectively, and specifically, reference can be made to the description of the two forms of the above position embedding vector.
The embodiment of step S124 described above is, for example: and vectorizing the abstract content, the dependency syntax core component and the reverse sequence position of each word in the text to be matched by using a GloVe model, a word2vec model, a FastText model and the like to obtain an reverse sequence position embedded vector corresponding to the abstract content, an reverse sequence position embedded vector corresponding to the dependency syntax core component and an reverse sequence position embedded vector corresponding to the text to be matched.
After step S120, step S130 is performed: and respectively carrying out fusion processing on the sentence component vector, the token embedded vector, the position embedded vector and/or the reverse sequence position embedded vector to obtain an input representation vector.
The embodiment of step S130 includes: in a first embodiment, the embedded vectors are summed (sum) by, for example: assuming that the dimensions of the vectors are all 2 dimensions, and that the sentence component vector is [0.11,0.07], the token embedding vector is [0.02,0.13], the position embedding vector is [0.04,0.03] and the reverse order position embedding vector is [0.07,0.04], an input representation vector (The input representation) obtained by addition summation (sum) is [0.24,0.27]; in a second embodiment, the embedded vectors are fused by connection (concat), for example: assuming that the dimensions of the vectors are all 2 dimensions, and that the sentence component vector is [0.11,0.07], the token embedded vector is [0.02,0.13], the position embedded vector is [0.04,0.03], and the reverse order position embedded vector is [0.07,0.04], the input representation vector obtained by concatenation (concat) is [0.11,0.07,0.02,0.13,0.04,0.03,0.07,0.04].
In the implementation process, the token embedded vector, the position embedded vector and the reverse sequence position embedded vector are fused in the input expression vector, so that the input expression vector can better express text content in terms of semantic expression, and the text matching effect in terms of semantic expression is effectively improved.
After step S130, step S140 is performed: and matching and sorting the plurality of search texts according to the input representation vector by using a text matching model to obtain a plurality of sorted search texts, wherein the text matching model is obtained through multi-task joint training.
Please refer to fig. 2, which illustrates a schematic structural diagram of a text matching model according to an embodiment of the present application; a text matching model (Text Matching Model), which refers to a neural network model for text matching; the text matching model includes: a feature extraction model and a depth network model; the depth network model includes a plurality of morphers (morphers), shown in the figure as 12 morphers (morphers), each including a Self-Attention layer, a first regularization layer, a fully connected layer, and a second regularization layer connected in sequence. Because a plurality of Self-Attention (Self-Attention) layers are arranged in the depth network model, the text matching model comprising the depth network model can better notice core components of the PairWise data (namely positive sample text and negative sample text) and the ListWise data (namely query content sample and a plurality of sample texts), and the matching data Pair (Pair) of the positive sample text and the negative sample text is expanded into a matching sequence (List) of the query content sample and the plurality of sample texts, so that the generalization capability of the text matching model is greatly enhanced.
The embodiment of step S140 may include:
step S141: feature vectors of the input representation vectors are extracted using a feature extraction model.
The embodiment of step S131 described above is, for example: the feature vectors of the input representation vectors are extracted using an advanced Roberta model or a bi-directionally encoded representation encoder (Bidirectional Encoder Representations from Transformers, BERT) model as a feature extraction model.
Step S142: and matching and sorting text vectors corresponding to the plurality of search texts according to the feature vectors by using the depth network model to obtain a plurality of sorted search texts.
The embodiment of step S142 described above is, for example: sequentially using 12 transformers (transformers) to match and sort text vectors corresponding to the plurality of search texts to obtain a plurality of sorted search texts; the text matching model is obtained through multi-task joint training, and each deformer comprises a self-attention layer, a first regularization layer, a full-connection layer and a second regularization layer which are connected in sequence.
In the implementation process, firstly, abstract content of a text to be matched and dependency syntax core components of the text to be matched are extracted, then word segmentation, vectorization and fusion processing are carried out on the abstract content, the dependency syntax core components and the text to be matched to obtain input expression vectors fused with sentence component vectors, and finally, a text matching model is used for matching and sequencing a plurality of search texts according to the input expression vectors. That is, the input representation vector obtained by word segmentation, vectorization and fusion processing of the abstract content, the dependency syntax core component and the text to be matched enables the input representation vector to better represent text content in terms of semantic representation of sentence components, and the abstract content and the dependency syntax core component are extracted, so that the text matching model can more easily distinguish the abstract content and the core component of the text, and match and sort according to the abstract content and the core component capable of better representing the semantics, thereby effectively improving the text matching effect in terms of semantic representation.
Please refer to fig. 3, which is a schematic flowchart of a training text matching model according to an embodiment of the present application; it will be appreciated that before using the text matching model in step S140, training the text matching model is further required, and an embodiment of training the text matching model may include:
step S210: a text dataset, a summary dataset, and a dependency dataset are obtained.
The embodiment of step S210 may include:
step S211: a pre-trained abstract extraction network model, a dependency analysis tool, and a text dataset comprising a plurality of sample texts and part-of-speech tag values corresponding to the sample texts are obtained.
The abstract extraction network model, the dependency analysis tool, and the text data set in step S211 are obtained by, for example: the first acquisition mode is to receive abstract extraction network models, dependency analysis tools and/or text data sets sent by other terminal equipment and store the abstract extraction network models, the dependency analysis tools and/or the text data sets into a file system, a database or mobile storage equipment; the second way of obtaining, obtains a pre-stored abstract extraction network model, dependency analysis tool and/or text data set, specifically for example: acquiring the data set from a file system, a database and/or a mobile storage device; a third way of obtaining, using software such as a browser or other application, access the Internet to obtain the abstract extraction network model, dependency analysis tool and/or text data set.
Step S212: and abstracting the sample text by using a pre-trained abstract extraction network model or an abstract generation network model to obtain an abstract text of the sample text, performing part-of-speech prediction on the abstract text of the sample text to obtain a part-of-speech tag value of the abstract text, and adding the abstract text of the sample text and the part-of-speech tag value of the abstract text into an abstract data set.
The implementation principle and implementation of this step S212 are similar to those of the step S112, and thus, the implementation principle and implementation thereof will not be described again here, and reference may be made to the description of the step S112 if it is not clear.
Step S213: extracting the dependency syntax core component of the sample text by using the dependency analysis tool, and adding the dependency syntax core component of the sample text as a dependency tag value into the dependency data set.
The implementation principle and implementation of this step S213 are similar to those of the step S113, and thus, the implementation principle and implementation thereof will not be described here, and reference may be made to the description of the step S113 if it is not clear.
After step S210, step S220 is performed: and performing multi-task combined training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain a text matching model.
Wherein the text data set comprises: querying a content sample, a positive sample text, and a plurality of negative sample texts; the query content sample refers to a text content sample that needs to be matched, specifically for example: questions in an automatic question-answering task, or text to be searched submitted by a user in a search engine task, etc. The positive sample text refers to matching content text with similarity, correlation or satisfaction exceeding a preset threshold, specifically for example: the question in the automatic question-answering task is the most appropriate answer to the content text, or the matching of the content text in the search engine task that the user wants to search for is the most appropriate. Similarly, the negative sample text refers to a matching content text with similarity, correlation or satisfaction not exceeding a preset threshold; specific examples are: occurrences of content text in search engine tasks that are not intended by the user, and the like.
The embodiment of the above-described multitasking joint training of the feature extraction model using the abstract dataset and the dependency dataset in step S220 is as follows: performing multi-task Joint Training (Joint Training) on the feature extraction model according to the abstract data set and the dependency data set using a multi-task learning framework; wherein the multiplexing includes, but is not limited to: the multiple tasks such as part-of-speech prediction and dependency analysis corresponding to each Token (Token) may use a multitasking learning framework including, but not limited to: multi-gate Mixture-of-Experts (MMoE) frameworks, and the like.
In a specific practice, a loss function may be used for each task in the multitasking joint training to calculate a loss value for each task; specific examples are: calculating a first loss value between a part-of-speech predicted value and a part-of-speech tag value corresponding to each Token (Token) using a multi-class cross entropy loss function, the first loss value being represented by L1; and/or calculating a second loss value between the dependency prediction value and the dependency label value in the dependency data set in the dependency analysis task using a multi-class cross entropy loss function, the second loss value may be represented using L2.
The training of the deep network model using the text data set in step S220 may include:
step S221: and predicting the predicted matching text corresponding to the query content sample by using a depth network model in the text matching model.
The embodiment of step S221 described above is, for example: word segmentation and vectorization are carried out on abstract content in the abstract data set, dependency syntax core components in the dependency data set and sample texts in the text data set to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse sequence position embedded vectors; respectively carrying out fusion processing on sentence component vectors, token embedded vectors, position embedded vectors and/or reverse sequence position embedded vectors to obtain input representation vectors; and extracting feature vectors of the input representing vectors by using a feature extraction model in the text matching model, finally, matching and sorting text vectors corresponding to the plurality of search texts by using a depth network model according to the feature vectors to obtain a plurality of sorted search texts, and determining the plurality of sorted search texts as predictive matching texts.
It will be appreciated that the above-described embodiment of determining the ranked plurality of retrieved texts as predictive matching texts is different when calculating different penalty values, specifically for example: in calculating the PairWise penalty value, predicting the matching text includes: positive sample text and negative sample text, then the search text found to be most similar (or matching) to the predicted-match text from the ordered plurality of search texts is required to be the positive sample text, and the search text found to be least similar (or least matching) to the predicted-match text from the ordered plurality of search texts is required to be the negative sample text. In calculating the listdise penalty value, predicting the matching text includes: positive sample text and a plurality of negative sample texts, then it is necessary to take as the positive sample text the search text found to be most similar to (or match) the predicted-match text from among the sorted plurality of search texts, and take as the plurality of negative sample texts the search text other than the positive sample text.
Step S222: a PairWise penalty value is calculated between the predictive match text and the positive sample text and the negative sample text.
Step S222 aboveEmbodiments are for example: predicting a predicted matching text corresponding to the query content sample using a deep network model in the text matching model, and using a fourth loss function Calculating a PairWise loss value between the predictive matching text and the positive sample text and between the predictive matching text and the negative sample text; wherein L is PairWise Representing PairWise loss values between predictive matching text and positive and negative sample text, m representing a preset boundary threshold, q i Representing the i-th predictive matching text, c i + Representing the i-th positive sample text, c i - Represents the i-th negative sample text, h θ (q ic i + ) Representing calculation of the ith predictive match text and the ithiAbsolute correlation value, h, between positive sample texts θ (q i ,c i - ) Representing the calculation of an absolute relevance score between the i-th predictive matching text and the i-th negative sample text.
Step S223: a lisdwase penalty value is calculated between the query content sample, the positive sample text, and the plurality of negative sample texts.
The embodiment of step S222 described above is, for example: firstly according to the formula Score j =h θ (q i ,c ij ) Calculating an absolute relevance Score between the query content sample and each sample text of the plurality of sample texts to obtain a plurality of relevance scores, wherein Score j Represents the j-th relevance score, q in the text dataset i Representing the i-th predictive matching text, c ij Representing a j-th sample text in an i-th set of the plurality of sample texts, h θ (q i ,c ij ) Representing calculating an absolute correlation value between an i-th predictive matching text and a j-th sample text in an i-th set of sample texts, the plurality of sample texts comprising: positive sample text and a plurality of negative sample text. Then, the formula s=softmax ([ Score 1 ,Score 2 ,…,Score m ]) Normalizing the plurality of relevance scores to obtain normalized relevance scoresA number set; wherein S represents a normalized relevance Score set, score 1 ,Score 2 ,…,Score m Representing a plurality of relevance scores. Finally, using the formulaNormalizing the relevance label corresponding to the query content sample to obtain a normalized relevance label, wherein Y represents the normalized relevance label, Y' represents the relevance label corresponding to the query content sample, and +>Representing the sum of all relevance labels; according to the KL divergence between the normalized relevance score set and the normalized relevance label, and taking a specific value of the KL divergence as a ListWise loss value, wherein the ListWise loss value can use L ListWise To represent.
Step S224: and training the depth network model in the text matching model according to the PairWise loss value and the ListWise loss value.
The embodiment of step S224 described above is, for example: constructing a third loss function of the depth network model according to the PairWise loss value and the ListWise loss value, and training the depth network model in the text matching model by using the third loss function to obtain a trained depth network model; the calculation process using the third loss function can be expressed as L 3 =m×L PairWise +n×L ListWise Wherein L is 3 Representing a third loss value, m and n being two different superparameters, L PairWise Represents PairWise loss value, L ListWise Representing the lisdwase loss value.
In a specific implementation process, the feature extraction model and the depth network model are also usually trained in an alternating joint manner by using a multi-task learning framework, so that the Loss function of the feature extraction model and the Loss function of the depth network model can be integrated into a total Loss function, where the total Loss function can be expressed as loss=a×l1+b×l2+c×l3; wherein L1 represents the first loss value, L2 represents the second loss value, L3 represents the third loss value, and a, b and c are different super parameters.
In the text matching model in the Internet of vehicles text matching method, the input representing vector capable of representing the semantic representation is fed to the pre-training language model such as Roberta or BERT, and further, the text matching model is jointly trained by using the text matching task, the part-of-speech predicting task and the dependency relation task, so that the text matching model has better semantic representation capability, and then the text matching or text retrieval is performed by using the jointly trained text matching model, so that the text matching or text retrieval effect in terms of semantic representation can be effectively improved.
Please refer to fig. 4, which illustrates a schematic structural diagram of an internet of vehicles text matching device according to an embodiment of the present application; the embodiment of the application provides a text matching device 300 for internet of vehicles, which comprises the following components:
the text content extraction module 310 is configured to obtain a text to be matched, and extract abstract content of the text to be matched and dependency syntax core components of the text to be matched.
The vector matrix obtaining module 320 is configured to perform word segmentation and vectorization on the summary content, the dependency syntax core component, and the text to be matched to obtain an embedded vector matrix, where the embedded vector matrix includes a sentence component vector, a token embedded vector, a position embedded vector, and/or an inverted position embedded vector;
the representation vector obtaining module 330 is configured to perform fusion processing on the sentence component vector, the token embedded vector, the position embedded vector, and/or the inverted position embedded vector to obtain an input representation vector.
The text matching and sorting module 340 is configured to perform matching and sorting on the plurality of search texts according to the input representation vector by using a text matching model, so as to obtain a plurality of sorted search texts, where the text matching model is obtained through multi-task joint training.
Optionally, in an embodiment of the present application, the text content extraction module includes:
And the abstract content extraction module is used for extracting the abstract of the text to be matched by using the pre-trained generated pre-training model as an abstract extraction model to obtain the abstract content of the text to be matched.
The dependency relation analysis module is used for extracting main-predicate relation components, dynamic guest relation components, inter-guest relation components, in-shape structure components and/or core relation components in the text to be matched by using the dependency relation analysis tool, and determining the main-predicate relation components, dynamic guest relation components, inter-guest relation components, in-shape structure components and/or core relation components as dependency syntax core components of the text to be matched.
Optionally, in an embodiment of the present application, the text matching model includes: a feature extraction model and a depth network model; a text matching ranking module comprising:
and the feature vector extraction module is used for extracting the feature vector of the input representation vector by using the feature extraction model.
And the vector matching and sorting module is used for matching and sorting the text vectors corresponding to the plurality of search texts according to the feature vectors by using the depth network model.
Optionally, in an embodiment of the present application, the internet of vehicles text matching device further includes:
and the training data acquisition module is used for acquiring the text data set, the abstract data set and the dependency data set.
And the matching model obtaining module is used for performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain a text matching model.
Optionally, in an embodiment of the present application, the text data set includes: querying a content sample, a positive sample text, and a plurality of negative sample texts; a matching model acquisition module comprising:
and the matching text prediction module is used for predicting the predicted matching text corresponding to the query content sample by using the depth network model in the text matching model.
And the first loss calculation module is used for calculating the PairWise loss value between the predictive matching text and the positive sample text and between the predictive matching text and the negative sample text.
A second penalty calculation module for calculating a ListWise penalty value between the query content sample, the positive sample text, and the plurality of negative sample texts.
And the network model training module is used for training the depth network model in the text matching model according to the PairWise loss value and the ListWise loss value.
It should be understood that, the apparatus corresponds to the above-mentioned method embodiment for matching text of internet of vehicles, and is capable of executing each step involved in the above-mentioned method embodiment, and specific functions of the apparatus may be referred to the above description, and detailed descriptions are omitted herein as appropriate to avoid repetition. The device includes at least one software functional module that can be stored in memory in the form of software or firmware (firmware) or cured in an Operating System (OS) of the device.
Please refer to fig. 5, which illustrates a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 400 provided in an embodiment of the present application includes: a processor 410 and a memory 420, the memory 420 storing machine-readable instructions executable by the processor 410, which when executed by the processor 410 perform the method as described above.
The embodiment of the present application also provides a computer readable storage medium 430, on which computer readable storage medium 430 a computer program is stored which, when executed by the processor 410, performs a method as above.
The computer-readable storage medium 430 may be implemented by any type or combination of volatile or nonvolatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk.
In the embodiments of the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
In addition, the functional modules of the embodiments of the present application may be integrated together to form a single part, or the modules may exist separately, or two or more modules may be integrated to form a single part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The foregoing description is merely an optional implementation of the embodiment of the present application, but the scope of the embodiment of the present application is not limited thereto, and any person skilled in the art may easily think about changes or substitutions within the technical scope of the embodiment of the present application, and the changes or substitutions are covered by the scope of the embodiment of the present application.

Claims (6)

1. The text matching method for the Internet of vehicles is characterized by comprising the following steps of:
acquiring a text to be matched, and extracting abstract content of the text to be matched and dependency syntax core components of the text to be matched;
word segmentation and vectorization are carried out on the abstract content, the dependency syntax core component and the text to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse sequence position embedded vectors;
fusion processing is carried out on the sentence component vector, the token embedded vector, the position embedded vector and/or the reverse sequence position embedded vector, so that an input representation vector is obtained;
Matching and sorting the plurality of search texts according to the input representation vector by using a text matching model to obtain a plurality of sorted search texts, wherein the text matching model is obtained through multi-task joint training;
the extracting the abstract content of the text to be matched and the dependency syntax core component of the text to be matched comprises the following steps: using a pre-trained generated pre-training model as a abstract extraction model to extract the abstract of the text to be matched, so as to obtain abstract content of the text to be matched; extracting a main-predicate relation component, a dynamic guest relation component, a state-in-state structure component and/or a core relation component in the text to be matched by using a dependency analysis tool, and determining the main-predicate relation component, the dynamic guest relation component, the state-in-state structure component and/or the core relation component as dependency syntax core components of the text to be matched;
the text matching model comprises: a feature extraction model and a depth network model; the using a text matching model to match and rank a plurality of search texts according to the input representation vector includes: obtaining a text dataset, a summary dataset, and a dependency dataset, the text dataset comprising: querying a content sample, a positive sample text, and a plurality of negative sample texts; performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and predicting a prediction matching text corresponding to the query content sample by using a deep network model in the text matching model; calculating a PairWise loss value between the predictive matching text and the positive sample text and between the predictive matching text and the negative sample text; calculating a ListWise loss value among the query content sample, the positive sample text, and the plurality of negative sample texts; training a depth network model in the text matching model according to the PairWise loss value and the ListWise loss value to obtain the text matching model; extracting a feature vector of the input representation vector using the feature extraction model; and matching and sorting text vectors corresponding to the plurality of search texts according to the feature vectors by using the depth network model.
2. The method of claim 1, further comprising, prior to said summarizing the text to be matched using the pre-trained generated pre-training model as a summary extraction model:
acquiring a text data set and a summary data set, wherein summary text in the summary data set is obtained by abstracting sample text in the text data set;
and training a generated pre-training network by using the text data set and the abstract data set to obtain the generated pre-training model.
3. The method of any of claims 1-2, wherein the feature extraction model employs a Roberta model.
4. The utility model provides a car networking text matching device which characterized in that includes:
the text content extraction module is used for obtaining a text to be matched and extracting abstract content of the text to be matched and dependency syntax core components of the text to be matched;
the vector matrix obtaining module is used for word segmentation and vectorization of the abstract content, the dependency syntax core component and the text to be matched to obtain an embedded vector matrix, and the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse sequence position embedded vectors;
The representation vector obtaining module is used for carrying out fusion processing on the sentence component vector, the token embedded vector, the position embedded vector and/or the reverse sequence position embedded vector to obtain an input representation vector;
the text matching and sorting module is used for matching and sorting a plurality of search texts according to the input representation vector by using a text matching model to obtain a plurality of sorted search texts, and the text matching model is obtained through multi-task joint training;
the extracting the abstract content of the text to be matched and the dependency syntax core component of the text to be matched comprises the following steps: using a pre-trained generated pre-training model as a abstract extraction model to extract the abstract of the text to be matched, so as to obtain abstract content of the text to be matched; extracting a main-predicate relation component, a dynamic guest relation component, a state-in-state structure component and/or a core relation component in the text to be matched by using a dependency analysis tool, and determining the main-predicate relation component, the dynamic guest relation component, the state-in-state structure component and/or the core relation component as dependency syntax core components of the text to be matched;
The text matching model comprises: a feature extraction model and a depth network model; the using a text matching model to match and rank a plurality of search texts according to the input representation vector includes: obtaining a text dataset, a summary dataset, and a dependency dataset, the text dataset comprising: querying a content sample, a positive sample text, and a plurality of negative sample texts; performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and predicting a prediction matching text corresponding to the query content sample by using a deep network model in the text matching model; calculating a PairWise loss value between the predictive matching text and the positive sample text and between the predictive matching text and the negative sample text; calculating a ListWise loss value among the query content sample, the positive sample text, and the plurality of negative sample texts; training a depth network model in the text matching model according to the PairWise loss value and the ListWise loss value to obtain the text matching model; extracting a feature vector of the input representation vector using the feature extraction model; and matching and sorting text vectors corresponding to the plurality of search texts according to the feature vectors by using the depth network model.
5. An electronic device, comprising: a processor and a memory storing machine-readable instructions executable by the processor, which when executed by the processor perform the method of any one of claims 1 to 3.
6. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the method according to any of claims 1 to 3.
CN202110622070.0A 2021-06-03 2021-06-03 Internet of vehicles text matching method and device, electronic equipment and storage medium Active CN113282711B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110622070.0A CN113282711B (en) 2021-06-03 2021-06-03 Internet of vehicles text matching method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110622070.0A CN113282711B (en) 2021-06-03 2021-06-03 Internet of vehicles text matching method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113282711A CN113282711A (en) 2021-08-20
CN113282711B true CN113282711B (en) 2023-09-22

Family

ID=77283433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110622070.0A Active CN113282711B (en) 2021-06-03 2021-06-03 Internet of vehicles text matching method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113282711B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434642B (en) * 2021-08-27 2022-01-11 广州云趣信息科技有限公司 Text abstract generation method and device and electronic equipment
CN113918702B (en) * 2021-10-25 2022-07-01 北京航空航天大学 Semantic matching-based online law automatic question-answering method and system
CN116610776A (en) * 2022-12-30 2023-08-18 摩斯智联科技有限公司 Intelligent question-answering system of Internet of vehicles
CN116628129B (en) * 2023-07-21 2024-02-27 南京爱福路汽车科技有限公司 Auto part searching method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844413A (en) * 2016-11-11 2017-06-13 南京缘长信息科技有限公司 The method and device of entity relation extraction
CN108319627A (en) * 2017-02-06 2018-07-24 腾讯科技(深圳)有限公司 Keyword extracting method and keyword extracting device
CN108681574A (en) * 2018-05-07 2018-10-19 中国科学院合肥物质科学研究院 A kind of non-true class quiz answers selection method and system based on text snippet
CN110851604A (en) * 2019-11-12 2020-02-28 中科鼎富(北京)科技发展有限公司 Text classification method and device, electronic equipment and storage medium
CN111753496A (en) * 2020-06-22 2020-10-09 平安付科技服务有限公司 Industry category identification method and device, computer equipment and readable storage medium
CN111966832A (en) * 2020-08-21 2020-11-20 网易(杭州)网络有限公司 Evaluation object extraction method and device and electronic equipment
WO2021051518A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Text data classification method and apparatus based on neural network model, and storage medium
CN112562669A (en) * 2020-12-01 2021-03-26 浙江方正印务有限公司 Intelligent digital newspaper automatic summarization and voice interaction news chat method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10573298B2 (en) * 2018-04-16 2020-02-25 Google Llc Automated assistants that accommodate multiple age groups and/or vocabulary levels

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844413A (en) * 2016-11-11 2017-06-13 南京缘长信息科技有限公司 The method and device of entity relation extraction
CN108319627A (en) * 2017-02-06 2018-07-24 腾讯科技(深圳)有限公司 Keyword extracting method and keyword extracting device
CN108681574A (en) * 2018-05-07 2018-10-19 中国科学院合肥物质科学研究院 A kind of non-true class quiz answers selection method and system based on text snippet
WO2021051518A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Text data classification method and apparatus based on neural network model, and storage medium
CN110851604A (en) * 2019-11-12 2020-02-28 中科鼎富(北京)科技发展有限公司 Text classification method and device, electronic equipment and storage medium
CN111753496A (en) * 2020-06-22 2020-10-09 平安付科技服务有限公司 Industry category identification method and device, computer equipment and readable storage medium
CN111966832A (en) * 2020-08-21 2020-11-20 网易(杭州)网络有限公司 Evaluation object extraction method and device and electronic equipment
CN112562669A (en) * 2020-12-01 2021-03-26 浙江方正印务有限公司 Intelligent digital newspaper automatic summarization and voice interaction news chat method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
incorporating syntactic dependencies into semantic word vector model for medical text processing;Maia lyer等;2018 IEEE International conference on bioinformatics and biomedicine;659-654 *
基于BiLSTM-CRF的中文生物医学开放式概念关系抽取;王序文;李姣;吴英杰;李军莲;;中华医学图书情报杂志;第27卷(第11期);33-39 *
基于神经网络的机器阅读理解综述;顾迎捷;桂小林;李德福;沈毅;廖东;;软件学报;第31卷(第07期);2095-2126 *

Also Published As

Publication number Publication date
CN113282711A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN109885672B (en) Question-answering type intelligent retrieval system and method for online education
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
US9009134B2 (en) Named entity recognition in query
CN112800170A (en) Question matching method and device and question reply method and device
CN110334186B (en) Data query method and device, computer equipment and computer readable storage medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
Gokul et al. Sentence similarity detection in Malayalam language using cosine similarity
Landthaler et al. Extending Full Text Search for Legal Document Collections Using Word Embeddings.
CN110457585B (en) Negative text pushing method, device and system and computer equipment
CN112632224B (en) Case recommendation method and device based on case knowledge graph and electronic equipment
CN116775847A (en) Question answering method and system based on knowledge graph and large language model
CN113168499A (en) Method for searching patent document
CN113196277A (en) System for retrieving natural language documents
CN113157885A (en) Efficient intelligent question-answering system for knowledge in artificial intelligence field
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
CN114372454A (en) Text information extraction method, model training method, device and storage medium
Arbaaeen et al. Natural language processing based question answering techniques: A survey
CN116976341A (en) Entity identification method, entity identification device, electronic equipment, storage medium and program product
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN114154496A (en) Coal prison classification scheme comparison method and device based on deep learning BERT model
CN114003773A (en) Dialogue tracking method based on self-construction multi-scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant