CN117312928B

CN117312928B - Method and system for identifying user equipment information based on AIGC

Info

Publication number: CN117312928B
Application number: CN202311601466.2A
Authority: CN
Inventors: 杨本芊; 杨文俊; 蔡海翔; 任翔; 徐健
Original assignee: Nanjing Mesh Information Technology Co ltd
Current assignee: Nanjing Mesh Information Technology Co ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-02-13
Anticipated expiration: 2043-11-28
Also published as: CN117312928A

Abstract

The application discloses a method and a system for identifying user equipment information based on AIGC, which belong to the field of equipment identification and comprise the following steps: acquiring equipment information corresponding to the user agent character string and the user agent character string; encoding the user agent character string by utilizing a pre-trained language model to acquire a user agent character string vector, and constructing a user agent character string vector library; retrieving the first N user agent character string vectors similar to the vector of the user agent character string to be identified from the user agent character string vector library; combining the device information of the user agent character strings corresponding to the first N user agent character string vectors with the user agent character strings to be identified to generate a sequence, and constructing prompt information by using the sequence; and taking the prompt information and the user agent character string to be identified as input, and outputting predicted equipment information as equipment information corresponding to the user agent character string to be identified. Aiming at the low accuracy of equipment information identification in the prior art, the information identification accuracy is improved.

Description

Method and system for identifying user equipment information based on AIGC

Technical Field

The present application relates to the field of device identification, and more particularly, to a method and system for identifying user equipment information based on an AIGC.

Background

For a long time, the mainstream method for identifying the User equipment is realized by analyzing the User Agent character string. The User Agent character string contains information such as a manufacturer, a model, an operating system and the like of the equipment, and can reflect the equipment type of the User. However, this method has a problem of low accuracy.

The User Agent character string itself contains limited device information, and often only includes limited information such as manufacturer, operating system, browser, etc., which is not fully used for specific determination of the device model. The expression forms of the User Agent character strings are quite different from each other by different manufacturers, and the same equipment information is expressed by different character strings, so that the analysis difficulty is increased.

Chinese patent application, application number CN202110795179.4, publication day 2021, 9 and 21, discloses a named entity recognition method, device, equipment and storage medium. The application includes: acquiring a graph feature vector of a character to be identified; and inputting the image feature vector of the character to be identified into a target named entity identification model to obtain the entity category corresponding to the image feature vector of the character to be identified. However, the present application has at least the following problems: training for named entity recognition typically requires a large amount of annotation data. If there is an error or inconsistency in the annotation data, the model may be affected during the training and prediction process. Erroneous annotation data may cause the model to learn the wrong pattern or fail to capture the correct pattern, thereby affecting the accuracy of information identification.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problem of low information identification accuracy in the existing method, the application provides a method and a system for identifying user equipment information based on AIGC, and the accuracy of information identification is improved by utilizing a pre-trained language model to encode user agent character strings and a vector retrieval method.

2. Technical proposal

The object of the present application is achieved by the following method scheme.

An aspect of embodiments of the present specification provides a method for identifying user equipment information based on an AIGC, including: acquiring equipment information corresponding to the user agent character string and the user agent character string; encoding the user agent character string by utilizing a pre-trained language model to acquire a user agent character string vector, and constructing a user agent character string vector library; receiving a user agent character string to be identified; encoding the user agent character string to be identified by utilizing the pre-trained language model to obtain a vector of the user agent character string to be identified; retrieving the first N user agent character string vectors similar to the vector of the user agent character string to be identified from a user agent character string vector library, wherein N is a positive integer; combining the device information of the user agent character strings corresponding to the first N user agent character string vectors with the user agent character strings to be identified to generate a sequence, and constructing prompt information by using the sequence; the prompt information and the user agent character string to be identified are used as input, a pre-trained language model is input, and predicted equipment information is output as equipment information corresponding to the user agent character string to be identified.

The vectorization representation of the user agent character strings is realized through the pre-training language model, the character strings are converted into points in a vector space, and the similarity calculation between the character strings can be realized. And constructing a user agent character string vector library, and quickly finding TopN character string vectors most similar to the character string to be identified by adopting a vector space retrieval technology. And constructing a prompt information sequence by using the found corresponding equipment information of the similar character string vector, and taking the prompt information sequence as the input of a language model to promote the context relevance of equipment information prediction. And the device information prediction results related to the context are generated through the pre-training language model, so that the recognition accuracy is improved compared with methods such as only matching retrieval. The whole technical route integrates pre-training language model coding, vector retrieval and context-based prediction, so that the recognition accuracy of the user equipment information is improved while the recognition speed is ensured.

Further, the step of obtaining the first N user agent string vectors includes: encoding the user agent character string to be identified by utilizing the pre-trained language model, and obtaining the vector of the user agent character string to be identified; performing similarity comparison between the vector of the user agent character string to be identified and the user agent character string vector in the user agent character string vector library by using a vector retrieval method; based on the result of the similarity comparison, the first N user agent string vectors that are similar to the vector of user agent strings to be identified are selected. And carrying out vectorization coding on the character string to be identified by using the pre-training language model, so that vector representation with fixed length can be obtained, and subsequent vector space calculation is facilitated.

The similarity between the character string vector to be identified and the vector in the character string vector library is calculated by applying the vector retrieval method, so that the most similar character string can be found out quickly and efficiently. According to the similarity, the TopN vectors at the top are ordered and selected, so that the character strings most relevant to the semantic information can be obtained, and the calculation efficiency is also considered. Compared with the mode of sequentially comparing all character strings one by one, the technical route reduces the searching range, only calculates the similarity of high correlation vectors, and reduces the calculation complexity. By applying the vector space model and the similarity sorting algorithm, the TopN vector most similar to the character string to be identified can be quickly and accurately searched. The TopN similar character string obtained by retrieval can provide rich associated context information for subsequent equipment information prediction.

Further, the vector search method is cosine similarity, euclidean distance, manhattan distance or Minkowski distance.

The cosine similarity considers two vector directions instead of the length, and the similarity degree is judged by calculating cosine values of included angles of the two vectors, and the smaller the included angle is, the more similar the included angle is. The similarity of vector directions is calculated, the correlation degree of two character string semantics can be reflected, and the method is suitable for text semantic similarity calculation.

Where Euclidean distance refers to the sum of squares and then evolution of each component difference in the vector, with smaller distances representing more similar. The actual distance between vectors in Euclidean space is calculated to reflect the distance between two strings in numerical value.

Wherein Manhattan distance is the sum of absolute values of corresponding component differences in the vector and is a special case of Euclidean distance. The sum of absolute differences of vector components in each dimension is calculated, reflecting the differences of the features of each dimension.

Wherein the minkowski distance sets weights for different dimensions on the basis of the euclidean distance, emphasizing the distance contribution of important features. The importance of the different dimensions can be set by computing the square root of the sum of squares of the component differences of the weighted vectors.

The different vector distance calculation methods have emphasis, and cosine similarity meeting the requirement of semantic similarity is adopted, so that the accuracy of judging the semantic relevance of the character string can be improved, and the TopN result which is more relevant to the character string to be identified can be obtained.

Further, the step of generating the sequence includes: judging whether the first N user agent character string vectors are acquired or not, wherein N is a positive integer; if the judgment result is yes, acquiring user agent character strings corresponding to the first N user agent character string vectors; acquiring equipment information corresponding to the first N user agent character strings from a user agent character string vector library; and splicing the equipment information of the first N user agent character strings with the user agent character strings to be identified to generate a sequence.

Whether to acquire the TopN vector is judged, and an input source generated by the sequence can be dynamically adjusted according to actual conditions, so that the robustness of the method is ensured. The user agent character string corresponding to the TopN vector is obtained, and the character string information which is most relevant to the semantics can be accurately obtained. And retrieving the device information of the TopN character string from the vector library to obtain rich relevant context sources. And splicing the TopN character string and the equipment information thereof with the character string to be identified, and constructing a sequence related to both the semantic meaning and the equipment information. The generated sequence is input as a language model, so that effective context information can be provided to the greatest extent, and correct prediction of equipment information by the model is facilitated. The generation strategy may produce input sequences that are more valuable for predicting device information than random sequences. The sequence generation process is simple and reliable, and the semantics and the equipment information of TopN similar character strings are effectively extracted and utilized.

Further, the step of constructing the prompt message includes: acquiring the semantic and contextual information of the user agent character string to be identified by utilizing a pre-trained language model; acquiring vocabulary corresponding to the equipment information from the sequence by using the acquired semantic and context information through a word vector method as a prompt vocabulary; and generating prompt information through a pre-trained language model according to the prompt vocabulary.

The character strings are analyzed by the pre-training language model, so that semantic information and context can be accurately acquired, and a knowledge source for generating subsequent prompts is provided. And extracting the related vocabulary of the equipment information from the sequence by using a word vector method, filtering irrelevant contents and retaining effective information. The extracted prompt vocabulary is highly relevant to equipment information, and can effectively assist in prompt generation. Based on the prompt vocabulary, the language model may generate more accurate, inter-related prompt messages. The generation policy may generate hints that are more valuable for device information prediction than random hints. The hint information may provide additional semantic constraints to the model that help output the correct device information. The prompt generation process is simple and effective, can filter useless information and emphasize effective information.

Further, the step of outputting predicted device information includes: combining the prompt information and the user agent character string to be identified as a text sequence, and inputting the text sequence into a pre-trained language model; semantic understanding is carried out on the text sequence through the pre-trained language model, and prediction equipment information corresponding to the user agent character string to be recognized is output.

The prompt information and the character string sequence are combined, so that a complete semantic context can be provided, and model understanding and prediction are facilitated. The sequences contain string semantic information and high quality cues, which can maximize the provision of valid information for prediction. The pre-training language model is applied to predict, so that the semantic understanding and generating capability of the model can be fully exerted. Compared with the input of the independent character strings, the prompt sequence can remarkably improve the accuracy of equipment information prediction. The sequence semantic understanding can filter irrelevant content, and helps the model output the correct device for matching the character string.

Further, the pre-trained language model is a BERT model, a GPT model, or a transducer model. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language representation model of the transformers structure, and context information is obtained through a bi-directional pre-training mechanism and can be used for language understanding and generating tasks. GPT (Generative Pretrained Transformer) is an original pre-training language generation model based on a transducer, and a text with continuity and semantically continuity can be generated by obtaining context information through unidirectional pre-training. The transducer is a neural network model based on a self-attention mechanism, realizes sequence-to-sequence modeling through an Encoder-Decoder structure, and is the basis of a pre-training language model such as BERT, GPT and the like. The Encoder-Decoder architecture is a neural network architecture that is typically used to handle sequence-to-sequence tasks. This structure comprises two main components: an Encoder (Encoder) and a Decoder (Decoder). The pre-training language models can be used for obtaining text semantic information, realizing language understanding and generation, have strong semantic modeling capability, are suitable for text coding, understanding and generation tasks, and can improve the effect of identifying user agent character string equipment information.

Further, the user agent character strings are in one-to-one correspondence with the corresponding device information. The one-to-one correspondence enables the user agent string to uniquely determine the corresponding device information, avoiding ambiguity of one-to-many matches. When the user agent character string vector library is constructed, each character string vector only needs to store one piece of equipment information, so that the design of a storage structure is simplified. When similar character strings are searched, the returned TopN results correspond to the unique N pieces of equipment information, so that the probability of collision is reduced. When the sequence is generated, each character string only needs to combine one piece of equipment information, so that the repeated and contradictory equipment information is avoided. The consistent correspondence reduces the association judgment between the equipment information and simplifies the design of the prediction model. In the final prediction result, the user agent character string can be directly mapped to single equipment information, so that the recognition accuracy is improved.

Further, the process of encoding the user agent character string to obtain the user agent character string vector adopts a word embedding method. Word embedding can map words in a character string into a continuous dense vector space, so that the character string can be represented by a vector with fixed length, and similarity calculation is facilitated. The word is embedded into the semantic association among the learnable words, and the similar words can be mapped to the similar positions of the vector space, so that the correlation of the character string semantics can be reflected. The word embedding model can learn word vectors from large-scale non-labeled texts, does not need manual labeling, and reduces resource requirements. Compared with a general word bag model, word embedding can learn richer word characteristic representation, and the quality of character string semantic matching is improved. The embedded vector contains semantic information, can be directly used for similarity comparison, does not need artificial feature engineering, and simplifies the flow. By adopting the mature word embedding model, the character string vector representation with better quality can be obtained quickly, and the realizability of the technology is improved. The word vector can be multiplexed into a plurality of natural language processing tasks, so that the resource cost of each task is reduced. By using the existing word vector, the vector representation of the character string can be realized faster, and the technical iteration period is shortened.

Another aspect of the embodiments of the present specification also provides a system for identifying user equipment information based on an AIGC, including: the training module is used for acquiring data containing user agent character strings and equipment information corresponding to the user agent character strings, coding the user agent character strings by utilizing a pre-trained language model, acquiring corresponding user agent character string vectors and constructing a user agent character string vector library; the receiving module is used for receiving the user agent character string to be identified; the coding module is used for coding the user agent character string to be identified by utilizing the pre-trained language model to obtain a vector of the user agent character string to be identified; the retrieval module is used for retrieving the first N user agent character string vectors similar to the vector of the user agent character string to be identified from the user agent character string vector library by using a vector retrieval method, wherein N is a positive integer; the generating module combines the device information of the user agent character strings corresponding to the first N user agent character string vectors with the user agent character strings to be identified to generate a sequence, and builds prompt information by using the sequence; the output module takes the prompt information and the user agent character string to be identified as input, inputs the pre-trained language model, and outputs predicted equipment information as equipment information corresponding to the user agent character string to be identified.

3. Advantageous effects

Compared with the existing method, the method has the advantages that:

(1) Using coding and vector retrieval methods: encoding the user agent character string into a vector representation by using a pre-trained language model, and carrying out similarity matching by using a vector retrieval method, and retrieving equipment information similar to the user agent character string to be identified from a vector library; the method utilizes the similarity measurement of the vector space, improves the accuracy of equipment information identification, and can more accurately match the user equipment information;

(2) Accurate equipment information is acquired: an acquisition module in the improved scheme acquires equipment information from similar user agent character string vectors; because the vector matching and similarity calculation methods are used, the device information with higher similarity with the user agent character string to be identified can be more accurately selected, and the accurate device information can be obtained from the device information;

(3) And (3) equipment information prediction and prompt information construction: the improvement scheme comprises the steps of combining the obtained equipment information with a user agent character string to be identified to generate a sequence containing the equipment information; and generating a prompt text related to the device information based on the device information by using a prompt information construction module, and providing more specific operation suggestions or related information. The targeted prompt information can further improve the accuracy of equipment information identification;

In conclusion, the improvement scheme improves the accuracy of equipment information identification from a plurality of aspects through the method characteristics of coding, vector retrieval, equipment information acquisition, equipment information prediction, prompt information construction and the like, and provides more accurate and precise user equipment information identification.

Drawings

FIG. 1 is an exemplary flow chart of a method of identifying user device information based on AIGC in accordance with the present application;

FIG. 2 is an exemplary flow chart of a device identification flow of the present application;

fig. 3 is a schematic diagram of a vector search service architecture diagram of the present application.

Detailed Description

The present application is described in detail below with reference to the attached drawing figures and specific examples.

Fig. 1 is an exemplary flowchart of a method for identifying user equipment information based on an AIGC according to the present application, as shown in fig. 1, a method for identifying user equipment information based on an AIGC obtains a data set including a user agent character string and equipment information corresponding to the user agent character string, where the user agent character string corresponds to the corresponding equipment information one by one; word embedding and coding are carried out on the user agent character strings by utilizing a pre-trained language model, vector representations of the user agent character strings are obtained, and a user agent character string vector library is constructed; receiving a user agent character string to be identified; word embedding encoding is carried out on the user agent character string to be identified, and vector representation is obtained; searching the top N user agent character string vectors which are most similar to the user agent character string vector to be identified from a user agent character string vector library by vector searching methods such as vector cosine similarity and the like, wherein N is a positive integer; acquiring equipment information corresponding to the first N similar user agent character strings, and splicing the equipment information and the user agent character strings to be identified into a sequence; analyzing the semantic meaning and the context information of the sequence by utilizing a pre-training language model, generating a prompt vocabulary and further generating prompt information; and merging the prompt information and the user agent character string to be identified, inputting the merged prompt information and the user agent character string to be identified into a pre-training language model, and outputting the prediction equipment information corresponding to the user agent character string to be identified. The pre-trained language model may be a BERT model, a GPT model, or a transducer model.

By generating the prompt information to correct the user agent character string, the accuracy of the final equipment information identification can be improved. In a word, the user agent character string data set and the equipment corresponding information are obtained, and the accurate identification of the user equipment information is realized by utilizing technologies such as coding, vector retrieval, semantic analysis and the like of the pre-training language model. According to the method, the user agent character string data set is constructed, semantic information of character strings is extracted through the pre-training language model, the character strings with the closest semantics are found in the vector space, and corresponding equipment information is used for constructing prompts, so that the model is helped to accurately predict the equipment types of the character strings to be identified. The method and the device integrate the pre-training language model and the vector space retrieval technology, and can greatly improve the accuracy of equipment information identification.

The method comprises the following steps: by acquiring the data set containing the user agent character string and the corresponding equipment information, the one-to-one correspondence relation between the user agent character string and the equipment information is constructed, and accurate identification of the user equipment information is realized. The language model is trained by means of a large number of user agent string data, so that the language model learns the semantic features of the user agent strings. Then, word embedding encoding is carried out on the user agent character string, and the word embedding encoding is converted into dense vector representation with fixed length, so that a vector library of the user agent character string is constructed. The vector representation abstracts semantic information of the user agent strings, and the dimension reduction represents semantic similarity among the strings. The distance between the character strings is calculated in the vector space, so that the character string similarity retrieval of the semantic level can be realized. In the equipment information identification task, candidate equipment information can be collected by finding out the vector closest to the character string to be identified in the user agent character string vector library, so that the identification accuracy is improved. The semantic similarity calculation of vector space more accurately characterizes the relevance between user agent strings than the lexical level comparison of strings. The device information candidate collection strategy based on vector representation and semantic similarity improves the final recognition accuracy.

The pre-trained language model is a BERT model, a GPT model or a transducer model. The BERT model utilizes the mask language model and the next sentence to predict two pre-training tasks, so that strong semantic understanding capability is obtained. The method can fully capture the context semantic information of the character strings, and is helpful for extracting key words related to the equipment. The GPT model was autoregressive pre-trained using the transformer decoder structure. The powerful generating capability can predict the subsequent vocabulary of the character string and generate the prompt word of the device description. The transducer introduces a self-attention mechanism, can model remote vocabulary dependence, and learn the inherent semantic association of the character string through multiple heads of attention, thereby being beneficial to extracting the device semantics. The above models all adopt encoder or decoder structures of transformers in BERT and GPT, the self-attention mechanism of transformers is utilized to analyze the semantics of character strings without being limited by distance, in the fields of deep learning and natural language processing, the transformers refer to a neural network structure based on the self-attention mechanism, and the self-attention mechanism in transformers allows the models to consider all positions in an input sequence at the same time when processing the sequence, without operations such as recursion or convolution. Innovations in this architecture make the transducer perform well in parallel computing, thereby speeding up the training process. Compared to sequential models such as RNN, the transducer is more efficient at handling long distance dependencies. In addition, the large-scale corpus pre-training also endows the model with profound language understanding capability. Based on this, we select BERT, GPT or transducer to obtain semantic information of the character string, and generate a prompt vocabulary by using its strong context representation capability, so as to improve accuracy of subsequent device information identification.

Specifically, the process of encoding the user agent string to obtain the user agent string vector employs a word embedding method. The user agent string is encoded into a vector representation using a word embedding method. Word embedding may capture semantic information in a string. And inputting the vector obtained in the last step into a pre-trained language model. The pre-trained model can better understand the semantics of the character string, thereby generating character string vectors containing rich semantic information. And inputting the character string vector into a classifier, and performing recognition prediction of the equipment type. Compared with the traditional rule-based method, the method and the device can better utilize the capability of the pre-training language model to understand the semantics of the user agent character strings, thereby improving the recognition accuracy of the device types.

Receiving a user agent character string to be identified; encoding the user agent character string to be identified by utilizing the pre-trained language model to obtain a vector of the user agent character string to be identified; a user agent string to be identified is received. The user agent character string contains characteristic information of the client device and is a key information source for device identification. The user agent string is encoded using a pre-trained language model. The method uses advanced pre-training language models such as BERT model, GPT model or transducer model, so that semantic information of the character strings can be fully captured, and high-quality character string vector representation can be obtained. And inputting the character string vector into a classifier, and performing equipment type recognition prediction. The Classifier may be a Softmax, MLP Classifier (Multilayer Perceptron Classifier) multi-layer perceptron Classifier, SVM Classifier (Support Vector Machine Classifier) support vector machine Classifier, or the like. And finally outputting the device type identification result. Compared with the traditional rule-based identification method, the method has the following advantages: and the deep neural network is used for completing identification end to end, so that the limitation of manually making an identification rule is avoided. The powerful ability of the pre-trained language model can deeply understand the character string semantics and extract richer features. The character string vector extracts richer semantic information, provides high-quality input for the classifier, and improves recognition performance. The deep neural network classifier has strong capability of fitting complex features and is suitable for distinguishing equipment types. End-to-end learning and reasoning reduce manual work and are easy to realize engineering.

Retrieving first N user agent character string vectors similar to the vector of the user agent character string to be identified from a user agent character string vector library, wherein N is a positive integer; and constructing a user agent character string vector library. A plurality of user agent character strings are encoded into vectors by using a pre-training language model, and a character string vector library is constructed. And the character strings to be identified are also encoded identically, so that vector representation of the character strings is obtained. And searching the first N character string vectors which are most similar to the character string vector to be identified in the character string vector library. The similarity may be measured by cosine similarity or the like. And presuming the equipment type of the character string to be recognized by the equipment type corresponding to the similar character string vector.

The method comprises the steps of utilizing a pre-trained language model to encode a user agent character string to be identified, and obtaining a vector of the user agent character string to be identified; performing similarity comparison between the vector of the user agent character string to be identified and the user agent character string vector in the user agent character string vector library by using a vector retrieval method; and selecting the first N user agent character string vectors similar to the vectors of the user agent character strings to be identified according to the result of the similarity comparison. The user agent character string to be identified is encoded by a pre-trained language model (BERT model, GPT model or transducer model) to obtain a vector representation thereof.

The pre-trained language model may capture semantic information of the character string. A user agent string vector library is constructed containing vector representations of a plurality of user agent strings. To identify the character string vector, the TopN character string vectors most similar to the character string vector are found out in the character string vector library by using a vector space searching method (cosine similarity and the like). And generating device candidates of the character string to be identified through the device types corresponding to the similar character string vectors. The final aggregated candidate results generate an identification output. Semantic features of the character string may be extracted by means of semantic coding capabilities of the pre-trained language model. The character string vector contains rich semantics, which is beneficial to similarity comparison. Zero sample recognition can be realized based on the recognition mode of the similarity. By utilizing the vector space semantic structure, high-precision semantic matching retrieval is realized, reliable similar equipment types are used as candidates, and the recognition precision is improved.

Specifically, the vector search method is cosine similarity, euclidean distance, manhattan distance or Minkowski distance. The cosine similarity is calculated as the similarity between two vectors, is a common vector similarity algorithm, and can effectively measure the semantic similarity of two character string vectors. Euclidean distance, the straight line distance between two vector points, the shorter the distance, the more similar. The method is simple in calculation and is also commonly used for vector similarity comparison. The Manhattan distance is used for calculating the sum of absolute values of the distances of the vectors in all dimensions and is used as a similarity measure, so that the implementation is simple. The Minkowski distance is a combination of the Euclidean distance and the Manhattan distance, the calculation mode is smooth, and the similarity measurement is more reliable. Semantic similarity among vectors can be effectively calculated, and character string similarity comparison is realized. The similar strings have similar device types so that similar devices can be found. The distance algorithm is simple and effective, and character string similarity comparison is easy to realize. The method can be used for character string vectors with any dimension, and has strong expansibility. In conclusion, the vector retrieval algorithms can cooperate with character string vector representation, so that the equipment information identification accuracy based on similarity is effectively improved.

Combining the device information of the user agent character strings corresponding to the first N user agent character string vectors with the user agent character strings to be identified to generate a sequence, and constructing prompt information by using the sequence; and combining the equipment information corresponding to the similar TopN character string vector of the character string to be identified with the character string to be identified to form a sequence. With this sequence, a hint can be constructed. The hint information may include: a similar string, a device type to which the similar string corresponds, a similarity score, etc. The hint information reflects the presumed type of the possible device by matching the similar strings and can be output as a candidate for the model. The user can check according to the prompt information to select the correct equipment type. The fed-back real equipment type can be used for enhancing the original recognition model, and further improving the model precision. By means of the prompt information, the recognition accuracy can be remarkably improved. The model is enhanced immediately, a feedback sample improvement model is accumulated, and prompt information provides a model inference basis and can be explained more.

Judging whether the first N user agent character string vectors are acquired or not, wherein N is a positive integer; if the judgment result is yes, acquiring user agent character strings corresponding to the first N user agent character string vectors; acquiring equipment information corresponding to the first N user agent character strings from a user agent character string vector library; and splicing the equipment information of the first N user agent character strings with the user agent character strings to be identified to generate a sequence. Specifically, the N value is set too small, the obtained similar character strings and equipment information are too small, and the generated sequence information is insufficient, so that the prediction accuracy of the model is reduced. The N value is set too large, a large number of redundant character strings are obtained, an overlong sequence is generated, the model operation burden is increased, and model training is not facilitated. Based on the above, the invention provides a technical scheme for dynamically setting the N value: initializing N to a small value, e.g., 5; judging whether the first N user agent character string vectors are obtained, if not, increasing the N value, and repeating the judgment until the TopN character string vectors meeting the requirements are obtained; acquiring corresponding character strings and equipment information to generate a sequence; inputting the generated sequence into a pre-training language model, and increasing the N value by a counter if the model loss function continuously decreases; if the model loss function fluctuates or rises for successive training rounds or periods (epochs), then the increase in N is stopped. And finally determining the N value as the current optimal value number of the previous TopN character string. The scheme uses the change of the loss function as the increase of the N value, and can dynamically adjust N to obtain a similar character string set with proper size. The method not only avoids insufficient information quantity, but also prevents the introduction of excessive redundant characteristics, and can obtain the optimal balance of the accuracy and the efficiency of the model.

And inputting the user agent character string to be identified, and vectorizing and encoding the character string by using the pre-training language model to obtain a character string vector to be identified. In the user agent character string vector library, using cosine similarity as similarity index, searching out TopN character string vectors most similar to the character string vector to be identified. The cosine similarity judges the approaching degree of the two vectors, the value range is-1 to 1, and the larger the value is, the more similar the two vectors are. The product of the two vector dot products is divided by the product of the two vector Euclidean lengths. Similar string vectors have similar semantic connotations and corresponding device types also have similarity. Zero sample recognition of unknown strings can be achieved through similar strings. And acquiring a source character string corresponding to the TopN similar character string vector and equipment information corresponding to the character string. And splicing the TopN equipment information with the character string to be identified to form a sequence. The sequence contains rich device type information. The sequence contains the context information of the character strings to be identified, and provides basis for subsequent semantic prompt. And carrying out semantic analysis on the sequence by using a pre-training language model to obtain character string semantic information and context information. And extracting vocabulary related to the equipment type by a word vector technology by virtue of the context information to serve as a prompt vocabulary. According to the prompt vocabulary, a language model is used for generating human-readable prompt sentences, and a user is helped to verify the recognition result. The accuracy and the interpretability of the equipment information identification can be obviously improved.

Specifically, semantic and contextual information of a user agent character string to be identified is obtained by utilizing a pre-trained language model; acquiring vocabulary corresponding to the equipment information from the sequence by using the acquired semantic and context information through a word vector method as a prompt vocabulary; and generating prompt information through a pre-trained language model according to the prompt vocabulary. The user agent string to be identified is entered and is subjected to code analysis using a pre-trained language model (BERT model, GPT model or transducer model). The pre-training language model can understand the semantics through the structures such as a transducer and can extract the semantic features of the character strings. The language model may obtain contextual information of the character string. Semantic analysis can accurately obtain the semantics and context information of the character strings. The character string sequence contains rich device type vocabulary information. Semantic relativity between words can be judged by using word vector technology. According to the context information, selecting the vocabulary related to the device type semanteme as the prompt vocabulary. The prompt vocabulary accurately reflects the type of device information possible. And according to the extracted prompt vocabulary, performing semantic generation by using a pre-training language model. The language model can accurately grasp semantic information reflected by the prompt vocabulary. And generating readable sentences fused with the prompt vocabulary as prompt information. The prompt information accurately expresses the equipment type information contained in the vocabulary. In conclusion, the semantic analysis technology is fully utilized, the prompt of the equipment information can be accurately obtained, and the interpretability of the recognition system is obviously improved.

And taking the prompt information and the user agent character string to be identified as inputs, inputting a pre-trained language model, and outputting predicted equipment information as equipment information corresponding to the user agent character string to be identified. Input: user agent character strings to be identified, prompt information (including device information vocabulary). The prompt information accurately reflects the equipment type possibly corresponding to the character string. Both are entered as models, which may provide additional device type information. And inputting the fused information into the pre-training language model. The language model focuses on the device vocabulary in the prompt message through the attention mechanism. The structure of the transducer fully understands the semantics and combines the two to perform joint reasoning. The model comprehensively analyzes the character string semantics and the prompt vocabulary and predicts the most probable equipment type. And outputting the device type as a recognition result of the character string to be recognized. The prompt information avoids the recognition deviation of the model and outputs more accurate equipment information. The auxiliary information improves the recognition capability and the interpretability of the model. In conclusion, the method and the device reasonably utilize external information brought by the prompt information to guide the model to generate a more accurate equipment information identification result.

Combining the prompt information and the user agent character string to be identified as a text sequence, and inputting the text sequence into a pre-trained language model; and carrying out semantic understanding on the text sequence through the pre-trained language model, and outputting prediction equipment information corresponding to the user agent character string to be recognized. And directly splicing the prompt sentence and the user agent character string to be identified into a text sequence. The sequence comprises: complete character string information to be recognized and related device type prompt vocabulary. The sequence fuses external side information to provide additional semantic cues for the model. The fused text sequence is input into a pre-trained language model (BERT model, GPT model or transducer model). The language model converts sequences into vector representations through a word embedding layer. The transducer or like structure captures sequence semantic information and context-dependent features. The model fully understands the semantic information of the sequence and the device-type vocabulary cues. Focusing the prompt vocabulary through an attention mechanism, and carrying out semantic cascade reasoning. And outputting a prediction result of the equipment type as a recognition result of the character string to be recognized. The hint information helps the model to generate more accurate predictions of device information.

Specifically, the pre-training language model is directly utilized for prediction, so that an additional model training process is avoided, the system flow is simplified, and the implementation difficulty is reduced. The pre-training language model has generalization capability, and can directly predict the equipment information of the character string sequence without seeing the sample. This reduces reliance on training data size and coverage. The prompt information provides additional semantic signals, so that the defect of the expression of equipment information in the character string is supplemented, and the pre-training model is helped to more accurately understand text semantics. The combined sequence is used as model input, and the semantic association of the character strings and the prompt information can be automatically learned by utilizing a pre-training language model without manually constructing features. The one-step end-to-end prediction reduces model chains and reduces the risk of error accumulation. In summary, the method and the device can utilize the strong prediction capability of the pre-training language model and assist in prompting information guidance to realize visual and accurate prediction of equipment information. The training of a new model is avoided, the system realization difficulty is reduced, and the generalization of the model is enhanced.

In conclusion, the generation of the prompt information fully utilizes the semantic understanding capability of the pre-training language model, and can capture key equipment description vocabulary and provide additional supervision signals; and combining the prompt information with the original character string to form a text sequence with complete semantics, and providing complete context information for the language model, thereby being beneficial to understanding the semantics of the text. The end-to-end prediction mode reduces model chains, reduces the risk of error accumulation, and generates a prediction result more intuitively. By means of large-scale pre-training, the language model obtains deep language understanding capability and good generalization, and unknown character string sequences can be accurately analyzed. The self-attention mechanism can effectively model remote dependency relationships, analyze complex semantic logic and understand the intrinsic meaning of the character string. Multitasking and transfer learning improve the adaptability of the model to different tasks, such as device information prediction tasks. The method and the device can effectively fuse prompt messages, enhance and predict by using the language model, and improve the recognition performance of the equipment information.

Fig. 2 is an exemplary flowchart of a device identification procedure of the present application, and fig. 3 is a schematic diagram of a vector retrieval service architecture diagram of the present application. As shown in fig. 2 and 3, a specific embodiment of constructing a vector search service using a method for identifying user equipment information based on an AIGC according to the present application is as follows: laying a data foundation for automatic construction model prompt (prompt), and enabling a vector retrieval service to be based on a constructed user agent character string vector library. A user agent string to be identified is entered. And carrying out vectorization coding on the character string to be identified, and obtaining vector representation of the character string to be identified. In the vector library, the TopN string vectors most similar to it are found. And outputting the TopN similar character string vector. The vectors in the library represent the string semantic information. And calculating cosine similarity among vectors, and judging the similarity of the semantics. And returning TopN similar character strings with the meaning most similar to that of the input character string. Similar strings may provide relevant device information. Data support is provided for the next step of automatically constructing the promt. Prompt helps predict the correct device information. In summary, the vector search service can find similar character strings, laying a foundation for automated Prompt generation.

User Agent (UA) character string data reported by a user is obtained from a data platform. And acquiring data corresponding to the UA and equipment category data returned by the GoogleAds. The UA data contains a user equipment characteristic identification string. The device data represents the actual device type to which the UA corresponds. And carrying out structural word segmentation on the UA character string by using a regular expression. Regular expressions can capture grammatical features in a UA string. Feature extraction is performed according to the UA protocol syntax specification. And obtaining the UA characteristic phrase of the structural representation. And establishing a mapping relation between the UA data and the device data. A training data set for constructing UA string recognition. The mapping relationship reflects the correspondence between the UA string and the device type. In summary, the UA training data is acquired and processed, and a data set is constructed for model training of equipment information identification.

Embedding (embedding) operation is carried out on the UA character strings after word segmentation by utilizing a pre-training language model LLM. Word segmentation is carried out on the User Agent (UA) character string to obtain a word sequence; mapping each word in the word sequence to a dense vector by using a word embedding technology to form a word vector sequence corresponding to the word sequence; and inputting a word vector sequence into the LLM, predicting each word vector by the LLM according to the context, gradually adjusting model parameters, and learning the distribution rule of the language. After the pre-training is completed, the LLM obtains language representation capabilities. In the application, the word embedding can be directly carried out on the UA character strings after word segmentation by utilizing the pre-trained LLM, so that semantic vector representation of the UA character strings is obtained. The word vector layer in LLM helps to extract semantic information of character strings, and its powerful language model capability supports downstream semantic parsing tasks. Compared with the traditional word bag model and the like, the language representation of LLM learning is more coherent and rich, and is more suitable for expressing the semantic content of UA character strings. Therefore, the invention adopts LLM to extract the feature vector of UA character string, which can improve the effect of subsequent equipment information identification.

Specifically, LLM converts UA characters into dense vectors through a word embedding layer. The vector encodes semantic information of the UA string. And storing the original UA character string, the corresponding device data and the encoded UA vector into a vector library. The UA string-device map-UA vector combined data is formed. The vector library stores vector representations of a large number of UA strings. The vectors encode semantic features of the UA string. The vector library provides data support for UA string recognition based on vector space analysis. During recognition, UA vectors are input, and similar vectors are searched in a library to obtain corresponding equipment information. In summary, the present application vector encodes the UA string through the LLM model and constructs a vector library in preparation for subsequent recognition.

The text string of UA can be retrained by using a pre-training language model such as GPT, BERT model, GPT model or transducer model. Different models have respective advantages and can be selected according to requirements and resources. The casting vector dimension directly affects model effects and computational complexity. The high-dimensional vector code information is richer, but the calculation and storage cost is high. Low-dimensional vector computation is efficient, but key information is easily lost. The appropriate dimensions may be selected empirically or experimentally according to the actual circumstances. 128-dimensional and 256-dimensional are more common, balancing effects and efficiency. For scenes with emphasized accuracy, 512 or 768 dimensions may be selected. A storage and speed limited scenario, either 64-dimensional or 96-dimensional, may be selected. To sum up, dimension size should be carefully considered when selecting UA character string enabling balance between model effect and calculation cost.

Extracting the vector characteristics of each device and storing the vector characteristics into a database. And normalizing the vector. Faiss supports multiple indexing schemes, and different indexing schemes have respective advantages and disadvantages. For the device identification scene, it is suggested to select an inverted file Flat index (IVF Flat) or an inverted file product quantization index (IVFPQ index, inverted File Product Quantization Index), which are all cluster-based indexes, which can well process high-dimensional vectors and provide better query accuracy. Index IVF Flat or Index IVF PQ classes of Faiss are used to construct Index objects, and device vector data is added to the Index. And setting super parameters such as cluster number, compression factor and the like to balance the index size, the construction speed and the query precision. The new device vector is entered and the most similar TopK device is queried using the indexed search function. Retrieval parameters such as query time-consuming constraints are optionally set. And obtaining the ID of the search result vector, and finding out corresponding equipment information. And according to the similarity score, setting a threshold value to screen out the equipment which is accurately matched. And adjusting index super parameters or trying different index modes to evaluate the accuracy of equipment identification. And updating the index by newly adding equipment data to continuously optimize the index so as to continuously improve the equipment identification performance. The selection of the index algorithm can be considered by combining factors such as data volume, computing resources, search effect and the like;

And a vector retrieval interface is externally provided through Web service, so that UA and equipment information similar to UA to be identified can be conveniently and efficiently acquired, and the accuracy of equipment identification is improved by constructing a prompt. Constructing a Python Web framework such as flash and the like, and writing a vector retrieval interface. The interface needs a vector that is passed into the UA to be identified. The vector search engine index is invoked internally to find the vector and device information of TopN most similar UA. Vector search indexes are pre-built using tools such as Faiss (Facebook AI Similarity Search, facebook artificial intelligence similarity search) and updated periodically. The index is made to contain sufficiently comprehensive UA vector data. The Web framework connects back-end stores of vector indexes, such as Elastic Search or Milvus. After the search is completed, the query store obtains detailed metadata of similar UAs. And filtering the interface result. Such as setting a similarity threshold, only returning results that are sufficiently similar to the UA to be identified. Cross-domain access is achieved using flash's Cross-domain resource sharing (CORS, cross-Origin Resource Sharing). The front-end system can conveniently call the API without deploying a vector retrieval engine. Additional functionality is provided, such as supporting uploading custom UA vectors, which are then added to the search index to continuously optimize service capabilities. And the service is deployed to the cloud platform by utilizing the technologies of Docker and the like, so that the service is ensured to be stable and reliable.

As new data is added to the vector database, the vector index needs to be updated in time, so that the retrieval service is always based on the latest vector database, and the accuracy of equipment identification is ensured. The change of the vector database is monitored and new data is tracked. Depending on the magnitude, either a full update or an incremental update is determined. Directly reconstructing the index, and adding all new and old vectors in the database into the index again. The method is suitable for directly reconstructing the index faster due to large data magnitude change. Only the newly added vector is added into the index, and the original data is not affected. Add functions using fass index. The method is suitable for small data change and more efficient in incremental update. The time to reconstruct the index is set periodically, such as by reconstructing the index once in full at the end of each week and the middle of the night. When updating, a new index is established first, and the index is off-line, so that the old result is prevented from being returned during the query transition period. An index rebuild application programming interface (API, application Programming Interface) is provided through the flash framework, which is invoked upon index update, halting vector retrieval requests. Ensuring that all requests hit the latest index. Extended distributed index support. And deploying multiple index nodes, and switching the flow to other nodes when a single node is updated, so as to avoid global query interruption. And optimizing an index reconstruction algorithm. Different index parameters are tested, and the settings with high reconstruction speed and good query effect are selected.

Utilizing pre-training LLM coding to UA data to be identified and generating Prompt through a vector retrieval service; the new UA is vectorized using a pre-trained language model (LLM) to obtain dense vectors of UAs. And calling a vector retrieval service developed in advance, and transmitting the vector into the UAV vector. TopN UAs most similar to the UA are looked up from the vector library. According to the returned similar UA list, the metadata database (Metadata Database) is queried to acquire the device information of the corresponding UA. And screening the devices with the highest confidence level to form candidates of the prompt. Sample example: to identify UA, inputting the formed promt into a pre-training language model, and outputting the most probable device selection. Providing a manual feedback function for erroneous results. The feedback data is used to incrementally update the index of the vector retrieval service, continuously optimizing the quality of service. And a container technology such as a Docker is utilized to deploy services, so that the stability and high availability of automatic generation of the Prompt are ensured.

Through the pre-flow, the campt of the new UA has been generated. Further analysis of the probt based on the AIGC model is required to obtain the device predictions. And collecting a large number of real devices and corresponding campts, training a tuning AIGC model, and obtaining a pre-training model for device identification. And taking the newly generated campt as input, calling a prediction interface of the AIGC model, and obtaining a device prediction result. The AIGC model learns the context semantic information of the sample through a self-attention mechanism to obtain the association weight of the candidate equipment. For the case of weight ambiguity, multiple candidate devices are returned instead of directly giving a single result. And verifying the prediction result of the model, and if the accuracy is low, collecting error cases, and continuing training and optimizing the model. And (5) deploying AIGC service to the cloud environment, and ensuring stable and reliable online service. And controlling the model version and tracking the model performance. And monitoring identification index (Metrics) data of the online service, such as accuracy, recall, F1 score and the like, and judging whether the performance of the model meets or is degraded. When the identification index is reduced, the model is optimized by utilizing the new training data increment, and the model version which is not ideal on the line is replaced.

The application also provides a system for identifying user equipment information based on AIGC, comprising: the training module is used for acquiring data containing user agent character strings and corresponding equipment information, coding the user agent character strings by utilizing a pre-trained language model, acquiring corresponding user agent character string vectors and constructing a user agent character string vector library; the receiving module is used for receiving the user agent character string to be identified; the coding module is used for coding the user agent character string to be identified by utilizing the pre-trained language model to obtain a vector of the user agent character string to be identified; the retrieval module is used for retrieving the first N user agent character string vectors similar to the vector of the user agent character string to be identified from the user agent character string vector library by using a vector retrieval method, wherein N is a positive integer; the generating module is used for combining the equipment information of the user agent character strings corresponding to the first N user agent character string vectors with the user agent character strings to be identified to generate a sequence, and constructing prompt information by utilizing the sequence; the output module is used for inputting the prompt information and the user agent character string to be identified, inputting the pre-trained language model, and outputting the predicted equipment information as the equipment information corresponding to the user agent character string to be identified.

According to the method and the device, the user equipment information is identified by constructing the user agent character string vector library and utilizing the vector retrieval and language model prediction modes, the characteristic information in the user agent character string can be fully utilized, and the identification accuracy of the equipment information is improved. The pre-training language model endows the system with strong semantic understanding and predicting capability, and the similar character string vectors are searched to provide effective supervision information, so that the combination of the two can overcome the limitation of the darkness and generalization expression of the user agent character string, and the intellectualization and the robustness of the system are greatly improved. In general, the method and the device integrate the advantages of vector retrieval and deep learning algorithms, and can realize accurate and efficient identification of the user equipment information.

The foregoing has been described schematically the invention and embodiments thereof, which are not limiting, but can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The drawings are also intended to depict only one of the embodiments of the invention, and therefore the actual construction is not intended to be limiting, as any reference number in the claims should not be limiting to the claims that issue. Therefore, if one of ordinary skill in the art is informed by this disclosure, the structural manner and embodiment similar to the scheme of the present application are not creatively designed without departing from the gist of the present invention, and all the structural manners and embodiments are considered to be within the protection scope of the present patent. Furthermore, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Several of the elements recited in the product claims may be embodied by one element in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method of identifying user equipment information based on an AIGC, comprising:

acquiring equipment information corresponding to the user agent character string and the user agent character string;

encoding the user agent character string by utilizing a pre-trained language model to acquire a user agent character string vector, and constructing a user agent character string vector library;

receiving a user agent character string to be identified;

encoding the user agent character string to be identified by utilizing the pre-trained language model to obtain a vector of the user agent character string to be identified;

retrieving the first N user agent character string vectors similar to the vector of the user agent character string to be identified from a user agent character string vector library, wherein N is a positive integer;

combining the device information of the user agent character strings corresponding to the first N user agent character string vectors with the user agent character strings to be identified to generate a sequence, and constructing prompt information by using the sequence;

taking the prompt information and the user agent character string to be identified as inputs, inputting a pre-trained language model, and outputting predicted equipment information as equipment information corresponding to the user agent character string to be identified;

wherein, AIGC means to output predicted device information as device information corresponding to a user agent character string to be recognized by using a pre-trained language model;

The step of generating the sequence comprises:

judging whether the first N user agent character string vectors are acquired or not, wherein N is a positive integer;

if the judgment result is yes, acquiring user agent character strings corresponding to the first N user agent character string vectors;

acquiring equipment information corresponding to the first N user agent character strings from a user agent character string vector library;

splicing the equipment information of the first N user agent character strings with the user agent character strings to be identified to generate a sequence;

the step of constructing the prompt message comprises the following steps:

acquiring the semantic and contextual information of the user agent character string to be identified by utilizing a pre-trained language model;

acquiring vocabulary corresponding to the equipment information from the sequence by using the acquired semantic and context information through a word vector method as a prompt vocabulary;

and generating prompt information through a pre-trained language model according to the prompt vocabulary.

2. The method for identifying user equipment information based on AIGC of claim 1, wherein:

the step of obtaining the first N user agent string vectors includes:

encoding the user agent character string to be identified by utilizing the pre-trained language model, and obtaining the vector of the user agent character string to be identified;

Performing similarity comparison between the vector of the user agent character string to be identified and the user agent character string vector in the user agent character string vector library by using a vector retrieval method;

based on the result of the similarity comparison, the first N user agent string vectors that are similar to the vector of user agent strings to be identified are selected.

3. The method for identifying user equipment information based on AIGC as in claim 2, wherein:

the vector retrieval method is cosine similarity, euclidean distance, manhattan distance or Minkowski distance.

4. The method for identifying user equipment information based on AIGC of claim 1, wherein:

the step of outputting predicted device information includes:

combining the prompt information and the user agent character string to be identified as a text sequence, and inputting the text sequence into a pre-trained language model;

semantic understanding is carried out on the text sequence through the pre-trained language model, and prediction equipment information corresponding to the user agent character string to be recognized is output.

5. The method for identifying user equipment information based on AIGC of claim 4, wherein:

the pre-trained language model is a BERT model, a GPT model, or a transducer model.

6. The method for identifying user equipment information based on AIGC of claim 1, wherein:

the user agent character strings are in one-to-one correspondence with the corresponding device information.

7. The method for identifying user equipment information based on AIGC of claim 1, wherein:

the process of encoding the user agent character string to obtain the user agent character string vector adopts a word embedding method.

8. A system based on the AIGC-based method of identifying user equipment information of any of claims 1 to 7, comprising:

the training module is used for acquiring data containing user agent character strings and equipment information corresponding to the user agent character strings, coding the user agent character strings by utilizing a pre-trained language model, acquiring corresponding user agent character string vectors and constructing a user agent character string vector library;

the receiving module is used for receiving the user agent character string to be identified;

the coding module is used for coding the user agent character string to be identified by utilizing the pre-trained language model to obtain a vector of the user agent character string to be identified;

the retrieval module is used for retrieving the first N user agent character string vectors similar to the vector of the user agent character string to be identified from the user agent character string vector library by using a vector retrieval method, wherein N is a positive integer;

The generating module combines the device information of the user agent character strings corresponding to the first N user agent character string vectors with the user agent character strings to be identified to generate a sequence, and builds prompt information by using the sequence;

the output module takes the prompt information and the user agent character string to be identified as input, inputs the pre-trained language model, and outputs predicted equipment information as equipment information corresponding to the user agent character string to be identified.