CN112016010A

CN112016010A - Natural language semantic library construction method for automatic driving test scene description

Info

Publication number: CN112016010A
Application number: CN202010462504.0A
Authority: CN
Inventors: 王赟芝; 杜志彬; 赵瑞文; 周博林; 陈蔯; 赵启东; 翟洋
Original assignee: Sinotruk Data Co ltd; China Automotive Technology and Research Center Co Ltd; Automotive Data of China Tianjin Co Ltd
Current assignee: Sinotruk Data Co ltd; China Automotive Technology and Research Center Co Ltd; Automotive Data of China Tianjin Co Ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-12-01

Abstract

The invention provides a natural language semantic library construction method for automatic driving test scene description, which comprises the following steps: step 1: crawling a specific online resource by applying a crawler program; step 2: standardizing the format of the crawled information resource address link; and step 3: processing the repeatedly grabbed content by applying a bloom filter; and 4, step 4: preprocessing the obtained text word segmentation and word segmentation according to the part of speech; and 5: performing keyword sequencing on the preprocessed text by using a text keyword sequencing algorithm; step 6: the weight distribution proportion of the keywords in the obtained text is improved through the pointing importance and the like; and 7: and adjusting weight distribution according to the keyword sequencing result, and finally generating an automatic driving test scene semantic library. The method for constructing the natural language semantic library for the automatic driving test scene description avoids the acquisition of repeated contents and the interference on the statistics and acquisition of final keywords.

Description

Natural language semantic library construction method for automatic driving test scene description

Technical Field

The invention belongs to the field of automatic driving, and particularly relates to a method for constructing a natural language semantic library for automatic driving test scene description.

Background

The automatic driving simulation test scene is the key to whether the automatic driving automobile can land or not. The completeness of the semantic library is a crucial link for testing the natural language interface of the scene library. At present, a plurality of methods for constructing a semantic library exist, wherein the most representative methods in a keyword extraction algorithm include TF-IDF, LDA, TextRank and the like. However, the algorithms cannot meet the requirement of the semantic library of the automatic driving test scene in terms of keyword identification, sorting and selection, information is easy to repeatedly capture, and a natural language semantic library specially used for describing the automatic driving test scene does not exist.

Disclosure of Invention

In view of this, the invention provides a natural language semantic library construction method for automatic driving test scene description, so as to solve the problem of repeated web page capture.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a natural language semantic library construction method for automatic driving test scene description comprises the following steps:

step 1: crawling a specific online resource by applying a crawler program;

step 2: standardizing the format of the crawled information resource address link, and deleting the accessed information resource address;

and step 3: processing the repeatedly grabbed content by applying a bloom filter;

and 4, step 4: preprocessing the obtained text by word segmentation, labeling and the like according to the part of speech;

and 5: performing keyword sequencing on the preprocessed text by using a text keyword sequencing algorithm;

step 6: the weight distribution proportion of the keywords in the obtained text is improved through three dimensions of the pointing importance, the part-of-speech importance and the frequency importance, and therefore the keyword sequencing result is optimized;

and 7: and adjusting weight distribution according to the keyword sequencing result, and finally generating an automatic driving test scene semantic library.

Further, the normalizing process of the format of the crawled information resource address link utilized in the step 2 comprises the following steps:

step a: the URL protocol name and the host name are lowercase;

step b: converting the character string escape sequence into capital;

step c: deleting the information fragment;

step d: delete empty query string'? ';

step e: deleting the default suffix;

step f: deleting redundant point repairing characters;

step g: delete prefix "www";

step h: deleting the variables with default values;

step i: deleting redundant query strings;

step j: the URL is processed using a different link due process for similar web pages.

Further, the repeatedly captured contents are processed in step 2, so that the captured content data is converted into a hash value through a hash function, if the hash values of the bits corresponding to the bits of the two contents are all 1, the contents can be judged to be the same or similar, and the deletion processing is performed on one of the contents.

Further, the preprocessing of the text utilized in step 4 is to perform complete sentence segmentation on the text according to the sentence numbers, perform word segmentation and labeling processing on each sentence according to the part of speech, and remove the punctuation marks and stop words.

Further, the text keywords used in the step 5 and the step 6 are subjected to importance ranking, the output result is checked according to the related description words of the autopilot test scene standard, and under the condition that the output result is not ideal, the weight factors of the directional importance, the part-of-speech importance and the frequency importance in the output result are adjusted and are reordered again, and finally, the top keywords in the content importance ranking are obtained as the extraction result.

Compared with the prior art, the invention has the following advantages:

(1) by standardizing URL links and duplicate content deletion, duplicate content acquisition and interference with final keyword statistics and acquisition are avoided.

(2) On the basis of a TextRank keyword sorting algorithm, weight adjustment factors pointing to importance, part-of-speech importance and frequency importance are added according to the characteristics of the description language of the automatic driving test scene, so that the keyword acquisition capability of the semantic library of the automatic driving test scene is enhanced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic flow chart of a method for constructing a natural language semantic library for automatic driving test scenario description according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

In the description of the present invention, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 1, a method for constructing a natural language semantic library described in an automatic driving test scenario includes the following steps:

step 1: crawling a specific online resource by applying a crawler program;

step 2: in order to avoid repeated page acquisition, the links crawled by the crawler need to be subjected to standardization processing. Each information Resource on the network has a unique address called URL (Uniform Resource locator). Crawlers often access and crawl the content of the webpage through URL addresses pointing to other webpages contained in the webpage, however, many URL addresses which look different actually originate from the same link, which may cause repeated crawling of the same webpage, and further affect on the crawling of the content. The URL link standardization treatment can unify the URL format, remove the accessed URL and prevent the crawler from entering the accessed URL address for the second time.

And step 3: the bloom filter is applied to process repeatedly crawled content, which is roughly or completely duplicated although some web pages have different URL link addresses. In order to avoid capturing repeated contents, which causes errors in keyword statistics and extraction, a bloom filter is applied to process the repeatedly captured contents.

and 5: a text keyword sequencing algorithm is applied, a text is divided into vocabularies to be used as network nodes, a vocabulary network graph model is formed, and keywords in the text are sequenced by a voting mechanism;

Further, in step 2, the URL obtained by the crawler program is subjected to uniform standardization processing according to the structure of the URL. The URL consists of three parts: resource type, host domain name where the resource is stored, and resource file name. Namely, protocol:// hostname [: port ]/path/[; parameters ] [? query ] # fragment. Processing the URL according to the following steps:

step a: the URL protocol name and the host name are lowercase;

step b: converting the character string escape sequence into capital;

step c: deleting the information fragment;

step d: delete empty query string'? ';

step e: deleting the default suffix;

step f: deleting redundant point repairing characters;

step g: delete prefix "www";

step h: deleting the variables with default values;

step i: deleting redundant query strings;

Further, the repeatedly captured contents are processed in step 2, so that the captured content data is converted into a hash value through a hash function, if the hash values of the bits corresponding to the bits of the two contents are all 1, the contents can be judged to be the same or similar, and the deletion processing is performed on one of the contents, so as to reduce the interference on statistics.

Further, the preprocessing of the text utilized in step 4 is to divide the text into complete sentences according to periods, perform word segmentation and labeling processing on each sentence according to the part of speech, and remove the punctuation marks and stop words such as "because of" the "and" is "to reduce the high word frequency invalid word interference.

And (5) TextRank keyword graph. The TextRank model can be expressed as a keyword graph G ═ V, E, and is composed of a node set V and an edge set

Composition, arbitrary two points V in the figure_i，V_jThe weight of the edges in between is omega_jiFor any one node V_iHas, a direction V_iNode set In (V)_i) And V_iNode set Out (V) pointed by point_i). Node V_iThe weight scores of (a) are as follows:

wherein d is a damping coefficient with a value range of 0-1, and represents the probability that the point Vi points to any other node, which is usually set to 0.85.

And optimizing the TextRank algorithm. The TextRank algorithm delivers the weight of Vi to its associated nodes in a uniform scale. In the description language of the automatic driving test scene, more weight should be assigned to the words in the following dimensions:

a. the orientation importance: the greater the number of different nodes that point to a Vi node, the greater the importance of the Vi node.

b. Part of speech importance: in the scene description language, words such as verbs, nouns, adjectives and the like and words describing road, weather, orientation and the like should be paid higher attention.

c. Frequency importance: the higher the frequency of occurrence of a keyword in the text, the more important the keyword.

A, B, C indicate the weight of the influence of the point importance, the part of speech importance, and the frequency importance, W indicates the weight of the influence of the entire node, and W + a + B + C is 1.

Finally, the weight score of any node is iteratively formulated as

Wherein, the vector is an N-dimensional vector with all elements being 1, and N is a weight distribution matrix among words

And any node omega_i,jAvailable pointing importance ω_APart-of-speech importance ω_BFrequency importance ω_cTo show that:

wherein X (V)_j) Representation node V_jThe importance of the method is that the relative behavior verbs, scene description nouns, adjectives and the like of vehicles and pedestrians are assigned according to the international standard of automatic driving; f (V) represents the number of times the keyword V appears in the text.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A natural language semantic library construction method for automatic driving test scene description is characterized by comprising the following steps:

step 1: crawling a specific online resource by applying a crawler program;

step 6: improving the weight distribution proportion of the keywords in the obtained text through three dimensions of the pointing importance, the part-of-speech importance and the frequency importance, and further achieving the goal of optimizing the keyword sequencing result;

2. The method for constructing the natural language semantic library described in the automatic driving test scenario as claimed in claim 1, wherein the step 2 of standardizing the format of the crawled information resource address link comprises the following steps:

step a: the URL protocol name and the host name are lowercase;

step b: converting the character string escape sequence into capital;

step c: deleting the information fragment;

step d: delete empty query string'? ';

step e: deleting the default suffix;

step f: deleting redundant point repairing characters;

step g: delete prefix "www";

step h: deleting the variables with default values;

step i: deleting redundant query strings;

3. The method for constructing the natural language semantic library of the automatic driving test scene description according to claim 1, characterized in that: and 2, processing the repeatedly captured contents in the step 2, converting captured content data into a hash value through a hash function, and if the hash values of the bits corresponding to the two contents are all 1, judging that the contents are the same or similar, and deleting one of the contents.

4. The method for constructing the natural language semantic library of the automatic driving test scene description according to claim 1, characterized in that: the preprocessing of the text utilized in the step 4 is to divide the text into complete sentences according to the sentence numbers, perform word segmentation and labeling processing on each sentence according to the part of speech, and remove punctuation marks and stop words.

5. The method for constructing the natural language semantic library of the automatic driving test scene description according to claim 1, characterized in that: and 5, ranking the importance of the text keywords in the step 6, auditing an output result according to related description words of the automatic driving test scene standard, adjusting the weight factors pointing to the importance, the part-of-speech importance and the frequency importance in the output result under the condition that the output result is not ideal, re-ranking again, and finally obtaining the keywords at the top in the content importance ranking as an extraction result.