CN112016010A - Natural language semantic library construction method for automatic driving test scene description - Google Patents

Natural language semantic library construction method for automatic driving test scene description Download PDF

Info

Publication number
CN112016010A
CN112016010A CN202010462504.0A CN202010462504A CN112016010A CN 112016010 A CN112016010 A CN 112016010A CN 202010462504 A CN202010462504 A CN 202010462504A CN 112016010 A CN112016010 A CN 112016010A
Authority
CN
China
Prior art keywords
automatic driving
importance
driving test
text
test scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010462504.0A
Other languages
Chinese (zh)
Inventor
王赟芝
杜志彬
赵瑞文
周博林
陈蔯
赵启东
翟洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sinotruk Data Co ltd
China Automotive Technology and Research Center Co Ltd
Automotive Data of China Tianjin Co Ltd
Original Assignee
Sinotruk Data Co ltd
China Automotive Technology and Research Center Co Ltd
Automotive Data of China Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sinotruk Data Co ltd, China Automotive Technology and Research Center Co Ltd, Automotive Data of China Tianjin Co Ltd filed Critical Sinotruk Data Co ltd
Priority to CN202010462504.0A priority Critical patent/CN112016010A/en
Publication of CN112016010A publication Critical patent/CN112016010A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a natural language semantic library construction method for automatic driving test scene description, which comprises the following steps: step 1: crawling a specific online resource by applying a crawler program; step 2: standardizing the format of the crawled information resource address link; and step 3: processing the repeatedly grabbed content by applying a bloom filter; and 4, step 4: preprocessing the obtained text word segmentation and word segmentation according to the part of speech; and 5: performing keyword sequencing on the preprocessed text by using a text keyword sequencing algorithm; step 6: the weight distribution proportion of the keywords in the obtained text is improved through the pointing importance and the like; and 7: and adjusting weight distribution according to the keyword sequencing result, and finally generating an automatic driving test scene semantic library. The method for constructing the natural language semantic library for the automatic driving test scene description avoids the acquisition of repeated contents and the interference on the statistics and acquisition of final keywords.

Description

Natural language semantic library construction method for automatic driving test scene description
Technical Field
The invention belongs to the field of automatic driving, and particularly relates to a method for constructing a natural language semantic library for automatic driving test scene description.
Background
The automatic driving simulation test scene is the key to whether the automatic driving automobile can land or not. The completeness of the semantic library is a crucial link for testing the natural language interface of the scene library. At present, a plurality of methods for constructing a semantic library exist, wherein the most representative methods in a keyword extraction algorithm include TF-IDF, LDA, TextRank and the like. However, the algorithms cannot meet the requirement of the semantic library of the automatic driving test scene in terms of keyword identification, sorting and selection, information is easy to repeatedly capture, and a natural language semantic library specially used for describing the automatic driving test scene does not exist.
Disclosure of Invention
In view of this, the invention provides a natural language semantic library construction method for automatic driving test scene description, so as to solve the problem of repeated web page capture.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
a natural language semantic library construction method for automatic driving test scene description comprises the following steps:
step 1: crawling a specific online resource by applying a crawler program;
step 2: standardizing the format of the crawled information resource address link, and deleting the accessed information resource address;
and step 3: processing the repeatedly grabbed content by applying a bloom filter;
and 4, step 4: preprocessing the obtained text by word segmentation, labeling and the like according to the part of speech;
and 5: performing keyword sequencing on the preprocessed text by using a text keyword sequencing algorithm;
step 6: the weight distribution proportion of the keywords in the obtained text is improved through three dimensions of the pointing importance, the part-of-speech importance and the frequency importance, and therefore the keyword sequencing result is optimized;
and 7: and adjusting weight distribution according to the keyword sequencing result, and finally generating an automatic driving test scene semantic library.
Further, the normalizing process of the format of the crawled information resource address link utilized in the step 2 comprises the following steps:
step a: the URL protocol name and the host name are lowercase;
step b: converting the character string escape sequence into capital;
step c: deleting the information fragment;
step d: delete empty query string'? ';
step e: deleting the default suffix;
step f: deleting redundant point repairing characters;
step g: delete prefix "www";
step h: deleting the variables with default values;
step i: deleting redundant query strings;
step j: the URL is processed using a different link due process for similar web pages.
Further, the repeatedly captured contents are processed in step 2, so that the captured content data is converted into a hash value through a hash function, if the hash values of the bits corresponding to the bits of the two contents are all 1, the contents can be judged to be the same or similar, and the deletion processing is performed on one of the contents.
Further, the preprocessing of the text utilized in step 4 is to perform complete sentence segmentation on the text according to the sentence numbers, perform word segmentation and labeling processing on each sentence according to the part of speech, and remove the punctuation marks and stop words.
Further, the text keywords used in the step 5 and the step 6 are subjected to importance ranking, the output result is checked according to the related description words of the autopilot test scene standard, and under the condition that the output result is not ideal, the weight factors of the directional importance, the part-of-speech importance and the frequency importance in the output result are adjusted and are reordered again, and finally, the top keywords in the content importance ranking are obtained as the extraction result.
Compared with the prior art, the invention has the following advantages:
(1) by standardizing URL links and duplicate content deletion, duplicate content acquisition and interference with final keyword statistics and acquisition are avoided.
(2) On the basis of a TextRank keyword sorting algorithm, weight adjustment factors pointing to importance, part-of-speech importance and frequency importance are added according to the characteristics of the description language of the automatic driving test scene, so that the keyword acquisition capability of the semantic library of the automatic driving test scene is enhanced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
fig. 1 is a schematic flow chart of a method for constructing a natural language semantic library for automatic driving test scenario description according to an embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
In the description of the present invention, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in fig. 1, a method for constructing a natural language semantic library described in an automatic driving test scenario includes the following steps:
step 1: crawling a specific online resource by applying a crawler program;
step 2: in order to avoid repeated page acquisition, the links crawled by the crawler need to be subjected to standardization processing. Each information Resource on the network has a unique address called URL (Uniform Resource locator). Crawlers often access and crawl the content of the webpage through URL addresses pointing to other webpages contained in the webpage, however, many URL addresses which look different actually originate from the same link, which may cause repeated crawling of the same webpage, and further affect on the crawling of the content. The URL link standardization treatment can unify the URL format, remove the accessed URL and prevent the crawler from entering the accessed URL address for the second time.
And step 3: the bloom filter is applied to process repeatedly crawled content, which is roughly or completely duplicated although some web pages have different URL link addresses. In order to avoid capturing repeated contents, which causes errors in keyword statistics and extraction, a bloom filter is applied to process the repeatedly captured contents.
And 4, step 4: preprocessing the obtained text by word segmentation, labeling and the like according to the part of speech;
and 5: a text keyword sequencing algorithm is applied, a text is divided into vocabularies to be used as network nodes, a vocabulary network graph model is formed, and keywords in the text are sequenced by a voting mechanism;
step 6: the weight distribution proportion of the keywords in the obtained text is improved through three dimensions of the pointing importance, the part-of-speech importance and the frequency importance, and therefore the keyword sequencing result is optimized;
and 7: and adjusting weight distribution according to the keyword sequencing result, and finally generating an automatic driving test scene semantic library.
Further, in step 2, the URL obtained by the crawler program is subjected to uniform standardization processing according to the structure of the URL. The URL consists of three parts: resource type, host domain name where the resource is stored, and resource file name. Namely, protocol:// hostname [: port ]/path/[; parameters ] [? query ] # fragment. Processing the URL according to the following steps:
step a: the URL protocol name and the host name are lowercase;
step b: converting the character string escape sequence into capital;
step c: deleting the information fragment;
step d: delete empty query string'? ';
step e: deleting the default suffix;
step f: deleting redundant point repairing characters;
step g: delete prefix "www";
step h: deleting the variables with default values;
step i: deleting redundant query strings;
step j: the URL is processed using a different link due process for similar web pages.
Further, the repeatedly captured contents are processed in step 2, so that the captured content data is converted into a hash value through a hash function, if the hash values of the bits corresponding to the bits of the two contents are all 1, the contents can be judged to be the same or similar, and the deletion processing is performed on one of the contents, so as to reduce the interference on statistics.
Further, the preprocessing of the text utilized in step 4 is to divide the text into complete sentences according to periods, perform word segmentation and labeling processing on each sentence according to the part of speech, and remove the punctuation marks and stop words such as "because of" the "and" is "to reduce the high word frequency invalid word interference.
Further, the text keywords used in the step 5 and the step 6 are subjected to importance ranking, the output result is checked according to the related description words of the autopilot test scene standard, and under the condition that the output result is not ideal, the weight factors of the directional importance, the part-of-speech importance and the frequency importance in the output result are adjusted and are reordered again, and finally, the top keywords in the content importance ranking are obtained as the extraction result.
And (5) TextRank keyword graph. The TextRank model can be expressed as a keyword graph G ═ V, E, and is composed of a node set V and an edge set
Figure BDA0002511503540000062
Composition, arbitrary two points V in the figurei,VjThe weight of the edges in between is omegajiFor any one node ViHas, a direction ViNode set In (V)i) And ViNode set Out (V) pointed by pointi). Node ViThe weight scores of (a) are as follows:
Figure BDA0002511503540000061
wherein d is a damping coefficient with a value range of 0-1, and represents the probability that the point Vi points to any other node, which is usually set to 0.85.
And optimizing the TextRank algorithm. The TextRank algorithm delivers the weight of Vi to its associated nodes in a uniform scale. In the description language of the automatic driving test scene, more weight should be assigned to the words in the following dimensions:
a. the orientation importance: the greater the number of different nodes that point to a Vi node, the greater the importance of the Vi node.
b. Part of speech importance: in the scene description language, words such as verbs, nouns, adjectives and the like and words describing road, weather, orientation and the like should be paid higher attention.
c. Frequency importance: the higher the frequency of occurrence of a keyword in the text, the more important the keyword.
A, B, C indicate the weight of the influence of the point importance, the part of speech importance, and the frequency importance, W indicates the weight of the influence of the entire node, and W + a + B + C is 1.
Finally, the weight score of any node is iteratively formulated as
Figure BDA0002511503540000071
Wherein, the vector is an N-dimensional vector with all elements being 1, and N is a weight distribution matrix among words
Figure BDA0002511503540000072
And any node omegai,jAvailable pointing importance ωAPart-of-speech importance ωBFrequency importance ωcTo show that:
Figure BDA0002511503540000073
wherein X (V)j) Representation node VjThe importance of the method is that the relative behavior verbs, scene description nouns, adjectives and the like of vehicles and pedestrians are assigned according to the international standard of automatic driving; f (V) represents the number of times the keyword V appears in the text.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (5)

1. A natural language semantic library construction method for automatic driving test scene description is characterized by comprising the following steps:
step 1: crawling a specific online resource by applying a crawler program;
step 2: standardizing the format of the crawled information resource address link, and deleting the accessed information resource address;
and step 3: processing the repeatedly grabbed content by applying a bloom filter;
and 4, step 4: preprocessing the obtained text by word segmentation, labeling and the like according to the part of speech;
and 5: performing keyword sequencing on the preprocessed text by using a text keyword sequencing algorithm;
step 6: improving the weight distribution proportion of the keywords in the obtained text through three dimensions of the pointing importance, the part-of-speech importance and the frequency importance, and further achieving the goal of optimizing the keyword sequencing result;
and 7: and adjusting weight distribution according to the keyword sequencing result, and finally generating an automatic driving test scene semantic library.
2. The method for constructing the natural language semantic library described in the automatic driving test scenario as claimed in claim 1, wherein the step 2 of standardizing the format of the crawled information resource address link comprises the following steps:
step a: the URL protocol name and the host name are lowercase;
step b: converting the character string escape sequence into capital;
step c: deleting the information fragment;
step d: delete empty query string'? ';
step e: deleting the default suffix;
step f: deleting redundant point repairing characters;
step g: delete prefix "www";
step h: deleting the variables with default values;
step i: deleting redundant query strings;
step j: the URL is processed using a different link due process for similar web pages.
3. The method for constructing the natural language semantic library of the automatic driving test scene description according to claim 1, characterized in that: and 2, processing the repeatedly captured contents in the step 2, converting captured content data into a hash value through a hash function, and if the hash values of the bits corresponding to the two contents are all 1, judging that the contents are the same or similar, and deleting one of the contents.
4. The method for constructing the natural language semantic library of the automatic driving test scene description according to claim 1, characterized in that: the preprocessing of the text utilized in the step 4 is to divide the text into complete sentences according to the sentence numbers, perform word segmentation and labeling processing on each sentence according to the part of speech, and remove punctuation marks and stop words.
5. The method for constructing the natural language semantic library of the automatic driving test scene description according to claim 1, characterized in that: and 5, ranking the importance of the text keywords in the step 6, auditing an output result according to related description words of the automatic driving test scene standard, adjusting the weight factors pointing to the importance, the part-of-speech importance and the frequency importance in the output result under the condition that the output result is not ideal, re-ranking again, and finally obtaining the keywords at the top in the content importance ranking as an extraction result.
CN202010462504.0A 2020-05-27 2020-05-27 Natural language semantic library construction method for automatic driving test scene description Pending CN112016010A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010462504.0A CN112016010A (en) 2020-05-27 2020-05-27 Natural language semantic library construction method for automatic driving test scene description

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010462504.0A CN112016010A (en) 2020-05-27 2020-05-27 Natural language semantic library construction method for automatic driving test scene description

Publications (1)

Publication Number Publication Date
CN112016010A true CN112016010A (en) 2020-12-01

Family

ID=73507148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010462504.0A Pending CN112016010A (en) 2020-05-27 2020-05-27 Natural language semantic library construction method for automatic driving test scene description

Country Status (1)

Country Link
CN (1) CN112016010A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112744229A (en) * 2021-01-18 2021-05-04 国汽智控(北京)科技有限公司 Generation system of proprietary language in automatic driving field

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123642A (en) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 Searching method and device based on web language
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN107832457A (en) * 2017-11-24 2018-03-23 国网山东省电力公司电力科学研究院 Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm
WO2018157805A1 (en) * 2017-03-03 2018-09-07 腾讯科技(深圳)有限公司 Automatic questioning and answering processing method and automatic questioning and answering system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123642A (en) * 2012-02-22 2013-05-29 深圳市谷古科技有限公司 Searching method and device based on web language
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
WO2018157805A1 (en) * 2017-03-03 2018-09-07 腾讯科技(深圳)有限公司 Automatic questioning and answering processing method and automatic questioning and answering system
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN107832457A (en) * 2017-11-24 2018-03-23 国网山东省电力公司电力科学研究院 Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112744229A (en) * 2021-01-18 2021-05-04 国汽智控(北京)科技有限公司 Generation system of proprietary language in automatic driving field
CN112744229B (en) * 2021-01-18 2021-12-21 国汽智控(北京)科技有限公司 Generation system of proprietary language in automatic driving field

Similar Documents

Publication Publication Date Title
US7346487B2 (en) Method and apparatus for identifying translations
JP5746286B2 (en) High-performance data metatagging and data indexing method and system using a coprocessor
KR101201037B1 (en) Verifying relevance between keywords and web site contents
US6199067B1 (en) System and method for generating personalized user profiles and for utilizing the generated user profiles to perform adaptive internet searches
US20030221163A1 (en) Using web structure for classifying and describing web pages
JP5744228B2 (en) Method and apparatus for blocking harmful information on the Internet
US20110137919A1 (en) Apparatus and method for knowledge graph stabilization
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
US20110082853A1 (en) System and method for extracting content for submission to a search engine
US20090319449A1 (en) Providing context for web articles
US7941418B2 (en) Dynamic corpus generation
WO2018169597A1 (en) Systems and methods for verbatim -text mining
CN109902290B (en) Text information-based term extraction method, system and equipment
CN109948154B (en) Character acquisition and relationship recommendation system and method based on mailbox names
CN112115232A (en) Data error correction method and device and server
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
CN112016010A (en) Natural language semantic library construction method for automatic driving test scene description
CN105677684A (en) Method for making semantic annotations on content generated by users based on external data sources
KR20010102687A (en) Method and System for Web Documents Sort Using Category Learning Skill
CN115238124A (en) Video character retrieval method, device, equipment and storage medium
US10552459B2 (en) Classifying a document using patterns
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
CN113157857A (en) Hot topic detection method, device and equipment for news
CN110851560B (en) Information retrieval method, device and equipment
CN110019814B (en) News information aggregation method based on data mining and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination