CN111291154B

CN111291154B - Dialect sample data extraction method, device and equipment and storage medium

Info

Publication number: CN111291154B
Application number: CN202010054280.XA
Authority: CN
Inventors: 陈鑫; 肖龙源; 蔡振华; 李稀敏; 刘晓葳; 谭玉坤
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2022-08-23
Anticipated expiration: 2040-01-17
Also published as: CN111291154A

Abstract

The invention discloses a dialect sample data extraction method, which comprises the following steps: acquiring a first dialect of a plurality of dialect areas and city data corresponding to each dialect area in the plurality of dialect areas, wherein one dialect area corresponds to one city; classifying the dialect regions with the same first dialect into the same dialect group, and obtaining a plurality of dialect groups; sorting each dialect group according to the city data corresponding to each dialect region, and determining a target dialect region of each dialect group from each sorted dialect group; acquiring medical and American dialogue data of a city corresponding to a target dialect area of each dialect group; and taking the acquired medical and American dialogue data corresponding to each dialect group as dialect sample data. Therefore, the invention theoretically needs to cover all official language areas in the selection of machine learning data, so that the generalization capability of the model can be enhanced.

Description

Dialect sample data extraction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting dialect sample data.

Background

In the field of natural language processing, task-based conversational robotics, dialects are often the place of headache, since china has roughly 7 official regions, and dialects are hundreds of thousands. The mainstream natural language processing is in chinese, that is, in the unofficial region, there may be a case where a task recognition error, such as a graph recognition, exists. For example, asking prices by people in most places says: "how much money" and some dialect areas say: "more money". The dialect data is lacking in the training sample data, so that the dialect comparison robot cannot recognize the dialect, and the dialogue robot cannot answer the user accurately.

Disclosure of Invention

The invention provides a dialect sample data extraction method, a dialect sample data extraction device, dialect sample data extraction equipment and a computer readable storage medium, and mainly aims to extract sample data for all dialect areas and improve the universality of machine learning algorithm data.

In order to achieve the above object, the present invention further provides a dialect sample data extraction method applied to an electronic device, where the method includes:

acquiring a first dialect of a plurality of dialect areas and city data corresponding to each dialect area in the plurality of dialect areas, wherein one dialect area corresponds to one city;

classifying the dialect regions with the same first dialect into the same dialect group, and obtaining a plurality of dialect groups;

sorting each dialect group according to the city data corresponding to each dialect region, and determining a target dialect region of each dialect group from each sorted dialect group;

acquiring medical and American dialogue data of a city corresponding to a target dialect area of each dialect group;

and taking the acquired medical and American dialogue data corresponding to each dialect group as dialect sample data.

Preferably, the city data includes GDP data, and the sorting each dialect group according to the city data corresponding to each dialect group, and determining the target dialect group of each dialect group from each sorted dialect group includes:

and sorting the dialect areas in each dialect group in a descending order according to GDP data of the city corresponding to each dialect area, and taking the dialect area which is arranged at the top and is preset with N bits as a target dialect area of each dialect group.

Preferably, the city data includes medical and aesthetic consumption data, the sorting of each dialect group according to the city data corresponding to each dialect group, and the determining of the target dialect group of each dialect group from each sorted dialect group includes:

and according to the medical and American consumption data of the city corresponding to each dialect area, sequencing the dialect areas in each dialect group in a descending order, and taking the dialect area with preset N positions in the front as the target dialect area of each dialect group.

Preferably, the acquiring medical and American conversation data of a city corresponding to the target dialect area of each dialect group includes:

and acquiring medical and American dialogue data stored in a server of medical and American representative organization of a city corresponding to the target dialect area of each dialect group.

Preferably, the medical and aesthetic representative tissue comprises one or more of the following combinations: a medical and beauty representative mechanism and a medical and beauty hospital.

Preferably, the method further comprises:

acquiring a second dialect of each dialect area;

acquiring medical and American dialogue data of a city corresponding to the target dialect area of each dialect group by using the second dialect;

and using the acquired medical and American dialogue data which correspond to each dialect group and utilize the second dialect as dialect sample data.

Preferably, the method further comprises:

counting dialect sample data size of each dialect group;

and if the dialect sample data size of one target dialect group is lower than the data size threshold, increasing the dialect sample data size of the target dialect group.

To achieve the above object, the present invention further provides an electronic device, including a memory and a processor, where the memory stores a dialect sample data extraction program executable on the processor, and the dialect sample data extraction program, when executed by the processor, implements the following steps:

classifying the dialect areas with the same first dialect into the same dialect group to obtain a plurality of dialect groups;

In order to achieve the above object, the present invention further provides an electronic device, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first dialect of a plurality of dialect areas and city data corresponding to each dialect area in the plurality of dialect areas, and one dialect area corresponds to one city;

the classification module is used for classifying the dialect regions with the same first dialect into the same dialect group and obtaining a plurality of dialect groups;

the determining module is used for sequencing each dialect group according to the city data corresponding to each dialect region and determining a target dialect region of each dialect group from each sequenced dialect group;

the acquisition module is further used for acquiring medical and American dialogue data of a city corresponding to the target dialect area of each dialect group;

the determining module is further used for taking the acquired medical and American dialogue data corresponding to each dialect group as dialect sample data.

Furthermore, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a dialect sample data extraction program, which is executable by one or more processors to implement the steps of the dialect sample data extraction method as described above.

The method comprises the steps of obtaining a first dialect of a plurality of dialect areas and city data corresponding to each dialect area in the plurality of dialect areas, wherein one dialect area corresponds to one city; classifying the dialect regions with the same first dialect into the same dialect group, and obtaining a plurality of dialect groups; sorting each dialect group according to the city data corresponding to each dialect region, and determining a target dialect region of each dialect group from each sorted dialect group; acquiring medical and American dialogue data of a city corresponding to a target dialect area of each dialect group; and taking the acquired medical and American dialogue data corresponding to each dialect group as dialect sample data. The method mainly solves the problem that the generalization of the model is too poor due to insufficient characteristics of machine learning training data. Therefore, in the machine learning data selection, theoretically, the data needs to cover all official speaking areas, so that the generalization capability of the model can be enhanced.

Drawings

Fig. 1 is a schematic flow chart illustrating a dialect sample data extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of city data corresponding to a plurality of dialect areas according to the present invention;

fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present invention;

fig. 4 is a block diagram illustrating a dialect sample data extraction procedure according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

An embodiment of the present invention provides a dialect sample data extraction method, which is applied to electronic devices, including, but not limited to, medical and American robots, terminals, electronic devices, and the like. The method comprises the steps that electronic equipment obtains a first dialect of a plurality of dialect areas and city data corresponding to each dialect area in the plurality of dialect areas, wherein one dialect area corresponds to one city; classifying the dialect regions with the same first dialect into the same dialect group, and obtaining a plurality of dialect groups; sorting each dialect group according to the city data corresponding to each dialect region, and determining a target dialect region of each dialect group from each sorted dialect group; acquiring medical and American dialogue data of a city corresponding to a target dialect area of each dialect group; and taking the acquired medical and American dialogue data corresponding to each dialect group as dialect sample data. The method mainly solves the problem that the generalization of the model is too poor due to insufficient characteristics of machine learning training data. Therefore, in the machine learning data selection, theoretically, the data needs to cover all official speaking areas, so that the generalization capability of the model can be enhanced.

The present invention will be described in detail with reference to examples.

The invention provides a dialect sample data extraction method. Referring to fig. 1, a schematic flow chart of a dialect sample data extraction method according to an embodiment of the present invention is shown, where the schematic flow chart is applied to an electronic device. The method may be performed by an electronic device, which may be implemented by software and/or hardware. The method for extracting sample data in the present embodiment is not limited to the steps shown in the flowchart, and in addition, some steps may be omitted and the order between the steps may be changed in the steps shown in the flowchart.

In this embodiment, the dialect sample data extraction method is applied to an electronic device, and includes:

s10, acquiring a first dialect of the dialect areas and city data corresponding to each dialect area in the dialect areas.

In this embodiment, one dialect region corresponds to one city, for example, as shown in fig. 2, five dialect regions are included, each dialect region corresponds to one city, and the city corresponding to the first jiangxi language region is yellow stone.

In this embodiment, the dialects in the dialects zones can include all dialects in the region of china, wherein the dialects include but are not limited to: gan Jiang Yu, zang Xian, official Xian and Bai Xian, etc. When a plurality of dialect areas comprise all dialects as far as possible, the dialect types in subsequent samples can be ensured to be more, and the training of the machine can be more accurate.

And S11, classifying the dialect areas with the same first dialect into the same dialect group, and obtaining a plurality of dialect groups.

For example, as shown in fig. 2, the first gan region, the second gan region and the third gan region are both gan, so that the gan region, the second gan region and the third gan region are classified into gan dialect group. For example, the first dialects of the first official language region and the second official language region are north official language regions, so that the first official language region and the second official language region are classified as a north official language group.

And S12, sorting each dialect group according to the city data corresponding to each dialect area, and determining the target dialect area of each dialect group from each sorted dialect group.

In an embodiment of the present invention, the city data includes GDP data, and the sorting each dialect group according to the city data corresponding to each dialect group, and determining the target dialect group of each dialect group from each sorted dialect group includes:

and sorting the dialect areas in each dialect group in a descending order according to GDP data of the city corresponding to each dialect area, and taking the dialect area which is arranged at the top and is preset with N bits as a target dialect area of each dialect group. Through GDP data of the cities, each dialect group can be sorted, the cities with higher consumption capability in each dialect can be obtained, and medical and American dialogue data of the cities with higher consumption capability are richer, so that more sample data can be obtained.

In an embodiment of the present invention, the city data includes medical and cosmetic consumption data, the sorting each dialect group according to the city data corresponding to each dialect group, and the determining the target dialect group of each dialect group from each sorted dialect group includes:

and according to the medical and American consumption data of the city corresponding to each dialect area, sequencing the dialect areas in each dialect group in a descending order, and taking the dialect area with preset N positions in the front as the target dialect area of each dialect group. In each dialect group, medical and cosmetic consumption data of each dialect area are directly calculated, and medical and cosmetic consumption levels of cities of each dialect area can be reflected better. The higher the consumption level of medical science and American is, more medical science and American conversation data can be obtained from the dialect area, so that the sample data of the dialect group can be enriched.

And S13, acquiring medical and American dialogue data of a city corresponding to the target dialect area of each dialect group.

In an embodiment of the present invention, the acquiring medical and american conversation data of a city corresponding to the target dialect area of each dialect group includes:

and acquiring medical and American dialogue data stored in a server of medical and American representative organization of a city corresponding to the target dialect area of each dialect group. Wherein the medical and aesthetic representative tissue comprises one or more of the following combinations: a medical and beauty representative mechanism and a medical and beauty hospital.

For example, through the direct communication between the electronic device and the servers of the medical and beauty representative organization, when the sample data amount of one server is accumulated to a certain amount, the sample data amount can be directly transmitted to the electronic device. Therefore, rich dialect, doctor and beauty dialogue data can be acquired more timely.

And S14, using the acquired medical and American dialogue data corresponding to each dialect group as dialect sample data.

In an embodiment of the invention, the method further comprises:

acquiring a second dialect of each dialect area;

In each dialect area, although the first dialect is the main language of each dialect area, in many occasions, the user also uses the second dialect to carry out the dialogue, so that the medical and American dialogue data using the second dialect of the city corresponding to the target dialect area of each dialect group is increased, the dialect types can be increased, the sample data of the dialect can be more extensive, and the accuracy of model training is improved.

In an embodiment of the present invention, the method further includes:

counting dialect sample data size of each dialect group;

The data volume of each dialect group can be balanced by counting the dialect sample data volume of each dialect group, so that the situation that the data volume of some dialect groups is too small and the contribution to the training of a model is too small, so that the model cannot accurately identify the few dialect groups is avoided, and the balance of the contribution of each dialect group to the model can be ensured by balancing the data volume of each dialect group.

The method comprises the steps of obtaining a first dialect of a plurality of dialect areas and city data corresponding to each dialect area in the plurality of dialect areas, wherein one dialect area corresponds to one city; classifying the dialect regions with the same first dialect into the same dialect group, and obtaining a plurality of dialect groups; sorting each dialect group according to the city data corresponding to each dialect region, and determining a target dialect region of each dialect group from each sorted dialect group; acquiring medical and American dialogue data of a city corresponding to a target dialect area of each dialect group; and taking the acquired medical and American dialogue data corresponding to each dialect group as dialect sample data. The method mainly solves the problem that the generalization of the model is too poor due to insufficient characteristics of machine learning training data. Therefore, in the selection of machine learning data, theoretically, the data needs to cover all official language areas, so that the generalization capability of the model can be enhanced.

Fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present invention; in the present embodiment, the electronic device 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.

In the present embodiment, the electronic device 1 may be a Personal Computer (PC), or may be a terminal device such as a smartphone, a tablet Computer, a portable Computer, or a robot.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, for example a hard disk of the electronic device 1. The memory 11 may be an external storage device in other embodiments, such as a plug-in hard disk provided on the electronic device 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the dialect sample data extraction program 01, but also to temporarily store data that has been output or is to be output.

The processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, is used for executing program codes or Processing data stored in the memory 11, such as dialect sample data extraction program 01.

The communication bus 13 is used to realize connection communication between these components.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., a WI-FI interface), and is typically used to establish a communication link between the apparatus 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying processed information and for displaying a visualized user interface.

Fig. 3 shows only the electronic device 1 with the components 11-14 and the dialect sample data extraction program 01, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may include fewer or more components than shown, or combine certain components, or a different arrangement of components.

In the embodiment of the electronic device 1 shown in fig. 3, a dialect sample data extraction program 01 is stored in the memory 11; the processor 12 executes the dialect sample data extraction program 01 stored in the memory 11 to implement the following steps:

The functions or operation steps implemented when the above steps are executed are substantially the same as those of the above embodiments, and are not described herein again.

Alternatively, in other embodiments, the dialect sample data extracting program may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.

For example, referring to fig. 4, a schematic diagram of program modules of a dialect sample data extraction program in an embodiment of the electronic device of the present invention is shown, in which the dialect sample data extraction program may be divided into an obtaining module 10, a classifying module 20, and a determining module 30, and exemplarily:

the system comprises an acquisition module 10, a processing module and a display module, wherein the acquisition module is used for acquiring a first dialect of a plurality of dialect areas and city data corresponding to each dialect area in the plurality of dialect areas, and one dialect area corresponds to one city;

the classification module 20 is configured to classify the dialect regions with the same first dialect into the same dialect group, and obtain a plurality of dialect groups;

a determining module 30, configured to sort each dialect group according to the city data corresponding to each dialect region, and determine a target dialect region of each dialect group from each sorted dialect group;

the obtaining module 10 is further configured to obtain medical and American dialogue data of a city corresponding to the target dialect area of each dialect group;

the determining module 30 is further configured to use the acquired medical and aesthetic dialogue data corresponding to each dialect group as dialect sample data.

The functions or operation steps of the above-mentioned obtaining module 10, classifying module 20 and determining module 30 when executed are substantially the same as those of the above-mentioned embodiments, and are not described herein again.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a dialect sample data extraction program is stored on the computer-readable storage medium, where the dialect sample data extraction program can be executed by one or more processors, and implemented functions or operation steps are substantially the same as those in the foregoing embodiments, and are not described herein again.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, an electronic device, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A dialect sample data extraction method, the method comprising:

sequencing each dialect group according to city data corresponding to each dialect region, and determining a target dialect region of each dialect group from each sequenced dialect group, wherein the city data comprise medical and aesthetic consumption data and GDP data;

acquiring medical and American dialogue data of a city corresponding to a target dialect area of each dialect group, wherein the acquiring medical and American dialogue data of the city corresponding to the target dialect area of each dialect group comprises the following steps:

acquiring medical and American dialogue data stored in a server of medical and American representative organization of a city corresponding to a target dialect area of each dialect group;

taking the acquired medical and American dialogue data corresponding to each dialect group as dialect sample data;

acquiring a second dialect of each dialect area;

2. The method of claim 1, wherein the city data comprises GDP data, and the sorting each dialect group according to the city data corresponding to each dialect region and determining the target dialect region of each dialect group from each sorted dialect group comprises:

3. The method of claim 1, wherein the city data comprises medical and cosmetic consumption data, and wherein ranking each dialect group according to the city data corresponding to each dialect region and determining a target dialect region for each dialect group from each ranked dialect group comprises:

and according to the medical and American consumption data of the city corresponding to each dialect area, sequencing the dialect areas in each dialect group in a descending order, and taking the dialect area with the preset N positions at the top as the target dialect area of each dialect group.

4. The dialect sample data extraction method of claim 1, wherein the medical and beauty representative organization comprises one or more of the following combinations: a medical and beauty representative mechanism and a medical and beauty hospital.

5. The method of dialect sample data extraction of claim 1, the method further comprising:

counting dialect sample data size of each dialect group;

6. An electronic device for operating the dialect sample data extraction method according to any one of claims 1-5, wherein the electronic device comprises a memory and a processor, the memory having stored thereon a dialect sample data extraction program operable on the processor, the dialect sample data extraction program when executed by the processor implementing the steps of:

7. An electronic device for operating the dialect sample data extraction method according to any one of claims 1-5, wherein the electronic device comprises

the determining module is used for sequencing each dialect group according to the city data corresponding to each dialect group, and determining a target dialect group of each dialect group from each sequenced dialect group, wherein the city data comprise medical and aesthetic consumption data and GDP data;

the obtaining module is further configured to obtain medical and American dialogue data of a city corresponding to the target dialect area of each dialect group, and the obtaining module is specifically configured to:

8. A computer readable storage medium having stored thereon a dialect sample data extraction program executable by one or more processors for carrying out the steps of the dialect sample data extraction method of any one of claims 1 to 5.