Abstract

Machine learning in natural language processing analyzes datasets to make future predictions for various filed in the real world. By training machine algorithms on the datasets of text, the model can learn patterns and structure of the text in many different languages. Then the model enables to perform the text classification, sentiment analysis, and other tasks. A large and balanced dataset is required to develop an accurate machine learning model. However, the collection of a reliable, large, and equally distributed dataset is a challenging and requires significant resources and time. As a solution to this challenge, a data augmentation technique can be used to increase the size of a dataset by generating new data from the original dataset. This study investigates the impact of data augmentation on the performance of a machine learning models using small datasets in three diverse languages: French, German, and Japanese. After the data augmentation inflates the three diverse languages training datasets, three models are trained by each augmented training dataset. The three models’ performance were compared with other three models’ performance that are trained by each three original training datasets. This not only addresses the issue of a lack of large and balanced datasets but also the issue of dataset scarcity in various areas. Towards this, the generalization of each model trained by an augmented dataset is evaluated on each test dataset in different languages. A machine learning’s capability of generalization can contribute situations where cross-lingual capabilities are needed, such as, international market research, multilingual customer support, obtaining cultural insights, etc. The models' performances and generalization are measured through evaluation metrics: accuracy, precision, recall, and f1-scores. Our results show that data augmentation improved the performance of the model’s sentiment analysis with the languages French and Japanese. The results also showed that a model trained with a Japanese dataset showed improved performance in sentiment analysis when tested using German test data and vice versa. Similarly, a model trained with the German dataset showed marked improvement in its performance in sentimental analysis when tested with the French test dataset and vice versa.

Advisor

Subah Alkushayni

Committee Member

Naseef Mansoor

Committee Member

John Burke

Date of Degree

2024

Language

english

Document Type

Thesis

Degree

Master of Science (MS)

Program of Study

Data Science

Department

Mathematics and Statistics

College

Science, Engineering and Technology

Included in

Data Science Commons

Share

COinS
 

Rights Statement

In Copyright