Accentator

An AI system that restores accents/tone marks from non-accent Vietnamese text using Transformer.

Introduction

In the Vietnamese language, words can have multiple accents compared to English. For example, a single word "kho" can have different meanings based on the accent placed on it, i.e. "khó" means "hard", "khổ" means miserable, "khò" is the sleeping sound, even "kho" has a meaning which is storage. Because of this, multiple Vietnamese texts are without accents as it is much quicker to do so without the additionally hassle to click another or two buttons for that accent that people would probably understand anyway. However, there are cases where it can cause confusion, the case above is an example for that. Our project here is proposing a rough solution for that, by using AI and machine learning to turn non-accented text into accented one with its context.

Although limited, there has been studies on this problem in the past, with Transformer. Most notable is duongntbk's repo, which incorporates the BERT architecture (Bidirectional Encoder Representations from Transformers) and see this as a machine translation problem. His method achieved 94.05% accuracy on test datasets. Another research related to this is from Phuong, who also used the BERT architecture to generate diacritics from text online with the purpose of detecting hate speech on Vietnamese social media. They achieved around 92% accuracy.

In our project Accentator, we will apply a lightweight version of the GPT-2 structure, which is unidirectional and decoder-only, and see if it improves over bidirectional methods.

Methodology

The structure of Accentator goes as follow:

Word/Character embedding + Positional encoding
6 blocks of the module:
- Layer Norm
- Attention Layer
- Layer Norm
- Linear Layer + GeLU
- Linear Layer
Linear Layer to Output

Currently, the hyperparameters are:

Context size: 128
Vocab size:
Head number: 6
Embedding size: 256

We ran experiments on an NVIDIA A5000 24GB VRAM.

Results

See more explanations and results here: https://docs.google.com/presentation/d/1S9H-FJmKm0u2wHCjLgMIbcm8y_VzR3SdODFEyvns7UA/edit?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
bigram		bigram
data		data
gpt		gpt
.gitignore		.gitignore
README.md		README.md
clear_cache.py		clear_cache.py
data_processing.ipynb		data_processing.ipynb
task.md		task.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Accentator

Introduction

Methodology

Results

About

Uh oh!

Releases

Packages

Languages

goatguy2310/accentator

Folders and files

Latest commit

History

Repository files navigation

Accentator

Introduction

Methodology

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages