Korean Datasets for LLMs
Large language models are successful in many natural language tasks using massive datasets for training. In particular, many large language models, such as Bert and GPT, are trained using English datasets, some of which are publicly available. As datasets are biased toward English, large language models show a significant performance degradation in tasks for other languages, such as Korean. We are building a Korean dataset using various techniques to gather necessary data, including web crawling, de-identification, and de-duplication. We aim to build a massive Korean dataset to train LLMs for Korean tasks.
Korean Tokenizers
We are developing a tokenizer based on deep learning for Korean natural language processing (NLP). Tokenization breaks down text into tokens for computer programs or NLP models to process. Korean tokenization commonly involves both decomposing words into morphemes(‘형태소’) and assigning part-of-speech(‘품사’) tags to the decomposed morphemes. We are using large-scale Transformer-based models, such as GPT-2, for this task. Our goal is to develop an accurate and efficient tokenizer suited for Korean NLP tasks.
Korean Language Models
LLMs have been successful on English NLP tasks, but the models are working on Korean tasks badly. For example, ChatGPT shows a significant latency problem and quality degradation when we request Korean chatting tasks. We are developing Korean language models based on transformer models from scratch to overcome the issue. We build an appropriate Korean training data set and a Korean tokenizer for the Korean LLM.
Energy Efficiency Optimization of GPT-3
Optimizing Energy Efficiency Energy consumption has become a serious issue in training large language models. To improve energy efficiency, we are working on a detailed performance and energy consumption characterization of GPUs in GPT-3 training in the granularity of hardware modules (FPU, tensor core, cache, shared memory, etc.) and deep learning layers. We also propose various methods to increase the energy efficiency of training GPT-3 including energy-aware parallelization and optimal underclocking of GPUs.
Overcoming GPU Memory Capacity Problem for LLM Training
Recently, huge transformer models with more than 100 billion parameters (e.g., GPT-3) have been developed rapidly. Training these models on a cluster is a challenging task due to the limited capacity of the GPU memory. A common solution to the memory capacity problem is parallelizing the DNN model across multiple GPUs. Utilizing the main memory or storage devices as a backup space for the GPU memory is another promising solution to the memory capacity problem. However, the interaction between those techniques is non-trivial, and users of the conventional DL framework wrote complex programs to support such Out-of-GPU-Memory (OoGM) scale models. We are working on a cluster-targeting DL framework that utilizes CPU memory, NVMe SSDs, and HDDs to break the GPU memory wall. Our framework also uses the performance model to find the optimal way of parallelization.
Large Model Compression
Recent advances in deep learning suggest that the predictive power of the models is beneficial from their enormous number of parameters. However, deploying these large models into devices with limited computation and memory resources is challenging. Model compression is a technique to reduce the model size and improve training/inference performance with no quality degradation. Quantization, pruning, and knowledge distillation are well-known model compression techniques. We focus on each compression technique in-depth and methods to combine multiple compression techniques properly.
LLM Training Techniques
The required dataset size and computing power increase as deep learning models become larger. In particular, LLMs, such as GPT-3, require terabytes of datasets and thousands of GPU nodes for training. It makes large model training nearly impossible without the support of big-tech companies. We are focusing on new training techniques for LLMs using techniques such as auto-encoder, layer-wise training, a mixture of experts, and generating better datasets.