hwkim-dev Blog

[프로젝트] llm-lite — Gemma 3N E4B 경량 추론 엔진

Sun, 19 Apr 2026 00:00:00 GMT

llm-lite 는 저사양 로컬 환경에서 Gemma 3N E4B 를 클라우드 없이 돌리는 걸 목표로 만든 멀티 백엔드 추론 엔진이다. 모델 구조는 그대로 두되 공격적인 양자화(INT4 weights + MMAP)와 저수준 하드웨어 가속으로 성능을 끌어내는 방향을 택했다.

타겟 하드웨어

1차 타겟은 AMD Ryzen 5 4500U APU (Renoir, 6C/6T, Radeon RX Vega 6 iGPU) 를 장착한 리눅스 머신이다. 2차 타겟으로 macOS (Apple Silicon / Intel, MoltenVK 경유), Raspberry Pi 4/5 (aarch64), 그리고 Xilinx KV260 FPGA 를 지원한다. KV260 에서는 별도 NPU 백엔드 (uCA — micro Compute Architecture) 를 사용한다.

아키텍처 요약

레이어	기술
추론 엔진	Python 3.12 + NumPy
CPU 커널	C++17 + SIMD / OpenMP
GPU 커널	Vulkan 1.2 Compute + GLSL
웹 GUI	Flask 3 + SSE 스트리밍
네이티브 GUI	Dear ImGui 1.91 + Vulkan
양자화	W4A32 기본 — INT4 weights, FP32 activations
Weight 로딩	safetensors + MMAP (zero-copy)

프리필 ~35 tokens/sec, 디코드 ~8-12 tokens/sec 수준으로, Ryzen 4500U 에서도 일상적 대화가 가능한 속도다. 모델 RAM 은 INT4 MMAP 기준 약 2.8 GB.

최근 업데이트

양자화 모드 확장: 기존 INT4 에 더해 INT8 / FP16 / FP32 weight 모드 추가. 특히 구형 iGPU (Vega 6 등) 는 정수 연산보다 부동소수점이 빠를 수 있어 모드 선택이 의미가 있다.
모델 매니저: GUI 에서 HuggingFace 모델 다운로드 → 양자화 → 기존 variant 삭제까지 한 흐름으로 가능.
Speculative Decoding 준비: Gemma 3N 의 MatFormer 구조를 이용해 E4B 에서 E2B 를 슬라이스하는 방향으로 draft model 을 만드는 연구를 시작했다. 현재는 scaffold 상태이고 실제 구현은 별도 이슈로 추적 중.

[Paper] Attention Is All You Need

Fri, 17 Apr 2026 00:00:00 GMT

This text contains the core concepts and mathematical principles of the Transformer model architecture.

1. Background of the Transformer's Emergence

The mainstream models in the existing NLP field were RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory). These models process data sequentially. For example, given the sentence "I go to school," it processes "I", then uses that result to process "go to," and uses that result again to process "school."

This sequential processing approach has two critical limitations:

Inability to process in parallel: The computation for the next word can only be performed after the computation for the previous word is finished, making it impossible to utilize the computer's computational resources simultaneously for parallel processing.
Long-term Dependency problem: As the sentence gets longer, the information of words inputted early on tends to fade as it progresses towards the end.

The Transformer originated from the idea: "Instead of inputting words sequentially, let's input the entire sentence at once and calculate the relationships between words simultaneously." The core technology that made this possible is the Attention mechanism.

2. Model Architecture

The Transformer adopts an Encoder-Decoder structure optimized for Sequence Transduction tasks like machine translation.

Auto-regressive property: When generating an output, the model uses the previously generated output symbols as additional input for the next step. In other words, it predicts the 1st word, and then predicts the 2nd word by including that 1st word.

2.1 Encoder

The Encoder reads the inputted original sentence (e.g., a Korean sentence) and comprehends the meanings and context of the words within it, transforming it into compressed information (Representation).

Layer structure: It consists of a stack of $N = 6$ identical layers.
Sub-layer: Each layer internally has two sub-layers.
1. Multi-Head Self-Attention: It determines how the words within the sentence relate to one another.
2. Position-wise Feed-Forward Network (FFN): A Neural Network that deeply learns the features of each word based on the identified relationship information.
Residual Connection and Layer Normalization: The output of each sub-layer is processed with the following formula:
$Output = LayerNorm(x + Sublayer(x))$
- $x$ : The original input value entering the sub-layer.
- $Sublayer(x)$ : The resulting value after going through the Attention or FFN computation.
- $x + Sublayer(x)$ (Residual Connection): The original input value is added to the computation result. This prevents the loss of initial information even as the layers get deeper, stabilizing the training.
- $LayerNorm(...)$ : Calculates the mean and variance of the added result to normalize the data into a consistent range.
Dimensionality unification: To facilitate the Residual Connection smoothly, the output dimensions of all sub-layers and Embedding layers within the model are fixed at $d_{model} = 512$ .

2.2 Decoder

The Decoder generates the target sentence (e.g., a translated English sentence) one by one, based on the context information compressed by the Encoder. Like the Encoder, it consists of $N = 6$ identical layers, but the number of sub-layers increases to 3.

Masked Multi-Head Self-Attention:
- When the Decoder generates an output word, it acts to mask the words that are behind(in the future) the current position (in the future) so they cannot be seen in advance.
- For example, when predicting the 3rd word, it masks the similarity scores of future words with $-\infty$ so that only the 1st and 2nd words can be referenced, making the Attention weight 0 after passing through the Softmax function.
Multi-Head Attention (Encoder-Decoder Attention):
- This is where the Decoder decides "which part of the original sentence to focus on" to generate a word.
- Here, the Decoder uses its own information as the standard (Query) and references the information (Key, Value) finally outputted by the Encoder.
Position-wise Feed-Forward Network: Identical to the Encoder's structure.

3. Attention Mechanism

The Attention mechanism is the core of the Transformer. The Attention function can be described as mapping a Query and a set of Key-Value pairs to an output.

To use an analogy, it is like the process of finding information in a library.

Query (Q): The 'search term' entered by the user in the search bar (the target word currently being analyzed).
Key (K): The 'index' or 'label' attached to the books in the library (features possessed by other words).
Value (V): The actual 'content' of that book (the actual information possessed by other words).

(* In the case of Self-Attention, $Q$ , $K$ , and $V$ are all generated from the same input sentence, each transformed for its specific purpose by multiplying different weight matrices.)

3.1 Scaled Dot-Product Attention

The paper proposes a method called 'Scaled Dot-Product Attention' to compute attention. The computation formula is as follows:

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

$Q$ (Query Matrix): | [Question] | A matrix gathering the vectors of the words currently being processed.
$K$ (Key Matrix): | [Position] | A matrix gathering the vectors of words to be referenced.
$V$ (Value Matrix): | [Content] | A matrix gathering the actual information vectors of words to be referenced.
$K^T$ : The Transposed Matrix of the Key Matrix. Its rows and columns are swapped for matrix multiplication.
$d_k$ : The dimensionality of the Query and Key vectors. (The paper uses $d_k = 64$ .)
$\sqrt{d_k}$ : The square root of $d_k$ . (In the paper, it becomes $\sqrt{64} = 8$ .)
$softmax$ : A function that converts inputted values into probabilities between 0 and 1, ensuring their sum equals 1. (Formula: $\frac{e^{x_i}}{\sum e^{x_j}}$ )

Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

$QK^T$ (Similarity Calculation): Performs Matrix Multiplication between the Query matrix and the transposed Key matrix. This is the process of computing the dot product between the Query word vector and each Key word vector at once, yielding a mathematical score of how highly related (similar) the Query word is to each key word. A larger value means a higher correlation between the two words.
$\frac{QK^T}{\sqrt{d_k}}$ (Scaling): When performing the dot product, the resulting values tend to grow very large as the dimensionality ( $d_k$ ) increases. If the values become too large, the gradient approaches 0 in the subsequent Softmax function, causing an issue where learning does not progress. To prevent this, the scores are divided by $\sqrt{d_k}$ to appropriately adjust (Scale) the magnitude of the values.
$softmax(...)$ (Weight Probabilization): The scaled scores are passed through the Softmax function. Through this process, the score for each word is converted into a probability value (weight) between 0 and 1. For example, a result of "0.9" means it is very strongly associated with this word, while "0.01" means it can be largely ignored.
$\times V$ (Combining Information): The calculated Softmax weight is multiplied by the Value Matrix, which is the actual information. Consequently, a large amount of information (Value) from highly correlated words is retrieved, and a small amount of information from weakly correlated words is retrieved, merging them into one. This result becomes the final output of Attention.

3.2 Multi-Head Attention

The Transformer does not perform the above single Attention just once; it splits the dimensions into multiple parts and performs several Attentions in parallel. This is called Multi-Head Attention.

In the paper, the $d_{model} = 512$ dimension is split into $h = 8$ Heads. Therefore, each Head handles a vector of $d_k = d_v = 512 / 8 = 64$ dimensions.

Why use Multi-Head (Multiple)?

The relationships between words in a sentence can be interpreted from multiple angles. For example, in the sentence "He kicked the ball hard," the word 'kicked' could be connected to 'He' (Subject, who did it?) or connected to 'ball' (Object, what was done?). Using a single Attention only allows looking at an average perspective among various relationships, but dividing into 8 Heads enables each Head to simultaneously capture different and diverse contextual features (Representation subspaces), such as relationships with the subject, the object, and the tense.

The 8 resulting matrices calculated from each Head are concatenated together at the end, and then multiplied by a Linear Projection matrix to become the final output matrix.

4. Position-wise Feed-Forward Network

The data that has passed through the Attention sub-layer goes through a Fully Connected Feed-Forward Network (FFN) included in each layer.

"Position-wise" means that the exact same Neural Network is applied independently to each individual word position that makes up the sentence.

FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2

$x$ : The input vector that has passed through the Attention layer. The dimension is $d_{model} = 512$ .
$W_1, b_1$ : The weight matrix and bias vector for the first linear transformation.
$\max(0, ...)$ : The ReLU (Rectified Linear Unit) activation function. If the calculation result inside the parenthesis is less than 0, it becomes 0; if it is greater than 0, the value is maintained. This is a core element that imparts non-linearity.
$W_2, b_2$ : The weight matrix and bias vector for the second linear transformation.

This neural network has a sandwich structure.

Dimension Expansion: The input vector $x$ (512 dimensions) is multiplied by the weight $W_1$ to greatly expand the dimension to $d_{ff} = 2048$ dimensions.
Activation: It passes through the ReLU function in the expanded space to extract the non-linear features of the data. In this process, unnecessary information (negative values) is eliminated as 0.
Dimension Compression: It is multiplied again by the weight $W_2$ to compress it back to the original dimension of $d_{model} = 512$ and outputted.

If Attention is the process of collecting 'relationships' between words, the FFN layer takes on the role of processing and remembering the 'meaning' of each word itself in a more complex and richer way based on the collected information. Most of the learning parameters (weights) of the entire model are concentrated in the $W_1, W_2$ matrices of this FFN.

5. Positional Encoding

The Transformer abandoned the RNN structure and opted for parallel processing through matrix multiplication. However, this creates a inherent limitation. Because the Attention operation treats a set of words like an unordered 'Bag of words', it can mathematically perceive "I eat rice" and "Rice eat I" as identical.

To solve this, the process of adding a vector containing position information to the inputted word's Embedding vector so the model can know the relative or absolute 'position (order)' information of words within a Sequence is called Positional Encoding.

The paper uses Sine and Cosine functions with various frequencies to generate position information.

PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})

PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})

$pos$ : The position index of the corresponding word within the sentence. (e.g., the first word is 0, the second word is 1)
$i$ : The dimension index. It indicates which position the value holds within the Embedding vector. The range of $i$ is from $0$ to $d_{model}/2 - 1$ , and through this, different trigonometric functions are paired and applied to the even index ( $2i$ ) and odd index ( $2i+1$ ) of the vector, respectively.
$2i, 2i+1$ : This means the sine (sin) function is used when the vector's index is even (2i), and the cosine (cos) function is used when it is odd (2i+1).
$d_{model}$ : The total dimensionality of the Embedding vector (512).
$10000^{2i/d_{model}}$ : The denominator term that determines the frequency. As the index $i$ increases, the denominator gets larger, causing the frequency to decrease, which makes the position values oscillate more slowly.

Using this formula generates continuous real numbers with a unique pattern for each position (pos) within the sentence and each dimension (i) of the vector. Because trigonometric functions are used, the values of the position vector constantly wave between -1 and 1.

This 512-dimension 'position vector' generated by a mathematical rule is simply added (+) to the original word's 'Embedding vector' right before the data enters the first layer of the Encoder or Decoder. Consequently, as the model progresses with learning, it can grasp not only the inherent meaning of the word but also the relative position by backtracking this trigonometric wave pattern, realizing, "Ah, this word is at the beginning of the sentence," or "That word is right in the next position."

[Paper] GPT-1 — Improving Language Understanding by Generative Pre-Training

Fri, 17 Apr 2026 00:00:00 GMT

This document is a note organizing the architecture and training process of the GPT-1 paper by combining mathematical definitions with intuitive interpretations.

1. Core Fundamental Concepts of Language Models

1) Context Window

Definition: Refers to the maximum number of tokens (sequence length $k$ ) that the model can process at one time. The computational complexity of the Transformer's Self-Attention operation is $O(k^2)$ .
Intuitive Explanation:
- Pros: As the Context Window ( $k$ ) gets larger, the model can remember words from the more distant past. With more hints available, the model can grasp the context more precisely and improve the accuracy of predicting the next word.
- Cons: Transformers must calculate relationships (Attention) between all pairs of words. Therefore, if the context window becomes 10 times longer, the computational complexity (or cost) explodes quadratically (100 times). In short, increasing $k$ is a realistic barrier directly linked to hardware memory and training costs.

2) Maximize Likelihood

Definition: A mathematical objective function that optimizes the model's internal parameters $\Theta$ to Maximize the conditional probability (Likelihood) that the actual ground-truth word will appear after a given context.
Intuitive Explanation: Simply put, this is the fundamental "goal" that the language model learns. It is the process of constantly adjusting internal circuits (parameters) so that the word predicted by the model matches the word written in the actual text while reading massive amounts of text data.

2. GPT Backbone: Transformer Decoder

Originally, the Transformer published by Google consisted of an Encoder (understanding input) and a Decoder (generating output) for machine translation. However, GPT boldly discarded the Encoder and adopted a structure consisting of a 12-layer Decoder stack.

Why use only the Decoder? The essence of GPT is Next Token Prediction (Auto-regressive). Inside the Decoder, there is a core feature called Masked Self-Attention. This Masking prevents the model from seeing future words when processing the current word, effectively preventing "cheating." This structure aligns perfectly with GPT's philosophy of inferring the next step by looking only at the context from the past to the present.

3. GPT-1's Two-Stage Learning Pipeline

Stage 1: Unsupervised Pre-training

This is the stage where the model learns the general patterns of language on its own through massive unlabeled text data.

Definition (Objective Function): Given a massive corpus of unlabeled tokens $\mathcal{U} = \{u_1, \dots, u_n\}$ , the model is trained to maximize the following Log-Likelihood:

L_1(\mathcal{U}) = \sum_i \log P(u_i | u_{i-k}, \dots, u_{i-1}; \Theta)

When the model ( $\Theta$ ) is shown previous words ( $u_{i-k}, \dots, u_{i-1}$ ), it calculates the probability $P(\dots)$ of guessing the correct next word ( $u_i$ ), and this value is summed ( $\sum_i$ ) over all text data to get $L_1(\mathcal{U})$ .

$L_1(\mathcal{U})$ :
- Means the Objective Function.
  Here, $\mathcal{U}$ is the massive unlabeled text Corpus used for training.
  In other words, it is a score representing "how well the model understands (predicts) the data $\mathcal{U}$ ."
$\sum_i$ :
- Means to sum up the probability values below over the index $i$ of all words (tokens) in the sentences (data).
$\log$ :
- The Logarithm function. Probability values are decimals between 0 and 1, so multiplying probabilities of many words causes the number to converge to 0 (Underflow problem). Applying the log converts multiplication into addition ( $\sum$ ), making it much better for computer calculation.
$P(\dots)$ :
- Probability. ( $P$ = Conditional probability calculated by the Transformer Decoder with parameters $\Theta$ ).
$u_i$ :
- The 'current (next) word' the model needs to guess.
$u_{i-k} ,…,u_{i−1}$ :
- The words that appeared before $u_i$ . $k$ stands for the Context Window Size the model can see at once. In short, it is the 'context up to this point'.
$\Theta$ (Theta):
- The parameters (weights) of the AI model we want to train.

Intuitive Explanation:
- Method: The model reads massive texts scattered across the internet (news, books, wiki, etc.) in order and is made to guess the blank (next word).
  (Actually, the main corpus GPT-1 trained on is 'BooksCorpus', consisting of over 7,000 unpublished books. The nature of book data was very helpful in learning long-range dependencies.)
- Why it is Unsupervised: There is no need for humans to label the answer sheets one by one. The sentence "The capital of South Korea is [Seoul]" itself serves as both the question and the answer.
- Result: Through this massive and simple "next word guessing game," the model learns grammar, common sense of the world, and contextual logic in their entirety.

Stage 2: Supervised Fine-tuning

After pre-training is complete, this is the stage where the model is tuned to fit the specific problem we actually want to solve (sentiment analysis, multiple choice, etc.). Since it uses data with answers, it becomes Supervised Learning.

Definition (Objective Function): Given an input sequence $x^1, \dots, x^m$ from a labeled dataset $\mathcal{C}$ and a label $y$ , the prediction probability and objective function are as follows:

Label (Answer) Prediction Probability

P(y | x^1, \dots, x^m) = \text{softmax}(h_l^m W_y)

$x^1, \dots, x^m$ :
- The input sentence (data). Consists of $m$ words (tokens). (e.g., "This movie is so fun")
$y$ :
- The target label we need to predict. (e.g., Positive or Negative)
$h_l^m$ :
- The final output value (Hidden state) produced by processing the very last word ( $m$ ) in the last layer ( $l$ ) of the pre-trained Transformer model. You can think of it as the 'core meaning of the sentence' summarized by the model after reading the entire sentence from start to finish.
$W_y$ :
- The weights of the Linear Layer newly added to perform a specific task (classification). It receives the model's summary ( $h_l^m$ ) and converts it into scores corresponding to the number of answer labels.
$\text{softmax}$ :
- The Softmax function. It beautifully converts the raw scores from $W_y$ into probability values that sum up to 1 (100%). (e.g., Probability of Positive 0.9, Negative 0.1)

Fine-Tuning Objective Function

L_2(\mathcal{C}) = \sum_{(x,y)} \log P(y | x^1, \dots, x^m)

$L_2(\mathcal{C})$ $L_{2} (C)$ :
- The objective function of the second learning stage (Fine-tuning). $\mathcal{C}$ refers to the labeled dataset where humans have directly attached answers ( $y$ ) (e.g., review-star rating data).
$\sum_{(x,y)}$ $\sum_{(x, y)}$ :
- Means to sum all the probabilities below for every (input sentence $x$ , answer $y$ ) pair in the dataset $\mathcal{C}$ .
$\log P(\dots)$ $lo g P (\dots)$ :
- The value obtained by applying the log to the probability that the model guesses the real answer $y$ .

( $h_l^m$ is the final activation vector of the Transformer's last block, and $W_y$ is the weight matrix of the output layer.)

Utilization of Auxiliary Objective: To improve training stability and convergence speed even in the supervised learning stage, GPT-1 additionally uses the language modeling (next word prediction) objective function from Stage 1 as an auxiliary.

L_3(\mathcal{C}) = L_2(\mathcal{C}) + \lambda \cdot L_1(\mathcal{C})

$L_3(\mathcal{C})$ $L_{3} (C)$ :
- The comprehensive goal score the model must finally maximize in the Fine-Tuning stage.
$L_2(\mathcal{C})$ $L_{2} (C)$ :
- The score for 'guessing the answer (label)' explained previously. (Supervised Learning)
$L_1(\mathcal{C})$ $L_{1} (C)$ :
- The score for 'guessing the next word' explained at the very beginning. (The method used in pre-training) However, here it predicts the next word using the text from the currently training labeled dataset ( $\mathcal{C}$ ), not the massive internet data ( $\mathcal{U}$ ).
$\lambda$ $λ$ (lambda):
- A number controlling the Weight. It is a control dial that decides "Guessing the answer ( $L_2$ ) is the main mission, but at what values should we mix in guessing the next word ( $L_1$ )?" (Usually, a value like 0.5 is used).

Why bring back the finished $L_1$ and add it?

Improving Generalization (Preventing Overfitting):
If the model focuses only on guessing the answer (label), it might forget the true meaning of the text and only learn shallow tricks (e.g., unconditionally picking 'Positive' if a certain word appears). Making it continue to predict the next word forces it to maintain the ability to deeply understand context.

Increasing Learning Speed (Faster Convergence):
Since it learns while continuing to recognize the structure of language, the speed at which the model finds the answer becomes much faster.

Retaining Pre-trained Knowledge:
It prevents the phenomenon where the smart brain (weights), built with hard work by reading the entire internet, is lost (or destroyed) while learning only one specific task (Catastrophic Forgetting).

4. Task-aware Input Transformations

The core of this technique is not altering the well-structured 12-layer decoder architecture. Without changing the architecture, it performs various tasks by manipulating only the shape of the text input using special tokens.

1) The Role of Special Tokens

(Start) ~~token~~~~: Attached to the front of the sequence, serving as an Anchor to signal the start of a new task.~~

Difference from Positional Encoding: While positional encoding informs the 'physical location' of a word, the token is a 'structural initialization signal' indicating a new independent problem disconnected from the previous context. Without this token, the first word would have to perform both a semantic role and a structural role, causing an overload in attention computation.

$ (Delim) token: Acts as a separator distinguishing different types of text, such as the premise and the hypothesis (options).

(Extract) token: A token attached to the very end of the sequence. When the decoder reaches this token, all previous context information has been calculated. In other words, it acts as a Trigger to extract a single summary Vector that compresses the meaning of the entire sentence.

2) Processing Mechanism for Multiple Choice Questions

This is the process assuming we are solving a multiple-choice question (1 premise, 4 options) like in the SATs.

Batch Construction: The 4 options are not bundled into one long text. Instead, independent sequences are constructed for the number of options as follows:

(Start) ~~+ Premise + $ (Delim) + Option 1 + (Extract)~~

~~(Start) ~~+ Premise + $ (Delim) + Option 2 + (Extract) (Same for the rest)~~~~

Parallel Operation: These 4 independent sequences are bundled into a batch and passed through the model at once.

Score Derivation: The 4 vectors outputted by the (Extract) token at the end of each sequence are passed through the same Linear Classifier to obtain 4 arbitrary scores (Logits), one for each option. These scores are then collected and passed through the Softmax function to derive the probability of the answer.

5. Mathematical Processing and Error Calculation (Completion of Learning)

This is the essential mathematical process to update (train) parameters by comparing the arbitrary scores spit out by the model with the actual answer.

1) Softmax Function

Definition: Converts the arbitrary scores $z_i$ of each class outputted from the linear classifier into probability values.

$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$

Intuitive Explanation: The 4 scores coming out of the linear classifier (e.g., 10, 5, 1, -2) vary in scale. There are two reasons for using Softmax instead of simple comparison:

Probability Distribution Conversion: Adjusts the ratio so that the sum of all scores becomes exactly 1 (100%) (where each value is $0 < \sigma < 1$ ). Because it uses the exponential function ( $e$ ), it makes large values more certain and small values smaller, inducing the model to have confidence.

Differentiability: To use Deep Learning backpropagation, the graph must be differentiable, and Softmax perfectly meets this mathematical condition.

2) One-hot Encoding

Definition: The target probability distribution $p$ when the answer is class $c$ is as follows:

$p(i) = \begin{cases} 1 & \text{if } i = c \\ 0 & \text{if } i \neq c \end{cases}$

Intuitive Explanation: For the computer to compare its predicted probability (70%, 20%, 8%, 2%) with the real answer, the answer must also be in a 'probability shape'. If the answer is number 2, it means assigning 100% (1.0) only to the second slot and 0% (0.0) to the rest, making it into the form [0.0, 1.0, 0.0, 0.0].

3) Cross-Entropy Loss

Definition: Measures the difference (Loss) between the model's predicted probability distribution $q$ and the actual answer distribution $p$ .

$H(p, q) = -\sum_{x} p(x) \log q(x)$
When the answer is One-hot Encoded, the probability is calculated only for the actual answer class $c$ . As the probability $q(c)$ assigned by the model to the answer class gets closer to 1, the error (Loss) converges to 0, and if the probability is low, the error diverges to infinity.

Intuitive Explanation: MSE (Mean Squared Error) is used for continuous numbers (regression) like predicting house prices. On the other hand, for multiple-choice or classification problems, Cross-Entropy, which measures the distance between two probability distributions (Prediction vs. Answer), is much more suitable. The model calculates the error value between the predicted value (e.g., [0.1, 0.7, 0.05, 0.15]) and the answer ([0, 1, 0, 0]), and then modifies internal parameters in the direction of reducing this error, gradually increasing the accuracy rate.

[Daily] Spring, and a Fresh Start

Tue, 14 Apr 2026 00:00:00 GMT

Cherry blossoms are starting to bloom — and so is this new site.

I have been working on a GPU programming project in the lab lately, and I lose track of time whenever I am deep in code. That moment when a CUDA kernel runs correctly for the first time... still thrilling every time 😄

My goal is to write here consistently — not just study notes but casual moments like this too.

Today I wrapped up the site setup over a cup of coffee. A good day, just like spring.

[Misc] Rebuilt My Personal Homepage with Docusaurus

Tue, 14 Apr 2026 00:00:00 GMT

I finally set up a proper personal homepage. What used to be just a GitHub Profile README is now a full static site powered by Docusaurus.

Why Docusaurus?

Markdown-first: Blog posts are just .md files — no fuss.

React extensible: Custom pages like Papers, Projects, and Chatbot are plain React components.

GitHub Pages deploy: One push to gh-pages and it's live.

Dark mode out of the box: No extra work needed 😄

What This Site Covers

Section Content
Home Bio, tech stack, contact
Blog Study / Misc / Daily / Review / News
Papers Archive of papers I've authored
Projects GitHub repositories & release showcase
Chatbot AI Q&A about me and my projects (coming soon)

Things Left To Do

Deploy and wire up the chatbot

Fill in papers and projects data

Write blog posts consistently (the hardest part...)

Hoping to keep this space low-pressure and fun to update. Stop by anytime!

[News] AI / HPC Weekly Clips — 2026.04.14

Tue, 14 Apr 2026 00:00:00 GMT

A quick summary of notable updates in deep learning inference, GPU architecture, and HPC this week.

Highlights

1. NVIDIA Blackwell 2nd-Gen Inference Benchmarks Published

New benchmarks show up to 4× throughput improvement over H100 for FP8 inference workloads. Memory bandwidth efficiency during LLM decoding stages appears to be the biggest gain.

2. FlashAttention-3 Posted on arXiv

The third iteration of the Flash Attention series is out. It leverages Hopper's Tensor Memory Accelerator (TMA) and asynchronous pipelines to further reduce attention kernel overhead on H100.

3. PyTorch 2.7 Released

Stability improvements for torch.compile and enhanced CUDA Graph automation are the headline features.

These are personal notes — please verify details from original sources!

[Review] CUDA by Example — Best Book for GPU Beginners

Tue, 14 Apr 2026 00:00:00 GMT

The book that helped me most when I first started learning CUDA programming.

Book Info

Title: CUDA by Example: An Introduction to General-Purpose GPU Programming

Authors: Jason Sanders, Edward Kandrot

Publisher: Addison-Wesley Professional (2010)

Difficulty: ⭐⭐☆☆☆ (Beginner)

Why It's Good

Example-Driven Structure

Instead of burying you in theory, it shows working code first and explains afterward. The progression feels natural: kernel basics → memory management → textures/constant memory → streams.

Core Topics Covered

Chapter Topic
3 Writing & launching kernels
4 Parallel reduction
5 Thread cooperation & shared memory
9 Atomic operations
10 CUDA streams

Downsides

Published in 2010, so nothing on modern architectures (Volta / Ampere / Hopper).

Warp-level primitives (__shfl_sync, etc.) are not covered — you'll need NVIDIA's Programming Guide for those.

Who Should Read It

Anyone who knows C and wants to get started with CUDA — strongly recommended. Once you finish it, move on to the NVIDIA Programming Guide and GTC session slides for deeper optimization.

Score: 4 / 5 ⭐⭐⭐⭐☆

[Study] CUDA Kernel Optimization — Memory Access Patterns

Tue, 14 Apr 2026 00:00:00 GMT

While studying deep learning inference optimization, I explored how memory access patterns in CUDA kernels dramatically affect performance.

Key Concepts

Coalesced Memory Access

When threads within a warp access contiguous addresses in global memory, the GPU coalesces them into a single transaction. Strided (non-contiguous) access multiplies the number of transactions and tanks bandwidth efficiency.

Shared Memory Tiling

Shared Memory is on-chip SRAM physically co-located with L1 cache. By loading data in tiles, we drastically reduce round-trips to global memory.

__global__ void matmul_tiled(float *A, float *B, float *C, int N) { __shared__ float sA[TILE][TILE]; __shared__ float sB[TILE][TILE]; // ... }

Benchmark Results

Implementation Throughput (GFLOPS)
Naive (global memory) 42
Coalesced access 198
+ Shared memory tiling 573

Just adding shared memory tiling yielded a ~13.6× speedup.

Next Goals

Analyze bank conflicts and test padding strategies

Explore __ldg() read-only cache

Minimize warp divergence patterns

Section	Content
Home	Bio, tech stack, contact
Blog	Study / Misc / Daily / Review / News
Papers	Archive of papers I've authored
Projects	GitHub repositories & release showcase
Chatbot	AI Q&A about me and my projects (coming soon)

Chapter	Topic
3	Writing & launching kernels
4	Parallel reduction
5	Thread cooperation & shared memory
9	Atomic operations
10	CUDA streams

Implementation	Throughput (GFLOPS)
Naive (global memory)	42
Coalesced access	198
+ Shared memory tiling	573

hwkim-dev Blog

[프로젝트] llm-lite — Gemma 3N E4B 경량 추론 엔진

타겟 하드웨어​

아키텍처 요약​

최근 업데이트​

관련 링크​

[Paper] Attention Is All You Need

1. Background of the Transformer's Emergence​

2. Model Architecture​

2.1 Encoder​

2.2 Decoder​

3. Attention Mechanism​

3.1 Scaled Dot-Product Attention​

3.2 Multi-Head Attention​

4. Position-wise Feed-Forward Network​

5. Positional Encoding​

[Paper] GPT-1 — Improving Language Understanding by Generative Pre-Training

1. Core Fundamental Concepts of Language Models​

1) Context Window​

2) Maximize Likelihood​

2. GPT Backbone: Transformer Decoder​

3. GPT-1's Two-Stage Learning Pipeline​

Stage 1: Unsupervised Pre-training​

Stage 2: Supervised Fine-tuning​

Label (Answer) Prediction Probability​

Fine-Tuning Objective Function​

Why bring back the finished L1L_1L1​ and add it?

4. Task-aware Input Transformations​

1) The Role of Special Tokens​

2) Processing Mechanism for Multiple Choice Questions​

5. Mathematical Processing and Error Calculation (Completion of Learning)​

1) Softmax Function​

2) One-hot Encoding​

3) Cross-Entropy Loss​

[Daily] Spring, and a Fresh Start

[Misc] Rebuilt My Personal Homepage with Docusaurus

Why Docusaurus?​

What This Site Covers​

Things Left To Do​

[News] AI / HPC Weekly Clips — 2026.04.14

Highlights​

1. NVIDIA Blackwell 2nd-Gen Inference Benchmarks Published​

2. FlashAttention-3 Posted on arXiv​

3. PyTorch 2.7 Released​

[Review] CUDA by Example — Best Book for GPU Beginners

Book Info​

Why It's Good​

Example-Driven Structure​

Core Topics Covered​

Downsides​

Who Should Read It​

[Study] CUDA Kernel Optimization — Memory Access Patterns

Key Concepts​

Coalesced Memory Access​

Shared Memory Tiling​

Benchmark Results​

Next Goals​

타겟 하드웨어

아키텍처 요약

최근 업데이트

관련 링크

1. Background of the Transformer's Emergence

2. Model Architecture

2.1 Encoder

2.2 Decoder

3. Attention Mechanism

3.1 Scaled Dot-Product Attention

3.2 Multi-Head Attention

4. Position-wise Feed-Forward Network

5. Positional Encoding

1. Core Fundamental Concepts of Language Models

1) Context Window

2) Maximize Likelihood

2. GPT Backbone: Transformer Decoder

3. GPT-1's Two-Stage Learning Pipeline

Stage 1: Unsupervised Pre-training

Stage 2: Supervised Fine-tuning

Label (Answer) Prediction Probability

Fine-Tuning Objective Function

Why bring back the finished $L_1$ and add it?

4. Task-aware Input Transformations

1) The Role of Special Tokens

2) Processing Mechanism for Multiple Choice Questions

5. Mathematical Processing and Error Calculation (Completion of Learning)

1) Softmax Function

2) One-hot Encoding

3) Cross-Entropy Loss

Why Docusaurus?

What This Site Covers

Things Left To Do

Highlights

1. NVIDIA Blackwell 2nd-Gen Inference Benchmarks Published

2. FlashAttention-3 Posted on arXiv

3. PyTorch 2.7 Released

Book Info

Why It's Good

Example-Driven Structure

Core Topics Covered

Downsides

Who Should Read It

Key Concepts

Coalesced Memory Access

Shared Memory Tiling

Benchmark Results

Next Goals