| Q1 |
CNNs and RNNs don’t use positional embeddings. Why do transformers use positional embeddings? |
Answer |
| Q2 |
Tell me the basic steps involved in running an inference query on an LLM. |
Answer |
| Q3 |
Explain how KV Cache accelerates LLM inference. |
Answer |
| Q4 |
How does quantization affect inference speed and memory requirements? |
Answer |
| Q5 |
How do you handle the large memory requirements of KV cache in LLM inference? |
Answer |
| Q6 |
After tokenization, how are tokens converted into embeddings in the Transformer model? |
Answer |
| Q7 |
Explain why subword tokenization is preferred over word-level tokenization in the Transformer model. |
Answer |
| Q8 |
Explain the trade-offs in using a large vocabulary in LLMs. |
Answer |
| Q9 |
Explain how self-attention is computed in the Transformer model step by step. |
Answer |
| Q10 |
What is the computational complexity of self-attention in the Transformer model? |
Answer |
| Q11 |
How do Transformer models address the vanishing gradient problem? |
Answer |
| Q12 |
What is tokenization, and why is it necessary in LLMs? |
Answer |
| Q13 |
Explain the role of token embeddings in the Transformer model. |
Answer |
| Q14 |
Explain the working of the embedding layer in the Transformer model. |
Answer |
| Q15 |
What is the role of self-attention in the Transformer model, and why is it called “self-attention”? |
Answer |
| Q16 |
What is the purpose of the encoder in a Transformer model? |
Answer |
| Q17 |
What is the purpose of the decoder in a Transformer model? |
Answer |
| Q18 |
How does the encoder-decoder structure work at a high level in the Transformer model? |
Answer |
| Q19 |
What is the purpose of scaling in the self-attention mechanism in the Transformer model? |
Answer |
| Q20 |
Why does the Transformer model use multiple self-attention heads instead of a single self-attention head? |
Answer |
| Q21 |
How are the outputs of multiple heads combined and projected back in the multi-head attention in the Transformer model? |
Answer |
| Q22 |
How does masked self-attention differ from regular self-attention, and where is it used in a Transformer? |
Answer |
| Q23 |
Discuss the pros and cons of the self-attention mechanism in the Transformer model. |
Answer |
| Q24 |
What is the purpose of masked self-attention in the Transformer decoder? |
Answer |
| Q25 |
Explain how masking works in masked self-attention in Transformer. |
Answer |
| Q26 |
Explain why self-attention in the decoder is referred to as cross-attention. How does it differ from self-attention in the encoder? |
Answer |
| Q27 |
What is the softmax function, and where is it applied in Transformers? |
Answer |
| Q28 |
What is the purpose of residual (skip) connections in Transformer layers? |
Answer |
| Q29 |
Why is layer normalization used, and where is it applied in Transformers? |
Answer |
| Q30 |
What is cross-entropy loss, and how is it applied during Transformer training? |
Answer |
| Q31 |
Compare Transformers and RNNs in terms of handling long-range dependencies. |
Answer |
| Q32 |
What are the fundamental limitations of the Transformer model? |
Answer |
| Q33 |
How do Transformers address the limitations of CNNs and RNNs? |
Answer |
| Q34 |
How do Transformer models address the vanishing gradient problem? |
Answer |
| Q35 |
What is the purpose of the position-wise feed-forward sublayer? |
Answer |
| Q36 |
Can you briefly explain the difference between LLM training and inference? |
Answer |
| Q37 |
What is latency in LLM inference, and why is it important? |
Answer |
| Q38 |
What is batch inference, and how does it differ from single-query inference? |
Answer |
| Q39 |
How does batching generally help with LLM inference efficiency? |
Answer |
| Q40 |
Explain the trade-offs between batching and latency in LLM serving. |
Answer |
| Q41 |
How can techniques like mixture-of-experts (MoE) optimize inference efficiency? |
Answer |
| Q42 |
Explain the role of decoding strategy in LLM text generation. |
Answer |
| Q43 |
What are the different decoding strategies in LLMs? |
Answer |
| Q44 |
Explain the impact of the decoding strategy on LLM-generated output quality and latency. |
Answer |
| Q45 |
Explain the greedy search decoding strategy and its main drawback. |
Answer |
| Q46 |
How does Beam Search improve upon Greedy Search, and what is the role of the beam width parameter? |
Answer |
| Q47 |
When is a deterministic strategy (like Beam Search) preferable to a stochastic (sampling) strategy? Provide a specific use case. |
Answer |
| Q48 |
Discuss the primary trade-off between the computational cost and the output quality when comparing Greedy Search and Beam Search. |
Answer |
| Q49 |
When you set the temperature to 0.0, which decoding strategy are you using? |
Answer |
| Q50 |
How is Beam Search fundamentally different from a Breadth-First Search (BFS) or Depth-First Search (DFS)? |
Answer |
| Q51 |
Explain the criteria for choosing different decoding strategies. |
Answer |
| Q52 |
Compare deterministic and stochastic decoding methods in LLMs. |
Answer |
| Q53 |
What is the role of the context window during LLM inference? |
Answer |
| Q54 |
Explain the pros and cons of large and small context windows in LLM inference. |
Answer |
| Q55 |
What is the purpose of temperature in LLM inference, and how does it affect the output? |
Answer |
| Q56 |
What is autoregressive generation in the context of LLMs? |
Answer |
| Q57 |
Explain the strengths and limitations of autoregressive text generation in LLMs. |
Answer |
| Q58 |
Explain how diffusion language models (DLMs) differ from Large Language Models (LLMs). |
Answer |
| Q59 |
Do you prefer DLMs or LLMs for latency-sensitive applications? |
Answer |
| Q60 |
Explain the concept of token streaming during inference. |
Answer |
| Q61 |
What is speculative decoding, and when would you use it? |
Answer |
| Q62 |
What are the challenges in performing distributed inference across multiple GPUs? |
Answer |
| Q63 |
How would you design a scalable LLM inference system for real-time applications? |
Answer |
| Q64 |
Explain the role of Flash Attention in reducing memory bottlenecks. |
Answer |
| Q65 |
What is continuous batching, and how does it differ from static batching? |
Answer |
| Q66 |
What is mixed precision, and why is it used during inference? |
Answer |
| Q67 |
Differentiate between online and offline LLM inference deployment scenarios and discuss their respective requirements. |
Answer |
| Q68 |
Explain the throughput vs latency trade-off in LLM inference. |
Answer |
| Q69 |
What are the various bottlenecks in a typical LLM inference pipeline when running on a modern GPU? |
Answer |
| Q70 |
How do you measure LLM inference performance? |
Answer |
| Q71 |
What are the different LLM inference engines available? Which one do you prefer? |
Answer |
| Q72 |
What are the challenges in LLM inference? |
Answer |
| Q73 |
What are the possible options for accelerating LLM inference? |
Answer |
| Q74 |
What is Chain-of-Thought prompting, and when is it useful? |
Answer |
| Q75 |
Explain the reason behind the effectiveness of Chain-of-Thought (CoT) prompting. |
Answer |
| Q76 |
Explain the trade-offs in using CoT prompting. |
Answer |
| Q77 |
What is prompt engineering, and why is it important for LLMs? |
Answer |
| Q78 |
What is the difference between zero-shot and few-shot prompting? |
Answer |
| Q79 |
What are the different approaches for choosing examples for few-shot prompting? |
Answer |
| Q80 |
Why is context length important when designing prompts for LLMs? |
Answer |
| Q81 |
What is a system prompt, and how does it differ from a user prompt? |
Answer |
| Q82 |
What is In-Context Learning (ICL), and how is few-shot prompting related? |
Answer |
| Q83 |
What is self-consistency prompting, and how does it improve reasoning? |
Answer |
| Q84 |
Why is context important in prompt design? |
Answer |
| Q85 |
Describe a strategy for reducing hallucinations via prompt design. |
Answer |
| Q86 |
How would you structure a prompt to ensure the LLM output is in a specific format, like JSON? |
Answer |
| Q87 |
Explain the purpose of ReAct prompting in AI agents. |
Answer |
| Q88 |
What are the different phases in LLM development? |
Answer |
| Q89 |
What are the different types of LLM fine-tuning? |
Answer |
| Q90 |
What role does instruction tuning play in improving an LLM’s usability? |
Answer |
| Q91 |
What role does alignment tuning play in improving an LLM's usability? |
Answer |
| Q92 |
How do you prevent overfitting during fine-tuning? |
Answer |
| Q93 |
What is catastrophic forgetting, and why is it a concern in fine-tuning? |
Answer |
| Q94 |
What are the strengths and limitations of full fine-tuning? |
Answer |
| Q95 |
Explain how parameter efficient fine-tuning addresses the limitations of full fine-tuning. |
Answer |
| Q96 |
When might prompt engineering be preferred over task-specific fine-tuning? |
Answer |
| Q97 |
When should you use fine-tuning vs RAG? |
Answer |
| Q98 |
What are the limitations of using RAG over fine-tuning? |
Answer |
| Q99 |
What are the limitations of fine-tuning compared to RAG? |
Answer |
| Q100 |
When should you prefer task-specific fine-tuning over prompt engineering? |
Answer |
| Q101 |
What is LoRA, and how does it work? |
Answer |
| Q102 |
Explain the key ingredient behind the effectiveness of the LoRA technique. |
Answer |
| Q103 |
What is QLoRA, and how does it differ from LoRA? |
Answer |
| Q104 |
When would you use QLoRA instead of standard LoRA? |
Answer |
| Q105 |
How would you handle LLM fine-tuning on consumer hardware with limited GPU memory? |
Answer |
| Q106 |
Explain different preference alignment methods and their trade-offs. |
Answer |
| Q107 |
What is gradient accumulation, and how does it help with fine-tuning large models? |
Answer |
| Q108 |
What are the possible options to speed up LLM fine-tuning? |
Answer |
| Q109 |
Explain the pretraining objective used in LLM pretraining. |
Answer |
| Q110 |
What is the difference between casual language modeling and masked language modeling? |
Answer |
| Q111 |
How do LLMs handle out-of-vocabulary (OOV) words? |
Answer |
| Q112 |
In the context of LLM pretraining, what is scaling law? |
Answer |
| Q113 |
Explain the concept of Mixture-of-Experts (MoE) architecture and its role in LLM pretraining. |
Answer |
| Q114 |
What is model parallelism, and how is it used in LLM pre-training? |
Answer |
| Q115 |
What is the significance of self-supervised learning in LLM pretraining? |
Answer |