0x01 // Foundations #
Updated: 2026-04-24

AI is no longer just a chatbot wrapped in hype. For security testers, researchers, and bug hunters, it has become a working tool: closer to a decompiler, proxy, scanner, script, or shell than a novelty. It has not replaced skilled people. It has made skilled people faster.
That matters because a lot of technical work is not one big flash of genius. It is pattern recognition, repeated testing, dead ends, documentation, revisiting old ideas, comparing outputs, cleaning up rough notes, and noticing connections other people missed. AI is useful there. It can help surface angles faster, summarize noise, reduce repetitive work, and turn raw output into something you can actually act on. For security people, that means faster triage, faster exploration, better documentation, and more time spent on the work that actually finds bugs.
And when you can run that power locally with more privacy, more control, and fewer dependencies on somebody else's platform, it becomes even more useful.
This book is not about vague futurism, hype, or "AI changes everything" filler. It is about using AI in a way that is actually useful: running models locally, understanding what the files and numbers mean, making smart hardware decisions, and building workflows that help with research, coding, reporting, recon, and automation.
The progression in this book is simple:
zero -> local models -> practical workflow improvements -> agentic setups -> model tuning -> real-world attacks and research
That means payload research, exploit analysis, recon automation, agentic tooling, and the places those workflows break under pressure.
If you are new to local AI, this chapter gives you the groundwork. By the end, the terms will make sense, the model names will stop looking random, and you will have a realistic picture of what local AI can and cannot do.
Before we start: a few terms that matter #
Before getting into hardware, runtimes, and workflows, it helps to lock down a few terms early. You do not need a giant glossary to use local AI, but there are a few ideas you need to be comfortable with so the rest of the chapter does not feel like guesswork.
Generative AI is the broad category for systems that generate content. That can mean text, code, summaries, images, shell one-liners, malware analysis notes, documentation drafts, or anything else that looks like a human created it.
LLMs, or Large Language Models, are the text engines behind most of what people interact with day to day. They are trained on massive amounts of text and learn statistical patterns in language, code, and structure. When you ask a model a question, it is not "thinking" like a human. It is predicting the next token based on everything it has seen and the context you gave it.
Tokens are how models break text apart internally. Tokens are not exactly words. Sometimes a token is a full word, sometimes part of a word, punctuation, or whitespace. In practice, prompt size, output length, speed, and memory usage are often discussed in tokens.
Context window is the amount of text the model can keep in view at once. That limit is measured in tokens too. This matters immediately in security work. Can the model hold a full source file? A full HTTP response? A session's worth of recon notes? A vulnerable function plus the route that reaches it? A model with a 4K context window is a different tool than a model with 128K. One is good for short prompts and tight tasks. The other can carry more evidence before it starts forgetting what you already gave it.
VRAM is your GPU's memory, and when you are running models locally, it becomes one of the most important hardware constraints. You can have a decent GPU core and still have a bad local AI experience if you do not have enough VRAM to fit the model efficiently.
Agentic AI is where things start getting more interesting. Instead of only generating text, the model is allowed to take actions through tools. That can mean reading files, running shell commands, calling APIs, writing code, or moving through multi-step workflows. Done right, this is where AI starts acting less like a chatbot and more like an operator assistant.
You do not need to master all of that on day one. You just need enough understanding to stop treating model selection and setup like guesswork.
Reality check: bigger does not always mean better #
Once the terms make sense, the next mistake to kill early is the assumption that the biggest model you can download is automatically the best one to use.
That is not how this works.
A smaller model that responds quickly is often more useful than a larger model that drags. A 7B model running at 30 or more tokens per second feels responsive. You can ask a question, refine your prompt, iterate on code, fix a script, summarize findings, and keep moving without breaking your flow.
That matters more than people think.
In practical work, speed changes how often you actually use the tool. If the model feels instant, it becomes part of your workflow. If every reply feels like waiting on a slow coworker to type one character at a time, you will stop reaching for it unless the task is important enough to justify the delay.
That is where larger models start to make sense.
Once you get into harder tasks like multi-step reasoning, code analysis, exploit explanation, architecture comparisons, tradeoff analysis, or deep writeups, larger models can absolutely earn their keep. A 32B model on the right prompt can outperform a smaller model in ways that are obvious: better structure, better context retention, fewer shallow mistakes, and stronger reasoning under pressure.
But there is a cost.
On something like a 12GB GPU, a 32B model is usually going to be slow. Depending on the quantization and backend, you might see roughly 2 to 5 tokens per second. That is usable, but it is not interactive in the same way a smaller model is.
So stop thinking in terms of "best model" and start thinking in terms of best model for the job.
A good mental model is:
Small models are for speed, iteration, and everyday utility Mid-size models are often the sweet spot for serious local use Large models are for harder problems where quality matters more than latency
Treat a large model like a slower, smarter colleague. Give it the hard problem, let it work, and use smaller models for the rapid-fire stuff.
Where models come from #
Once you stop chasing size for its own sake, the next thing to understand is where these models actually come from and what your role really is.
A lot of people getting into local AI assume they need to train a model themselves.
You almost never do.
Training a model from scratch is expensive, resource-heavy, and completely unnecessary for most people reading this. You are not going to casually build a competitive foundation model at home because you watched two videos and downloaded a repo. That is not the game.
The real world of local AI is built on an open model ecosystem.
People and organizations release pre-trained models, fine-tuned models, instruct models, distilled models, coding models, reasoning models, multimodal models, and quantized variants that are ready to run. The biggest hub for this ecosystem is Hugging Face, along with model communities, GitHub repos, Ollama registries, and GGUF distribution pages.
Base and instruct models are not the same thing. A base model completes text. That is its raw behavior. Give it a sentence and it continues the pattern. An instruct model has been tuned to follow instructions, answer questions, refuse some requests, format responses, and behave more like the AI tools people are used to. For security workflows, this distinction matters immediately. Base models are usually for training pipelines, experiments, and people who know why they want raw completion behavior. Instruct models are what you usually want for actual use.
That means your job is usually not "train a model."
Your job is to:
- pick the right family of model
- pick the right size
- pick the right quantization
- run it with the right backend
- test whether it actually helps with your work
That is a much more realistic and much more useful skill set.
In practice, you are standing on top of work that was already done for you. Someone trained or adapted the model. Someone exported it. Someone quantized it. Someone benchmarked it. Your job is to understand enough to make good decisions instead of downloading random files and hoping for the best.
Understanding model filenames without pretending they make sense at first glance #
Once you start downloading models, one of the first things you run into is the filenames. At first glance they look like garbage.
Then you realize they are actually compact metadata.
Take this example:
DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
That looks ugly, but it tells you a lot.
.gguf #
This is the file format. GGUF is a common format for running quantized models locally, especially in the llama.cpp ecosystem and the tools built on top of it. If you are downloading models for local inference on consumer hardware, you are going to see GGUF constantly.
14B #
This is the model size, meaning roughly 14 billion parameters. Bigger parameter counts usually mean more capability, but also more memory usage and slower inference.
Q4 #
This refers to quantization level. Quantization is one of the tricks that makes local AI practical. Instead of storing model weights in higher-precision formats that take a lot of memory, the weights are compressed into lower-precision representations.
The tradeoff is simple:
lower precision = smaller and faster, but potentially lower quality higher precision = larger and slower, but potentially better output fidelity
K_M #
This is part of the quantization method naming. You do not need to become a compression researcher to use local models, but you should recognize that not all Q4 files are the same. The exact quantization method affects quality, speed, and memory behavior.
For many users, Q4_K_M is a very solid default. It often gives a good balance between quality and performance, especially when you are trying to fit useful models onto limited hardware.
It is also worth knowing where things start going wrong. Once you push quantization too far, the model can get small fast, but also noticeably worse. Below a certain point, you are not making the same model more efficient anymore. You are making it dumber. A heavily crushed 70B at very low precision can end up less useful than a smaller model running at a healthier quantization level.
Distill and the family names #
This part tells you about lineage. Distill usually means the model was trained to inherit useful behavior from a stronger model, while names like Qwen, Llama, Mistral, or DeepSeek point to the base family it came from. That matters because model families have personalities: some are better at code, some at reasoning, some at multilingual work, and some are just easier to prompt well.
A rough field guide looks like this:
- Qwen 2.5: handles structured output and code well. Good for report cleanup, command generation, and turning messy findings into readable paragraphs.
- DeepSeek R1: reasoning-focused and slower. Use it when you need to work through logic chains, understand a vulnerability class, or analyze something that needs more than autocomplete.
- Llama 3.x: widest ecosystem support, most community fine-tunes, easiest to find security-adapted variants. Good default family when tooling compatibility matters.
- Mistral: lean and fast. Good daily driver on tighter hardware when you need something responsive for iteration and short tasks.
- Phi-4: smaller than it acts. Handles code and technical explanation better than its size suggests. Worth testing if you are working on constrained hardware and need more than a toy.
Do not treat family names like brand loyalty. Treat them like tool behavior. A model that is great at explaining a Python traceback might be mediocre at long recon summarization. A reasoning model that helps with exploit logic might be painfully slow for report cleanup. Test the family against your actual work.
So when you look at a filename, stop seeing noise. It is telling you:
- what the model is
- what family it came from
- how large it is
- how it was quantized
- what runtime ecosystem it likely fits into
Once you get comfortable reading that, you stop downloading blind.
Hardware reality: VRAM matters more than most people expect #
Understanding filenames is useful, but it only matters if you connect it back to hardware reality.
If you want one sentence that will save you time and money, here it is:
VRAM is one of the most important resources in local AI.
People obsess over GPU branding, CUDA cores, and marketing names, but for running local models, the amount of VRAM often determines what is practical far more than people expect.
Why? Because the model weights have to live somewhere.
If the model does not fit cleanly into available VRAM, performance drops, offloading happens, things spill into system RAM, and your setup suddenly feels like a miserable science project.
A rough rule people use is:
parameters x 2 bytes = memory for full-precision weights
That is a simplification, but it gives you a useful instinct. Quantization brings that number down, which is exactly why quantized formats matter so much for consumer hardware.
Training is even more expensive. Once you move from inference to fine-tuning or full training, memory use jumps hard. Optimizer states, gradients, activations, checkpoints, and batch sizing all add overhead.
A rough mental model for training is that it can require 4x to 6x or more of the base model memory footprint depending on the setup.
That is why there is such a big difference between:
- "I can run this model locally"
- "I can fine-tune this model locally"
- I can train something serious from scratch"
Those are completely different tiers of hardware reality.
For local inference, you are mostly balancing four things:
- model size
- quantization
- available VRAM
- patience
If you have less VRAM, you lean harder on smaller models and more aggressive quantization. If you have more VRAM, you can move into larger and more capable models without making inference painful.
There is no shame in using the model that fits your machine well. A fast model you actually use beats a huge model you launch twice and then abandon.
Fine-tuning overview: you probably want adaptation, not reinvention #
Once hardware constraints make sense, the next question is usually whether you can bend a model toward your own workflow instead of just running it as-is.
That is where fine-tuning enters the picture.
For most people, the goal is not reinvention. It is adaptation.
You are usually not trying to create a brand-new foundation model. You are trying to make an existing one more useful for your own domain, writing style, task format, internal vocabulary, or workflow. That difference matters, because it changes what "good enough" looks like.
A lot of people hear fine-tuning and immediately imagine huge GPU clusters, impossible costs, and research-lab complexity. Sometimes that is true. Most of the time, for the people reading this book, it is not.
What matters in practice is that there are lighter ways to adapt models without retraining everything.
LoRA #
LoRA stands for Low-Rank Adaptation. In practical terms, it lets you fine-tune a model without retraining every parameter. Instead of modifying the full model, you train a much smaller set of adapter weights that influence how the model behaves.
Why this matters is simple: it takes model customization out of datacenter territory and puts it into reach of normal hardware.
That means if you want a model to get better at a certain report style, internal format, domain vocabulary, or repeated task pattern, you do not need to rebuild the whole thing from the ground up. You can adapt it.
That is the reason LoRA matters. It is not just a technical trick. It is what makes specialization practical.
QLoRA #
QLoRA pushes that idea further by combining quantization with low-rank adaptation. In plain terms, it lowers the memory cost again, which makes it possible to adapt larger models on more modest hardware than most people would expect.
That is a big deal because it changes the question from "Can I afford to tune anything at all?" to "What is worth adapting for my workflow?"
For independent researchers, testers, and hobbyists, that is the line that matters.
Full fine-tuning #
This is the expensive route. Full fine-tuning updates the entire model and usually needs a lot more compute, memory, storage, and care. It can produce very strong results, but for most independent researchers, security testers, and hobbyists, it is overkill unless there is a very specific reason to do it.
The common mistake is thinking full fine-tuning is the "real" method and LoRA is a toy.
That is the wrong mindset.
For a lot of practical use cases, LoRA or QLoRA is the smart path because it gets you most of the value without demanding ridiculous hardware. The practical question is usually not "How do I do the biggest possible training job?" It is "How do I make this model more useful for the exact work I already do?"
That is why adaptation matters more than reinvention.
Where this breaks #
This tool fails in ways that matter.
It will hallucinate CVE details with confidence. It will mix real vulnerability names with fake version ranges. It will invent flags for tools you use every day. It will write exploit code that looks clean and does nothing. It will produce Python that imports libraries that do not exist, calls functions with the wrong arguments, or quietly skips the part that matters.
It also has a training cutoff problem. If a vulnerability dropped last week, your local model probably does not know it unless you give it the details. If a tool changed its syntax recently, the model may give you the old way. If a target technology is niche, proprietary, or poorly documented, the model may fill gaps with nonsense.
Long agentic tasks break in a different way. The model can lose coherence, forget earlier constraints, repeat failed steps, overwrite useful work, or chase the wrong branch of the problem because the last output looked convincing. Small models are worse here. They can miss subtle logic flaws, misunderstand auth flows, flatten important details, or summarize away the one thing you needed to notice.
This is not a reason to avoid the tool.
It is a reason to operate it correctly.
Use it to accelerate thinking, not replace verification. Make it show its assumptions. Feed it real evidence. Keep the scope tight. Test the commands. Read the code. Confirm the CVEs yourself. Treat confident output as untrusted until it survives contact with the terminal.
Why this matters for security testers #
Once you understand that models can be adapted, the obvious next question is what any of this actually changes for real security work.
This is where local AI stops being a novelty and starts becoming useful.
For security testers, bug hunters, red teamers, reverse engineers, malware analysts, and technical writers, local AI can act like a workflow multiplier.
A huge amount of offensive and defensive work involves repetitive thinking, pattern extraction, documentation, context shifting, and turning messy technical output into something structured.
That is exactly where these models can help.
Report writing #
A local model can help clean up rough notes, turn findings into readable paragraphs, rewrite awkward sections, normalize tone across a report, and help structure remediation language. You still need to know what is true and what matters, but you do not need to waste the same amount of time fighting sentence structure.
Recon analysis #
Recon dumps are messy. Local AI is good at helping organize targets, summarize exposed services, cluster findings, explain naming patterns, spot themes across subdomains, and turn raw reconnaissance into leads worth investigating.
Here is the kind of boring example that actually saves time:
$ cat nmap-output.log
PORT STATE SERVICE VERSION
22/tcp open ssh OpenSSH 8.9p1 Ubuntu
80/tcp open http nginx 1.18.0
443/tcp open ssl/http nginx 1.18.0
8080/tcp open http Jetty 9.4.48
8443/tcp open ssl/http Apache Tomcat/Coyote JSP engine 1.1
$ ollama run qwen2.5:14b "Summarize this nmap output for a web test. Focus on what to check next. Keep it short.
$(cat nmap-output.log)"
Likely web targets:
- 80/443: nginx. Check vhosts, TLS config, redirects, headers, and exposed default content.
- 8080: Jetty. Check admin consoles, default paths, old apps, and proxy exposure.
- 8443: Tomcat/Coyote. Check manager/html, host-manager, default creds, leaked WARs, and version-specific issues.
- 22: SSH. Note version and auth policy, but web services are the priority.
That is not magic. It is triage. The model did not hack anything. It cleaned up the first pass and gave you a checklist you can verify.
Payload generation and scripting support #
Models can help draft one-liners, explain shell behavior, rewrite fragile scripts, convert logic between languages, generate regex, build parsers, and reduce the friction of moving from idea to test.
That does not make the output automatically safe or correct. You still verify everything. But it can save serious time.
Automation and workflow glue #
Once you start combining models with tools, the value jumps. Feeding notes into a local model, having it summarize findings, classify output, suggest next steps, or generate structured content can reduce overhead across repeated tasks.
This is where local AI becomes more than "ask bot question, get bot answer."
It becomes part of an actual operating workflow.
Privacy and control #
For security work, this part matters a lot.
Running a local model means you are not automatically pasting client data, internal notes, target information, or sensitive findings into a third-party service. That alone is enough reason for many people to care.
For some readers, local is not just about preference. It is about control. It is about being able to keep work in a zero-log environment you understand, or even inside an air-gapped workflow when the engagement, client, or lab setup calls for it. In those situations, "local" stops being a nice feature and starts looking a lot more like a compliance, confidentiality, and operational security requirement.
You get more control over:
- what data leaves your machine
- what logs may exist
- what tooling you connect
- how your workflow is structured
That does not automatically make it secure. Local does not mean invincible. You still have to think about model files, supply chain trust, exposed APIs, insecure agent tooling, local secrets, and accidental data leakage.
But it gives you a level of control that cloud-only workflows do not.
The security angle people skip #
And that control cuts both ways.
A lot of AI content focuses on prompts, benchmarks, and vibes. Not enough of it talks about security tradeoffs.
That matters, especially for the audience of this book.
When you start downloading models, running local servers, exposing web UIs, connecting tools, and experimenting with agentic workflows, you are creating a new attack surface.
At minimum, think about the following:
Model trust #
Do not blindly download random model files and run them just because the filename looks exciting. Pay attention to where the model came from, who published it, whether the repo is reputable, and whether the community has actually used it.
Exposed inference endpoints #
A lot of local tools start web servers by default. Some bind only to localhost. Some are easy to expose by accident. If you do not understand what is listening and where, you can end up handing access to anyone who can reach that interface.
Tool abuse in agentic workflows #
Once a model can run shell commands, read directories, hit APIs, or write files, bad prompt handling stops being a silly chatbot issue and starts becoming an operational risk. Prompt injection, unsafe tool access, and over-trusting model output become real problems fast.
Sensitive data spillage #
If you are feeding target notes, client evidence, logs, source code, or credentials into a model pipeline, you need to know exactly where that data goes, what stores it, and what can read it later.
Security people should be better than average at thinking this way.
The irony would be using AI to help with security work while building an insecure AI workflow in the process.
Benchmarks are useful, but only if you understand what they are not telling you #
Security tradeoffs are one side of the picture. Performance is the other.
People love benchmark charts because they look objective.
Benchmarks do matter, but they are not the whole story.
A model can score well on a benchmark and still be annoying in real use. It can benchmark well and still format output badly. It can solve canned tasks but struggle with your actual workflow. It can also benchmark lower than another model and still be the one you use daily because it is faster, more stable, and better aligned with the kind of questions you ask.
So treat benchmarks as input, not gospel.
For this audience, one of the most useful real-world measurements is not just abstract score performance, but how the model behaves on your machine.
Things like this matter immediately:
- how fast it starts generating
- how many tokens per second you get
- whether it stays coherent over long outputs
- whether it follows structure well
- whether it handles code, logs, and technical language cleanly
That is why real terminal output is often more useful than pretty charts.
Here is an example from a real run of DeepSeek-R1 32B on a 12GB GPU:
$ ollama run deepseek-r1:32b "say hello" --verbose
Hello! How can I assist you today?
total duration: 26.404497393s
load duration: 191.073865ms
prompt eval count: 5 token(s)
prompt eval duration: 314.954855ms
prompt eval rate: 15.88 tokens/s
eval count: 16 token(s)
eval duration: 4.422921695s
eval rate: 3.62 tokens/s
3.62 tokens per second is what a 32B model looks like when it is spilling out of VRAM onto system RAM. The model is too large to fit cleanly, so it gets slow. That single number tells you more about your hardware/model fit than any benchmark chart will.
When you run a model with verbose output and see actual throughput on your own hardware, you are getting data that matters to your setup, not just a leaderboard.
In the next chapter, when you start running models directly, that kind of hands-on benchmarking will make more sense.
Your first model decision #
Before you close this chapter, you should have a starting point.
If you have less than 12GB of VRAM, do not try to make your machine something it is not. Start with smaller instruct models. Look at 3B, 7B, and efficient 8B-class models in sane quantization levels. Use them for report cleanup, recon summarization, command explanation, scripting help, and fast iteration. This is not the tier for pretending you have a local research lab.
If you have 16GB to 24GB of VRAM, you are in the useful middle. This is where 14B, 27B, and some 32B quantized models start making sense depending on backend, context size, and patience. Use this tier for heavier coding help, longer analysis, more serious reasoning, and security writeups that need structure instead of autocomplete.
If you are CPU-only, keep your expectations sharp. You can still run local models. They will just be slower. Start small, quantized, and practical. Use them for offline notes, short summaries, prompt experiments, and lightweight workflows. Do not judge local AI by forcing a giant model through a CPU and then complaining that it feels dead.
Then pick by use case.
For reasoning, start with a reasoning-tuned instruct model and accept that it may be slower.
For coding, start with a coding-capable instruct model from a strong family and test it against code you actually write.
For general security workflow help, start with a fast general-purpose instruct model that follows structure well.
A practical first setup is simple: pick one fast daily model and one slower stronger model. Use the fast one constantly. Use the stronger one when the problem deserves it. That will teach you more than downloading ten random models and benchmarking yourself into confusion.
The goal is operational usefulness, not model collecting.
What Chapter 0x01 Should Leave You With #
You do not need to know everything yet.
You need enough to stop guessing.
Local AI is not magic and it is not reserved for researchers with datacenter budgets. You can run capable models on hardware you already own. Bigger is not always better. Filenames carry real information. VRAM is the constraint that matters most. Quantization is what makes it practical. Fine-tuning does not mean starting from scratch.
For security testers and researchers, that combination is already useful because it removes friction from the parts of the job that slow you down the most. The point of this chapter is not to make you an expert overnight. It is to make the landscape feel legible enough that the next steps stop looking random.
Treat it like any other tool in your arsenal. Learn the parts. Understand the tradeoffs. Test what actually works on your machine.
In the next chapter, we get hands-on: runtimes, setup, and running your first local model.