Every time I mention running AI models locally in a workshop, I see the same mix of curiosity and scepticism. People assume it requires a server farm, a computer science degree, or at least a mass of Linux configuration. The reality in 2026 is very different. With tools like Ollama and LM Studio, you can have Google's Gemma 4 running on your own machine in under ten minutes -- no cloud subscription, no data leaving your network, and no usage caps.
I started running models locally about two years ago, initially out of curiosity. Now it is a core part of how I work. For sensitive client data, for offline train journeys, for rapid prototyping without worrying about API costs -- local inference has become genuinely practical. This guide walks you through the entire setup process from scratch, whether you prefer a command-line workflow or a graphical interface.
Why Run AI Locally?
Before diving into the how, it is worth understanding the why. Cloud-based AI tools like ChatGPT, Claude, and Gemini are excellent, and I use them daily. But they come with trade-offs that matter in specific situations:
- Privacy and data control. When you send a prompt to a cloud AI service, your data travels to their servers. For most casual use, this is fine. But if you are working with client financial data, patient records, proprietary research, or any information governed by compliance requirements, keeping everything on your own hardware eliminates an entire category of risk.
- Zero ongoing cost. Cloud AI services either charge per API call or limit free-tier usage. Running Gemma 4 locally costs nothing after the initial setup. There are no token limits, no monthly caps, and no surprise bills. If you make heavy use of AI throughout the day, the savings add up quickly.
- Offline availability. Local models work without an internet connection. I have written entire sections of training materials on flights and train rides using Gemma running on my laptop. Cloud services are useless in those situations.
- Customisation and integration. Running locally gives you full control over the model's behaviour. You can adjust parameters like temperature and context length, create custom system prompts, and integrate the model into your own scripts and applications through a local API.
- No rate limiting. Cloud free tiers throttle usage during peak times. Locally, the only bottleneck is your hardware, and it is consistently available.
The trade-off is that local models are generally slower than cloud-hosted ones (unless you have high-end hardware), and the largest models may not fit on consumer machines. The key is choosing the right model size for your hardware. For a detailed comparison of how Gemma 4 stacks up against cloud services, see my Gemma 4 vs ChatGPT vs Claude vs Copilot comparison.
Hardware Requirements
Gemma 4 comes in several sizes, and the hardware you need depends entirely on which variant you choose. Here is a practical breakdown based on my testing across different machines:
| Gemma 4 Variant | Parameters | Minimum RAM | Recommended GPU VRAM | Disk Space | Suitable For |
|---|---|---|---|---|---|
| Gemma 4 2B | 2 billion | 8 GB | None required (CPU-only works) | ~2 GB | Quick tasks, older laptops, testing |
| Gemma 4 9B | 9 billion | 16 GB | 6 GB+ (recommended) | ~6 GB | General use, good quality-speed balance |
| Gemma 4 12B | 12 billion | 16 GB | 8 GB+ | ~8 GB | Strong reasoning, Excel formulas, writing |
| Gemma 4 27B | 27 billion | 32 GB | 16-24 GB | ~18 GB | Best quality, complex analysis, coding |
A note on Apple Silicon Macs: If you have an M1, M2, M3, or M4 Mac, you are in a particularly good position. Apple Silicon uses unified memory, meaning the GPU and CPU share the same RAM pool. A MacBook Pro with 32 GB of unified memory can comfortably run the 27B model. A MacBook Air with 16 GB handles the 12B model well. The Metal framework acceleration in both Ollama and LM Studio is mature and performs excellently on these machines.
For Windows and Linux users: The most important component is your GPU. An NVIDIA RTX 3060 (12 GB VRAM) or better handles the 12B model comfortably. An RTX 4070 or 4080 handles the 27B model. AMD GPUs work through ROCm support in Ollama, though NVIDIA CUDA remains the more stable option. Running on CPU only is possible for smaller models but will be noticeably slower.
Method 1: Ollama -- The Command-Line Approach
Ollama is the tool I recommend for most power users. It is fast, lightweight, and stays out of your way. It runs as a background service and provides both a CLI and a local REST API, making it easy to integrate into scripts, applications, and automation workflows.
Step 1: Install Ollama
Download Ollama from ollama.com. Installers are available for macOS, Windows, and Linux.
On macOS: Download the .dmg file, drag Ollama to Applications, and launch it. It will install itself and run as a menu bar application.
On Windows: Download the installer and run it. Ollama will install as a system service that starts automatically.
On Linux: The simplest installation is through the install script:
curl -fsSL https://ollama.com/install.sh | sh
After installation, verify it is working by opening a terminal and running:
ollama --version
You should see a version number. If you see a "command not found" error on Windows, close and reopen your terminal to pick up the PATH changes.
Step 2: Pull the Gemma 4 Model
With Ollama installed, downloading Gemma 4 is a single command. Choose the variant that matches your hardware:
# For the 2B model (fastest, lowest requirements)
ollama pull gemma4:2b
# For the 12B model (good balance of quality and speed)
ollama pull gemma4:12b
# For the 27B model (best quality, needs strong hardware)
ollama pull gemma4:27b
The download size varies from about 1.5 GB (2B) to about 16 GB (27B). On a typical broadband connection, the 12B model takes 5-10 minutes to download. Ollama handles quantisation automatically -- the default quantisation (Q4_K_M) provides an excellent balance between quality and performance.
Step 3: Run Your First Prompt
Once the model is downloaded, start an interactive chat session:
ollama run gemma4:12b
This drops you into a chat interface in your terminal. Type a prompt and press Enter:
>>> Write an Excel SUMIFS formula that totals column D where column B is "North" and column C is "Electronics"
=SUMIFS(D:D, B:B, "North", C:C, "Electronics")
This formula uses SUMIFS to sum values in column D based on two criteria:
- Column B must equal "North"
- Column C must equal "Electronics"
...
To exit the chat, type /bye and press Enter.
Step 4: Use the Ollama API
One of Ollama's best features is its built-in REST API, which runs on localhost:11434 by default. This lets you integrate Gemma 4 into scripts, applications, and workflows. Here is a simple example using curl:
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:12b",
"prompt": "Write a nested IF formula that classifies exam scores: 90+ is Distinction, 75-89 is Merit, 60-74 is Pass, below 60 is Fail.",
"stream": false
}'
The API is compatible with many tools and libraries. If you use Python, the ollama Python package provides a clean interface:
import ollama
response = ollama.generate(
model='gemma4:12b',
prompt='Write an INDEX-MATCH formula with two criteria columns.'
)
print(response['response'])
This API compatibility makes Ollama particularly useful for building custom tools -- for example, a batch formula generator that processes a list of Excel formula requests from a text file.
Step 5: Customise With a Modelfile
For recurring use cases, you can create a custom Modelfile that sets a system prompt, adjusts parameters, and gives your configuration a name. Create a file called ExcelHelper with the following content:
FROM gemma4:12b
SYSTEM "You are an expert Excel formula assistant. Always provide the formula first, then explain each component. Mention any version requirements (e.g., Microsoft 365 only). Flag potential issues with blank cells or errors. Use British English spelling."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
Then create the custom model:
ollama create excel-helper -f ExcelHelper
Now you can use it with:
ollama run excel-helper
The lower temperature (0.3) makes the model more deterministic, which is exactly what you want for formula generation where creativity is less valuable than precision.
Method 2: LM Studio -- The GUI Approach
If you prefer a visual interface over the command line, LM Studio is an excellent alternative. It provides a ChatGPT-style chat interface, a model browser, and a built-in server -- all in a desktop application.
Step 1: Install LM Studio
Download LM Studio from lmstudio.ai. It is available for macOS, Windows, and Linux. The installation is a standard desktop application installer -- no terminal required.
Step 2: Download Gemma 4
Open LM Studio and navigate to the Discover tab (the magnifying glass icon). Search for "Gemma 4" and you will see available variants. Click the download button next to the variant you want. LM Studio shows you the file size, quantisation level, and estimated RAM usage before you download, which helps you choose the right model for your hardware.
I recommend starting with the 12B Q4_K_M quantisation. It offers the best quality-to-performance ratio for most consumer hardware. If your machine has less than 16 GB of RAM, go with the 9B or 2B variant instead.
Step 3: Start a Chat
Switch to the Chat tab (the speech bubble icon). Select your downloaded Gemma 4 model from the dropdown at the top. You will see a familiar chat interface where you can type prompts and receive responses, exactly like ChatGPT but running entirely on your machine.
The first response after loading a model takes a few seconds longer as the model loads into memory. Subsequent responses are faster. You can adjust settings like temperature, max tokens, and system prompt from the right-hand panel.
Step 4: Use the Built-in Server
LM Studio also includes a local server that exposes an OpenAI-compatible API. Navigate to the Server tab (the arrows icon), select your model, and click Start Server. This creates an API endpoint at http://localhost:1234/v1 that is compatible with any tool or library designed for the OpenAI API.
This means you can use LM Studio as a drop-in replacement for OpenAI in many applications by simply changing the base URL to http://localhost:1234/v1 and using any string as the API key.
Performance Comparison: Local Gemma 4 vs Cloud AI
Here is what you can realistically expect in terms of performance, based on my testing across different hardware configurations:
| Metric | Gemma 4 12B (Local, RTX 4070) | Gemma 4 27B (Local, RTX 4080) | ChatGPT (GPT-4o, Cloud) | Claude (Cloud) |
|---|---|---|---|---|
| First Token Latency | ~1.5 seconds | ~3 seconds | ~0.5 seconds | ~0.8 seconds |
| Generation Speed | ~20-30 tokens/sec | ~10-15 tokens/sec | ~50-80 tokens/sec | ~40-60 tokens/sec |
| Excel Formula Quality | Very good | Excellent | Excellent | Excellent |
| Cost Per Query | Electricity only | Electricity only | Free tier limited | Paid |
| Rate Limits | None | None | Yes (free tier) | Yes |
| Data Privacy | Full -- nothing leaves your machine | Full -- nothing leaves your machine | Data sent to OpenAI | Data sent to Anthropic |
| Offline Use | Yes | Yes | No | No |
The speed difference is noticeable but not deal-breaking for most workflows. When I use Gemma 4 locally for Excel formula generation, a typical response takes 3-5 seconds including a full explanation. That is slower than ChatGPT's near-instant replies, but fast enough that it does not disrupt my workflow. For a detailed accuracy comparison of Gemma 4 against GPT-4o and Llama 4 specifically for Excel formulas, see my Gemma 4 vs GPT-4o vs Llama 4 for Excel comparison.
When Local Makes Sense vs When Cloud Is Better
Running AI locally is not always the right choice. Here is how I think about the decision:
Use Local Gemma 4 When:
- You are working with sensitive or confidential data that cannot leave your network
- You need consistent, unlimited access without worrying about rate limits or subscription costs
- You are working offline or in environments with unreliable internet
- You want to build custom tools, scripts, or integrations using the local API
- You are doing repetitive batch processing (for example, generating formulas for 50 different Excel scenarios)
- You want full control over model parameters and system prompts
Use Cloud AI (ChatGPT, Claude, Gemini) When:
- You need the fastest possible response times and your data is not sensitive
- You need access to the absolute largest models (GPT-4o, Claude Opus, Gemini Ultra) which are too large for consumer hardware
- You want multimodal capabilities like image analysis or file uploads
- You are working on a machine without a capable GPU (for example, a Chromebook or thin client)
- You need features like web browsing, plugin ecosystems, or built-in tools that cloud platforms provide
In practice, most power users end up using both. I use Gemma 4 locally for formula generation, code assistance, and data analysis prompts. I use cloud services when I need image understanding, very long context windows, or the latest model capabilities. For a deeper dive into where Gemma 4 outperforms even paid cloud models, see 5 real-world tasks where Gemma 4 beats paid AI models.
Practical Tips for Getting the Best Output Quality Locally
Running a model locally gives you more control, but it also means you are responsible for the configuration. Here are the adjustments that make the biggest difference in my experience:
Choose the Right Quantisation
Quantisation reduces the model's memory footprint by using lower-precision numbers. The default Q4_K_M quantisation is the best starting point for most users -- it reduces memory usage by roughly 60% with minimal quality loss. If you have plenty of VRAM, try Q5_K_M or Q6_K for slightly better quality. Avoid Q2 or Q3 quantisations unless you are severely hardware-constrained, as the quality drop becomes noticeable.
Set an Appropriate Context Length
The context length determines how much conversation history the model can consider. For Excel formula work, 4096 tokens is usually sufficient. For longer conversations or document analysis, increase it to 8192 or beyond. Note that longer context lengths require more memory. In Ollama, set this with the num_ctx parameter. In LM Studio, adjust it in the model settings panel.
Use System Prompts Effectively
A well-crafted system prompt dramatically improves output consistency. For Excel work, I use a system prompt that instructs the model to provide the formula first, explain each component, mention version requirements, and flag potential issues. This eliminates the need to ask follow-up questions for every response.
Lower the Temperature for Technical Tasks
Temperature controls randomness in the model's output. For creative writing, higher temperature (0.7-0.9) adds variety. For technical tasks like formula generation, code writing, or data analysis, lower temperature (0.1-0.4) produces more consistent, deterministic output. I use 0.3 for Excel formula work and 0.2 for code generation.
Keep the Model Loaded
Both Ollama and LM Studio keep the model in memory after the first prompt, making subsequent responses faster. Avoid stopping and restarting the model between prompts. In Ollama, the model stays loaded for 5 minutes of inactivity by default. You can extend this with the OLLAMA_KEEP_ALIVE environment variable.
Understanding the Relationship Between Gemma and Gemini
One of the most common questions I get is about the difference between Gemma and Gemini. Both are from Google, but they serve very different purposes. Gemma is the open-weight model family designed for local deployment and customisation. Gemini is Google's proprietary cloud model integrated into Google Workspace, Bard, and the Gemini API. For a thorough comparison of capabilities, use cases, and when to choose which, see my dedicated Gemma 4 vs Gemini guide.
Frequently Asked Questions
What hardware do I need to run Gemma 4 locally?
The minimum hardware depends on the model size. The 2B parameter variant runs on 8 GB of RAM with no dedicated GPU. The 12B variant needs 16 GB of RAM and runs best with a GPU that has at least 8 GB of VRAM. The 27B variant requires 32 GB of RAM and a GPU with 16-24 GB of VRAM for comfortable performance. Apple Silicon Macs with unified memory handle these models particularly well.
Is Ollama or LM Studio better for running Gemma 4?
Ollama is better if you prefer a command-line workflow, want to use the API for automation, or need to integrate Gemma 4 into scripts and applications. LM Studio is better if you want a visual chat interface, prefer downloading and managing models through a GUI, or are new to running local AI models. Both are free and both produce identical output quality for the same model.
How does local Gemma 4 compare to cloud ChatGPT in speed?
Cloud ChatGPT is faster for most users because OpenAI runs it on enterprise-grade hardware. Local Gemma 4 on a mid-range GPU (RTX 4070 or similar) generates roughly 15-30 tokens per second with the 12B model, which feels responsive for conversational use. The 27B model is slower at 8-15 tokens per second on the same hardware. The trade-off is that local runs have no rate limits and complete privacy.
Can I use Gemma 4 locally for Excel formula generation?
Yes, and it works very well. The 12B and 27B variants of Gemma 4 produce accurate Excel formulas including SUMIFS, INDEX-MATCH, nested IFs, and array formulas. You can prompt it through the Ollama CLI, through LM Studio's chat interface, or by building a simple script that calls the Ollama API. See my companion guide on Gemma 4 vs GPT-4o vs Llama 4 for Excel for detailed accuracy comparisons.
Sources & Further Reading
- Google Gemma -- Official Documentation
- Ollama -- Official Site and Downloads
- LM Studio -- Official Site and Downloads
- Google Models on Hugging Face
Related Posts
- Gemma 4 vs Gemini: What's the Difference and When to Use Which
- Gemma 4 vs ChatGPT vs Claude vs Copilot: Best AI Model Comparison in 2026
- Gemma 4 vs GPT-4o vs Llama 4: Which Free AI Model Is Best for Excel Formulas?
- 5 Real-World Tasks Where Gemma 4 Beats Paid AI Models
- Gemma 4 for Data Analysis: Can It Replace ChatGPT for Spreadsheet Work?
Want to use AI tools more effectively?
My courses cover practical AI workflows, from spreadsheet formulas to app development, with real projects and honest tool comparisons.
Browse all courses