Gemma 4 running locally means your data never leaves your machine. That makes it practical for workflows involving sensitive files, proprietary code, or data you simply do not want to upload to a cloud API.
This guide covers specific local AI workflows you can build with Gemma 4 — not just chatting with the model, but automating real tasks on your own files.
Quick answer
Run Gemma 4 locally with Ollama or LM Studio, then build scripts or pipelines that send your files and prompts to the local model. Common workflows include code review, document summarisation, data extraction, and batch text processing.
- You need AI processing but cannot send data to external APIs.
- You want to automate repetitive tasks on local files.
- You are building internal tools that need AI without ongoing API costs.
Setting up Gemma 4 for workflow use
Install Ollama or LM Studio and pull the Gemma 4 model. For workflow automation, Ollama is the right choice because it exposes a plain HTTP API on localhost:11434 that you can call from any language — no SDK lock-in, no cloud relay.
Choose the right model size for your hardware. Gemma 4 12B works on most machines with 16GB RAM and handles extraction, summarisation, and code review cleanly. Drop to 4B on 8GB machines. Step up to 27B only if you have 32GB+ RAM and quality matters more than throughput — the 27B pass is roughly three times slower per request.
Three-minute setup check
- Install Ollama from ollama.com/download, then pull the model:
ollama pull gemma4:12b. - Verify the server is running:
curl http://localhost:11434/api/tags— you should seegemma4:12bin the JSON response. - Install the Python client once per project:
pip install ollama. That is the only dependency you need for every pattern below.
If you need a deeper walkthrough of the install, see how to run Gemma 4 locally. The rest of this tutorial assumes the three-step check above passes.
File processing workflows
The most immediately useful local AI workflow is batch file processing — sending multiple files through the model for analysis, extraction, or transformation.
Write a simple script that reads each file, sends it to the local model with a specific prompt, and saves the result.
- Code review: analyse files for bugs, style issues, or security concerns
- Document summarisation: condense long documents into key points
- Data extraction: pull structured data from unstructured text files
- Content tagging: classify or tag files based on their content
Four ready-to-run pipeline patterns
Every reliable local-AI workflow boils down to four shapes. Each one is a handful of lines of Python you can lift straight into a script. All four assume you ran the pip install ollama step above.
The shared helper used in every pattern below — keep it in one place and call it from each script:
from ollama import generate
MODEL = "gemma4:12b"
def ask_gemma(prompt: str, context: str = "") -> str:
full_prompt = f"{context}\n\n{prompt}" if context else prompt
return generate(model=MODEL, prompt=full_prompt)["response"]Pattern 1 — Map: one prompt per file, results to disk
The workhorse pattern. Use it for summarisation, classification, code review, anything where each file stands alone. Writing the result to disk per iteration matters — if the batch crashes at file 73, you still have files 1 through 72 on disk.
from pathlib import Path
OUT = Path("summaries")
OUT.mkdir(exist_ok=True)
for path in Path("docs").glob("*.txt"):
out_file = OUT / f"{path.stem}.md"
if out_file.exists():
continue # resume-safe: skip files already done
summary = ask_gemma(
"Summarise in 3 bullet points. Return markdown only.",
path.read_text(encoding="utf-8"),
)
out_file.write_text(summary, encoding="utf-8")
print(f"✓ {path.name}")Pattern 2 — Filter: route files by their content
Ask the model a yes/no question per file and move the file to a folder based on the answer. Useful for triaging log files, sorting support tickets, or flagging code that touches sensitive modules.
import shutil
from pathlib import Path
FLAGGED = Path("flagged"); FLAGGED.mkdir(exist_ok=True)
CLEAN = Path("clean"); CLEAN.mkdir(exist_ok=True)
for path in Path("inbox").glob("*.log"):
verdict = ask_gemma(
"Reply with exactly one word: FLAG or CLEAN. "
"Reply FLAG if the log shows a stack trace, auth failure, or "
"data-loss error. Otherwise CLEAN.",
path.read_text(encoding="utf-8")[:6000],
).strip().upper()
dest = FLAGGED if verdict.startswith("FLAG") else CLEAN
shutil.move(str(path), dest / path.name)Two tricks here: forcing single-word output keeps parsing trivial, and trimming to the first 6,000 characters prevents context-window overflow on huge log files.
Pattern 3 — Extract: unstructured text to structured JSON
Ask for JSON, parse it, and use it. This turns Gemma 4 into a schema-aware extraction tool that runs entirely offline — no third-party OCR-to-JSON service touching your contracts, CVs, or invoices.
import json
from pathlib import Path
rows = []
for path in Path("invoices").glob("*.txt"):
raw = ask_gemma(
"Extract this invoice as JSON with keys: vendor, invoice_number, "
"date_iso, total_gbp. Return only the JSON object, no prose.",
path.read_text(encoding="utf-8"),
)
try:
rows.append(json.loads(raw.strip().strip("`")))
except json.JSONDecodeError:
print(f"⚠ parse failed: {path.name}")
Path("invoices.json").write_text(json.dumps(rows, indent=2))Two habits that save hours: log the filename when JSON parsing fails (so you know which file to inspect), and strip backticks in case the model wraps output in a markdown code fence.
Pattern 4 — Map-reduce: documents larger than the context window
When a file exceeds Gemma 4's context window, split it into chunks, summarise each chunk, then summarise the summaries. This keeps the first pass cheap and lets the second pass see the whole document compressed into its key points.
def chunk_by_paragraphs(text: str, size: int = 20) -> list[str]:
paras = [p for p in text.split("\n\n") if p.strip()]
return ["\n\n".join(paras[i:i+size]) for i in range(0, len(paras), size)]
long_doc = Path("report.md").read_text(encoding="utf-8")
partials = [
ask_gemma("Summarise in 5 bullet points:", chunk)
for chunk in chunk_by_paragraphs(long_doc)
]
final = ask_gemma(
"Merge these section summaries into one 10-bullet executive summary.",
"\n\n---\n\n".join(partials),
)
Path("report-summary.md").write_text(final, encoding="utf-8")Performance and context limits
Local models are slower than cloud APIs. Plan for 10-30 seconds per request depending on input length and hardware. For batch jobs, this adds up — 100 files at 20 seconds each is over 30 minutes.
Watch context window limits. Gemma 4 handles 8K-128K tokens depending on the variant. If your files are large, split them or summarise sections separately.
When to stay local vs. use an API
Use local Gemma 4 when privacy matters, when you are processing many files (avoiding per-token API costs), or when you need to work offline. Use a cloud API when you need the fastest possible response, the highest quality model, or when the data is not sensitive.
You can also build hybrid workflows — use local Gemma for initial processing and a cloud API for final quality checks on the most important outputs.
Worked example: local code documentation generator
You point a script at your project directory. It sends each source file to local Gemma 4, which generates a one-paragraph summary of what the file does. The script collects all summaries into a project overview document. Total cost: zero API fees. Total data leaked: none.
Common mistakes
- Trying to process files larger than the model's context window without splitting — use Pattern 4 (map-reduce) above.
- Using a model variant too large for your hardware (causes swapping and extreme slowdown) — verify GPU offload with
ollama ps. - Building complex multi-step prompts instead of simple pipelines — three focused steps beat one ambitious prompt every time.
- Accumulating results in memory instead of writing each one to disk — one crash and the whole batch is gone.
- Restarting the Ollama server between requests — it reloads the model from disk every time, costing 10-40 seconds per call.
Troubleshooting batch workflows
Five errors account for almost every failed overnight batch. Check them in this order before reaching for a bigger debug tool.
- "connection refused" on port 11434. The Ollama server is not running. On Windows, open the Ollama app; on macOS/Linux run
ollama servein a spare terminal. Confirm withcurl http://localhost:11434/api/tags— it should return JSON, not an error. - "model not found: gemma4". You pulled a different tag than the script expects. Run
ollama listto see what is actually on disk, then either pull the tag the script uses (ollama pull gemma4:12b) or changeMODELin the helper to match what you have. - Batch speed collapses after the first few files. The model is being unloaded between requests. Set
OLLAMA_KEEP_ALIVE=24hin your environment before launching Ollama, so the model stays resident in memory for the whole batch. - Random JSON parse failures on Pattern 3. Gemma 4 occasionally wraps its output in
```json … ```fences. Strip backticks beforejson.loads(the Pattern 3 snippet already does this), and tighten the prompt: "Return only the JSON object. Do not wrap in markdown." - GPU offload silently failed and the model is running on CPU. Check
ollama ps— if the PROCESSOR column shows100% CPU, the model did not fit in VRAM. Drop togemma4:4b, or lowernum_gpulayers via a custom Modelfile so part of the model runs on GPU and part on CPU.
FAQ
Which Gemma 4 variant should I use for batch file processing?
Gemma 4 12B is the best starting point on a 16GB machine — large enough to follow extraction instructions cleanly, small enough to finish a 100-file batch in reasonable time. Drop to 4B on 8GB machines, or step up to 27B if you have 32GB+ RAM and quality matters more than throughput.
How do I call Ollama from Python for a batch workflow?
Install the ollama Python package (pip install ollama) and call ollama.generate(model='gemma4:12b', prompt=...). That hits the local API on port 11434 and returns a dict — no network round-trip, no keys. For batch work, wrap the call in a loop over pathlib.Path.glob() and write each result to disk so a crash mid-batch does not lose finished work.
Why is my batch slow even on a powerful machine?
Three usual culprits: model reloading between requests (keep the Ollama server running and send sequential calls — do not restart it), context window bloat (trim each file down to what the prompt actually needs), and CPU fallback (verify GPU offload with ollama ps — if layers show 0% GPU, your model is too big for VRAM).
How do I split files that exceed the context window?
Chunk by semantic boundary, not by character count. For code, split per function or per class. For prose, split per H2 or per 10-paragraph block. Summarise each chunk with the same prompt, then run a second pass that summarises the summaries. Pattern 4 above implements exactly this map-reduce.
Can I run a Gemma 4 batch overnight without supervision?
Yes, but add three guardrails: write each result to disk immediately (do not accumulate in memory), wrap each call in a try/except that logs the filename and continues, and keep a simple processed.txt ledger so a re-run skips files already done. Pair that with OLLAMA_KEEP_ALIVE=24h so the model stays warm overnight and you do not pay a reload cost on every request.
When to use something else
To set up Gemma 4 locally, see running Gemma 4 on your own machine. For using Gemma 4 in your code editor, see Gemma 4 in VS Code.
How to apply this in a real AI project
How to Use Gemma 4 for Local AI Workflows becomes much more useful once it is tied to the rest of the workflow around it. In real work, the result depends on model selection, prompt design, tool integration, evaluation, and the operational reality of shipping AI features, not only on following one local tip correctly.
That is why the biggest win rarely comes from one clever move in isolation. It comes from making the surrounding process easier to review, easier to repeat, and easier to hand over when another person inherits the workbook or codebase later.
- Test with realistic inputs before shipping, not just the examples that inspired the idea.
- Keep the human review step visible so the workflow stays trustworthy as it scales.
- Measure what matters for your use case instead of relying on general benchmarks.
How to extend the workflow after this guide
Once the core technique works, the next leverage usually comes from standardising it. That might mean naming inputs more clearly, keeping one review checklist, or pairing this page with neighbouring guides so the process becomes repeatable rather than person-dependent.
The follow-on guides below are the most natural next steps from How to Use Gemma 4 for Local AI Workflows. They help move the reader from one useful page into a stronger connected system.
- Go next to How to Run Gemma 4 on Your Own Machine if you want to deepen the surrounding workflow instead of treating How to Use Gemma 4 for Local AI Workflows as an isolated trick.
- Go next to How to Use Local AI on Your Own Files if you want to deepen the surrounding workflow instead of treating How to Use Gemma 4 for Local AI Workflows as an isolated trick.
Related tutorials on this site
These tutorials cover the Gemma 4 hub, installation, and model selection decisions that pair naturally with automation pipelines.
- How to Use Gemma 4 in VS Code: Setup, Extensions, and Coding Workflows — the hub tutorial for Gemma 4 in your editor.
- How to Run Gemma 4 Locally for Free: A Beginner's Guide With Ollama and LM Studio — the install prerequisite.
- How to Use Local AI on Your Own Files — conceptual companion covering non-batch, interactive use.
- How to Choose Between Open Models and API Models — when to stay local vs. pay per token.