How to Use Gemma 4 for Local AI Workflows

Coding Liquids blog cover featuring Sagnik Bhattacharya for using Gemma 4 for local AI workflows.
Coding Liquids blog cover featuring Sagnik Bhattacharya for using Gemma 4 for local AI workflows.

Gemma 4 running locally means your data never leaves your machine. That makes it practical for workflows involving sensitive files, proprietary code, or data you simply do not want to upload to a cloud API.

I teach Flutter and Excel with AI — explore my courses if you want structured learning.

This guide covers specific local AI workflows you can build with Gemma 4 — not just chatting with the model, but automating real tasks on your own files.

Follow me on Instagram@sagnikteaches Connect on LinkedInSagnik Bhattacharya Subscribe on YouTube@codingliquids

Quick answer

Run Gemma 4 locally with Ollama or LM Studio, then build scripts or pipelines that send your files and prompts to the local model. Common workflows include code review, document summarisation, data extraction, and batch text processing.

  • You need AI processing but cannot send data to external APIs.
  • You want to automate repetitive tasks on local files.
  • You are building internal tools that need AI without ongoing API costs.

Setting up Gemma 4 for workflow use

Install Ollama or LM Studio and pull the Gemma 4 model. For workflow automation, Ollama is the right choice because it exposes a plain HTTP API on localhost:11434 that you can call from any language — no SDK lock-in, no cloud relay.

Choose the right model size for your hardware. Gemma 4 12B works on most machines with 16GB RAM and handles extraction, summarisation, and code review cleanly. Drop to 4B on 8GB machines. Step up to 27B only if you have 32GB+ RAM and quality matters more than throughput — the 27B pass is roughly three times slower per request.

Three-minute setup check

  1. Install Ollama from ollama.com/download, then pull the model: ollama pull gemma4:12b.
  2. Verify the server is running: curl http://localhost:11434/api/tags — you should see gemma4:12b in the JSON response.
  3. Install the Python client once per project: pip install ollama. That is the only dependency you need for every pattern below.

If you need a deeper walkthrough of the install, see how to run Gemma 4 locally. The rest of this tutorial assumes the three-step check above passes.

File processing workflows

The most immediately useful local AI workflow is batch file processing — sending multiple files through the model for analysis, extraction, or transformation.

Write a simple script that reads each file, sends it to the local model with a specific prompt, and saves the result.

  • Code review: analyse files for bugs, style issues, or security concerns
  • Document summarisation: condense long documents into key points
  • Data extraction: pull structured data from unstructured text files
  • Content tagging: classify or tag files based on their content

Four ready-to-run pipeline patterns

Every reliable local-AI workflow boils down to four shapes. Each one is a handful of lines of Python you can lift straight into a script. All four assume you ran the pip install ollama step above.

The shared helper used in every pattern below — keep it in one place and call it from each script:

from ollama import generate

MODEL = "gemma4:12b"

def ask_gemma(prompt: str, context: str = "") -> str:
    full_prompt = f"{context}\n\n{prompt}" if context else prompt
    return generate(model=MODEL, prompt=full_prompt)["response"]

Pattern 1 — Map: one prompt per file, results to disk

The workhorse pattern. Use it for summarisation, classification, code review, anything where each file stands alone. Writing the result to disk per iteration matters — if the batch crashes at file 73, you still have files 1 through 72 on disk.

from pathlib import Path

OUT = Path("summaries")
OUT.mkdir(exist_ok=True)

for path in Path("docs").glob("*.txt"):
    out_file = OUT / f"{path.stem}.md"
    if out_file.exists():
        continue  # resume-safe: skip files already done
    summary = ask_gemma(
        "Summarise in 3 bullet points. Return markdown only.",
        path.read_text(encoding="utf-8"),
    )
    out_file.write_text(summary, encoding="utf-8")
    print(f"✓ {path.name}")

Pattern 2 — Filter: route files by their content

Ask the model a yes/no question per file and move the file to a folder based on the answer. Useful for triaging log files, sorting support tickets, or flagging code that touches sensitive modules.

import shutil
from pathlib import Path

FLAGGED = Path("flagged"); FLAGGED.mkdir(exist_ok=True)
CLEAN = Path("clean"); CLEAN.mkdir(exist_ok=True)

for path in Path("inbox").glob("*.log"):
    verdict = ask_gemma(
        "Reply with exactly one word: FLAG or CLEAN. "
        "Reply FLAG if the log shows a stack trace, auth failure, or "
        "data-loss error. Otherwise CLEAN.",
        path.read_text(encoding="utf-8")[:6000],
    ).strip().upper()
    dest = FLAGGED if verdict.startswith("FLAG") else CLEAN
    shutil.move(str(path), dest / path.name)

Two tricks here: forcing single-word output keeps parsing trivial, and trimming to the first 6,000 characters prevents context-window overflow on huge log files.

Pattern 3 — Extract: unstructured text to structured JSON

Ask for JSON, parse it, and use it. This turns Gemma 4 into a schema-aware extraction tool that runs entirely offline — no third-party OCR-to-JSON service touching your contracts, CVs, or invoices.

import json
from pathlib import Path

rows = []
for path in Path("invoices").glob("*.txt"):
    raw = ask_gemma(
        "Extract this invoice as JSON with keys: vendor, invoice_number, "
        "date_iso, total_gbp. Return only the JSON object, no prose.",
        path.read_text(encoding="utf-8"),
    )
    try:
        rows.append(json.loads(raw.strip().strip("`")))
    except json.JSONDecodeError:
        print(f"⚠ parse failed: {path.name}")

Path("invoices.json").write_text(json.dumps(rows, indent=2))

Two habits that save hours: log the filename when JSON parsing fails (so you know which file to inspect), and strip backticks in case the model wraps output in a markdown code fence.

Pattern 4 — Map-reduce: documents larger than the context window

When a file exceeds Gemma 4's context window, split it into chunks, summarise each chunk, then summarise the summaries. This keeps the first pass cheap and lets the second pass see the whole document compressed into its key points.

def chunk_by_paragraphs(text: str, size: int = 20) -> list[str]:
    paras = [p for p in text.split("\n\n") if p.strip()]
    return ["\n\n".join(paras[i:i+size]) for i in range(0, len(paras), size)]

long_doc = Path("report.md").read_text(encoding="utf-8")
partials = [
    ask_gemma("Summarise in 5 bullet points:", chunk)
    for chunk in chunk_by_paragraphs(long_doc)
]
final = ask_gemma(
    "Merge these section summaries into one 10-bullet executive summary.",
    "\n\n---\n\n".join(partials),
)
Path("report-summary.md").write_text(final, encoding="utf-8")

Performance and context limits

Local models are slower than cloud APIs. Plan for 10-30 seconds per request depending on input length and hardware. For batch jobs, this adds up — 100 files at 20 seconds each is over 30 minutes.

Watch context window limits. Gemma 4 handles 8K-128K tokens depending on the variant. If your files are large, split them or summarise sections separately.

When to stay local vs. use an API

Use local Gemma 4 when privacy matters, when you are processing many files (avoiding per-token API costs), or when you need to work offline. Use a cloud API when you need the fastest possible response, the highest quality model, or when the data is not sensitive.

You can also build hybrid workflows — use local Gemma for initial processing and a cloud API for final quality checks on the most important outputs.

Worked example: local code documentation generator

You point a script at your project directory. It sends each source file to local Gemma 4, which generates a one-paragraph summary of what the file does. The script collects all summaries into a project overview document. Total cost: zero API fees. Total data leaked: none.

Common mistakes

  • Trying to process files larger than the model's context window without splitting — use Pattern 4 (map-reduce) above.
  • Using a model variant too large for your hardware (causes swapping and extreme slowdown) — verify GPU offload with ollama ps.
  • Building complex multi-step prompts instead of simple pipelines — three focused steps beat one ambitious prompt every time.
  • Accumulating results in memory instead of writing each one to disk — one crash and the whole batch is gone.
  • Restarting the Ollama server between requests — it reloads the model from disk every time, costing 10-40 seconds per call.

Troubleshooting batch workflows

Five errors account for almost every failed overnight batch. Check them in this order before reaching for a bigger debug tool.

  1. "connection refused" on port 11434. The Ollama server is not running. On Windows, open the Ollama app; on macOS/Linux run ollama serve in a spare terminal. Confirm with curl http://localhost:11434/api/tags — it should return JSON, not an error.
  2. "model not found: gemma4". You pulled a different tag than the script expects. Run ollama list to see what is actually on disk, then either pull the tag the script uses (ollama pull gemma4:12b) or change MODEL in the helper to match what you have.
  3. Batch speed collapses after the first few files. The model is being unloaded between requests. Set OLLAMA_KEEP_ALIVE=24h in your environment before launching Ollama, so the model stays resident in memory for the whole batch.
  4. Random JSON parse failures on Pattern 3. Gemma 4 occasionally wraps its output in ```json … ``` fences. Strip backticks before json.loads (the Pattern 3 snippet already does this), and tighten the prompt: "Return only the JSON object. Do not wrap in markdown."
  5. GPU offload silently failed and the model is running on CPU. Check ollama ps — if the PROCESSOR column shows 100% CPU, the model did not fit in VRAM. Drop to gemma4:4b, or lower num_gpu layers via a custom Modelfile so part of the model runs on GPU and part on CPU.

FAQ

Which Gemma 4 variant should I use for batch file processing?

Gemma 4 12B is the best starting point on a 16GB machine — large enough to follow extraction instructions cleanly, small enough to finish a 100-file batch in reasonable time. Drop to 4B on 8GB machines, or step up to 27B if you have 32GB+ RAM and quality matters more than throughput.

How do I call Ollama from Python for a batch workflow?

Install the ollama Python package (pip install ollama) and call ollama.generate(model='gemma4:12b', prompt=...). That hits the local API on port 11434 and returns a dict — no network round-trip, no keys. For batch work, wrap the call in a loop over pathlib.Path.glob() and write each result to disk so a crash mid-batch does not lose finished work.

Why is my batch slow even on a powerful machine?

Three usual culprits: model reloading between requests (keep the Ollama server running and send sequential calls — do not restart it), context window bloat (trim each file down to what the prompt actually needs), and CPU fallback (verify GPU offload with ollama ps — if layers show 0% GPU, your model is too big for VRAM).

How do I split files that exceed the context window?

Chunk by semantic boundary, not by character count. For code, split per function or per class. For prose, split per H2 or per 10-paragraph block. Summarise each chunk with the same prompt, then run a second pass that summarises the summaries. Pattern 4 above implements exactly this map-reduce.

Can I run a Gemma 4 batch overnight without supervision?

Yes, but add three guardrails: write each result to disk immediately (do not accumulate in memory), wrap each call in a try/except that logs the filename and continues, and keep a simple processed.txt ledger so a re-run skips files already done. Pair that with OLLAMA_KEEP_ALIVE=24h so the model stays warm overnight and you do not pay a reload cost on every request.

When to use something else

To set up Gemma 4 locally, see running Gemma 4 on your own machine. For using Gemma 4 in your code editor, see Gemma 4 in VS Code.

How to apply this in a real AI project

How to Use Gemma 4 for Local AI Workflows becomes much more useful once it is tied to the rest of the workflow around it. In real work, the result depends on model selection, prompt design, tool integration, evaluation, and the operational reality of shipping AI features, not only on following one local tip correctly.

That is why the biggest win rarely comes from one clever move in isolation. It comes from making the surrounding process easier to review, easier to repeat, and easier to hand over when another person inherits the workbook or codebase later.

  • Test with realistic inputs before shipping, not just the examples that inspired the idea.
  • Keep the human review step visible so the workflow stays trustworthy as it scales.
  • Measure what matters for your use case instead of relying on general benchmarks.

How to extend the workflow after this guide

Once the core technique works, the next leverage usually comes from standardising it. That might mean naming inputs more clearly, keeping one review checklist, or pairing this page with neighbouring guides so the process becomes repeatable rather than person-dependent.

The follow-on guides below are the most natural next steps from How to Use Gemma 4 for Local AI Workflows. They help move the reader from one useful page into a stronger connected system.

Related tutorials on this site

These tutorials cover the Gemma 4 hub, installation, and model selection decisions that pair naturally with automation pipelines.