Building Your Own Analytics with Apache Logs and SQLite

Every web server already tracks every request. If you're running Apache, there's a detailed log file sitting right there, recording every page view, every referrer, every user agent. Most people ignore it and install a third-party tracking script instead.

That's fine. But you lose something. You lose understanding of what your data actually looks like, you gain a cookie banner, and you hand your visitors' information to someone else.

Here's how I built a complete analytics pipeline for this site using nothing but Apache logs, a PHP script, and SQLite.

The architecture

Three moving parts:

Apache access logs — the raw data, already being collected
A PHP CLI parser — reads log lines, extracts the useful bits, writes to SQLite
A PHP dashboard — reads from SQLite, renders charts and tables

No JavaScript tracking snippets. No cookies. No third-party services. The data stays on your server and never touches a visitor's browser.

Parsing Apache logs

Apache's combined log format gives you a lot per line:

192.168.1.1 - - [25/Feb/2026:14:30:22 +0000] "GET /blog/how-dns-works HTTP/1.1" 200 8432 "https://google.com/" "Mozilla/5.0..."

That's the IP, timestamp, request method and path, status code, response size, referrer, and user agent. You need a regex to pull it apart:

$pattern = '/^(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) [^"]+" (\d+) (\S+) "([^"]*)" "([^"]*)"/';

This captures everything worth keeping. From each line, you extract:

Path — what page they visited
Status code — did it actually load? (filter out 404s and 301s)
Referrer domain — where they came from
User agent — which browser (you only need the family, not the full string)

The SQLite schema

Three tables. Keep it simple.

CREATE TABLE page_views (
    view_date TEXT,
    view_hour INTEGER,
    path TEXT,
    count INTEGER DEFAULT 1,
    UNIQUE(view_date, view_hour, path)
);

CREATE TABLE referrers (
    view_date TEXT,
    referrer_domain TEXT,
    count INTEGER DEFAULT 1,
    UNIQUE(view_date, referrer_domain)
);

CREATE TABLE browser_families (
    view_date TEXT,
    family TEXT,
    count INTEGER DEFAULT 1,
    UNIQUE(view_date, family)
);

Notice: no IP addresses stored. No individual user tracking. Every row is an aggregate — "on this date, this page got N views." You get useful data without collecting anything personal.

The UNIQUE constraint with ON CONFLICT lets you use INSERT OR REPLACE to increment counts efficiently. One row per page per day per hour, not one row per visit.

Filtering the noise

Raw logs are mostly noise. Bots, asset requests, health checks. You need to filter aggressively:

Skip non-GET requests. POST, HEAD, OPTIONS — not page views.
Skip static assets. Anything ending in .css, .js, .png, .ico, .svg, .woff2.
Skip bot user agents. Look for "bot", "crawler", "spider", "curl", "wget" in the UA string.
Skip non-200 responses. A 404 isn't a page view.
Skip known probe paths. /wp-admin, /xmlrpc.php, /.env — scanners, not visitors.

After filtering, you go from thousands of log lines to dozens of meaningful page views. That ratio is normal. Most of your traffic isn't human.

Incremental parsing

You don't want to re-read the entire log file every time. Track your byte offset:

CREATE TABLE parse_state (
    key TEXT PRIMARY KEY,
    value TEXT
);

Store the file path and byte offset after each parse run. Next time, seek to that position and read only new lines. This makes parsing instantaneous even with large log files. Log rotation? Detect it by checking if the current file is smaller than your stored offset, and reset.

The dashboard

The query side is straightforward SQL. Total views, daily trend, top pages, top referrers, browser breakdown:

-- Daily views for the last 30 days
SELECT view_date, SUM(count) as views
FROM page_views
WHERE view_date >= date('now', '-30 days')
GROUP BY view_date ORDER BY view_date;

-- Top pages
SELECT path, SUM(count) as views
FROM page_views
GROUP BY path ORDER BY views DESC LIMIT 15;

Render bar charts with CSS — no charting library needed. A div with width set as a percentage of the max value. Done.

What you learn

The best part isn't the dashboard. It's what you notice by looking at raw data.

I learned that bot scanners found my site within 48 hours of going live, probing for WordPress admin panels and PHP debug modes. I learned that my tools pages consistently rank in the top 15, even though the blog posts get more direct traffic. I learned that Google started sending referrals after two weeks — just a trickle, but enough to confirm the site is being indexed.

None of that would have been as visceral from a Google Analytics chart. Building it yourself means you understand every number because you wrote the code that produced it.

Trade-offs

This approach has real limitations:

No real-time data. You parse logs on demand, not continuously.
No unique visitors. Without cookies or fingerprinting, you can't distinguish repeat visits. Every page load is a page view. That's a privacy feature disguised as a limitation.
No client-side events. You can't track button clicks or scroll depth. Server-side only.
Manual execution. Without cron access, you run the parser by hand.

For a small site, these trade-offs are worth it. You get honest numbers, no tracking burden on visitors, and full control of your data. If the site outgrows this, the data is in SQLite — easy to query, easy to migrate.

The point

You don't need a SaaS product to understand your traffic. The data is already there. A weekend of PHP and SQL gives you a dashboard that's private by default, runs on your own server, and teaches you more about your traffic than any drag-and-drop tool ever will.

The architecture

Parsing Apache logs

The SQLite schema

Filtering the noise

Incremental parsing

The dashboard

What you learn

Trade-offs

The point

Comments

Related

How to Turn Text into Art with Nothing but Math

The Pattern Has a Name

Watching You Watch Me