tulkki
See your site through an AI crawler's eyes.

A local command-line tool that diagnoses what AI crawlers actually see on your website. Fetches a URL three ways — raw HTML bytes, extracted content, and post-JavaScript DOM — and tells you which AI systems will see your content, which won't, and why.

GitHub v0.2.0 Changelog MIT license Python 3.12+

Why tulkki

SEO made your site findable to Google. GEO makes your content knowable to AI. tulkki tells you which AI systems actually know your content, which ones don’t, and why.

AI visibility is not one thing — it's three different thresholds. A page can be 100% visible to one AI crawler, 5% to another, and invisible to a third, all on the same URL. Most tools give you one number. tulkki measures all three.

#1
Raw HTML bytes
What any pipeline tokenizing raw HTML directly would see — including training crawlers that read script tags.
#2
Standard extraction
What Common Crawl WET files, readability.js, and most content pipelines actually see after boilerplate stripping.
#3
Post-JS DOM
What human browsers see, and what AI systems that execute JavaScript (like Google-Extended) can reach.

Different AI systems use different pipelines and hit different thresholds. Without measuring all three, you can't tell whether a visibility problem is your framework, your rendering strategy, or something else entirely.

A real example

Anthropic has a page about how AI affects the economy. Open it in Chrome and you see a world map, rankings for 50 states, task categories with percentages, and methodology notes — about 689 words of content.

Here's what different AI systems see when they visit the exact same URL:

Most AI training crawlers
GPTBot, ClaudeBot, Common Crawl WET
53 words. Mostly the navigation menu and the word “Loading”.
AI pipelines reading raw HTML bytes
Training pipelines that tokenize script tag contents
More content is technically present, but buried inside framework code — hard to use as training signal.
AI systems that run JavaScript first
Google’s AI Overviews, some real-time agents
All 689 words. Same as a human user.

Same URL. Three AI systems. Three completely different answers. A page about AI, invisible to most AI — and the page owner has no way to know until a tool tells them.

40.0%
Raw HTML coverage
5.4%
Extractor visibility
Raw HTML coverage raw_presence_score
The fraction of human-visible content that can be found as literal text inside the raw HTML bytes — before any extraction or JavaScript runs. This is the upper bound of what any AI crawler could possibly see, including training pipelines that tokenize raw HTML directly (script tag contents included).
Extractor visibility visibility_score
The fraction of human-visible content that a standard boilerplate-stripping extractor (trafilatura) recovers from the raw HTML. This represents what Common Crawl’s WET files, readability.js, and most AI training pipelines actually see in practice. When this is low, your content is invisible to the majority of training crawlers even if it’s technically present in the bytes.

tulkki runs on any URL in about 15 seconds and produces a shareable HTML report like the ones below. You can hand it to a client, attach it to a pull request, or diff it against last week’s version.

Install and try

tulkki runs locally with no API keys. Install from source with uv:

git clone https://github.com/wcl-dev/tulkki
cd tulkki
uv sync
uv run playwright install chromium
uv run tulkki check https://example.com

Run a diagnostic

# Terminal report + AI-view and human-view markdown files
tulkki check https://example.com

# Add a self-contained HTML report you can share
tulkki check https://example.com --html

# Plain-English explanation of every metric
tulkki explain

Automate in CI

# Exit 1 if extractor visibility drops below 80%
tulkki check https://example.com --quiet --fail-below 80

# Also fail if raw HTML coverage drops below 60%
tulkki check https://example.com --quiet \
  --fail-below 80 --fail-below-raw 60

Who this is for

tulkki is built for the niche of practitioners currently underserved by SEO tools and marketing score-checkers: