Dokimos CLI

Dokimos is a command-line tool for:

discovering plagiarism evidence from free remote sources
optionally comparing against a locally indexed corpus
scoring documents for AI-likeness using stylometric heuristics
emitting either human-readable summaries or structured JSON reports

The Python distribution name is dokimos-cli, but the primary command is dokimos.

What Dokimos Does

Dokimos reads a document, extracts text, splits it into chunks, and can run up to two analyses:

Plagiarism Detection: queries free remote providers, fetches source text where possible, and verifies overlap locally using shingling, Jaccard similarity, and RapidFuzz reranking.
AI-Likeness Scoring: computes stylometric signals that may indicate unusually uniform, repetitive, or templated writing.

Important Caveats

AI-likeness output is heuristic and statistical. It is not proof of AI authorship.
Remote plagiarism coverage depends on what free providers can discover and what source text is publicly accessible.
Paywalled, private, or blocked pages are outside the reach of the free remote backends.
Local indexing is optional and only helps when you have a private corpus you want to compare against.
Short documents produce weaker AI-likeness signals and receive an explicit caveat.

Getting Started

Prerequisites

Python 3.12+
A virtual environment is strongly recommended

Install From This Repository

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e ".[dev,pdf,docx]"

Optional extras:

pdf installs pymupdf so .pdf files can be read.
docx installs python-docx so .docx files can be read.
If you only need plain text and Markdown, pip install -e ".[dev]" is enough.

Installed Command Names

After installation, these invocation forms are available:

dokimos ...
python -m dokimos ...
dokimos-cli ...

If dokimos is not on your shell path yet, refresh the editable install:

pip install -e .

Run Directly From Source

If you have not installed the package into the environment, run from the repository root:

PYTHONPATH=src python -m dokimos --help

Supported Input Formats

Dokimos currently supports:

.txt
.md
.docx
.pdf

Format support notes:

.txt and .md are read as UTF-8 text.
.docx requires the docx extra.
.pdf requires the pdf extra.

If a dependency is missing, the CLI returns a structured error and exits with status code 1.

Quick Start

Run a full analysis

By default, Dokimos uses the hybrid plagiarism backend. That means:

it queries free remote providers
it also checks a local corpus if you have indexed one
if you have not indexed anything locally, the remote path still runs normally

dokimos analyze essay.txt --format json

Analyze the example PDF

dokimos analyze examples/Research-Paper.pdf --format json

Build a local corpus index

If you want Dokimos to merge remote matches with your own document set, build a corpus index:

dokimos index-sources ./corpus

Typical output:

Scanned 42 file(s) (recursive)
  Indexed: 38
  Skipped: 1
  Up-to-date: 3
  Index: corpus/index.json

Write a report to disk

dokimos analyze essay.txt --json-out reports/essay.json

For command details, see CLI Reference. For environment variables, see Configuration.

CLI Reference

Top-level help:

dokimos --help

Global Options

--log-level TEXT: set log level such as DEBUG, INFO, WARNING, or ERROR
--install-completion: install shell completion
--show-completion: print completion for the current shell
--help: show help and exit

Commands

analyze
plagiarism
ai-check
index-sources

analyze

Runs both Plagiarism Detection and AI-Likeness Scoring unless one is explicitly disabled.

dokimos analyze FILE_PATH [OPTIONS]

Arguments:

FILE_PATH: path to the document to analyze

Options:

--format [json|text]: output format, default text
--json-out PATH: write the JSON report to a file instead of stdout
--no-plagiarism: skip plagiarism analysis
--no-ai-check: skip AI-likeness analysis
--help

Examples:

dokimos analyze paper.pdf
dokimos analyze paper.pdf --format json
dokimos analyze paper.pdf --json-out output/paper.json
dokimos analyze paper.pdf --no-plagiarism
dokimos analyze paper.pdf --no-ai-check

plagiarism

Runs Plagiarism Detection only.

dokimos plagiarism FILE_PATH [OPTIONS]

Arguments:

FILE_PATH: path to the document to analyze

Options:

--format [json|text]: output format, default text
--json-out PATH: write the JSON report to a file instead of stdout
--help

Example:

dokimos plagiarism essay.txt --format json

ai-check

Runs AI-Likeness Scoring only.

dokimos ai-check FILE_PATH [OPTIONS]

Arguments:

FILE_PATH: path to the document to analyze

Options:

--format [json|text]: output format, default text
--json-out PATH: write the JSON report to a file instead of stdout
--help

Example:

dokimos ai-check essay.txt --format json

index-sources

Indexes a directory of source documents into the optional local corpus index used by local and hybrid plagiarism backends.

dokimos index-sources DIRECTORY_PATH [OPTIONS]

Arguments:

DIRECTORY_PATH: directory containing source documents

Options:

--recursive / --no-recursive: recurse into subdirectories, default --recursive
--help

Examples:

dokimos index-sources ./corpus
dokimos index-sources ./corpus --no-recursive

Behavior notes:

Files are indexed only if they match supported extensions.
Re-indexing is incremental and skips files that have not changed.
The on-disk index is written to corpus/index.json by default.
Remote plagiarism analysis does not require this index.

Output Behavior

Dokimos supports two output modes.

Text Output

--format text is the default. It prints a human-readable summary using Rich.

Important behavior:

Text output is written to stderr, not stdout.
This is useful for interactive terminal use.
If you are scripting, prefer --format json.

JSON Output

--format json prints strict JSON to stdout.

This is the best choice for:

shell pipelines
CI jobs
API handoff
storing analysis artifacts

json-out

--json-out PATH always writes JSON to the specified file, regardless of the selected --format value.

Example:

dokimos analyze essay.txt --format text --json-out reports/essay.json

In that case:

the JSON report is written to the file
the CLI prints a confirmation message
JSON is not emitted to stdout

JSON Report Shape

The report includes top-level fields like:

schema_version
document_id
version
status
generated_at
document
summary
plagiarism
ai_likelihood
caveats

Example:

{
  "schema_version": "1.0",
  "document_id": "...",
  "status": "complete",
  "document": {
    "source_path": "essay.txt",
    "filename": "essay.txt",
    "file_format": ".txt",
    "character_count": 1234
  },
  "summary": {
    "analyses_run": ["plagiarism", "ai_check"],
    "plagiarism_match_count": 2,
    "plagiarism_overall_score": 0.85,
    "ai_likeness_score": 0.34,
    "automated_writing_risk": "medium",
    "human_review_recommended": true
  },
  "plagiarism": { "...": "..." },
  "ai_likelihood": { "...": "..." },
  "caveats": []
}

Configuration

All settings can be overridden with DOKIMOS_-prefixed environment variables.

Core Settings

Variable	Default	Description
`DOKIMOS_LOG_LEVEL`	`INFO`	Default application log level
`DOKIMOS_CORPUS_PATH`	`corpus`	Directory used for the internal source corpus
`DOKIMOS_INDEX_FILE`	`corpus/index.json`	JSON index file path
`DOKIMOS_OUTPUT_DIR`	`output`	Default output directory for reports

Chunking Settings

Variable	Default	Description
`DOKIMOS_CHUNK_STRATEGY`	`paragraph`	One of `paragraph`, `sentence`, `fixed`
`DOKIMOS_CHUNK_SIZE`	`500`	Approximate words per chunk for fixed chunking
`DOKIMOS_CHUNK_OVERLAP`	`50`	Overlap between fixed chunks

Plagiarism Settings

Variable	Default	Description
`DOKIMOS_PLAGIARISM_BACKEND`	`hybrid`	One of `local`, `remote`, or `hybrid`
`DOKIMOS_SHINGLE_SIZE`	`5`	Number of words per shingle
`DOKIMOS_PLAGIARISM_JACCARD_THRESHOLD`	`0.10`	Minimum Jaccard similarity to retain a candidate
`DOKIMOS_PLAGIARISM_FUZZ_THRESHOLD`	`60.0`	Minimum RapidFuzz score to keep a reranked match
`DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHUNKS`	`5`	Maximum input chunks used as remote search queries
`DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHARS`	`180`	Maximum characters sent in each remote query
`DOKIMOS_PLAGIARISM_REMOTE_PER_PROVIDER_RESULTS`	`3`	Maximum candidate sources requested from each provider per query
`DOKIMOS_PLAGIARISM_REMOTE_TIMEOUT_SECONDS`	`10.0`	Remote request timeout
`DOKIMOS_PLAGIARISM_REMOTE_MAX_SOURCE_CHARS`	`20000`	Maximum retained text size per fetched remote source
`DOKIMOS_PLAGIARISM_REMOTE_FETCH_FULL_TEXT`	`true`	Attempt to fetch source pages or PDFs instead of relying on metadata
`DOKIMOS_PLAGIARISM_REMOTE_CONTACT_EMAIL`	unset	Optional contact email included where provider etiquette recommends it
`DOKIMOS_PLAGIARISM_REMOTE_ENABLE_OPENALEX`	`true`	Enable OpenAlex provider
`DOKIMOS_PLAGIARISM_REMOTE_ENABLE_CROSSREF`	`true`	Enable Crossref provider
`DOKIMOS_PLAGIARISM_REMOTE_ENABLE_ARXIV`	`true`	Enable arXiv provider
`DOKIMOS_PLAGIARISM_REMOTE_ENABLE_DUCKDUCKGO`	`true`	Enable DuckDuckGo HTML web-search provider

AI-Likeness Settings

Variable	Default	Description
`DOKIMOS_AI_SHORT_SENTENCE_THRESHOLD`	`8`	Sentences at or below this word count are considered short
`DOKIMOS_AI_LONG_SENTENCE_THRESHOLD`	`40`	Sentences at or above this word count are considered long
`DOKIMOS_AI_SIGNAL_TRIGGER_THRESHOLD`	`0.5`	Per-signal threshold to mark a signal as triggered
`DOKIMOS_AI_RISK_HIGH_THRESHOLD`	`0.6`	Aggregate score threshold for `high` risk
`DOKIMOS_AI_RISK_MEDIUM_THRESHOLD`	`0.3`	Aggregate score threshold for `medium` risk
`DOKIMOS_AI_SHORT_DOCUMENT_WORDS`	`80`	Documents below this size receive the short-document caveat

Typical Workflows

Analyze a single paper against free remote sources

dokimos analyze submissions/paper-01.pdf --format json

Combine remote discovery with a local course corpus

export DOKIMOS_INDEX_FILE=/tmp/course-a-index.json
dokimos index-sources ./course-a-corpus
dokimos analyze submissions/paper-01.pdf --format json

Force remote-only analysis

export DOKIMOS_PLAGIARISM_BACKEND=remote
dokimos analyze essay.txt --format json

Force local-only offline analysis

export DOKIMOS_PLAGIARISM_BACKEND=local
dokimos index-sources ./course-a-corpus
dokimos analyze essay.txt --format json

Use sentence chunking for a short-form writing set

export DOKIMOS_CHUNK_STRATEGY=sentence
dokimos analyze essay.txt --format json

Generate a saved report for later inspection

dokimos analyze essay.txt --json-out reports/essay.json

AI-Likeness

Dokimos includes a rule-based stylometric AI-likeness engine.

It computes six signals per chunk, aggregates them into a document-level score, and maps that score to a low, medium, or high automated-writing-risk band.

This score is heuristic and statistical. It is not proof of AI authorship and should not be treated as such.

The Six Signals

The current engine computes these signals:

sentence_length_uniformity Measures how uniform sentence lengths are. Low variation can correlate with machine-generated text.
short_sentence_ratio Measures the fraction of very short sentences. A high ratio can suggest formulaic or list-like writing.
long_sentence_ratio Measures the fraction of very long sentences. A high ratio can also indicate auto-generated prose.
lexical_diversity Uses type-token ratio and inverts it into a risk contribution. Low vocabulary variety increases risk.
sentence_start_repetition Measures whether multiple sentences begin with the same word. Repeated openings can suggest template-driven generation.
avg_word_length_uniformity Measures how uniform average word length is across sentences. Unusually regular patterns can be a machine-writing indicator.

Aggregation and Risk Bands

Each signal contributes to a weighted mean. Most signals use weight 1.0. sentence_start_repetition uses a lower weight because it is less reliable on short text.

The document-level score is then mapped to the configured risk bands:

high when score is at least DOKIMOS_AI_RISK_HIGH_THRESHOLD (default 0.6)
medium when score is at least DOKIMOS_AI_RISK_MEDIUM_THRESHOLD (default 0.3)
low otherwise

Chunk-level findings are also included so you can see where signals cluster inside the document.

Caveats

The engine always emits a stylometric_only caveat:

Stylometric heuristics only — not a substitute for human review.

It also emits a short_document caveat when total word count is below DOKIMOS_AI_SHORT_DOCUMENT_WORDS (default 80):

Short documents produce less reliable signals.

Current Evaluation Status

Dokimos does not currently publish a formal benchmark with false-positive and false-negative rates.

What exists today is regression and ordering coverage in tests/test_evaluation.py. Those tests verify that:

clearly AI-like fixture text scores higher than natural human text
templated text scores higher than clearly human writing
mixed text remains within a sensible range
risk bands line up with configured thresholds
caveats and explanations are present when expected
edge cases such as empty documents and single-sentence input do not crash the engine

That is useful for guarding implementation drift, but it is not the same thing as a published accuracy study.

In other words:

the current tests support relative ordering and stability claims
they do not support a strong claim like “the AI detector is X% accurate”

Practical Guidance

Treat AI-likeness as a review signal, not a verdict.
Expect domain, genre, and document length to affect the score.
Use the JSON output when you want to inspect individual indicators, caveats, and chunk findings.
For higher-stakes decisions, keep a human review step in the loop.

Troubleshooting

dokimos: command not found

Refresh the editable install:

pip install -e .

If you are intentionally running directly from source instead of installing:

PYTHONPATH=src python -m dokimos --help

No module named dokimos

You are likely running from source without an install and without PYTHONPATH=src.

Use one of these:

pip install -e .

PYTHONPATH=src python -m dokimos --help

PDF or DOCX files fail to open

Install the required extras:

pip install -e ".[pdf,docx]"

Plagiarism returns zero matches

Check the basics:

make sure the machine has internet access if you are using remote or hybrid
make sure the remote providers you want are enabled
try increasing DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHARS slightly for longer phrases
if you want local matches too, make sure you indexed the source corpus first
make sure DOKIMOS_INDEX_FILE points to the expected index when using local or hybrid
remember that paywalled or inaccessible pages cannot be fetched by the free providers

Output is not appearing on stdout

That is expected in text mode.

--format text writes the human-readable summary to stderr
--format json writes structured JSON to stdout

If you are scripting, use:

dokimos analyze essay.txt --format json

AI-likeness score seems surprising

Remember the score is heuristic, not proof of authorship.

Common reasons for surprising results:

very short documents produce weaker signals
domain-specific or highly repetitive human writing can look stylometrically unusual
mixed human and AI-assisted text can sit between the obvious extremes

Inspect the JSON output when you need more detail about individual indicators and caveats.

Keyboard shortcuts

Dokimos CLI