Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Dokimos CLI

Dokimos is a command-line tool for:

  • discovering plagiarism evidence from free remote sources
  • optionally comparing against a locally indexed corpus
  • scoring documents for AI-likeness using stylometric heuristics
  • emitting either human-readable summaries or structured JSON reports

The Python distribution name is dokimos-cli, but the primary command is dokimos.

What Dokimos Does

Dokimos reads a document, extracts text, splits it into chunks, and can run up to two analyses:

  • Plagiarism Detection: queries free remote providers, fetches source text where possible, and verifies overlap locally using shingling, Jaccard similarity, and RapidFuzz reranking.
  • AI-Likeness Scoring: computes stylometric signals that may indicate unusually uniform, repetitive, or templated writing.

Important Caveats

  • AI-likeness output is heuristic and statistical. It is not proof of AI authorship.
  • Remote plagiarism coverage depends on what free providers can discover and what source text is publicly accessible.
  • Paywalled, private, or blocked pages are outside the reach of the free remote backends.
  • Local indexing is optional and only helps when you have a private corpus you want to compare against.
  • Short documents produce weaker AI-likeness signals and receive an explicit caveat.

Getting Started

Prerequisites

  • Python 3.12+
  • A virtual environment is strongly recommended

Install From This Repository

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e ".[dev,pdf,docx]"

Optional extras:

  • pdf installs pymupdf so .pdf files can be read.
  • docx installs python-docx so .docx files can be read.
  • If you only need plain text and Markdown, pip install -e ".[dev]" is enough.

Installed Command Names

After installation, these invocation forms are available:

  • dokimos ...
  • python -m dokimos ...
  • dokimos-cli ...

If dokimos is not on your shell path yet, refresh the editable install:

pip install -e .

Run Directly From Source

If you have not installed the package into the environment, run from the repository root:

PYTHONPATH=src python -m dokimos --help

Supported Input Formats

Dokimos currently supports:

  • .txt
  • .md
  • .docx
  • .pdf

Format support notes:

  • .txt and .md are read as UTF-8 text.
  • .docx requires the docx extra.
  • .pdf requires the pdf extra.

If a dependency is missing, the CLI returns a structured error and exits with status code 1.

Quick Start

Run a full analysis

By default, Dokimos uses the hybrid plagiarism backend. That means:

  • it queries free remote providers
  • it also checks a local corpus if you have indexed one
  • if you have not indexed anything locally, the remote path still runs normally
dokimos analyze essay.txt --format json

Analyze the example PDF

dokimos analyze examples/Research-Paper.pdf --format json

Build a local corpus index

If you want Dokimos to merge remote matches with your own document set, build a corpus index:

dokimos index-sources ./corpus

Typical output:

Scanned 42 file(s) (recursive)
  Indexed: 38
  Skipped: 1
  Up-to-date: 3
  Index: corpus/index.json

Write a report to disk

dokimos analyze essay.txt --json-out reports/essay.json

For command details, see CLI Reference. For environment variables, see Configuration.

CLI Reference

Top-level help:

dokimos --help

Global Options

  • --log-level TEXT: set log level such as DEBUG, INFO, WARNING, or ERROR
  • --install-completion: install shell completion
  • --show-completion: print completion for the current shell
  • --help: show help and exit

Commands

  • analyze
  • plagiarism
  • ai-check
  • index-sources

analyze

Runs both Plagiarism Detection and AI-Likeness Scoring unless one is explicitly disabled.

dokimos analyze FILE_PATH [OPTIONS]

Arguments:

  • FILE_PATH: path to the document to analyze

Options:

  • --format [json|text]: output format, default text
  • --json-out PATH: write the JSON report to a file instead of stdout
  • --no-plagiarism: skip plagiarism analysis
  • --no-ai-check: skip AI-likeness analysis
  • --help

Examples:

dokimos analyze paper.pdf
dokimos analyze paper.pdf --format json
dokimos analyze paper.pdf --json-out output/paper.json
dokimos analyze paper.pdf --no-plagiarism
dokimos analyze paper.pdf --no-ai-check

plagiarism

Runs Plagiarism Detection only.

dokimos plagiarism FILE_PATH [OPTIONS]

Arguments:

  • FILE_PATH: path to the document to analyze

Options:

  • --format [json|text]: output format, default text
  • --json-out PATH: write the JSON report to a file instead of stdout
  • --help

Example:

dokimos plagiarism essay.txt --format json

ai-check

Runs AI-Likeness Scoring only.

dokimos ai-check FILE_PATH [OPTIONS]

Arguments:

  • FILE_PATH: path to the document to analyze

Options:

  • --format [json|text]: output format, default text
  • --json-out PATH: write the JSON report to a file instead of stdout
  • --help

Example:

dokimos ai-check essay.txt --format json

index-sources

Indexes a directory of source documents into the optional local corpus index used by local and hybrid plagiarism backends.

dokimos index-sources DIRECTORY_PATH [OPTIONS]

Arguments:

  • DIRECTORY_PATH: directory containing source documents

Options:

  • --recursive / --no-recursive: recurse into subdirectories, default --recursive
  • --help

Examples:

dokimos index-sources ./corpus
dokimos index-sources ./corpus --no-recursive

Behavior notes:

  • Files are indexed only if they match supported extensions.
  • Re-indexing is incremental and skips files that have not changed.
  • The on-disk index is written to corpus/index.json by default.
  • Remote plagiarism analysis does not require this index.

Output Behavior

Dokimos supports two output modes.

Text Output

--format text is the default. It prints a human-readable summary using Rich.

Important behavior:

  • Text output is written to stderr, not stdout.
  • This is useful for interactive terminal use.
  • If you are scripting, prefer --format json.

JSON Output

--format json prints strict JSON to stdout.

This is the best choice for:

  • shell pipelines
  • CI jobs
  • API handoff
  • storing analysis artifacts

json-out

--json-out PATH always writes JSON to the specified file, regardless of the selected --format value.

Example:

dokimos analyze essay.txt --format text --json-out reports/essay.json

In that case:

  • the JSON report is written to the file
  • the CLI prints a confirmation message
  • JSON is not emitted to stdout

JSON Report Shape

The report includes top-level fields like:

  • schema_version
  • document_id
  • version
  • status
  • generated_at
  • document
  • summary
  • plagiarism
  • ai_likelihood
  • caveats

Example:

{
  "schema_version": "1.0",
  "document_id": "...",
  "status": "complete",
  "document": {
    "source_path": "essay.txt",
    "filename": "essay.txt",
    "file_format": ".txt",
    "character_count": 1234
  },
  "summary": {
    "analyses_run": ["plagiarism", "ai_check"],
    "plagiarism_match_count": 2,
    "plagiarism_overall_score": 0.85,
    "ai_likeness_score": 0.34,
    "automated_writing_risk": "medium",
    "human_review_recommended": true
  },
  "plagiarism": { "...": "..." },
  "ai_likelihood": { "...": "..." },
  "caveats": []
}

Configuration

All settings can be overridden with DOKIMOS_-prefixed environment variables.

Core Settings

VariableDefaultDescription
DOKIMOS_LOG_LEVELINFODefault application log level
DOKIMOS_CORPUS_PATHcorpusDirectory used for the internal source corpus
DOKIMOS_INDEX_FILEcorpus/index.jsonJSON index file path
DOKIMOS_OUTPUT_DIRoutputDefault output directory for reports

Chunking Settings

VariableDefaultDescription
DOKIMOS_CHUNK_STRATEGYparagraphOne of paragraph, sentence, fixed
DOKIMOS_CHUNK_SIZE500Approximate words per chunk for fixed chunking
DOKIMOS_CHUNK_OVERLAP50Overlap between fixed chunks

Plagiarism Settings

VariableDefaultDescription
DOKIMOS_PLAGIARISM_BACKENDhybridOne of local, remote, or hybrid
DOKIMOS_SHINGLE_SIZE5Number of words per shingle
DOKIMOS_PLAGIARISM_JACCARD_THRESHOLD0.10Minimum Jaccard similarity to retain a candidate
DOKIMOS_PLAGIARISM_FUZZ_THRESHOLD60.0Minimum RapidFuzz score to keep a reranked match
DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHUNKS5Maximum input chunks used as remote search queries
DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHARS180Maximum characters sent in each remote query
DOKIMOS_PLAGIARISM_REMOTE_PER_PROVIDER_RESULTS3Maximum candidate sources requested from each provider per query
DOKIMOS_PLAGIARISM_REMOTE_TIMEOUT_SECONDS10.0Remote request timeout
DOKIMOS_PLAGIARISM_REMOTE_MAX_SOURCE_CHARS20000Maximum retained text size per fetched remote source
DOKIMOS_PLAGIARISM_REMOTE_FETCH_FULL_TEXTtrueAttempt to fetch source pages or PDFs instead of relying on metadata
DOKIMOS_PLAGIARISM_REMOTE_CONTACT_EMAILunsetOptional contact email included where provider etiquette recommends it
DOKIMOS_PLAGIARISM_REMOTE_ENABLE_OPENALEXtrueEnable OpenAlex provider
DOKIMOS_PLAGIARISM_REMOTE_ENABLE_CROSSREFtrueEnable Crossref provider
DOKIMOS_PLAGIARISM_REMOTE_ENABLE_ARXIVtrueEnable arXiv provider
DOKIMOS_PLAGIARISM_REMOTE_ENABLE_DUCKDUCKGOtrueEnable DuckDuckGo HTML web-search provider

AI-Likeness Settings

VariableDefaultDescription
DOKIMOS_AI_SHORT_SENTENCE_THRESHOLD8Sentences at or below this word count are considered short
DOKIMOS_AI_LONG_SENTENCE_THRESHOLD40Sentences at or above this word count are considered long
DOKIMOS_AI_SIGNAL_TRIGGER_THRESHOLD0.5Per-signal threshold to mark a signal as triggered
DOKIMOS_AI_RISK_HIGH_THRESHOLD0.6Aggregate score threshold for high risk
DOKIMOS_AI_RISK_MEDIUM_THRESHOLD0.3Aggregate score threshold for medium risk
DOKIMOS_AI_SHORT_DOCUMENT_WORDS80Documents below this size receive the short-document caveat

Typical Workflows

Analyze a single paper against free remote sources

dokimos analyze submissions/paper-01.pdf --format json

Combine remote discovery with a local course corpus

export DOKIMOS_INDEX_FILE=/tmp/course-a-index.json
dokimos index-sources ./course-a-corpus
dokimos analyze submissions/paper-01.pdf --format json

Force remote-only analysis

export DOKIMOS_PLAGIARISM_BACKEND=remote
dokimos analyze essay.txt --format json

Force local-only offline analysis

export DOKIMOS_PLAGIARISM_BACKEND=local
dokimos index-sources ./course-a-corpus
dokimos analyze essay.txt --format json

Use sentence chunking for a short-form writing set

export DOKIMOS_CHUNK_STRATEGY=sentence
dokimos analyze essay.txt --format json

Generate a saved report for later inspection

dokimos analyze essay.txt --json-out reports/essay.json

AI-Likeness

Dokimos includes a rule-based stylometric AI-likeness engine.

It computes six signals per chunk, aggregates them into a document-level score, and maps that score to a low, medium, or high automated-writing-risk band.

This score is heuristic and statistical. It is not proof of AI authorship and should not be treated as such.

The Six Signals

The current engine computes these signals:

  1. sentence_length_uniformity Measures how uniform sentence lengths are. Low variation can correlate with machine-generated text.

  2. short_sentence_ratio Measures the fraction of very short sentences. A high ratio can suggest formulaic or list-like writing.

  3. long_sentence_ratio Measures the fraction of very long sentences. A high ratio can also indicate auto-generated prose.

  4. lexical_diversity Uses type-token ratio and inverts it into a risk contribution. Low vocabulary variety increases risk.

  5. sentence_start_repetition Measures whether multiple sentences begin with the same word. Repeated openings can suggest template-driven generation.

  6. avg_word_length_uniformity Measures how uniform average word length is across sentences. Unusually regular patterns can be a machine-writing indicator.

Aggregation and Risk Bands

Each signal contributes to a weighted mean. Most signals use weight 1.0. sentence_start_repetition uses a lower weight because it is less reliable on short text.

The document-level score is then mapped to the configured risk bands:

  • high when score is at least DOKIMOS_AI_RISK_HIGH_THRESHOLD (default 0.6)
  • medium when score is at least DOKIMOS_AI_RISK_MEDIUM_THRESHOLD (default 0.3)
  • low otherwise

Chunk-level findings are also included so you can see where signals cluster inside the document.

Caveats

The engine always emits a stylometric_only caveat:

  • Stylometric heuristics only — not a substitute for human review.

It also emits a short_document caveat when total word count is below DOKIMOS_AI_SHORT_DOCUMENT_WORDS (default 80):

  • Short documents produce less reliable signals.

Current Evaluation Status

Dokimos does not currently publish a formal benchmark with false-positive and false-negative rates.

What exists today is regression and ordering coverage in tests/test_evaluation.py. Those tests verify that:

  • clearly AI-like fixture text scores higher than natural human text
  • templated text scores higher than clearly human writing
  • mixed text remains within a sensible range
  • risk bands line up with configured thresholds
  • caveats and explanations are present when expected
  • edge cases such as empty documents and single-sentence input do not crash the engine

That is useful for guarding implementation drift, but it is not the same thing as a published accuracy study.

In other words:

  • the current tests support relative ordering and stability claims
  • they do not support a strong claim like “the AI detector is X% accurate”

Practical Guidance

  • Treat AI-likeness as a review signal, not a verdict.
  • Expect domain, genre, and document length to affect the score.
  • Use the JSON output when you want to inspect individual indicators, caveats, and chunk findings.
  • For higher-stakes decisions, keep a human review step in the loop.

Troubleshooting

dokimos: command not found

Refresh the editable install:

pip install -e .

If you are intentionally running directly from source instead of installing:

PYTHONPATH=src python -m dokimos --help

No module named dokimos

You are likely running from source without an install and without PYTHONPATH=src.

Use one of these:

pip install -e .
PYTHONPATH=src python -m dokimos --help

PDF or DOCX files fail to open

Install the required extras:

pip install -e ".[pdf,docx]"

Plagiarism returns zero matches

Check the basics:

  • make sure the machine has internet access if you are using remote or hybrid
  • make sure the remote providers you want are enabled
  • try increasing DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHARS slightly for longer phrases
  • if you want local matches too, make sure you indexed the source corpus first
  • make sure DOKIMOS_INDEX_FILE points to the expected index when using local or hybrid
  • remember that paywalled or inaccessible pages cannot be fetched by the free providers

Output is not appearing on stdout

That is expected in text mode.

  • --format text writes the human-readable summary to stderr
  • --format json writes structured JSON to stdout

If you are scripting, use:

dokimos analyze essay.txt --format json

AI-likeness score seems surprising

Remember the score is heuristic, not proof of authorship.

Common reasons for surprising results:

  • very short documents produce weaker signals
  • domain-specific or highly repetitive human writing can look stylometrically unusual
  • mixed human and AI-assisted text can sit between the obvious extremes

Inspect the JSON output when you need more detail about individual indicators and caveats.