Dokimos CLI
Dokimos is a command-line tool for:
- discovering plagiarism evidence from free remote sources
- optionally comparing against a locally indexed corpus
- scoring documents for AI-likeness using stylometric heuristics
- emitting either human-readable summaries or structured JSON reports
The Python distribution name is dokimos-cli, but the primary command is dokimos.
What Dokimos Does
Dokimos reads a document, extracts text, splits it into chunks, and can run up to two analyses:
- Plagiarism Detection: queries free remote providers, fetches source text where possible, and verifies overlap locally using shingling, Jaccard similarity, and RapidFuzz reranking.
- AI-Likeness Scoring: computes stylometric signals that may indicate unusually uniform, repetitive, or templated writing.
Important Caveats
- AI-likeness output is heuristic and statistical. It is not proof of AI authorship.
- Remote plagiarism coverage depends on what free providers can discover and what source text is publicly accessible.
- Paywalled, private, or blocked pages are outside the reach of the free remote backends.
- Local indexing is optional and only helps when you have a private corpus you want to compare against.
- Short documents produce weaker AI-likeness signals and receive an explicit caveat.
Read Next
Getting Started
Prerequisites
- Python 3.12+
- A virtual environment is strongly recommended
Install From This Repository
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e ".[dev,pdf,docx]"
Optional extras:
pdfinstallspymupdfso.pdffiles can be read.docxinstallspython-docxso.docxfiles can be read.- If you only need plain text and Markdown,
pip install -e ".[dev]"is enough.
Installed Command Names
After installation, these invocation forms are available:
dokimos ...python -m dokimos ...dokimos-cli ...
If dokimos is not on your shell path yet, refresh the editable install:
pip install -e .
Run Directly From Source
If you have not installed the package into the environment, run from the repository root:
PYTHONPATH=src python -m dokimos --help
Supported Input Formats
Dokimos currently supports:
.txt.md.docx.pdf
Format support notes:
.txtand.mdare read as UTF-8 text..docxrequires thedocxextra..pdfrequires thepdfextra.
If a dependency is missing, the CLI returns a structured error and exits with status code 1.
Quick Start
Run a full analysis
By default, Dokimos uses the hybrid plagiarism backend. That means:
- it queries free remote providers
- it also checks a local corpus if you have indexed one
- if you have not indexed anything locally, the remote path still runs normally
dokimos analyze essay.txt --format json
Analyze the example PDF
dokimos analyze examples/Research-Paper.pdf --format json
Build a local corpus index
If you want Dokimos to merge remote matches with your own document set, build a corpus index:
dokimos index-sources ./corpus
Typical output:
Scanned 42 file(s) (recursive)
Indexed: 38
Skipped: 1
Up-to-date: 3
Index: corpus/index.json
Write a report to disk
dokimos analyze essay.txt --json-out reports/essay.json
For command details, see CLI Reference. For environment variables, see Configuration.
CLI Reference
Top-level help:
dokimos --help
Global Options
--log-level TEXT: set log level such asDEBUG,INFO,WARNING, orERROR--install-completion: install shell completion--show-completion: print completion for the current shell--help: show help and exit
Commands
analyzeplagiarismai-checkindex-sources
analyze
Runs both Plagiarism Detection and AI-Likeness Scoring unless one is explicitly disabled.
dokimos analyze FILE_PATH [OPTIONS]
Arguments:
FILE_PATH: path to the document to analyze
Options:
--format [json|text]: output format, defaulttext--json-out PATH: write the JSON report to a file instead of stdout--no-plagiarism: skip plagiarism analysis--no-ai-check: skip AI-likeness analysis--help
Examples:
dokimos analyze paper.pdf
dokimos analyze paper.pdf --format json
dokimos analyze paper.pdf --json-out output/paper.json
dokimos analyze paper.pdf --no-plagiarism
dokimos analyze paper.pdf --no-ai-check
plagiarism
Runs Plagiarism Detection only.
dokimos plagiarism FILE_PATH [OPTIONS]
Arguments:
FILE_PATH: path to the document to analyze
Options:
--format [json|text]: output format, defaulttext--json-out PATH: write the JSON report to a file instead of stdout--help
Example:
dokimos plagiarism essay.txt --format json
ai-check
Runs AI-Likeness Scoring only.
dokimos ai-check FILE_PATH [OPTIONS]
Arguments:
FILE_PATH: path to the document to analyze
Options:
--format [json|text]: output format, defaulttext--json-out PATH: write the JSON report to a file instead of stdout--help
Example:
dokimos ai-check essay.txt --format json
index-sources
Indexes a directory of source documents into the optional local corpus index used by local and hybrid plagiarism backends.
dokimos index-sources DIRECTORY_PATH [OPTIONS]
Arguments:
DIRECTORY_PATH: directory containing source documents
Options:
--recursive / --no-recursive: recurse into subdirectories, default--recursive--help
Examples:
dokimos index-sources ./corpus
dokimos index-sources ./corpus --no-recursive
Behavior notes:
- Files are indexed only if they match supported extensions.
- Re-indexing is incremental and skips files that have not changed.
- The on-disk index is written to
corpus/index.jsonby default. - Remote plagiarism analysis does not require this index.
Output Behavior
Dokimos supports two output modes.
Text Output
--format text is the default. It prints a human-readable summary using Rich.
Important behavior:
- Text output is written to stderr, not stdout.
- This is useful for interactive terminal use.
- If you are scripting, prefer
--format json.
JSON Output
--format json prints strict JSON to stdout.
This is the best choice for:
- shell pipelines
- CI jobs
- API handoff
- storing analysis artifacts
json-out
--json-out PATH always writes JSON to the specified file, regardless of the selected --format value.
Example:
dokimos analyze essay.txt --format text --json-out reports/essay.json
In that case:
- the JSON report is written to the file
- the CLI prints a confirmation message
- JSON is not emitted to stdout
JSON Report Shape
The report includes top-level fields like:
schema_versiondocument_idversionstatusgenerated_atdocumentsummaryplagiarismai_likelihoodcaveats
Example:
{
"schema_version": "1.0",
"document_id": "...",
"status": "complete",
"document": {
"source_path": "essay.txt",
"filename": "essay.txt",
"file_format": ".txt",
"character_count": 1234
},
"summary": {
"analyses_run": ["plagiarism", "ai_check"],
"plagiarism_match_count": 2,
"plagiarism_overall_score": 0.85,
"ai_likeness_score": 0.34,
"automated_writing_risk": "medium",
"human_review_recommended": true
},
"plagiarism": { "...": "..." },
"ai_likelihood": { "...": "..." },
"caveats": []
}
Configuration
All settings can be overridden with DOKIMOS_-prefixed environment variables.
Core Settings
| Variable | Default | Description |
|---|---|---|
DOKIMOS_LOG_LEVEL | INFO | Default application log level |
DOKIMOS_CORPUS_PATH | corpus | Directory used for the internal source corpus |
DOKIMOS_INDEX_FILE | corpus/index.json | JSON index file path |
DOKIMOS_OUTPUT_DIR | output | Default output directory for reports |
Chunking Settings
| Variable | Default | Description |
|---|---|---|
DOKIMOS_CHUNK_STRATEGY | paragraph | One of paragraph, sentence, fixed |
DOKIMOS_CHUNK_SIZE | 500 | Approximate words per chunk for fixed chunking |
DOKIMOS_CHUNK_OVERLAP | 50 | Overlap between fixed chunks |
Plagiarism Settings
| Variable | Default | Description |
|---|---|---|
DOKIMOS_PLAGIARISM_BACKEND | hybrid | One of local, remote, or hybrid |
DOKIMOS_SHINGLE_SIZE | 5 | Number of words per shingle |
DOKIMOS_PLAGIARISM_JACCARD_THRESHOLD | 0.10 | Minimum Jaccard similarity to retain a candidate |
DOKIMOS_PLAGIARISM_FUZZ_THRESHOLD | 60.0 | Minimum RapidFuzz score to keep a reranked match |
DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHUNKS | 5 | Maximum input chunks used as remote search queries |
DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHARS | 180 | Maximum characters sent in each remote query |
DOKIMOS_PLAGIARISM_REMOTE_PER_PROVIDER_RESULTS | 3 | Maximum candidate sources requested from each provider per query |
DOKIMOS_PLAGIARISM_REMOTE_TIMEOUT_SECONDS | 10.0 | Remote request timeout |
DOKIMOS_PLAGIARISM_REMOTE_MAX_SOURCE_CHARS | 20000 | Maximum retained text size per fetched remote source |
DOKIMOS_PLAGIARISM_REMOTE_FETCH_FULL_TEXT | true | Attempt to fetch source pages or PDFs instead of relying on metadata |
DOKIMOS_PLAGIARISM_REMOTE_CONTACT_EMAIL | unset | Optional contact email included where provider etiquette recommends it |
DOKIMOS_PLAGIARISM_REMOTE_ENABLE_OPENALEX | true | Enable OpenAlex provider |
DOKIMOS_PLAGIARISM_REMOTE_ENABLE_CROSSREF | true | Enable Crossref provider |
DOKIMOS_PLAGIARISM_REMOTE_ENABLE_ARXIV | true | Enable arXiv provider |
DOKIMOS_PLAGIARISM_REMOTE_ENABLE_DUCKDUCKGO | true | Enable DuckDuckGo HTML web-search provider |
AI-Likeness Settings
| Variable | Default | Description |
|---|---|---|
DOKIMOS_AI_SHORT_SENTENCE_THRESHOLD | 8 | Sentences at or below this word count are considered short |
DOKIMOS_AI_LONG_SENTENCE_THRESHOLD | 40 | Sentences at or above this word count are considered long |
DOKIMOS_AI_SIGNAL_TRIGGER_THRESHOLD | 0.5 | Per-signal threshold to mark a signal as triggered |
DOKIMOS_AI_RISK_HIGH_THRESHOLD | 0.6 | Aggregate score threshold for high risk |
DOKIMOS_AI_RISK_MEDIUM_THRESHOLD | 0.3 | Aggregate score threshold for medium risk |
DOKIMOS_AI_SHORT_DOCUMENT_WORDS | 80 | Documents below this size receive the short-document caveat |
Typical Workflows
Analyze a single paper against free remote sources
dokimos analyze submissions/paper-01.pdf --format json
Combine remote discovery with a local course corpus
export DOKIMOS_INDEX_FILE=/tmp/course-a-index.json
dokimos index-sources ./course-a-corpus
dokimos analyze submissions/paper-01.pdf --format json
Force remote-only analysis
export DOKIMOS_PLAGIARISM_BACKEND=remote
dokimos analyze essay.txt --format json
Force local-only offline analysis
export DOKIMOS_PLAGIARISM_BACKEND=local
dokimos index-sources ./course-a-corpus
dokimos analyze essay.txt --format json
Use sentence chunking for a short-form writing set
export DOKIMOS_CHUNK_STRATEGY=sentence
dokimos analyze essay.txt --format json
Generate a saved report for later inspection
dokimos analyze essay.txt --json-out reports/essay.json
AI-Likeness
Dokimos includes a rule-based stylometric AI-likeness engine.
It computes six signals per chunk, aggregates them into a document-level score, and maps that score to a low, medium, or high automated-writing-risk band.
This score is heuristic and statistical. It is not proof of AI authorship and should not be treated as such.
The Six Signals
The current engine computes these signals:
-
sentence_length_uniformityMeasures how uniform sentence lengths are. Low variation can correlate with machine-generated text. -
short_sentence_ratioMeasures the fraction of very short sentences. A high ratio can suggest formulaic or list-like writing. -
long_sentence_ratioMeasures the fraction of very long sentences. A high ratio can also indicate auto-generated prose. -
lexical_diversityUses type-token ratio and inverts it into a risk contribution. Low vocabulary variety increases risk. -
sentence_start_repetitionMeasures whether multiple sentences begin with the same word. Repeated openings can suggest template-driven generation. -
avg_word_length_uniformityMeasures how uniform average word length is across sentences. Unusually regular patterns can be a machine-writing indicator.
Aggregation and Risk Bands
Each signal contributes to a weighted mean. Most signals use weight 1.0. sentence_start_repetition uses a lower weight because it is less reliable on short text.
The document-level score is then mapped to the configured risk bands:
highwhen score is at leastDOKIMOS_AI_RISK_HIGH_THRESHOLD(default0.6)mediumwhen score is at leastDOKIMOS_AI_RISK_MEDIUM_THRESHOLD(default0.3)lowotherwise
Chunk-level findings are also included so you can see where signals cluster inside the document.
Caveats
The engine always emits a stylometric_only caveat:
Stylometric heuristics only — not a substitute for human review.
It also emits a short_document caveat when total word count is below DOKIMOS_AI_SHORT_DOCUMENT_WORDS (default 80):
Short documents produce less reliable signals.
Current Evaluation Status
Dokimos does not currently publish a formal benchmark with false-positive and false-negative rates.
What exists today is regression and ordering coverage in tests/test_evaluation.py. Those tests verify that:
- clearly AI-like fixture text scores higher than natural human text
- templated text scores higher than clearly human writing
- mixed text remains within a sensible range
- risk bands line up with configured thresholds
- caveats and explanations are present when expected
- edge cases such as empty documents and single-sentence input do not crash the engine
That is useful for guarding implementation drift, but it is not the same thing as a published accuracy study.
In other words:
- the current tests support relative ordering and stability claims
- they do not support a strong claim like “the AI detector is X% accurate”
Practical Guidance
- Treat AI-likeness as a review signal, not a verdict.
- Expect domain, genre, and document length to affect the score.
- Use the JSON output when you want to inspect individual indicators, caveats, and chunk findings.
- For higher-stakes decisions, keep a human review step in the loop.
Troubleshooting
dokimos: command not found
Refresh the editable install:
pip install -e .
If you are intentionally running directly from source instead of installing:
PYTHONPATH=src python -m dokimos --help
No module named dokimos
You are likely running from source without an install and without PYTHONPATH=src.
Use one of these:
pip install -e .
PYTHONPATH=src python -m dokimos --help
PDF or DOCX files fail to open
Install the required extras:
pip install -e ".[pdf,docx]"
Plagiarism returns zero matches
Check the basics:
- make sure the machine has internet access if you are using
remoteorhybrid - make sure the remote providers you want are enabled
- try increasing
DOKIMOS_PLAGIARISM_REMOTE_QUERY_MAX_CHARSslightly for longer phrases - if you want local matches too, make sure you indexed the source corpus first
- make sure
DOKIMOS_INDEX_FILEpoints to the expected index when usinglocalorhybrid - remember that paywalled or inaccessible pages cannot be fetched by the free providers
Output is not appearing on stdout
That is expected in text mode.
--format textwrites the human-readable summary to stderr--format jsonwrites structured JSON to stdout
If you are scripting, use:
dokimos analyze essay.txt --format json
AI-likeness score seems surprising
Remember the score is heuristic, not proof of authorship.
Common reasons for surprising results:
- very short documents produce weaker signals
- domain-specific or highly repetitive human writing can look stylometrically unusual
- mixed human and AI-assisted text can sit between the obvious extremes
Inspect the JSON output when you need more detail about individual indicators and caveats.