AI-Likeness
Dokimos includes a rule-based stylometric AI-likeness engine.
It computes six signals per chunk, aggregates them into a document-level score, and maps that score to a low, medium, or high automated-writing-risk band.
This score is heuristic and statistical. It is not proof of AI authorship and should not be treated as such.
The Six Signals
The current engine computes these signals:
-
sentence_length_uniformityMeasures how uniform sentence lengths are. Low variation can correlate with machine-generated text. -
short_sentence_ratioMeasures the fraction of very short sentences. A high ratio can suggest formulaic or list-like writing. -
long_sentence_ratioMeasures the fraction of very long sentences. A high ratio can also indicate auto-generated prose. -
lexical_diversityUses type-token ratio and inverts it into a risk contribution. Low vocabulary variety increases risk. -
sentence_start_repetitionMeasures whether multiple sentences begin with the same word. Repeated openings can suggest template-driven generation. -
avg_word_length_uniformityMeasures how uniform average word length is across sentences. Unusually regular patterns can be a machine-writing indicator.
Aggregation and Risk Bands
Each signal contributes to a weighted mean. Most signals use weight 1.0. sentence_start_repetition uses a lower weight because it is less reliable on short text.
The document-level score is then mapped to the configured risk bands:
highwhen score is at leastDOKIMOS_AI_RISK_HIGH_THRESHOLD(default0.6)mediumwhen score is at leastDOKIMOS_AI_RISK_MEDIUM_THRESHOLD(default0.3)lowotherwise
Chunk-level findings are also included so you can see where signals cluster inside the document.
Caveats
The engine always emits a stylometric_only caveat:
Stylometric heuristics only — not a substitute for human review.
It also emits a short_document caveat when total word count is below DOKIMOS_AI_SHORT_DOCUMENT_WORDS (default 80):
Short documents produce less reliable signals.
Current Evaluation Status
Dokimos does not currently publish a formal benchmark with false-positive and false-negative rates.
What exists today is regression and ordering coverage in tests/test_evaluation.py. Those tests verify that:
- clearly AI-like fixture text scores higher than natural human text
- templated text scores higher than clearly human writing
- mixed text remains within a sensible range
- risk bands line up with configured thresholds
- caveats and explanations are present when expected
- edge cases such as empty documents and single-sentence input do not crash the engine
That is useful for guarding implementation drift, but it is not the same thing as a published accuracy study.
In other words:
- the current tests support relative ordering and stability claims
- they do not support a strong claim like “the AI detector is X% accurate”
Practical Guidance
- Treat AI-likeness as a review signal, not a verdict.
- Expect domain, genre, and document length to affect the score.
- Use the JSON output when you want to inspect individual indicators, caveats, and chunk findings.
- For higher-stakes decisions, keep a human review step in the loop.