This bookmarklet calculates and displays the unique words ratio within the visible text content of the current page.
(Drag this button to your boormarks toolbar)var segmenter = new Intl.Segmenter([], {granularity: 'word'});
var segmentedText = segmenter.segment(document.body.innerText);
var words = [...segmentedText].filter(s => s.isWordLike).map(s => s.segment);
var uniqueWords = [...new Set(words)];
alert('Unique words ratio: ' + Math.round(words.length / uniqueWords.length * 10.0) / 10.0);
Note that it only counts words within visible text on the page. Not including hidden content, <img alt="">
or anything inside <head>
. All words are counted as is, without any filtering or NLP preprocessing: no stop words removal, no stemming or etc.
Works with any language, including CJK and similar.
The idea is based on "Detecting spam web pages through content analysis" paper.
Leave a comment here: @ugnich