Calculate the ratio of unique words on a page

This bookmarklet calculates and displays the unique words ratio within the visible text content of the current page.

(Drag this button to your boormarks toolbar)

Source code

var segmenter = new Intl.Segmenter([], {granularity: 'word'});
var segmentedText = segmenter.segment(document.body.innerText);
var words = [...segmentedText].filter(s => s.isWordLike).map(s => s.segment);
var uniqueWords = [...new Set(words)];
alert('Unique words ratio: ' + Math.round(words.length / uniqueWords.length * 10.0) / 10.0);

Note that it only counts words within visible text on the page. Not including hidden content, <img alt=""> or anything inside <head>. All words are counted as is, without any filtering or NLP preprocessing: no stop words removal, no stemming or etc.

Works with any language, including CJK and similar.

The idea is based on "Detecting spam web pages through content analysis" paper.

Calculate the ratio of unique words on a page

Source code

Feedback