Inverted Wordles

Jutta asked: "Wordles or word clouds centre and increase the size of majority words. I would like to flip the algorithm so that it is the minority or novel words that are centred and amplified. We did that with a proprietary Word Cloud software for UofT class I taught."

How to characterise a "minority or novel" word is an interesting issue. There are lots of well-established packages which will accept an array of words together with a measure of "importance", e.g. , which indeed typically derive "importance" from a simple count. The naive approach to inverting this algorithm and instead using it to visualise the margins would be to simple flip the array of frequencies and make importance inversely proportion to frequency. This is likely to produce results which are highly "noisy" and unstable - especially in short texts, there may be any number of reasons why, say, a term has a count of 1 - and the fact that it occurs just once will have little relationship with the structure and intention of the text. What might be more interesting is a comparison of these low frequency terms with the background frequency of those terms in language at large. That is, a term which is a good candidate to be assigned high importance in the visualisation is that it 

i) Has a low frequency in the text itself (but in short texts there will be a pretty high limit on how low this frequency can actually be) - but also

ii) Has a much lower frequency in some background, very large, language corpus from which the text might be drawn.

Interestingly this idea, applied not to individual words but clusters of words, seems to have been pursued historically by Amazon for several years under the name of "Statistically Improbable Phrases" - - perhaps between the years 2005 to 2013, in order to somehow epitomise what a text was about. The following thread also refers to a measure named "term frequency–inverse document frequency" - which has links with information theory.

Given that correlating these two criteria i) and ii) could be done in a number of ways, it might be a useful exploration to produce a variety of means of trading off one against the other in SIP-like ways and allowing users to see the effect on the appearance of the resulting Wordle. One prerequisite of this is to lay our hands on a word frequency list that is fairly accurate at the opposite end that is traditional - that is, in assessing the frequency of rare words. In contrast, readily available corpora of this kind, e.g. concentrate on determining the frequencies of more common words, e.g. the most common 220,000 words.