TF-IDF Killed The Copywriting Spam
Note: If you want to play with TF-IDF, download this spreadsheet. The first tab is a simple TF-IDF calculator. Enter the occurrences of a word, the words in each document, total documents and the number of documents containing the phrase. It does the rest. The second tab demonstrates falling TF-IDF as documents containing a phrase goes up.
Term Frequency-Inverse Document Frequency (TF-IDF) proves that, in the modern SEO game, quality trumps quantity.
TF-IDF is a text-mining algorithm …
Wait! Don’t run! I’m not here to teach you the math behind TF-IDF. Truth is, I barely understand it myself. But Term Frequency, Inverse Document Frequency (TF-IDF – a great phrase for the next SEO cocktail party you attend) contains some crucial lessons for us copywriters.
Here’s a very brief description of TF-IDF and how it works (a little fancy math involved):
TF = Term Frequency
We all know this one:
If our key phrase is “flibbergibbet,” and it occurs 4 times in a document that’s 400 words in length, then the TF for “flibbergibbet” is:
4 / 400 = 1%
Some folks call this keyword density. But we’re past that now.
Inverse Document Frequency
Inverse Document Frequency (IDF) is the inverse of the number of documents in which a phrase occurs. That’s a terrible description – I know that because the mathematicians I know all punched me in the arm after I said it. But it’ll work for our purposes.
In case you want to know:
IDF = log(total documents/number of documents with phrase)
So, if “flibbergibbet” appears in 250 out of 1000 documents, the IDF is:
log(1000/250) = .6
TF-IDF
TF-IDF is the Term Frequency times the Inverse Document Frequency, or TF*IDF.
Here’s the thing about IDF that you must understand: As the number of documents containing a phrase goes up, the TF-IDF score goes down. Have a look at this graph — document frequency goes up as you move to the right:
Yikes. So, the more times you mention a phrase, the less important that phrase appears on a specific page.
What It All Means
We don’t know for certain if the search engines use TF-IDF to determine the importance of a word on a page. But it’s likely they use it or something very like it.
Say you want your website to rank well for our favorite word. You include the word at least 3 times on every single page of your site. That actually reduces the TF-IDF score of each page for “flibbergibbet.”
Of course, there are many, many other ranking factors. Thousands. If your site is 150 pages of fantastic content, and it:
- Has a unique, fully descriptive title tag for every page
- Has a unique structure for every page
- Doesn’t spin or duplicate content
- Uses fully-descriptive ALT attributes, etc. etc.
… then TF-IDF probably doesn’t hurt you at all. A visiting search engine can use other signals to determine page relevance.
But content farmers, beware. If you crank out 999 pages of total crap, using your key phrase 5-10 times per page, all you’ve done is made it harder for a search engine to figure out which page is most important for that phrase.
If I were a search engine (and I’m not), I’d take that as a signal of a poorly-organized site.
Wouldn’t you?
The Lesson
In the past, site owners created page after page expounding on a specific key phrase, repeating it time after time in articles that were barely different, poorly written and poorly structured. That’s still a standard “SEO copywriting” tactic. I use quotes because it’s not SEO copywriting at all.
TF-IDF explains why that tactic has lost its power. It also shows why the cliche “If you want to rank, write good stuff,” really is the right strategy. TF-IDF means more isn’t necessarily better. So, write good stuff!
About Ian Lurie
Ian Lurie is Chief Marketing Curmudgeon and President at Portent Interactive, an Internet marketing company he started in 1995. Ian started practicing SEO in 1997, and has been addicted ever since. For more of Ian Lurie’s smarts, raves and rants, check out his Conversation Marketing blog. He’s also published several reader-friendly, no-nonsense ebooks on SEO copywriting, including The Unscary, Real World Guide to SEO Copywriting and Fat Free Guide to Google Analytics. Follow him on Twitter: @portentint.
Huh? I thought that the IDF was calculated from *all* known documents – not just the ones on your site – and was thus used to filter out common words whose TF should not be used as a ranking factor.
e.g. The word “good” might have 5% TF in an (uncreatively written) blog post about my good times on vacation but that doesn’t mean it’s a relevant result for “good” queries. Search engines can use IDF to determine that – across the interwebs – “good” is a common phrase and its appearance on my page isn’t especially noteworthy / shouldn’t be heavily weighted as a ranking factor.
Seems to me that if single-site IDF was used as a ranking factor many sites would have a harder time ranking for their brand terms or core products than a competing site that only mentions your brand / products on one page.