THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

John Paul Cook

Analyzing Complexity of Text

Storing text in a database is common. What isn’t common is needing to know the reading ease and grade level of the text, but I was presented such a requirement (actually it was more of a wish list item) this week. There are ways of solving this problem. In the conclusion to this post, I outline the steps for implementing T-SQL code to estimate reading complexity. I think the topic of reading ease and grade level ratings is potentially of greater general interest than you might at first think. For example, you could have data driven web pages accessed by the general public where choices entered by the users deliver custom content. Perhaps you need to deliver product specific operating or safety instructions.

For purposes of this post, I’ve assumed that you have a large body of unanalyzed text in VARCHAR and NVARCHAR columns. Text stored in Word documents stored in SQL Server FILESTREAM is out of scope for this post.

There are several well-known and relatively simple algorithms for estimating the grade level and reading ease of text. Microsoft Office has the Flesch-Kincaid algorithm that you can use to obtain an estimated grade level and the Flesch Reading Ease algorithm. To do readability analysis in Word, you’ll need to enable it. See http://blogs.office.com/b/microsoft-word/archive/2007/06/26/can-word-improve-your-writing.aspx and follow the easy instructions for doing this. Notice that the page shows an analysis of something written by Dr. Seuss which has a grade level of zero and a reading ease of 100. For comparison purposes, I analyzed the United States Internal Revenue Service instructions for completing a form 1040 income tax return. Notice that income tax instructions have a much lower ease of reading than Dr. Seuss, but somehow I think you already knew that.

image

Figure 1. United States income tax instructions analysis.

An analysis of the product information for acetaminophen (a.k.a. paracetamol, brand names Tylenol, Panadol, Acamol, Biogesic, Crocin) found at http://www.nlm.nih.gov/medlineplus/druginfo/meds/a681004.html shows an even lower reading ease than income tax instructions and a higher grade level.

image

Figure 2. Acetaminophen product information analysis.

Before you can write code to calculate reading difficulty, you need to pick an algorithm. The Flesch and Flesch-Kincaid algorithms require that you know the total syllables, total words, and total sentences in the body of text to be analyzed. The Simple Measure of Gobbledygook (SMOG), Gunning fog index, and Coleman-Liau index are similar. If you want to implement something simple using T-SQL, finding the number of syllables is too difficult. The  Dale-Chall and Spache algorithms require that you use a list of words considered to be common so that you can find the percentage of complex words. Finding a copy of one of these word lists in a single column format is a bit of a challenge. I found the updated and expanded list of Dale-Chall words at http://lindacarlton.net/thoughts/2010/02/dalechall_list.php if you need to implement something possibly more accurate than the algorithm in the next paragraph.

The Automated Readability Index is sufficiently easy to code. The greatest difficulty you will likely encounter is in determining the number of sentences. Since computing readability isn’t an exact science, you could count the number of periods in a block of text to estimate the number of sentences. The accuracy could be improved by reducing the total by the number of ellipses (…) found in the text. As the linked document shows, the Automated Readability Index was developed for the United States Air Force in 1967. The document shows both a multiple regression version of the algorithm to estimate grade level as well as a simplified equation to compute the Automated Readability Index:

grade level = 0.50 * (number of words/number of sentences) + 4.71 (number of characters/number of words) – 21.43

Automated Readability Index = (number of words/number of sentences) + ( 9 * (number of characters/number of words) )

At this point, all that has been asked of me is to explain what it would take to analyze existing textual data for readability. I’ve presented algorithms that can be implemented. Do you remember how your advanced math textbooks would say the proof is obvious and is left as an exercise for the reader? I won’t provide the code for your stored procedure or function today – I leave the implementations details as an exercise for the reader.

This document has a Flesch Reading Ease of 44.7 and a Flesch-Kincaid Grade Level of 11.5. If reading something this complex gives you a headache, at least I provided a link to information about an analgesic!

Published Saturday, January 05, 2013 1:32 PM by John Paul Cook

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Diane Davis said:

Loved the humor!!

December 3, 2013 1:31 PM

Leave a Comment

(required) 
(required) 
Submit

About John Paul Cook

John Paul Cook is a Technology Solutions Professional for Microsoft's data platform and works out of Microsoft's Houston office. Prior to joining Microsoft, he was a Microsoft SQL Server MVP. He is experienced in Microsoft SQL Server and Oracle database application design, development, and implementation. He has spoken at many conferences including Microsoft TechEd and the SQL PASS Summit. He has worked in oil and gas, financial, manufacturing, and healthcare industries. John is also a Registered Nurse who graduated from Vanderbilt University with a Master of Science in Nursing Informatics and is an active member of the Sigma Theta Tau nursing honor society. He volunteers as a nurse at safety net clinics. Contributing author to SQL Server MVP Deep Dives and SQL Server MVP Deep Dives Volume 2. Opinions expressed in John's blog are strictly his own and do not represent Microsoft in any way.

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement