Automatic Recognition of Text Difficulty from Consumer Health Information

AuthorSearch for:
TypeArticle
ConferenceProceedings of the 19th IEEE International Symposium on Computer-Based Medical Systems, June 22-23, 2006., Salt Lake City, Utah, USA
AbstractInternet is used as one of major sources of health information. However, some studies show that the readability of health information presented on health web sites is difficult for many consumers. Readability formulas usually measure difficulty of writing style, instead of difficulty of content. In order to recommend health information with appropriate reading level to consumers, we investigate the feasibility of identifying text difficulty of health information using machine learning methods. Support Vector Machine is used to classify consumer health information into easy to read and reading level for the general public. Three feature sets: surface linguistic features, word difficulty features, unigrams and their combinations are compared in terms of classification accuracy. Unigram features alone reach an accuracy of 80.71%, and the combination of three feature sets is the most effective in classification with accuracy of 84.06%. They are significantly better than surface linguistic features, word difficulty features and their combination.
Publication date
LanguageEnglish
AffiliationNRC Institute for Information Technology; National Research Council Canada
Peer reviewedNo
NRC number48548
NPARC number8913290
Export citationExport as RIS
Report a correctionReport a correction
Record identifier9b4f73e6-91e7-4aa0-9bae-c0602e10c618
Record created2009-04-22
Record modified2016-05-09
Bookmark and share
  • Share this page with Facebook (Opens in a new window)
  • Share this page with Twitter (Opens in a new window)
  • Share this page with Google+ (Opens in a new window)
  • Share this page with Delicious (Opens in a new window)