For Stanford’s Machine Learning class, I worked on a project to alleviate the impact of healthcare spending on the the “last mile” of medicine and predict a user’s illness simply based on a description of their symptoms. While handling multiple online data sources we found that a large part of the challenge was consolidating information across data sources in a consistent matter. We explore several techniques for automatic document matching across data bases of documents,with the goal of finding matching documents across Freebase, Mayo Clinic, andWikipedia. We define matching documents to be articles which describe the same disease, then we found matching documents between databases with 91% accuracy.
Using data from Freebase, Mayo Clinic, and Wikipedia, we trained a Naive Bayes, Logistic Regression, Random Trees, and many other ML models. We obtain nearly 90% top 5 accuracy and about 60% top 1 accuracy