Data Mining And Topic Modeling – Provenance and Traceability Research Group

About

In the first phase of this project we used data mining and topic modeling techniques to achieve this. We analyzed the version history of the software using association mining. Association mining is a kind of data mining technique. The result of this analysis was sets of files that are frequently modified together indicating that they are related to each other. However, some projects may not have enough version history available to obtain good results. We therefore, analyzed the source code files of the project using topic modeling (LDA). This technique yielded a topic distribution for a given document. We used topic distribution percentage to determine the how files are related to each other. The results of these two techniques were combined to get the final recommendations. We have already published a research paper and a journal for this in eKnow conference. The paper and journal are available in wiki.

In the second phase of this project we are refining our techniques by applying Genetic Algorithm. The above-mentioned techniques require certain input parameters other than the software project itself. These parameters play a major role in determining the accuracy of the results obtained using these techniques. In the earlier phase, these parameters were manually pre-determined and used for all projects. We are now working on getting the optimal input parameters for each individual project using Genetic Algorithm. This algorithm uses crossover, mutation, and fitness functions to derive the most optimal set of inputs. More optimal inputs mean more accurate results.

For students who are interested in this project, we use the following technologies: Java, MySQL database. The algorithm used for data mining is FP-Growth.

Publication

Namita Dave, Karen Potts, Vu Dinh, Hazeline U. Asuncion. Combining Association Mining with Topic Modeling to Discover More File Relationships, International Journal On Advances in Software, December 2014.

Namita Dave, Delmar B. Davis, Karen Potts, Hazeline U. Asuncion, Uncovering File Relationships using Association Mining and Topic Modeling, In the Sixth International Conference on Information, Process, and Knowledge Management (eKNOW), March 2014.

Links

This work is based upon work supported by the US National Science Foundation under Grant No. CCF 1218266. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.