Git OSS-um

Solution

Solution Overview

Our solution is a web application that assists newcomers to open source software. The three features that we have implemented to assist newcomers is to display simple, easy to understand visualizations and statistics that do not overload our users with information, gauging the activity of repositories, and show users the acceptance rate of submitted contributions. The technologies we have used will be detailed below.

Technologies

In order to accomplish our mission in assisting newcomers find GitHub projects that best suit their needs, Git OSS-um must overcome the following challenges with our selected technologies of Python, Django, Celery, RabbitMQ, MongoDB, Plotly, and Digital Ocean...

What information is important to mine?

We will be mining data for repository statistics as a whole, as well as statistics on the pull requests submitted within a repository. Those data points for each category are as follows:

Repository Data

  • Repo Name
  • Description
  • Created Date
  • Last Updated Date
  • Clone URL
  • Homepage
  • Stargazers Count
  • Language
  • Has a Wiki or Not
  • License Key
  • Lincense Name
  • Number of Open Issues
  • Number of Forks
  • Number of Watchers

Pull Request (PR) Data

  • PR Number
  • State
  • Title
  • User Who Submitted
  • PR Body
  • Date Created
  • Date Closed
  • Date Merged
  • Labels
  • Author Association
  • Merged or Not
  • Merged By
  • Comments
  • Review Comments
  • Commits
  • Additions
  • Deletions
  • Changed Files

How is information mined from GitHub? Python, RabbitMQ, Celery

We are using the Python programming language as the backbone for our product as it has many capabilities with other frameworks, including Plotly and Django. Python is used to mine the data straight from GitHub using the GitHub API and the PyGithub framework for a more convenient method of mining data. We have used the RabbitMQ and Celery frameworks to be used for asynchronous mining of data and updating the repositories we have in our database.

How is information being stored? MongoDB

Considering our application is responsible for mining a great deal of data, it is essential that the information we gather be stored in a way that is accessible. Because of this, we are using the document-oriented database, MongoDB. Our MongoDB database is linked directly to our application through our Django front-end to display mined repositories and is also linked to our mining scripts so that new repository data can be stored in our database.

How is the information being analyzed? Plotly

With an understanding of what information is essential for our team to collect and display, we have decided to use the Plotly framework to visualize and represent our data. Plotly will represent the data that we have mined in a simple manner where users can study the data and make their own conclusion. We have also implemented a table that holds raw data that users can look over as well.

How is the information being presented to the user? Plotly Interactive Graphics

As described above, we will be using the Plotly framework to visualize and represent our data. Our graphics are capable of interactive visualizations that allow users to hover over specific parts of the graph to view the data in more depth. Users have the ability to adjust the graphs as well, to examine that area more in depth.

How is this application accessible? Digital Ocean

Our server space is a Linux-based, Digital Ocean server. We have one instance of the server where our back-end and front-end of our product will be represented at the same time. The Django framework will handle our front-end and some back-end as well along with some Python scripts. When a user views our product, They will see a representation of our database of our mined repositories along with HTML templates and forms for users to input information for signing up, logging in, and submitting a mining request.