Research API calls vs cloning for LOC statistics

Task description

Figure out what is the best way to retrieve the LOC statistics that we want. Is this by cloning or through API calls? We want to retrieve the following information.

Who authored a line.
The type of a line: code, comment, doc, empty, (etc. ?)
- For this we need to know the file type and line contents
The file type of a line
When the line was added and potentially removed
The type of modification: addition, removal, change
- Change might be hard and can be ignored for this issue, but it would be nice to think about it already
The branch of a line

Technical considerations

There are a few technical details to consider.

Current state and history

It is possible, both via API and with git blame, to know who authored a line. This however, does not take into consideration the history of the line. If someone overwrote the line, by indenting for example, the line becomes 'theirs'. You will probably need to traverse the git history and get the diff for every commit.

Merges

Merges produce merge commits. Counting these merge commits, might double-count lines, as merge commits contain all changed aggregated from a single merge. There are two approaches here: either exclude merge commits, or exclude all commits that are contained in a merge commit.

We cannot assume all changes will come from merged. It is possible that branches are unprotected and students push to them immediately. It is also possible, the course staff, pushed a commit directly to a protected branch.

Rebasing and force pushed

It is realistic that someone force pushed to a branch, because it is not protected or because only they are working on that branch. Additionally, it is possible that the setting is turned on that merges MRs by rebasing. Both of these actions result in the history being overwritten. This means that if you calculated LOC statistics based on a previous state, you cannot simply start from that state and process new commits, but you potentially need to process the entire history again. An assumption you can make is that the hash of a commit remains the same if its contents are the same (so you could, for example, go back to the commit with an already processed hash and start from that point).

Analysis of a cloned repository

It is possible to perform the analysis by using git commands, e.g. git diff. An alternative approach is to analyse the files in the .git folder directly, which might be faster as it does not have to interact with the git process. It is also very possible this is not faster, as git is highly optimised. Use whichever is technically feasible, and if both are, whichever is faster.

Experimental setup

To get the most accurate measurements, you can perform the experiments in java code in GitBull itself. Create an endpoint /run-experiments that calls a service method in which you execute the experiments. Execute each experiment some times (~3) and discard the results. Execute each experiment multiple times (~10) and average the results.

Pseudocode

> run twice, once for local clone and once for API calls
for each experiment
    repeat 3 times
        run experiment
    results = []
    repeat 10 times
        run experiment
        add time taken to results
    write results

Deliverables

Push a branch that contains the java code used. On that branch also include an excel file containing the following data.

On an example repository (e.g. GitBull), the time it takes to calculate LOC statistics for a single branch.
- e.g. main
The time it takes provided you already calculated LOC statistics for a similar branch.
- e.g. development, which should be ahead or main by some commits, and behind by some merge commits
The time it takes on a newer version of that same branch.
- e.g. calculate for development minus 5 commits, and then measure for development origin/HEAD
The time it takes on a newer version of that same branch provided the git history was changed (e.g. force push).
- e.g. calculate for a branch, then rebase -i HEAD~5 and add some lines, and then measure for that same branch
All of these measurement should be done using API calls and using a locally cloned repository.
For a local repository, also calculate how much space it would take if we would clone ~1000 repositories (feel free to just do space on dist * 1000)
For API calls, also determine a formula for the number of API calls necessary for one branch, in terms of number of new commits, files, and potentially other variables.
Include the time for all runs except the warmup runs of each experiment and the averages.

Resources

API

Git commands

https://git-scm.com/docs

.git folder

https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

Edited Apr 02, 2025 by Ruben Backx