Clean Security Issues from Git History

Git is great for project development, because every change by every user is preserved so divergent paths or new functions can be merged together. There is no fear of ruining it, because the project can always revert to any point in its history, and special tools like cherry pick, and stash, and patch even allow you to choose what individual lines you want to retrieve from different points in history or your local workspace. But this also means that if a password, username, or server name is accidentally added, it will always be in the visible history.

Best Practice

  • Don’t put passwords or file paths in your code
    • Only have these values in your local workspace, and do not commit them.
    • It’s easier to avoid accidental commits if the information is in a separate file like .key or .config or .yourext.
      • This file should be outside your commit history
      • Allows automated processes to run without publicly storing the sensitive info
      • It can be helpful to commit an example file for others, with paths like fpath = \\server\yourdir
    • Make these values command line entries or user inputs
  • If you are carefully checking your commits and find this mistake, immediately use git reset --soft HEAD~1
    • This will keep the file changes, but clear the git history
    • Do NOT use revert. It will change the files but preserve the history
    • Make sure to run git status and make sure you don’t add back the file changes again
  • Don’t push commits to the remote!!
    • Once this is done, the system is now on a web server, and vulnerable…
    • Now you have 2 versions that need to be cleaned, if someone else pulls or fetches the commit, you have 3 versions
      to clean…
    • You can still fix your copy and the remote copy with git reset --soft HEAD~1 followed by
      git push --force origin
    • Time to talk to your co-workers about fixing those other versions, or be a jerk and ruin their workspace and
      delete those changes they haven’t committed yet by using git reset --hard @{username}

Ooops, No One Noticed Until Now

Well, 2 years ago something bad was committed. Now, 50 commits later, you notice…

Now you need to change a bunch of commit histories. And all of the commits that followed from all of those commits.

Warning

Changing commit histories is always a dicey idea. Danger Will Robinson!! you might be about to explode your work and could loose a lot of valuable code. The below will actually give new commit name and tags. You’ll need a fresh clone on every machine you and all of your co-workers use.

git filter-branch was made for this purpose, but even it’s doc page now calls it antiquated. Use git filter-repo instead.

git filter-repo

This tool seems to work great, but it’s not actually integrated into the git program. It worked in 2 or 3 seconds for me. Go to git-filter-repo. Clone or download it.

Windows Install

This whole thing runs off of a single file with no file extension. Copy that file and paste it into the git path. If you’re not sure where that is:

On my machine (and I guess most PC’s) python is in the path as python not python3. You can check in your path, or just type python3 into a terminal and see if it works. Change the first line of the file git-filter-repo to #!/usr/bin/env python.

I also was able to do a pip install git-filter-repo. If you have pip installed it’s easy to do, but I think this only helps if you want to call some of it’s functions from Python.

Cleaning Your Repo

There are a lot of good methods explained in the docs for removing files or directories and other things https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#EXAMPLES .

You should read the docs carefully. Do this stuff on a *fresh clone* instead of your working copy. I recommend going into the parent directory of your working repo and making a new directory called ./clean_backup or ./fresh_clone

I found this the most helpful:

DocQuote:

If you want to modify file contents, you can do so based on a list of expressions in a file, one per line. For
example, with a file named expressions.txt containing

```
p455w0rd
foo==>bar
glob:*666*==>
regex:\bdriver\b==>pilot
literal:MM/DD/YYYY==>YYYY-MM-DD
regex:([0-9]{2})/([0-9]{2})/([0-9]{4})==>\3-\1-\2
then running

git filter-repo --replace-text expressions.txt
```

will go through and replace p455w0rd with ***REMOVED***, foo with bar, any line containing 666 with a blank
line, the word driver with pilot (but not if it has letters before or after; e.g. drivers will be unmodified),
replace the exact text MM/DD/YYYY with YYYY-MM-DD and replace date strings of the form MM/DD/YYYY with ones of
the form YYYY-MM-DD. In the expressions file, there are a few things to note:

Every line has a replacement, given by whatever is on the right of ==>. If ==> does not appear on the line, the
default replacement is ***REMOVED***.

Lines can start with literal:, glob:, or regex: to specify whether to do literal string matches, globs
(see https://docs.python.org/3/library/fnmatch.html), or regular expressions
(see https://docs.python.org/3/library/re.html#regular-expression-syntax). If none of these are specified,
literal: is assumed.

If multiple matches are found, all are replaced.

globs and regexes are applied to the entire file, but without any special flags turned on. Some folks may be
interested in adding (?m) to the regex to turn on MULTILINE mode, so that ^ and $ match the beginning and ends
of lines rather than the beginning and end of file. See https://docs.python.org/3/library/re.html for details.

Correcting the Target Lines (regex)

I checked out a commit that I new had problems and then tested my regex in Notepad++ using the Find in Files option with the Regular Expression box checked. This got me pretty close, but it was still missing some lines I wanted to catch.

I copy and pasted 1 of my problem files into this program, and this is where I really got the regex kinks out.

https://pythex.org/

Of course, mixed lower case and uppercase adds some trouble, but the slipperiest issue was unix / vs \ which is sometimes a valid path, vs \\. I found a couple of solutions:

For capitalization problems:

Good regex primers:

Applying fixes to remote

The easiest way to do this is to make a new remote. Yes, the rest of your team might grumble, but if you already ran git filter-repo , then you are already forcinng them to start over with a clean repo.

I recommend a few safety checks:

  1. Go into your old working directory: delete the remote links and rename the folder so that you don’t accidentally use it as a workspace. Exp: repo_local_archive
  2. Make a copy of your security fixes before connecting to a remote. Exp: repo_security_clean

If you are insistent on keeping your current remote, proceed slowly. You no longer have any commits that match any part of the history on your remote. Any links to issues or commits will now break. Each branch needs to be updated, and there is the risk that some branches have had their commits removed.

  1. Make a fresh clone of your remote before you erase it. Clearly label the directory so you know not to use it as a working directory (repo_remote_backup) and delete its remote links.
  2. Your filter-repo is your new working directory. Add your remote connection.
  3. Do a force push of all branches using the sequence below. –prune ensures dead branches or tags are removed. –mirror pushes all branches and tags (–all only pushes branches).

The following worked for me:

From a closer look at the docs, you may be able to push all in one shot by replacing the quoted path with --mirror.

DocQuote:

--all
    Push all branches (i.e. refs under refs/heads/); cannot be used with other <refspec>.
--prune
    Remove remote branches that don’t have a local counterpart. For example a remote branch
    tmp will be removed if a local branch with the same name doesn’t exist any more. This also
    respects refspecs, e.g. git push --prune remote refs/heads/*:refs/tmp/* would make sure that
    remote refs/tmp/foo will be removed if refs/heads/foo doesn’t exist.
--mirror
    Instead of naming each ref to push, specifies that all refs under refs/ (which includes but
    is not limited to refs/heads/, refs/remotes/, and refs/tags/) be mirrored to the remote
    repository. Newly created local refs will be pushed to the remote end, locally updated refs will
    be force updated on the remote end, and deleted refs will be removed from the remote end. This
    is the default if the configuration option remote.<remote>.mirror is set.

There’s a helpful description here: https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html

Keeping Paths and Credentials Out of Git History

Server names, passwords, and usernames are the 3 components someone would need to access our system and do damage to it. They are also the three things that allow us to access our system and use it. So our code always needs these things, but our collaborations can never include them. This is a common source of security breaches in git, but also in DropBox, Box, GoogleDrive, and OneDrive. These 3 things are kind of like underwear, they should always be with you (on your machine), should not be accessible by others, and each person should have their own.

Filepaths, while less directly threatening, reveal the inner workings of our computers and should be kept private too.

So, how do we do that?

Basically, there are 3 strategies:

  1. Keep these in a secret file (e.g. .ini, .config, .env, .json, .yaml) in the same directory as your code, but put the file in .gitignore and do not share it.
  2. Use an existing security protocol you can log in to to store them like JupyterLab Credential store, AmazonWebService Secrets manager, Azure(Microsoft) KeyVault, OSx Keychain.
  3. Type them into your code privately each time with something like GetPass.

Here are some great docs on the topic.

https://towardsdatascience.com/keeping-credentials-safe-in-jupyter-notebooks-fbd215a8e311

https://vickiboykis.com/2020/02/25/securely-storing-configuration-credentials-in-a-jupyter-notebook/