Skip to content

2018🔗

Dedupe Google Drive with RClone

An issue with duplicates

I migrated from Amazon Cloud Drive to a paid Google Drive account. To facilate this move, I used a paid service called MultCloud. For me, Comcast prevents unlimited data, so it would have been challenging to manage 1.5TB of video and photography files movement by downloading then reuploading to Google Drive.

I ran into issues due to hitting rate limiting with Multcloud. As a result, I had to work through their awkard user interface to relaunch those jobs, which still had failures. I basically was left at the point of not really trusting all my files had successfully transferred.

What's worse is that I found that the programmatic access by MultCloud seemed to be creating duplicates in the drive. Apparently Google Drive will allow you have to files side by side with the same name, as it doesn't operate like Windows in this manner, instead each file is considered unique. Same with folders.

Duplicate Images

RClone

I ran across RClone a while ago, and had passed over it only to arrive back at the documentation regarding Google Drive realizing they have specific functionality for this: dedupe. After working through some initial issues, my situation seems to have improved, and once again Google Drive is usable. In fact, it's time for some house cleaning.

Successfully Running

I suggest you make sure to find the developer api section and create an api access key. If you don't do this and just use Oauth2, you are going to get the dreaded message: Error 403: Rate Limit Exceeded and likely end up spending 30+ mins trying to track down what to do about this. 403 Rate Limit Messages

You'll see activity start to show up in the developer console and see how you are doing against your rate limits. Developer Console

Start Simple and Work Up From There

To avoid big mistakes, and confirm the behavior is doing what you expect, start small. In my script at the bottom, I walked through what I did.

Magic As It Works

Other Cool Uses for RClone

Merging

While I think the dedupe command in RClone is specific to Google Drive, you can leverage it's logic for merging folders in other systems, as well as issue remote commands that are server side and don't require download locally before proceeding.

Server Side Operations

This means, basically I could have saved the money over MultCloud, and instead used Rclone to achieve a copy from Amazon Cloud Drive to Google Drive, all remotely with server side execution, and no local downloads to achieve this. This has some great applications for data migration.

For an update list of what support they have for server side operations, take a look at this page: Server Side Operations

AWS

This includes quite a few nifty S3 operations. Even though I'm more experienced with the AWSPowershell functionality, this might offer some great alternatives to syncing to an s3 bucket

Mounting Remote Storage As Local

Buried in there was also mention of the ability to mount any of the storage systems as local drives in Windows. See RClount Mount documentation.. This means you could mount an S3 bucket as a local drive with RClone. I'll try and post an update on that after I try it out. It's pretty promising.

NTFS Compression and SQL Server Do Not Play Well Together

Wanted to be proactive and move a database that was in the default path on C:\ to a secondary drive as it was growing pretty heavily.

What I didn't realize was the adventure that would ensure.

Lesson 1

Don't move a SQL Server database to a volume that someone has set NTFS Compression on at the drive level.

Lesson 2

Copy the database next time, instead of moving. Would have eased my anxious dba mind since I didn't have a backup. before you judge me.. it was a dev oriented enviroment, not production... disclaimer finished

The Nasty Errors and Warnings Ensue

First, you'll get an error message if you try to mount the database and it has been compressed. Since I'd never done this before I didn't realize the mess I was getting into. It will tell you that you can't mount the database without marking as read-only as it's a compressed file.

Ok... so just go to file explorer > properties > advanced > uncheck compress ... right?

Nope...

Changing File Attributes 'E:\DATA\FancyTacos.mdf' The requested operation could not be completed due to a file system limitation`

I found that message about as helpful as the favorite .NET error message object reference not found that is of course so easy to immediately fix.

The Fix

  • Pull up volume properties. Uncheck compress drive OR
  • If you really want this compression, then make sure to uncompress the folders containing SQL Server files and apply.

Since I wasn't able to fix this large of a file by toggling the file (it was 100gb+), I figured to keep it simple and try copying the database back to the original drive, unmark the archive attribute, then copy back to the drive I had removed compression on and see if this worked. While it sounded like a typical "IT Crowd" fix (have you tried turning it on and off again) I figured I'd give it a shot.

... It worked. Amazingly enough it just worked.

Here's a helpful script to get you on your way in case it takes a while. Use at your own risk, and please... always have backups! #DontBlameMeIfYouDidntBackThingsUp #CowsayChangedMyLife

and finally to remount the database after copying it back to your drive ...

Deleting a Directory That Has a Trailing Space Shouldn't Be This Hard

It shouldn't be this hard. This is a consumate #windowsmoment

Removing Folder Fails

If you occasionally use something like Robocopy, or other command line tool, it can be possible to create a directory with a trailing slash. For instance

robocopy "C:\test\Taco" "C:\Burritos\are\delicious "

This trailing space would be actually used by Robocopy to initialize a directory that has a trailing space in the name. This can appear as a duplicator in explorer, until you try to rename and notice a trailing slash. Attempts to rename, delete, or perform any activity to manipulate this directory fail as Windows indicates that it can't find the directory as it no longer exists or might have been moved.

To resolve this I found details on SO about how to delete in response to the question "Can't Delete a Folder on Windows 7 With a Trailing Space".

Apparently it's an issue with NFTS handling. To resolve you have to use cmd.exe to rd (remove directory), and change your path to a UNC path referring to your local path.

To resolve the error then you'd do:

rm \\?\C:\Burritos\are\delicious

To confirm that PowerShell can't resolve this I did a quick test by running:

cd C:\temp
md ".\ Taco "
# Fails - No error
remove-item "\\?\C:\temp\taco "

# Fails with error: Remove-Item : Cannot find path '\\localhost\c$\temp\taco ' because it does not exist.
$verbosepreference = 'continue'; Remove-Item "\\localhost\c$\temp\taco "

# SUCCESS: Succeeds to remove it
GCI C:\Temp | Where-Object { $_.FullName -match 'taco'} | Remove-Item

So for me, I wanted to confirm that PowerShell was truly unable to resolve the issue without resorting to cmd.exe for this. Turns out it can, but you need to pass the matched object in, not expect it to match the filepath directly.

Now to go eat some tacos....

SQL .NET Requirements

SQL Server Install Requirements

SQL Server Installation requirements indicate .NET 3.5, 4.0, or 4.6 depending on the version. This is not including SSMS. At this point you shouldn't use SSMS from any SQL ISO. Just install SQL Management Studio directly.

See for more details on this - Improvements with SSMS 2016 - Update SSMS With PS1

From a quick review here's what you have regarding .NET requirements for the database engine.

SQL Version .NET Required
>= SQL 2016 RC1 (SQL 2017 included) .NET 4.6
SQL 2014 .NET 3.5 (manual install required)
.NET 4.0 (automatic)
SQL 2012 .NET 3.5 (manual install required)
.NET 4.0 (automatic)

Specifically noted in SQL 2012-2014 documentation is:

.NET 3.5 SP1 is a requirement for SQL Server 2014 when you select Database Engine, Reporting Services, Master Data Services, Data Quality Services, Replication, or SQL Server Management Studio, and it is no longer installed by SQL Server Setup.

When .NET 3.5 Install Just Won't Cooperate

If you need to install SQL Server that requires .NET 3.5 things can get a little tricky. This is a core feature with windows, so typically it's just a matter of going to Features and enabling, both in Windows 10 and Windows Server.

However, if you have a tighter GPO impacting your windows update settings, then you probably need to get this whitelisted. If you are on a time-crunch or unable to get the blocking of .NET 3.5 fixed, then you can also resolve the situation by using a manual offline install of .NET 3.5. Even the setup package Microsoft offers has online functionality and thereby typically fails in those situations.

Offline Approach

Surprisingly, I had to dig quite a bit to find a solution, as the .NET 3.5 installers I downloaded still attempted online connections, resulting in installation failure.

Turns out that to get an offline install correctly working you need a folder from the Windows install image (ISO) located at sources\sxs.

Since I wouldn't want to provide this directly here's the basic steps you take.

Get NetFx3 Cab

  1. Download ISO of Windows 10 (I'm guessing the version won't really matter as you just want the contents in one folder)
  2. Mount ISO
  3. Navigate to: MountedISO > sources and copy the sxs directory to your location. It should contain microsoft-windows-netfx3-ondemand-package.cab. This is the big difference, as the other methods provide an MSI, not the cab file.

Create Package

Next to create a reusable package

  1. Create a directory: Install35Offline

  2. Copy SXS directory to this

  3. Create 2 files. Gist below to save you some time.

    1. Install35Offline.ps1
    2. Install35Offline.bat

Hopefully this will save you some effort, as it took me a little to figure out how to wrap it all up to make it easy to run.

Packing this up in an internal chocolately package would be a helpful way to fix for any developers needing the help of their local dba wizard, and might even earn you some dev karma.

Git Cracking

{{< admonition type="info" title="Resources" >}} - GitKraken - Source Tree - Posh-Git - Cmder {{< /admonition >}}

Git Some Pain

Having come from a Team Foundation Server background, I found Git to be a bit confusing. The problem is primarily the big difference in a distributed version control system vs non-distributed. In addition to that complexity the terminology is not exactly intuitive. A lot of phrases like PULL have different results depending on what step you are in.

Here's Your Sign

Here's my version of "Here's Your Sign" For Newbie Git Users That Are Coming from TFS Background

You Must a TFS'er using Git when...

  • You commit changes, and refresh the TFS Source Control Server website trying to see your changes... but nothing ... ever... changes.
  • You pull changes to get things locally, but then get confused about why you are submitting a pull request to give someone else changes?
  • You want to use a GUI
  • You use force options often because: 1) You are used to forcing Get Latest to fix esoteric issues 2) Force makes things work better in TFS (no comment)
  • You are googling ways to forcibly reset your respository to one version because you don't know what the heck is out of sync and are tired of merging your own mistakes.
  • You think branching is a big deal
  • You think it's magical that you can download a Git repo onto a phone, edit, commit, and all without a Visual Studio Installation taking up half your lifespan.

I claim I'm innocent of any of those transgressions. And yes, I use command line through Cmder to get pretend some geek cred, then I go back to my GUI. :-) I have more to learn before I become a Git command line pro. I need pictures.

The Key Difference From TFS

The biggest difference to wrap my head around, was that I was working with a DVCS (Distributed Version Control System). This is a whole different approach than TFS, though they have many overlaps. I won't go into the pros/cons list in detail but here's the basics I've pulled (pun intended) from this.

Pros

  • I can save my work constantly in a local commit before I need to send remotely (almost like if I did shelves for each piece of work, and finally when pushing to server I'd be sending all my work with history/combined history)
  • File Based Workspace. Local Workspaces in TFS have benefit of recognizing additions and other changes, but it's tedious to do. Git makes this much cleaner.
  • Branching! Wow. This is the best. I honestly don't mess around with branching in TFS. It has more overhead from what I've seen, and is not some lightweight process that's constantly used for experimentation. (Comment if you feel differently, I'm not a pro at TFS branching). With Git, I finally realized that instead of sitting on work that was in progress and might break something I could branch, experiment and either merge or discard all very easily. This is probably my favorite thing. I'll be using this a lot more.

Cons

  • The wording.
  • More complicated merging and branching seem a little more complex with DVCS than non distributed like TFS, but that's just my high level impression. YMMV

GitKraken

GitKraken, a Git GUI to solve your learning woes.

Git GUI Goodness

I'm a Powershell prompt addict. I prefer command line when possible. However, I think GitKraken helped make this process a bit easier for me. I was using posh-git and Cmder initially, then Vscode with GitLens. However, other than basic commit/pull, I've found myself relying on GitKraken a lot more, as it's just fast, intuitive and easier to understand with my addled brain. I'd rather leave energy for figuring out how to get Query Optimization Through Minification

Timeline

To be honest, their timeline view and the navigation and staging of the changes seemed pretty intuitive to me compared to what I'd seen in other tools. Overall, I found it easier to wrap my head around the concepts of Git with it, and less fear of merging changes from remote as I was able to easily review and accept changes through it's built in merging tool.

GitKraken

Overall Impression

Overall impression is positive. I'd say it's a nice solution to help with understanding and getting up and running faster than some other solutions, or using Git via command line along. While that's a worthy goal, being able to easily review changes, amend commits, pull and merge remote changes from multiple sources, and other things, I'm not sure a newbie could do all at any time near what a little effort in GitKraken would provide. So overall, it's a win. I've used it for this blog and am pretty darn happy with it. The cost for professional if using in a work environment with the need for better profile handling, integration with VSTS and other services is a reasonable cost. For those just working with some Github open source repos and Jekyll blogs, they have a free community version, so it's a win!

A Free Alternative

Source Tree from Atlassian is a pretty solid product as well that I've used. Unfortunatelym I've had stability issues with it lately, and it lacks the most important feature required for all good code tools... a dark theme :-)... on Windows at least as of now. No success getting dark theme traction except on Mac. -1 demerits for this omission! Overall it has promise, but it tends towards so many options it can be daunting. I'd lean towards the implementation by GitKraken being much cleaner, designed for simplicity and flexibility.

Disclaimer: I like to review developer software from time to time, and occcasionally recieve a copy to continue using. This does not impact my reviews whatsoever, as I only use the stuff I find helpful that might be worth sharing. Good software makes the world go round!

Migration To Jekyll

I've been in the process of migrating my site to it's final home (as far as my inner geek can be satisfied staying with one platform)... Jekyll.

Jekyll

Jekyll is a static website generator that takes plain markdown files and runs through through files that are basically templates for the end html content, allowing flexibility in content generation. The result ends up being a static website with beautifully generated typography, search, pagination, and other great features for a blogging engine. You also keep the benefit of writing in our beloved markdown, allowing easy source controlling of your blog.

This site at this time basically is a github repo. Upon commit, Netlify provides an amazing free resource for developers to automatically launch a remote build, minify, ensure content is with their CDN and publishes the changes to your site upon successful build. Pretty amazing! They also have more flexibility than Github-Pages in that you can use other Ruby based plugins for Jekyll, while Github limits the plugins available, resulting in less features in Jekyll that are available.

Worth It?

This was a pretty exhaustive migration process, primarily because I worked on ensuring all links were correctly remapped, features like tag pages were in place, and all assets were migrated from Cloudinary and other locations. Overall it was a very time consuming affair but considering free hosting that will scale for any load required vs \$144 at squarespace, I think it's a win. In addition, no MySQL databases to manage, Apache webservers to maintain, PHP editions to troubleshoot.... well that sold me.

Using Fuzzy String Matching To Fix Urls

resources

matching urls

I noticed a lot of broken url's when migrating my site, so I got a list of url's and wanted to compare the old broken urls against a list of current url's and do a match to find the best resulting match. For instance, with the Jekyll title generating the link, I had issues with a url change like this:

I generated a sitemap csv by using ConvertFrom-Sitemap, originally written by Michael Hompus at TechCenter.

Original New
https://www.sheldonhull.com/blog/syncovery-arq https://www.sheldonhull.com/blog/syncovery-&-arq-syncing-&-backup

What I wanted was a way to do a fuzzy match on the url to give me the best guess match, even if a few characters were different... and I did not want to write this from scratch in the time I have.

I found a reference to a great library called Communary.PASM and in PowerShell ran the install command: Install-Package Communary.PASM -scope currentuser

The resulting adhoc script I created:

The resulting matches were helpful in saving me a lot of time, finding partial matches when a few characters were off.

Original FuzzyMatched
/blog/transaction-logging-recovery-101 /blog/transaction-logging-&-recovery-(101)
/blog/transaction-logging-recovery-part-2 /blog/transaction-logging-&-recovery-(part-2)
/blog/transaction-logging-&-recovery-(part-3) /blog/transaction-logging-recovery-part-3

I then tested out another several algorithms. I had come across references to the Levenshtein algorithm when reading about string matching on Stack Overflow. I added that logic into my script, and watched paint dry while it ran. It wasn't a good fit for my basic string matching. Learning more about string matching sounds interesting though as it seems to be a common occurrence in development, and I'm all for anything that lets me write less regex :-)

For my rough purposes the Fuzzy match was the best fit, as most of the title was the same, just typically missing the end, or slight variance in the delimiter.

I had other manual cleanup to do, but still, it was an interesting experiment. Gave me an appreciation for a consistent url naming schema, as migration to a new engine can be made very painful by changes in the naming of the posts. After some more digging, I decided to not worry about older post urls and mostly just focused on migrating any comments. I think the most interesting part of this was learning a little about the various string matching algorithm's out there... It's got my inner data geek interested in learning more on this.