My wife is doing a web course that has her running her websites in a virtual machine, accessed via GitBash – but she had been having problems running Grunt. This is actually a pretty complicated setup for a complete beginner and when something goes wrong the error messages are completely opaque.

I’m not a regular Linux user either, and both GitBash and Node.js are something I have only lightly touched on (well this isa .Net blog), so it took me some time to figure out what was wrong. In the end it came down to three things:

  1. sudo (superuser do) needs to be before every install command or you get a lot of folder / file creation permissions problems when running in Windows
  2. “--no-bin-links” needs to be applied after any install command when the virtual machine is running in Windows. Without this it generated a lot of linking errors.
  3. The version of Node was 4.0.0 instead of the expected 0.10.X version and no longer matched up with the installed version of npm.

While doing all of this, besides figuring out how to navigate my way around a largely unknown system, I found the following commands essential for doing the node reinstall.

Strip the Node and npm folders when run from your root directory (From Stack Overflow, although can’t find the original post):

sudo rm -rf /usr/local/{lib/node{,/.npm,_modules},bin,share/man}/{npm*,node*,man1/node*}

Pull down the the latest Node modules list from Nodesource using curl:

curl -sLhttps://deb.nodesource.com/setup| sudo bash -

And install Node from the updated list:

sudo apt-get install -y nodejs --no-bin-links

Once this is installed, run Grunt from the project directory:

sudo npm install –g grunt-cli --no-bin-links


Grunt is now up and running, changes are being picked up and published automatically! The world is no longer ending.


Over the next few weeks I’m going to put together a series of posts on git, continuous integration, continuous deployment, package management, and publishing into the cloud.

For this, I am going to start with the following:

The goal of the project will be as follows:

  • Create a dependency library project on GitHub
  • Publish the dependency library automatically from the master branch to a NuGet package feed using AppVeyor.
  • Create a service endpoint that is dependent on the previous package
  • Publish the service automatically from the master branch to an Azure server via AppVeyor
  • Demonstrate incrementing the versions of the package and service and the publishing process.
  • Provide usage statistics using Application Insights.

I’m hoping this will provide some interesting content covering end to end application life cycle management. (And should also keep me busy!)


I’ve been thinking about performance a lot recently - it’s a recurring problem in any system that lasts more than a few years and has a constantly growing data set and user base. I’ve seen a variety of different approaches to this problem, but only two really stand out. So how do we as developers go about identifying areas that need optimisation, and then go about solving the root cause of the problem?

Let’s start with the following scenario:

  • You have an application that is experiencing poor performance.
  • You measure the performance of the regions of the application that get the most complaints.
  • You find the pages or screens that are causing the problem, and identify the service calls they are making.
  • You measure the performance and find that it is taking 45 seconds and runs thousands of queries, generating almost a million objects.
  • You are tasked with improving the performance of these areas.


From here, I have seen two types of approach:

  • Re-structure the higher-level system so that it does not need to load 1,000,000 objects and instead only needs 50-100 records at a time, which are loaded on demand.
  • Create a system to load the objects using many concurrent queries, compress the data to transmit to the client and perform the operations there.


It seems obvious that loading 1,000,000 objects using dozens of queries in the first place is cause for concern. There is also the huge network overhead for transmitting data from the DB to the server, then from the server to the client, combined with the cost of serialising our objects and then compressing and decompressing the stream at each end… not to mention the impact of garbage collecting the objects once we’re done with them. A massive amount of work, especially considering that in some scenarios the client data is invalidated quite frequently.

Using multi-threading on this problem - which was caused by loading too much data unnecessarily in the first place - doesn’t really solve the underlying issue, as we still have all the other peripheral inefficiencies. This brings us back to the focus of this post: optimisation is not simply an increase in the speed we load a large volume of data, but an improvement in the system efficiency as a whole. Macro optimisations at the system or workflow level will have a much greater impact on its performance when compared with micro optimisations. (Although this isn’t to say that micro optimisations are not important, but consider – is it better to improve a process called a million times by a tiny fraction, or change the system so you only have to call the process a few times?) This is even more critical when considering the network traffic cost for a business.

Which leads me to…

Optimisation anti-pattern: Accelerated inefficiency

The process of spending significant time and effort finding ways to make an inefficient task faster by applying more computer resources, rather than finding ways to make the task as a whole more efficient - or entirely redundant.


Wow… it’s been a while. Work has been busy, so haven’t written anything and need to get my momentum back.

I have, however, been working on some really interesting back-end systems, and felt inspired to write something on caching.

What’s the problem?

Large, complex systems run a lot of complex, expensive calculations and reports on big sets of data. As a system’s data and user base grows, the effect of unnecessary operations will cause scalability problems. Consider the following system handling a request:


Every client connecting to this service goes through the same steps: connecting to the service, getting the data, generating the response. If there are only a few users and the query only takes a second, then there is no problem. But over time the duration of the GetData() call increases with the volume of data, and then we will find ourselves in the situation below:


What is caching?

Caching is the process of storing data at a more local level to avoid recalculating/re-fetching unnecessarily and to provide better system responsiveness and availability. This becomes more relevant the greater the number concurrent users you have in your system.

Read-through Cache

The most basic type of cache is a Read-through Cache, where data is stored on route from the data source and re-used for subsequent requests over a period of time. If we apply this to the previous scenario, you can see that we can achieve a 75% reduction in the number of calls to the database:



This benefit will be amplified by the many users, as the longer the data can be re-used, the lower the hit will be to the database.

Write-through Cache

Cache does not have to be read-only - you can also update the cache when you update the data source. This is called a Write-through Cache, where data is updated in memory en route to the data source.

The benefit of this is that it gives us the data persistence, combined with the performance boost of it being in memory. If the system crashes the data can be restored to the cache when it starts up, and if the database is unavailable the data can still be retained in the cache.

Consider the below:


The database is the primary bottleneck in most systems as it is a remote server running an IO-bound data repository. To get the maximum performance from your system, you need to minimise the load on the database. This approach goes a long way toward that goal, as it is only read from and updated once for four complete transactions.

Consider this as it scales to larger systems and you can immediately see the savings.


By maximising the number of scenarios like this in your system, you can scale much more linearly and not be as affected by higher load. If the goal is to minimise the load on the database while at the same time maintaining data integrity, this is a reliable approach.

It does, however, come at the cost of additional complexity. Subtle bugs caused by dirty data - data that is no longer up to date with what is in the data source - can cause odd and unexpected problems where the cause can be difficult to determine. Symptoms can include unknown IDs appearing in drop-down lists or being unable to action something on a server because it has been deleted. It’s important to keep these in mind when developing and diagnosing faults in systems that use caching.

This covers the most basic caching scenarios. Caching is by far the cheapest way to gain significant performance gains for minimal effort, and is something all developers should be aware of.