When putting together new deployments the octopus deploy interface does a great job. Unfortunately, when you have hundreds of deployments and thousands of variables it can become difficult to find and navigate the variables without search and filter functions.
As a stop-gap to Octopus hopefully adding this feature I’ve created OctoSearch. This allows you to download and cache the variable sets with their variable collections locally as Text, Json or Html.
OctoSearch itself is a package available from nuget. It has been compiled down to a native executable so a dotnet core installation is not needed.
The first step is to login with the octopus server so we can create, download and cache an API token. This will be used for subsequent calls to octopus.
Now that we’re authenticated we can download and cache the variable sets and their variable collections. This cache will be used for our searches to reduce the load on the Octopus server. Variables marked as sensitive won’t have their values downloaded or cached; their variable names will be searchable but not their values.
With the variables cached locally you can run fast searches and regenerate them into either Json or Html documents. To run a basic command line search you can use the search verb. It takes a regex so you can pass in basic text or more advanced text searches when you need to.
To output the search results into a text file you can do:
To display all the variables in a html report we omit the regex to default to a greedy regex \w.. The html report has a client side search facility to filter variables for easier exploration.
This will give you a search UI that looks like:
If you would prefer it in Json:
Once we have it in Json we can load and analyse it within powershell. For example to get all the variables marked as
IsSensitive we could do:
At my current workplace, some of our systems are approaching 1 Billion requests per day. At these volumes sub-optimal configuration between systems can cause significant issues and subtle performance degradation. To understand some of the issues we are facing I’m going back to basics. You can find the code for this post at https://github.com/naeemkhedarun/TestHttpClient.
There are two DNS level scenarios that I want to investigate:
cloudapp.netDNS name which points to the Azure Load Balancer distributing traffic over the nodes.
The transient client eventually behaves as expected despite taking 133 seconds to respect the change. The ServicePointManager.DnsRefreshTimeout defaults to 120 seconds. This still leaves 13 seconds unaccounted for which I suspect is the final socket connection timeout.
A test isolating the connection to the non-responsive endpoint yields:
I wasn’t able to find any configuration for this timeout within .NET but I didn’t manage to trace the framework source to an enumeration WSAETIMEDOUT. The timeout is controlled by the OS documented here.
TCP/IP adjusts the frequency of retransmissions over time. The delay between the first and second retransmission is three seconds. This delay doubles after each attempt. After the final attempt, TCP/IP waits for an interval equal to double the last delay, and then it closes the connection request.
You find the default values for your OS (in my case Windows Server 2016) by running:
So the result should be
(1 * 3000) + (2 * 3000) = 12000ms which explains the extra time. Now the result is understood, let’s re-run the test after dropping the DNS refresh timeout to 10 seconds.
So with a transient HttpClient a working way to stay up to date with traffic manager configuration is to tune the
DnsRefreshTimeout property to a good value for your application.
Using a singleton client will reuse the connection for many requests to reduce the overhead with starting new TCP connections. In this setup we still want the connection to be recreated occasionally so we get the latest DNS configuration.
Cancelled after 180000
With a singleton HttpClient the connection is kept alive by default. This can be undesirable in configuration changes or scale out scenarios where you want your clients to connect to and use the new resources. Let’s try the
Cancelled after 180000
Since the connection is open and kept open, we need to find a way to close it. There is another setting which controls the length of time a connection is held open for called ServicePointManager.ConnectionLeaseTimeout.
Unfortunately, having this setting alone isn’t enough based on our previous transient experiments; the DNS is still cached. Let’s combine the two settings.
So now, despite using a singleton pattern within the code, our connections are being recreated and re-resolved up to every 20 seconds (both timeouts combined).
We have a application setup which might be familiar to you; A cloud service in a classic virtual network (v1) which communicates with a database in an ARM virtual network (v2). Ideally we would like both of these services in a single network, but are restricted from doing so due to the deployment models. We had a discussion which involved performance, security and ideal topologies, however this post will solely focus on performance.
Is there a difference in latency and bandwidth when they are both hosted in the same region?
To reflect the setup we have for our application, two VMs were provisioned in North Europe.
I first wanted to test the latency and number of hops between the VMs. ICMP is not available for this test as we are hitting a public IP, however we can use TCP by using nmap.
We can see that there are 8 hops over the public IP, and as we expect only a single hop over the peered network. Both routes are still extremely fast with negligible ping times. This confirms my collegues suspicions; despite connecting to a public address the traffic probably never leaves the datacenters perimeter network.
To measure the bandwidth available between the VMs I’m using iperf3 which is cross platform. The test is run from the windows machine as a client and flows to the iperf server hosted on the linux box.
Surprisingly, both achieve the desired bandwith (1Gbps) for the selected VM sizes.
I was still curious if the performance profile was the same when upgrading both VMs to support 10Gbps networking. For this test both machines were upgraded to the DS14v2 VM size. To maximise the bandwidth I used iperfs
-P switch to run concurrent workers. The buffer size was also increased to see the effect it has on the bandwidth.
|Public IP (32MB)||3230|
As expected, with the default values the peered network performed better although the difference was marginal. More surprisingly, the public network had a high thoroughput when the buffer size was increased and despite running the test multiple times I am unable to explain why.
For our workload and use case, I can say the performance difference between the two approaches is irrelevant. If you are evaluating whether you might gain network performance by switching to peered networking then I hope these numbers can help guide you. I would recommend running a similar test if you are choosing different VM sizes or workload.
The new web based fabric explorer has a much nicer interface than the old desktop application. However we’ve lost the ability to pin it to the taskbar for quick shortcuts
win+n and having it as a chrome tab is less convenient than its own window.
Thankfully Chrome can help with that. Open the fabric explorer:
Create the desktop app:
Then you can pin it to the taskbar as you would normally. You’ll get a window with all the extra Chrome removed.
Don’t forget to take advantage of the default windows taskbar keyboard shortcuts. I have it
pinned as the fourth taskbar item, so its quick to switch using
I am not familar with the default Vim editor that comes with Git, which makes interactive rebases difficult. It took me a while until I realised you can configure this. Thanks to F Boucheros this is quite easy!
git config –global core.editor “‘C:\Program Files (x86)\Microsoft VS Code\code.exe’ -w”
And now when you run your
git rebase -i the todo log will open in vscode.
After you’ve got your service fabric application live, you might see performance issues which you didn’t pick up in testing or simulated load tests. This could be for a number of reasons.
Reliable actors do not yet have interception interfaces to add in this kind of detailed telemetry, but with careful code its possible to do this with a dynamic proxy. I chose to use LightInject for this but most of the framework would do the same job. I use statsd and graphite as my telemetry platform and I’ve had good experiences with this nuget package
We need to intercept object on both sides of the network boundary to cover these scenarios.
We can trace the former by using service fabrics dependency injection support to initialise the actors with a proxy inbetween. First we override fabrics initialisation to use our DI container which has dynamic proxy support.
Next we tell our DI container to resolve these types with a proxy that includes our telemetry interceptor.
This will catch the timings for any calls to actors made by the fabric system. Now we need to get the timings for all the calls we make, both actor to actor and client to actor.
Above we’ve created a factory class which should be used by clients and actors to create the proxied ActorProxies. We cache the generated proxy types in a thread safe dictionary as they are expensive to create.
Lastly we need the intercetor itself. We need to be sympathetic towards:
Waiton the task.
We can use a task continuation to handle the writing of telemetry together with a closure to capture the timer. If there is a return value we should return it, and for whatever reason that value is not a Task then we won’t try to add the continuation.
If you have your metrics library configured to push to a graphite backend you can use the following query to graph it:
I currently use LightInject for dependency injection primarily for its good performance and well documented features. I needed to proxy our service fabric actors to trace call timings and I wanted to see if their interception package was competitive with other open source offerings.
We’ll take a look at:
|Library||Average Time (us)|
I was quite surprised to see such a difference between the frameworks. I guessed that both NProxy and Castle cache their proxy types internally, which LightInject expects you to handle your own caching. Something good to bear in mind!
After caching the proxy type things are a little more competitive:
|Library||Average Time (ns)|
I still think the code can be more optimal in all cases, so I reduced everything as much as possible to a single call to activate the proxy type. I’ve included timings for
Activator.CreateInstance and the standard constructor against the non-proxy type as a baseline.
|Library||Average Time (ns)|
Things are much closer now! The difference between Castle and LightInject are negligible. There might be a way to optimise NProxy further but the API didn’t yield any obvious optimisations.
Now let’s take a look at the runtime overhead of calling a proxied object. I’ve included an unproxied instance as a baseline.
|Library||Average Time (ns)|
|No Proxy||3.0494 ns|
Surprisingly there is no overhead with any of the libraries with calling the proxied object. The graph looks skewed due to how close the results are and the timings are in nanoseconds. This is great news and we can use whichever library we want guilt-free.
You can review the code for the benchmarks on github.
You might need to programmatically lookup details about a service. The FabricClient class can be used to lookup various things from the cluster.
The result of
GetPartitionListAsync should never change for a service as you can’t change the partition information after a service has been created. It would be safe and give better performance to cache this.
The endpoint of the primary replica however can move between machines, so this does need to be resolved more frequently. You can also cache this if you have a retry strategy that will re-resolve after an
Service fabric gives you two mechanisms out of the box when resolving which partition you hit when calling a Reliable Service. We’ll ignore the singleton partitions as they won’t help us with sharding.
Int64range to decide which partition a numbered key falls in.
More information can be found here.
A named partition allows you to specify explicitly which partition you want to access at runtime. A common example is to specify A-Z named partitions and use the first letter of your data as the key. This splits your data into 26 partitions.
The advantages to this are that it is simple and you know which partition your data goes in without a lookup. Unfortunately as we will test later, you are unlikely to get a good distribution of your data across the partitions.
With a ranged partition the fabric tooling by default uses the entire
Int64 range as keys to decide which partition. It will then convert these into ranges or buckets depending on the partition count.
However to be able to lookup a partition we need a function which can reduce our data to an integer value. To use the configuration above we can convert our strings into an
Rather than use the ranges, you can fix your keys and plug in your own hash algorithm to resolve the partition.
We now have a key range limited to 0-25 rather than the entire
Int64 range. We can resolve a client connected to this partition in the same way, however this time we need to compute a key that fits in this smaller range. I’m using the jump consistent hash implementation in hydra.
To benchmark the distribution we have a list of around 17000 real email addresses. This should give us an idea of how the sharding strategies will distribute the data across 26 partitions. Another advantage of using one of the
Int64 methods is that they can be used with any amount of partitions.
We are looking for an even number of accounts allocated to each partition.
We can see from those results that sharding using the first character of an email produces wildly different partition sizes, not what we want! Both the jump hash and integer ranging methods produced very even parition sizes.
Based on these results I would use the ranged partitioning method, it produces provides good balancing and is fast to compute. An additional advantage is you do not need to know the partition count in the code, just map your data to an
Int64 and service fabric will do the rest.
With most applications its easy to get started on logging.
However this will not log unhandled exceptions from places you couldn’t forsee. So let’s log these just in case anything goes wrong.
Any exceptions which crash the application can be handled using the
This will help you diagnose fatal errors. Unfortunately not all exceptions are fatal, and if you have any timers, unawaited async or unhandled task pool exceptions these can cause your application to behave unexpectedly without you knowing about it.
You can use the UnobservedTaskException to catch some of those ones:
If you know any other events or ways to get more of these unexpected errors please let me know!